NVIDIA's 19-bit hybrid: FP32 range meets FP16 precision for accelerated tensor math
A TF32 number uses 19 bits divided into three fields:
TF32 combines the 8-bit exponent of FP32/BF16 with the 10-bit mantissa of FP16, giving it FP32's range with FP16's precision.
TF32 (TensorFloat-32) was introduced by NVIDIA with the Ampere architecture (A100 GPU) in 2020. Despite the "32" in its name, TF32 is actually a 19-bit format. The name reflects that it is used as a drop-in replacement for FP32 in tensor operations.
TF32 is unique because it's not a storage format: data is stored in FP32 memory layout, and TF32 is only used internally by Tensor Cores during matrix multiply-accumulate operations. The GPU automatically truncates FP32 inputs to TF32 precision (10 mantissa bits) before computation, then accumulates results in FP32.
This gives up to 8× speedup over FP32 matrix math on A100, with negligible accuracy loss for most deep learning workloads. On NVIDIA GPUs, TF32 is enabled by default for torch.matmul and torch.nn.Linear.
TF32 follows FP32's rules with the exponent, but with only 10 mantissa bits (like FP16). The bias of 127 is calculated as 2(e-1) - 1, where e is the number of exponent bits (8), identical to FP32 and BF16.
Click any bit to flip it, drag the slider, or enter a decimal or hex value. The graphs show how values are distributed across the encoding space.
torch.backends.cuda.matmul.allow_tf32. The cuBLASDx framework accepts the proprietary cublasdx::tfloat32_t type for explicit TF32 matrix kernels.tfloat32_t as a fundamental numeric type, and the MLIR NVVM dialect supports TF32 MMA fragment layouts for m16n8k4 and m16n8k8 shapes..tf32 as an alternate floating-point data format with dedicated MMA instruction encodings.