BF16 (Brain Float 16) | Floating Point Format Guide

Bit Layout

A BF16 number uses 16 bits divided into three fields:

BF16 uses the same 8-bit exponent as FP32, giving it identical dynamic range. The trade-off is a smaller 7-bit mantissa, providing less precision than FP16's 10 bits.

Overview

BF16 (Brain Floating Point 16) was developed by Google Brain for use in their TPU (Tensor Processing Unit) hardware. It has since been adopted by virtually every major ML hardware vendor including NVIDIA (Ampere and later), AMD, Intel, and ARM.

The key insight behind BF16 is that deep learning training is more sensitive to dynamic range than to precision. Neural network weights, activations, and gradients can span many orders of magnitude, and FP16's narrow range (max ~65K) often causes overflow. BF16 solves this by keeping FP32's full exponent range while truncating the mantissa.

Converting between FP32 and BF16 is trivially simple: just truncate the lower 16 bits of the FP32 representation. This makes BF16 extremely hardware-friendly.

Think of BF16 as "truncated FP32" BF16 is literally the upper 16 bits of an FP32 number. You can convert FP32 → BF16 by chopping off the lower 16 bits, and BF16 → FP32 by padding with 16 zeros.

Encoding Rules

Normal Numbers

value = (-1)^sign × 2^{(exponent - 127)} × (1 + mantissa / 2⁷)

BF16 follows the same rules as FP32, but with only 7 mantissa bits instead of 23. The bias of 127 is calculated as 2^(e-1) - 1, where e is the number of exponent bits (8), identical to FP32.

Subnormal Numbers

value = (-1)^sign × 2^-126 × (0 + mantissa / 2⁷)

Special Values

Zero: Exponent = 0, Mantissa = 0.
Infinity: Exponent = 255, Mantissa = 0 (same as FP32).
NaN: Exponent = 255, Mantissa ≠ 0.

Interactive Value Visualizer

Click any bit to flip it, drag the slider, or enter a decimal or hex value. The graphs show how values are distributed across the encoding space.

Decimal:

Hex:

Dynamic Range & Precision

Special Values & Bit Patterns

Format Comparison

Where BF16 Is Used

GPU Tensor Cores: NVIDIA Ampere (A100) and later GPUs accelerate BF16 via HMMA Tensor Core instructions at the same throughput as FP16. ARM processors support BF16 natively through Armv8.6 BFDOT/BFMMLA instructions, and Intel AMX provides BF16 tile multiply-accumulate.
LLM training: BF16 is the default precision for training large language models (GPT, LLaMA, Gemma). Its FP32-matching exponent range avoids the loss scaling required by FP16. PyTorch exposes it as torch.bfloat16.
Model interchange: The ONNX specification defines BFLOAT16 as a first-class tensor data type, enabling cross-framework model export and inference in BF16.
Kernel compilers: CUTLASS provides a bfloat16_t type for templated GPU matrix kernels. Triton maps tl.bfloat16 to MLIR backend targets, and the MLIR arith dialect supports bf16 as a builtin type.
NumPy ecosystem: The ml_dtypes library registers bfloat16 as a NumPy custom dtype for JAX and TensorFlow interop, since NumPy does not yet include a native BF16 type.