The signed byte: key to ML quantization and inference acceleration
An 8-bit integer (one byte) is the fundamental addressable unit in most computer architectures. INT8 uses two's complement for the range -128 to 127. For the unsigned variant (0–255), see UINT8.
The 8-bit integer is perhaps the most fundamental data type in computing. A single byte can represent 256 distinct values, which is sufficient for ASCII characters, pixel color channels, and increasingly, quantized neural network parameters.
In machine learning, INT8 quantization has become one of the most important techniques for deploying models efficiently. By converting FP32 weights and activations to INT8, you get 4× memory reduction and significantly faster inference on hardware with INT8 support (virtually all modern CPUs and GPUs).
ML quantization maps floating-point values to integers using a scale factor and optional zero point:
The scale and zero point are chosen to map the typical range of activations or weights to the INT8 range. This introduces quantization error but typically has minimal impact on model accuracy for well-calibrated models.
Click any of the 8 bits to flip them. With only 256 possible values, you can explore the entire format.
.s8x4 packed types with MMA shapes (m16n8k16, m16n8k32) for INT8 matrix multiply, accumulating into INT32. H100 delivers 2000 INT8 TOPS (4000 with sparsity).OCP_MXINT8Spec for MXINT8 quantization on ROCm.torch.int8 for signed 8-bit quantized tensors. The ONNX specification defines INT8 = 3 as a core data type.s8 types to MMA shapes with i32 accumulator. The AMDGPU dialect supports i8 × i8 → i32 signed dot-product (sdot4) operations and WMMA shapes on AMD GPUs.