INT8 - 8-Bit Signed Integer | Integer Format Guide

Bit Layout

An 8-bit integer (one byte) is the fundamental addressable unit in most computer architectures. INT8 uses two's complement for the range -128 to 127. For the unsigned variant (0–255), see UINT8.

Overview

The 8-bit integer is perhaps the most fundamental data type in computing. A single byte can represent 256 distinct values, which is sufficient for ASCII characters, pixel color channels, and increasingly, quantized neural network parameters.

In machine learning, INT8 quantization has become one of the most important techniques for deploying models efficiently. By converting FP32 weights and activations to INT8, you get 4× memory reduction and significantly faster inference on hardware with INT8 support (virtually all modern CPUs and GPUs).

INT8 in ML Quantization

ML quantization maps floating-point values to integers using a scale factor and optional zero point:

quantized_value = round(float_value / scale) + zero_point

The scale and zero point are chosen to map the typical range of activations or weights to the INT8 range. This introduces quantization error but typically has minimal impact on model accuracy for well-calibrated models.

Why INT8 is so popular for inference INT8 matrix multiplications are 2-4× faster than FP32 on modern CPUs (via VNNI/AMX instructions) and GPUs. Combined with 4× memory savings, INT8 quantization often gives 2-4× end-to-end speedup with less than 1% accuracy loss on well-quantized models.

Range & Properties

Key Bit Patterns

Interactive Bit Visualizer

Click any of the 8 bits to flip them. With only 256 possible values, you can explore the entire format.

Decimal:

Hex:

Format Comparison

Where INT8 Is Used

NVIDIA Tensor Cores: The PTX ISA documents .s8x4 packed types with MMA shapes (m16n8k16, m16n8k32) for INT8 matrix multiply, accumulating into INT32. H100 delivers 2000 INT8 TOPS (4000 with sparsity).
Inference engines: INT8 is the most broadly supported quantization format, with hardware acceleration across TensorRT, ONNX Runtime, Core ML, and TFLite. cuDNN supports INT8x4/INT8x32 packed tensor formats for Tensor Core inference.
OCP Microscaling: The OCP MX Spec v1.0 defines MXINT8 (INT8 elements, 32-element blocks, E8M0 scale). AMD Quark provides OCP_MXINT8Spec for MXINT8 quantization on ROCm.
ML frameworks: PyTorch provides torch.int8 for signed 8-bit quantized tensors. The ONNX specification defines INT8 = 3 as a core data type.
Compiler support: The MLIR NVVM dialect maps s8 types to MMA shapes with i32 accumulator. The AMDGPU dialect supports i8 × i8 → i32 signed dot-product (sdot4) operations and WMMA shapes on AMD GPUs.