INT4 - 4-Bit Signed Integer | Integer Format Guide

Bit Layout

A 4-bit integer (also called a nibble) stores one of 16 possible values. For INT4, two's complement gives the range -8 to 7. For the unsigned variant (0–15), see UINT4.

Overview

4-bit integers are a sub-byte format, meaning two INT4 values can be packed into a single byte. While not a standard hardware data type on most processors, INT4 has become critically important in machine learning for extreme quantization of large language models (LLMs).

With only 16 possible values, INT4 provides 8× compression over FP32 and 2× over INT8. This makes it possible to run large models (7B, 13B, 70B parameters) on consumer GPUs and mobile devices that would otherwise require enterprise hardware.

INT4 in LLM Quantization

Techniques like GPTQ, AWQ, and GGML/GGUF Q4 quantize LLM weights to 4 bits. The typical approach:

Group weights into blocks (e.g., 32 or 128 values per block)
Compute a per-block scale factor (stored in higher precision)
Quantize each weight to the nearest INT4 value
At inference time, dequantize on-the-fly: weight = int4_value × scale

A nibble = half a byte A 4-bit value is traditionally called a "nibble" (or "nybble"). One byte contains exactly two nibbles. Each nibble corresponds to one hexadecimal digit (0–F), which is why hex is so convenient for displaying binary data.

Range & Properties

All Representable Values

With only 16 possible bit patterns, here is the complete set:

Interactive Bit Visualizer

With only 4 bits, you can click through every possible value. Try flipping the sign bit (leftmost) to see two's complement in action.

Decimal:

Hex:

Format Comparison

Where INT4 Is Used

NVIDIA Tensor Cores: The PTX ISA documents .s4 packed types for signed 4-bit MMA instructions (m16n8k32, m16n8k64 shapes on Turing/Ampere). A100 supports sparse INT4 Tensor Core acceleration for inference.
LLM weight quantization: INT4 is the element type in quantization schemes like GPTQ, AWQ, and GGUF Q4_x formats. These methods store 4-bit weights with group-level scale factors, enabling 7B–70B models to fit on consumer GPUs.
TensorRT inference: TensorRT lists INT4 as a supported quantized type for weight-only quantization.
Model interchange: The ONNX specification defines INT4 with a packing rule of two 4-bit elements per byte (first element in LSB). CUTLASS packs eight int4b_t elements into a single 32-bit GPU register.
Python libraries: The ml_dtypes library provides int4 as a narrow signed integer NumPy extension (4-bit, stored in a byte, with defined cast behavior).