TGI supports AWQ, GPTQ, EETQ, BitsAndBytes, FP8, and Marlin kernels. The right choice depends on the model architecture, the VRAM you have, and how much quality degradation you can tolerate on your dedicated GPU server. Here is the decision tree.
Formats
AWQ
text-generation-launcher --model-id TheBloke/Llama-3-8B-AWQ --quantize awq
AWQ (Activation-aware Weight Quantization) is 4-bit with typically excellent quality retention. Broad model support. Requires a pre-quantized checkpoint. Fast inference via Marlin kernels on Ampere and newer.
GPTQ
text-generation-launcher --model-id TheBloke/Llama-3-8B-GPTQ --quantize gptq
GPTQ is 4-bit (sometimes 3 or 8). Slightly more quality degradation than AWQ on average. Wider availability of pre-quantized models historically, though AWQ is catching up. Exllama kernels for speed.
BitsAndBytes (bnb)
text-generation-launcher --model-id meta-llama/Llama-3.1-8B-Instruct --quantize bitsandbytes
Quantizes an FP16 model on-the-fly at load time. Slower inference than AWQ/GPTQ but works on any model without a pre-quantized checkpoint. Good for experimentation.
FP8
text-generation-launcher --model-id neuralmagic/Meta-Llama-3-8B-Instruct-FP8 --quantize fp8
FP8 needs a pre-quantized checkpoint and hardware support (Blackwell, Hopper, Ada with some caveats). Smaller than FP16, less quality loss than INT4. On RTX 5090 it is the sweet spot for 7-13B models.
Decision
| If | Use |
|---|---|
| You have a Blackwell/Ada GPU and FP8 checkpoint exists | FP8 |
| You need 4-bit quality and speed | AWQ |
| Only GPTQ pre-quantized exists for your model | GPTQ |
| No pre-quantized available, fast experimentation | BitsAndBytes |
| Running on very old GPU without kernels | BitsAndBytes |
TGI Configured With Your Quantization
We pre-test quantization paths on the target GPU before deployment.
Browse GPU Servers