RTX 3050 - Order Now
Home / Blog / Tutorials / TGI Quantization Flags Deep Dive
Tutorials

TGI Quantization Flags Deep Dive

TGI supports half a dozen quantization formats with different flags, precision, and supported architectures - a cheat sheet for each one.

TGI supports AWQ, GPTQ, EETQ, BitsAndBytes, FP8, and Marlin kernels. The right choice depends on the model architecture, the VRAM you have, and how much quality degradation you can tolerate on your dedicated GPU server. Here is the decision tree.

Formats

AWQ

text-generation-launcher --model-id TheBloke/Llama-3-8B-AWQ --quantize awq

AWQ (Activation-aware Weight Quantization) is 4-bit with typically excellent quality retention. Broad model support. Requires a pre-quantized checkpoint. Fast inference via Marlin kernels on Ampere and newer.

GPTQ

text-generation-launcher --model-id TheBloke/Llama-3-8B-GPTQ --quantize gptq

GPTQ is 4-bit (sometimes 3 or 8). Slightly more quality degradation than AWQ on average. Wider availability of pre-quantized models historically, though AWQ is catching up. Exllama kernels for speed.

BitsAndBytes (bnb)

text-generation-launcher --model-id meta-llama/Llama-3.1-8B-Instruct --quantize bitsandbytes

Quantizes an FP16 model on-the-fly at load time. Slower inference than AWQ/GPTQ but works on any model without a pre-quantized checkpoint. Good for experimentation.

FP8

text-generation-launcher --model-id neuralmagic/Meta-Llama-3-8B-Instruct-FP8 --quantize fp8

FP8 needs a pre-quantized checkpoint and hardware support (Blackwell, Hopper, Ada with some caveats). Smaller than FP16, less quality loss than INT4. On RTX 5090 it is the sweet spot for 7-13B models.

Decision

IfUse
You have a Blackwell/Ada GPU and FP8 checkpoint existsFP8
You need 4-bit quality and speedAWQ
Only GPTQ pre-quantized exists for your modelGPTQ
No pre-quantized available, fast experimentationBitsAndBytes
Running on very old GPU without kernelsBitsAndBytes

TGI Configured With Your Quantization

We pre-test quantization paths on the target GPU before deployment.

Browse GPU Servers

See AWQ vs GPTQ vs GGUF vs EXL2 and TGI batch tuning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?