Home / Blog / Tutorials / TGI Quantization Flags Deep Dive

Tutorials

TGI Quantization Flags Deep Dive

TGI supports half a dozen quantization formats with different flags, precision, and supported architectures - a cheat sheet for each one.

Tutorials April 19, 2026 2 min read admin

TGI supports AWQ, GPTQ, EETQ, BitsAndBytes, FP8, and Marlin kernels. The right choice depends on the model architecture, the VRAM you have, and how much quality degradation you can tolerate on your dedicated GPU server. Here is the decision tree.

Formats

AWQ

text-generation-launcher --model-id TheBloke/Llama-3-8B-AWQ --quantize awq

AWQ (Activation-aware Weight Quantization) is 4-bit with typically excellent quality retention. Broad model support. Requires a pre-quantized checkpoint. Fast inference via Marlin kernels on Ampere and newer.

GPTQ

text-generation-launcher --model-id TheBloke/Llama-3-8B-GPTQ --quantize gptq

GPTQ is 4-bit (sometimes 3 or 8). Slightly more quality degradation than AWQ on average. Wider availability of pre-quantized models historically, though AWQ is catching up. Exllama kernels for speed.

BitsAndBytes (bnb)

text-generation-launcher --model-id meta-llama/Llama-3.1-8B-Instruct --quantize bitsandbytes

Quantizes an FP16 model on-the-fly at load time. Slower inference than AWQ/GPTQ but works on any model without a pre-quantized checkpoint. Good for experimentation.

FP8

text-generation-launcher --model-id neuralmagic/Meta-Llama-3-8B-Instruct-FP8 --quantize fp8

FP8 needs a pre-quantized checkpoint and hardware support (Blackwell, Hopper, Ada with some caveats). Smaller than FP16, less quality loss than INT4. On RTX 5090 it is the sweet spot for 7-13B models.

Decision

If	Use
You have a Blackwell/Ada GPU and FP8 checkpoint exists	FP8
You need 4-bit quality and speed	AWQ
Only GPTQ pre-quantized exists for your model	GPTQ
No pre-quantized available, fast experimentation	BitsAndBytes
Running on very old GPU without kernels	BitsAndBytes

TGI Configured With Your Quantization

We pre-test quantization paths on the target GPU before deployment.

Browse GPU Servers

See AWQ vs GPTQ vs GGUF vs EXL2 and TGI batch tuning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

TGI Quantization Flags Deep Dive

Formats

AWQ

GPTQ

BitsAndBytes (bnb)

FP8

Decision

TGI Configured With Your Quantization

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

TGI Quantization Flags Deep Dive

Formats

AWQ

GPTQ

BitsAndBytes (bnb)

FP8

Decision

TGI Configured With Your Quantization

Need a Dedicated GPU Server?

admin

Related Articles

Ollama on RTX 5090: Running Large Models in 32GB

Connect Chrome Extension to Self-Hosted AI

Migrate from Google Vertex to Dedicated GPU: Document Intelligence Guide

ChromaDB vs Qdrant: Lightweight vs Production Vector DB

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?