GPTQ remains widely available for models that predate AWQ adoption. On the RTX 5060 Ti 16GB at our hosting, GPTQ works well via ExLlama kernels but AWQ is usually preferred in 2026.
Contents
What GPTQ Is
GPTQ is a 4-bit post-training quantisation method published in 2022. Widely adopted before AWQ emerged. Still functional and served by most inference engines.
vs AWQ
| Aspect | AWQ | GPTQ |
|---|---|---|
| Quality | Marginally better (0-2%) | Slightly worse on MMLU |
| Kernel | Marlin (fast) | ExLlama / Marlin-compatible |
| Speed | Slightly faster | Comparable on Blackwell |
| Checkpoint count | Growing | Very wide (legacy) |
In 2026 AWQ has overtaken GPTQ as the default quantised serving format for new models. GPTQ remains for back-compat.
Serving
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.3-GPTQ \
--quantization gptq \
--max-model-len 16384 \
--gpu-memory-utilization 0.92
vLLM on Blackwell uses Marlin kernels for both AWQ and GPTQ where possible – similar performance. For very old GPTQ checkpoints (bits=3 or group_size=32) ExLlama kernel is used as fallback.
When to Prefer GPTQ
- AWQ checkpoint not available for your specific model
- Only GPTQ available in your fine-tune or derivative
- Legacy deployment where GPTQ is already in place
For new deployments: prefer FP8 > AWQ > GPTQ in that order.
GPTQ Ready on Blackwell
ExLlama + Marlin kernels on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: full format comparison, AWQ guide.