When models exceed what FP8 can comfortably fit on the RTX 5060 Ti 16GB, AWQ INT4 is the next step down on our hosting. Strong quality retention, Marlin kernels for speed, wide checkpoint availability.
Contents
Why AWQ
AWQ (Activation-aware Weight Quantization) is 4-bit with strong quality retention:
- Typically within 1-2% of FP16 quality on MMLU, HumanEval
- Faster than GPTQ on modern Nvidia via Marlin kernels
- Widely supported – vLLM, TGI, SGLang all handle AWQ
- Pre-quantised checkpoints available for most popular models
Checkpoints
Common production AWQ checkpoints:
Qwen/Qwen2.5-14B-Instruct-AWQQwen/Qwen2.5-Coder-14B-Instruct-AWQQwen/Qwen2.5-Coder-7B-Instruct-AWQTheBloke/Mistral-7B-Instruct-v0.3-AWQhugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4TheBloke/Llama-2-13B-Chat-AWQ(legacy)
Serving
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--dtype half
--dtype half explicitly asks for FP16 compute with INT4 weights. --dtype auto is also fine.
Marlin Kernels
Marlin is an optimised CUDA kernel for INT4 quantised matmul. On Blackwell it runs within 5-10% of FP16 tensor core throughput while using 4x less memory bandwidth. vLLM uses Marlin automatically for AWQ on Blackwell – no flag needed.
Throughput with Marlin:
- Mistral 7B AWQ batch 1: ~130 t/s
- Qwen 14B AWQ batch 1: ~44 t/s
- Llama 3 8B AWQ batch 16 aggregate: ~900 t/s
AWQ vs FP8
| Criterion | FP8 | AWQ |
|---|---|---|
| Quality | Near-FP16 | Within 1-2% of FP16 |
| Weight size | 50% of FP16 | 25% of FP16 |
| Speed on Blackwell | Fastest (native) | Marlin, slightly slower |
| Checkpoint availability | Growing | Very wide |
Use FP8 when a quality-preserving checkpoint exists. Use AWQ when FP8 is unavailable or when maximum VRAM savings is needed for larger models on 16 GB.
AWQ-Optimised Hosting
Blackwell 16GB with Marlin kernels for AWQ serving. UK dedicated.
Order the RTX 5060 Ti 16GBSee also: quantisation format comparison, GPTQ guide.