AWQ — Activation-aware Weight Quantization — is the technique that lifts a 24 GB consumer GPU into the heavyweight model class. With Marlin INT4 kernels, the Ada AD102 in a single RTX 4090 24GB dedicated server serves Qwen 2.5 14B at 135 t/s decode, Qwen 2.5 32B at 65 t/s, Mixtral 8x7B at 85 t/s and Llama 3.1 70B at 22-24 t/s, all from one card. This guide explains how AWQ works, why Marlin matters on Ada, when to choose AWQ over FP8, the exact launch flags per model, and the calibration pitfalls that bite teams six months in. For the wider hardware menu see dedicated GPU hosting.
Contents
- How AWQ works in 90 seconds
- Marlin kernels on Ada and why they matter
- AWQ versus FP8: the decision matrix
- Which models to AWQ on a 4090
- Per-model launch commands explained
- Calibration: when to roll your own
- Quality, evaluation and pitfalls
- Verdict and decision criteria
How AWQ works in 90 seconds
The original observation behind AWQ is that not all weights matter equally. In a transformer linear layer, a small fraction (typically around 1%) of weight channels carry the bulk of activation magnitude. Naively quantising those channels to INT4 alongside the rest produces large reconstruction error in exactly the dimensions that matter most. AWQ scales the salient channels up before quantisation and applies the inverse scale to the corresponding activations at runtime. The net effect: the salient channels effectively use more bits, the others use the standard INT4 group quantisation, and total quality loss collapses from catastrophic to barely measurable.
The standard configuration uses group size 128 — every 128 weights along the input dimension share a scale and zero-point pair stored at FP16. Per parameter cost: 4 bits for the weight plus 32 bits / 128 = 0.25 bits for the shared scale, totalling 4.25 effective bits. A 14 billion parameter model thus weighs 14e9 * 4.25 / 8 = 7.4 GB raw, plus embedding and output tables that often stay at higher precision. The publicly distributed checkpoints land at 10.2 GB for 14B AWQ, 19.1 GB for 32B AWQ, and 17.0 GB for 70B AWQ (the 70B is more aggressively packed). See Qwen 14B, Qwen 32B and Llama 70B INT4 for the per-model footprints.
Marlin kernels on Ada and why they matter
AWQ is a storage format. The throughput depends entirely on the GEMM kernel that consumes it. vLLM ships two AWQ backends: the legacy awq kernel (FP16 dequant followed by FP16 GEMM) and the modern awq_marlin kernel (fused 4-bit dequant inside the matrix multiply, FP16 accumulation, FP16 output). Marlin is the right choice on Ada by a margin of 1.7x to 2.4x. The fused dequant means the INT4 weights stay packed in shared memory and the 4090’s 1,008 GB/s memory bandwidth is no longer wasted streaming dequantised FP16. On the 4090’s sm_89 silicon, Marlin saturates the 4th-gen tensor cores at FP16 accumulation and reaches throughputs that approach FP16 dense for memory-bound decode and exceed FP16 for batched prefill. The flag is --quantization awq_marlin, not awq.
Two operational points worth noting. First, Marlin JIT-compiles per shape on first use; expect an additional 30-90 seconds of cold-start time the first time a model loads on a fresh box. Subsequent boots reuse the kernel cache at ~/.cache/vllm. Second, Marlin requires CUDA 12.1 or later and PyTorch 2.3 or later; vLLM 0.6.x bundles compatible versions but if you build from source, mind the constraint.
AWQ versus FP8: the decision matrix
| Criterion | AWQ INT4 (Marlin) | FP8 (E4M3) |
|---|---|---|
| VRAM per parameter | ~0.5 byte | 1 byte |
| Decode speed (small model fits both) | Comparable | Slightly faster |
| Decode speed (large model) | Wins by fitting at all | Doesn’t fit |
| Quality vs FP16 | 0.1-0.4 pt perplexity rise | 0.02-0.05 pt perplexity rise |
| Math reasoning sensitivity | Higher (GSM8K -0.5 to -1.0) | Negligible |
| Calibration required | Yes (offline, ~128 samples) | No (load-time activation stats) |
| Pre-built checkpoints available | Yes for popular models | Yes via vLLM auto-quantise |
| Best for on 24 GB | 14B, 32B, 70B | 7B, 8B, 13B |
Rule of thumb on a 24 GB 4090: use FP8 if the model is 13B or below and quality matters more than VRAM efficiency; use AWQ INT4 if the model is 14B or above and only AWQ fits. Pair both with FP8 KV cache (--kv-cache-dtype fp8) which halves cache memory at negligible quality cost. The full FP8 picture lives in the FP8 Llama deployment guide and the FP8 tensor cores on Ada piece.
Which models to AWQ on a 4090
| Model | Repo (AWQ) | VRAM | Decode t/s | Notes |
|---|---|---|---|---|
| Qwen 2.5 14B Instruct | Qwen/Qwen2.5-14B-Instruct-AWQ | 10.2 GB | 135 | The workhorse |
| Qwen 2.5 32B Instruct | Qwen/Qwen2.5-32B-Instruct-AWQ | 19.1 GB | 65 | Hard reasoning |
| Qwen 2.5 Coder 32B | Qwen/Qwen2.5-Coder-32B-Instruct-AWQ | 19.1 GB | 65 | HumanEval 92.7 |
| Mixtral 8x7B | casperhansen/mixtral-8x7b-awq | 20.5 GB | 85 | MoE, only 13B active |
| Llama 3.1 70B Instruct | hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 | 17.0 GB + KV | 22-24 | Tight; FP8 KV essential |
| Codestral 22B | solidrust/Codestral-22B-AWQ | 13.0 GB | 80 | Code-only, fill-in-middle |
| Qwen Coder 14B | Qwen/Qwen2.5-Coder-14B-Instruct-AWQ | 10.2 GB | 130 | Lower-cost code lane |
For benchmarks at depth see Qwen 14B benchmark, Qwen 32B benchmark, Mixtral benchmark and Llama 70B INT4 benchmark.
Per-model launch commands explained
Standard 14B AWQ launch:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq_marlin \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 16 \
--enable-chunked-prefill \
--enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--port 8000
Why each flag for 14B specifically. awq_marlin over plain awq for the 1.7-2.4x throughput uplift. --kv-cache-dtype fp8 halves cache so 16 sequences at 32k context fit. --max-model-len 32768 rather than the model’s full 128k because beyond 32k the KV budget collapses; if you genuinely need 128k context, drop concurrency to 4 and shorten elsewhere. --max-num-seqs 16 matches the KV budget at 32k. --gpu-memory-utilization 0.92 is conservative for 14B; you can push to 0.94 if no other CUDA process shares the card.
32B AWQ is tighter: drop --max-num-seqs to 8 and --max-model-len to 16384 by default. 70B AWQ is tighter still and merits its own dedicated walkthrough at the 70B INT4 deployment page; the short version is --max-num-seqs 4 --max-model-len 16384 --gpu-memory-utilization 0.95 and skip chunked prefill. Mixtral 8x7B AWQ uses --max-num-seqs 8 and benefits from MoE’s only-13B-active runtime, giving it batch-1 throughput closer to a dense 13B than a dense 47B.
Calibration: when to roll your own
Public AWQ checkpoints are calibrated against Wikipedia and C4 — broad, English-heavy, formal prose. For most general chat, support, coding and RAG workloads this is fine and quality matches reported benchmarks within noise. You should consider rolling your own calibration when:
- Your application is heavily non-English in domains the original calibration did not see (e.g. financial Mandarin, medical Spanish).
- Your prompts have unusual structural features the calibration set lacked (e.g. very long structured JSON, extensive code with rare languages).
- You see consistent degradation in your eval set on AWQ relative to FP16 of more than 2 percentage points on a metric that matters.
The recipe with AutoAWQ is short:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "Qwen/Qwen2.5-14B-Instruct"
quant_path = "out/qwen-14b-awq-custom"
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 256 in-domain calibration samples, ~512 tokens each
calibration_data = load_my_in_domain_samples(n=256, max_len=512)
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Calibration takes 20-40 minutes on a 4090 for a 14B model and roughly 2-3 hours for 70B (the 70B run requires CPU offloading; mind your system RAM). Validate the calibrated checkpoint against the original on your eval set before promoting to production.
Quality, evaluation and pitfalls
Empirical quality losses across publicly distributed AWQ checkpoints, measured against the FP16 base on standard benchmarks:
| Model | MMLU delta | HumanEval delta | GSM8K delta |
|---|---|---|---|
| Qwen 2.5 14B AWQ | -0.12 | -0.18 | -0.6 |
| Qwen 2.5 32B AWQ | -0.21 | -0.30 | -0.7 |
| Llama 3.1 70B AWQ | -0.40 | -0.50 | -1.0 |
| Mixtral 8x7B AWQ | -0.18 | -0.25 | -0.8 |
Math benchmarks (GSM8K, MATH) are the most sensitive because chain-of-thought reasoning compounds small per-token errors. If math accuracy is load-bearing for your application, run your own GSM8K subset against both AWQ and FP16 before committing.
Pitfalls to know about. The legacy awq kernel is dramatically slower; never ship it. Group size 64 gains roughly 0.05 perplexity at +12% VRAM versus group size 128 — rarely worth it. Quantising your own custom fine-tunes requires the LoRA to be merged before AWQ; AWQ over a base model with hot-swappable LoRAs at runtime works only with vLLM’s specific LoRA-aware AWQ path (verify with the version’s release notes). Public AWQ checkpoints are licensed by the original model’s licence — Llama 3.1’s community licence applies to the 70B AWQ download.
Verdict and decision criteria
AWQ INT4 with Marlin is the correct choice on a 4090 whenever the model exceeds 13B and the workload tolerates a small (sub-1 percentage point on most benchmarks, larger on math) quality cost. It is the only practical way to run 70B on one card and the most efficient way to run 32B. Below 13B, FP8 is preferred for quality and convenience. Above 70B (e.g. Llama 3.1 405B, DeepSeek-V3) you cannot single-card on a 4090 regardless of quantisation; consider multi-card pairing or step up to cloud H100.
The decision flow: choose AWQ if (model size > 13B) AND (FP16 does not fit) AND (math sensitivity is moderate). Choose FP8 if (model size <= 13B) OR (every quality fraction matters and you can fit FP8). Always pair with FP8 KV cache regardless of weight quantisation. For day-one verification use the vLLM setup tutorial and the first day checklist; for cost framing see monthly hosting cost and vs OpenAI API cost.
14B-70B inference on a single 24 GB card
AWQ INT4 with Marlin kernels makes the 4090 punch far above its weight class. UK dedicated hosting.
Order the RTX 4090 24GBSee also: FP8 deployment, 70B INT4 deployment, Qwen 14B, Qwen 32B, Mixtral 8x7B, Qwen Coder 32B, vLLM setup.