AWQ INT4 Deep Dive on RTX 4090 24GB: Marlin Kernels, Calibration, and the 24GB Sweet Spot GIGAGPU

AWQ — Activation-aware Weight Quantization — is the technique that lifts a 24 GB consumer GPU into the heavyweight model class. With Marlin INT4 kernels, the Ada AD102 in a single RTX 4090 24GB dedicated server serves Qwen 2.5 14B at 135 t/s decode, Qwen 2.5 32B at 65 t/s, Mixtral 8x7B at 85 t/s and Llama 3.1 70B at 22-24 t/s, all from one card. This guide explains how AWQ works, why Marlin matters on Ada, when to choose AWQ over FP8, the exact launch flags per model, and the calibration pitfalls that bite teams six months in. For the wider hardware menu see dedicated GPU hosting.

How AWQ works in 90 seconds

The original observation behind AWQ is that not all weights matter equally. In a transformer linear layer, a small fraction (typically around 1%) of weight channels carry the bulk of activation magnitude. Naively quantising those channels to INT4 alongside the rest produces large reconstruction error in exactly the dimensions that matter most. AWQ scales the salient channels up before quantisation and applies the inverse scale to the corresponding activations at runtime. The net effect: the salient channels effectively use more bits, the others use the standard INT4 group quantisation, and total quality loss collapses from catastrophic to barely measurable.

The standard configuration uses group size 128 — every 128 weights along the input dimension share a scale and zero-point pair stored at FP16. Per parameter cost: 4 bits for the weight plus 32 bits / 128 = 0.25 bits for the shared scale, totalling 4.25 effective bits. A 14 billion parameter model thus weighs 14e9 * 4.25 / 8 = 7.4 GB raw, plus embedding and output tables that often stay at higher precision. The publicly distributed checkpoints land at 10.2 GB for 14B AWQ, 19.1 GB for 32B AWQ, and 17.0 GB for 70B AWQ (the 70B is more aggressively packed). See Qwen 14B, Qwen 32B and Llama 70B INT4 for the per-model footprints.

Marlin kernels on Ada and why they matter

AWQ is a storage format. The throughput depends entirely on the GEMM kernel that consumes it. vLLM ships two AWQ backends: the legacy awq kernel (FP16 dequant followed by FP16 GEMM) and the modern awq_marlin kernel (fused 4-bit dequant inside the matrix multiply, FP16 accumulation, FP16 output). Marlin is the right choice on Ada by a margin of 1.7x to 2.4x. The fused dequant means the INT4 weights stay packed in shared memory and the 4090’s 1,008 GB/s memory bandwidth is no longer wasted streaming dequantised FP16. On the 4090’s sm_89 silicon, Marlin saturates the 4th-gen tensor cores at FP16 accumulation and reaches throughputs that approach FP16 dense for memory-bound decode and exceed FP16 for batched prefill. The flag is --quantization awq_marlin, not awq.

Two operational points worth noting. First, Marlin JIT-compiles per shape on first use; expect an additional 30-90 seconds of cold-start time the first time a model loads on a fresh box. Subsequent boots reuse the kernel cache at ~/.cache/vllm. Second, Marlin requires CUDA 12.1 or later and PyTorch 2.3 or later; vLLM 0.6.x bundles compatible versions but if you build from source, mind the constraint.

AWQ versus FP8: the decision matrix

Criterion	AWQ INT4 (Marlin)	FP8 (E4M3)
VRAM per parameter	~0.5 byte	1 byte
Decode speed (small model fits both)	Comparable	Slightly faster
Decode speed (large model)	Wins by fitting at all	Doesn’t fit
Quality vs FP16	0.1-0.4 pt perplexity rise	0.02-0.05 pt perplexity rise
Math reasoning sensitivity	Higher (GSM8K -0.5 to -1.0)	Negligible
Calibration required	Yes (offline, ~128 samples)	No (load-time activation stats)
Pre-built checkpoints available	Yes for popular models	Yes via vLLM auto-quantise
Best for on 24 GB	14B, 32B, 70B	7B, 8B, 13B

Rule of thumb on a 24 GB 4090: use FP8 if the model is 13B or below and quality matters more than VRAM efficiency; use AWQ INT4 if the model is 14B or above and only AWQ fits. Pair both with FP8 KV cache (--kv-cache-dtype fp8) which halves cache memory at negligible quality cost. The full FP8 picture lives in the FP8 Llama deployment guide and the FP8 tensor cores on Ada piece.

Which models to AWQ on a 4090

Model	Repo (AWQ)	VRAM	Decode t/s	Notes
Qwen 2.5 14B Instruct	Qwen/Qwen2.5-14B-Instruct-AWQ	10.2 GB	135	The workhorse
Qwen 2.5 32B Instruct	Qwen/Qwen2.5-32B-Instruct-AWQ	19.1 GB	65	Hard reasoning
Qwen 2.5 Coder 32B	Qwen/Qwen2.5-Coder-32B-Instruct-AWQ	19.1 GB	65	HumanEval 92.7
Mixtral 8x7B	casperhansen/mixtral-8x7b-awq	20.5 GB	85	MoE, only 13B active
Llama 3.1 70B Instruct	hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4	17.0 GB + KV	22-24	Tight; FP8 KV essential
Codestral 22B	solidrust/Codestral-22B-AWQ	13.0 GB	80	Code-only, fill-in-middle
Qwen Coder 14B	Qwen/Qwen2.5-Coder-14B-Instruct-AWQ	10.2 GB	130	Lower-cost code lane

For benchmarks at depth see Qwen 14B benchmark, Qwen 32B benchmark, Mixtral benchmark and Llama 70B INT4 benchmark.

Per-model launch commands explained

Standard 14B AWQ launch:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-14B-Instruct-AWQ \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 16 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --port 8000

Why each flag for 14B specifically. awq_marlin over plain awq for the 1.7-2.4x throughput uplift. --kv-cache-dtype fp8 halves cache so 16 sequences at 32k context fit. --max-model-len 32768 rather than the model’s full 128k because beyond 32k the KV budget collapses; if you genuinely need 128k context, drop concurrency to 4 and shorten elsewhere. --max-num-seqs 16 matches the KV budget at 32k. --gpu-memory-utilization 0.92 is conservative for 14B; you can push to 0.94 if no other CUDA process shares the card.

32B AWQ is tighter: drop --max-num-seqs to 8 and --max-model-len to 16384 by default. 70B AWQ is tighter still and merits its own dedicated walkthrough at the 70B INT4 deployment page; the short version is --max-num-seqs 4 --max-model-len 16384 --gpu-memory-utilization 0.95 and skip chunked prefill. Mixtral 8x7B AWQ uses --max-num-seqs 8 and benefits from MoE’s only-13B-active runtime, giving it batch-1 throughput closer to a dense 13B than a dense 47B.

Calibration: when to roll your own

Public AWQ checkpoints are calibrated against Wikipedia and C4 — broad, English-heavy, formal prose. For most general chat, support, coding and RAG workloads this is fine and quality matches reported benchmarks within noise. You should consider rolling your own calibration when:

Your application is heavily non-English in domains the original calibration did not see (e.g. financial Mandarin, medical Spanish).
Your prompts have unusual structural features the calibration set lacked (e.g. very long structured JSON, extensive code with rare languages).
You see consistent degradation in your eval set on AWQ relative to FP16 of more than 2 percentage points on a metric that matters.

The recipe with AutoAWQ is short:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-14B-Instruct"
quant_path = "out/qwen-14b-awq-custom"

model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 256 in-domain calibration samples, ~512 tokens each
calibration_data = load_my_in_domain_samples(n=256, max_len=512)

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Calibration takes 20-40 minutes on a 4090 for a 14B model and roughly 2-3 hours for 70B (the 70B run requires CPU offloading; mind your system RAM). Validate the calibrated checkpoint against the original on your eval set before promoting to production.

Quality, evaluation and pitfalls

Empirical quality losses across publicly distributed AWQ checkpoints, measured against the FP16 base on standard benchmarks:

Model	MMLU delta	HumanEval delta	GSM8K delta
Qwen 2.5 14B AWQ	-0.12	-0.18	-0.6
Qwen 2.5 32B AWQ	-0.21	-0.30	-0.7
Llama 3.1 70B AWQ	-0.40	-0.50	-1.0
Mixtral 8x7B AWQ	-0.18	-0.25	-0.8

Math benchmarks (GSM8K, MATH) are the most sensitive because chain-of-thought reasoning compounds small per-token errors. If math accuracy is load-bearing for your application, run your own GSM8K subset against both AWQ and FP16 before committing.

Pitfalls to know about. The legacy awq kernel is dramatically slower; never ship it. Group size 64 gains roughly 0.05 perplexity at +12% VRAM versus group size 128 — rarely worth it. Quantising your own custom fine-tunes requires the LoRA to be merged before AWQ; AWQ over a base model with hot-swappable LoRAs at runtime works only with vLLM’s specific LoRA-aware AWQ path (verify with the version’s release notes). Public AWQ checkpoints are licensed by the original model’s licence — Llama 3.1’s community licence applies to the 70B AWQ download.

Verdict and decision criteria

AWQ INT4 with Marlin is the correct choice on a 4090 whenever the model exceeds 13B and the workload tolerates a small (sub-1 percentage point on most benchmarks, larger on math) quality cost. It is the only practical way to run 70B on one card and the most efficient way to run 32B. Below 13B, FP8 is preferred for quality and convenience. Above 70B (e.g. Llama 3.1 405B, DeepSeek-V3) you cannot single-card on a 4090 regardless of quantisation; consider multi-card pairing or step up to cloud H100.

The decision flow: choose AWQ if (model size > 13B) AND (FP16 does not fit) AND (math sensitivity is moderate). Choose FP8 if (model size <= 13B) OR (every quality fraction matters and you can fit FP8). Always pair with FP8 KV cache regardless of weight quantisation. For day-one verification use the vLLM setup tutorial and the first day checklist; for cost framing see monthly hosting cost and vs OpenAI API cost.

14B-70B inference on a single 24 GB card

AWQ INT4 with Marlin kernels makes the 4090 punch far above its weight class. UK dedicated hosting.

Order the RTX 4090 24GB

AWQ INT4 Deep Dive on RTX 4090 24GB: Marlin Kernels, Calibration, and the 24GB Sweet Spot

Contents

How AWQ works in 90 seconds

Marlin kernels on Ada and why they matter

AWQ versus FP8: the decision matrix

Which models to AWQ on a 4090

Per-model launch commands explained

Calibration: when to roll your own

Quality, evaluation and pitfalls

Verdict and decision criteria

14B-70B inference on a single 24 GB card

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

AWQ INT4 Deep Dive on RTX 4090 24GB: Marlin Kernels, Calibration, and the 24GB Sweet Spot

Contents

How AWQ works in 90 seconds

Marlin kernels on Ada and why they matter

AWQ versus FP8: the decision matrix

Which models to AWQ on a 4090

Per-model launch commands explained

Calibration: when to roll your own

Quality, evaluation and pitfalls

Verdict and decision criteria

14B-70B inference on a single 24 GB card

Need a Dedicated GPU Server?

gigagpu

Related Articles

Ollama Keep-Alive and Model Memory Tuning

Structured Output vs Prompting

LangChain with Self-Hosted vLLM

Migrate from Anthropic to Self-Hosted: Customer Support Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?