RTX 3050 - Order Now
Home / Blog / Tutorials / AWQ INT4 Deep Dive on RTX 4090 24GB: Marlin Kernels, Calibration, and the 24GB Sweet Spot
Tutorials

AWQ INT4 Deep Dive on RTX 4090 24GB: Marlin Kernels, Calibration, and the 24GB Sweet Spot

Activation-aware INT4 quantisation with Marlin kernels turns the RTX 4090 24GB into a credible 14B-70B inference card; this is the senior engineering walkthrough.

AWQ — Activation-aware Weight Quantization — is the technique that lifts a 24 GB consumer GPU into the heavyweight model class. With Marlin INT4 kernels, the Ada AD102 in a single RTX 4090 24GB dedicated server serves Qwen 2.5 14B at 135 t/s decode, Qwen 2.5 32B at 65 t/s, Mixtral 8x7B at 85 t/s and Llama 3.1 70B at 22-24 t/s, all from one card. This guide explains how AWQ works, why Marlin matters on Ada, when to choose AWQ over FP8, the exact launch flags per model, and the calibration pitfalls that bite teams six months in. For the wider hardware menu see dedicated GPU hosting.

Contents

How AWQ works in 90 seconds

The original observation behind AWQ is that not all weights matter equally. In a transformer linear layer, a small fraction (typically around 1%) of weight channels carry the bulk of activation magnitude. Naively quantising those channels to INT4 alongside the rest produces large reconstruction error in exactly the dimensions that matter most. AWQ scales the salient channels up before quantisation and applies the inverse scale to the corresponding activations at runtime. The net effect: the salient channels effectively use more bits, the others use the standard INT4 group quantisation, and total quality loss collapses from catastrophic to barely measurable.

The standard configuration uses group size 128 — every 128 weights along the input dimension share a scale and zero-point pair stored at FP16. Per parameter cost: 4 bits for the weight plus 32 bits / 128 = 0.25 bits for the shared scale, totalling 4.25 effective bits. A 14 billion parameter model thus weighs 14e9 * 4.25 / 8 = 7.4 GB raw, plus embedding and output tables that often stay at higher precision. The publicly distributed checkpoints land at 10.2 GB for 14B AWQ, 19.1 GB for 32B AWQ, and 17.0 GB for 70B AWQ (the 70B is more aggressively packed). See Qwen 14B, Qwen 32B and Llama 70B INT4 for the per-model footprints.

Marlin kernels on Ada and why they matter

AWQ is a storage format. The throughput depends entirely on the GEMM kernel that consumes it. vLLM ships two AWQ backends: the legacy awq kernel (FP16 dequant followed by FP16 GEMM) and the modern awq_marlin kernel (fused 4-bit dequant inside the matrix multiply, FP16 accumulation, FP16 output). Marlin is the right choice on Ada by a margin of 1.7x to 2.4x. The fused dequant means the INT4 weights stay packed in shared memory and the 4090’s 1,008 GB/s memory bandwidth is no longer wasted streaming dequantised FP16. On the 4090’s sm_89 silicon, Marlin saturates the 4th-gen tensor cores at FP16 accumulation and reaches throughputs that approach FP16 dense for memory-bound decode and exceed FP16 for batched prefill. The flag is --quantization awq_marlin, not awq.

Two operational points worth noting. First, Marlin JIT-compiles per shape on first use; expect an additional 30-90 seconds of cold-start time the first time a model loads on a fresh box. Subsequent boots reuse the kernel cache at ~/.cache/vllm. Second, Marlin requires CUDA 12.1 or later and PyTorch 2.3 or later; vLLM 0.6.x bundles compatible versions but if you build from source, mind the constraint.

AWQ versus FP8: the decision matrix

CriterionAWQ INT4 (Marlin)FP8 (E4M3)
VRAM per parameter~0.5 byte1 byte
Decode speed (small model fits both)ComparableSlightly faster
Decode speed (large model)Wins by fitting at allDoesn’t fit
Quality vs FP160.1-0.4 pt perplexity rise0.02-0.05 pt perplexity rise
Math reasoning sensitivityHigher (GSM8K -0.5 to -1.0)Negligible
Calibration requiredYes (offline, ~128 samples)No (load-time activation stats)
Pre-built checkpoints availableYes for popular modelsYes via vLLM auto-quantise
Best for on 24 GB14B, 32B, 70B7B, 8B, 13B

Rule of thumb on a 24 GB 4090: use FP8 if the model is 13B or below and quality matters more than VRAM efficiency; use AWQ INT4 if the model is 14B or above and only AWQ fits. Pair both with FP8 KV cache (--kv-cache-dtype fp8) which halves cache memory at negligible quality cost. The full FP8 picture lives in the FP8 Llama deployment guide and the FP8 tensor cores on Ada piece.

Which models to AWQ on a 4090

ModelRepo (AWQ)VRAMDecode t/sNotes
Qwen 2.5 14B InstructQwen/Qwen2.5-14B-Instruct-AWQ10.2 GB135The workhorse
Qwen 2.5 32B InstructQwen/Qwen2.5-32B-Instruct-AWQ19.1 GB65Hard reasoning
Qwen 2.5 Coder 32BQwen/Qwen2.5-Coder-32B-Instruct-AWQ19.1 GB65HumanEval 92.7
Mixtral 8x7Bcasperhansen/mixtral-8x7b-awq20.5 GB85MoE, only 13B active
Llama 3.1 70B Instructhugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT417.0 GB + KV22-24Tight; FP8 KV essential
Codestral 22Bsolidrust/Codestral-22B-AWQ13.0 GB80Code-only, fill-in-middle
Qwen Coder 14BQwen/Qwen2.5-Coder-14B-Instruct-AWQ10.2 GB130Lower-cost code lane

For benchmarks at depth see Qwen 14B benchmark, Qwen 32B benchmark, Mixtral benchmark and Llama 70B INT4 benchmark.

Per-model launch commands explained

Standard 14B AWQ launch:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-14B-Instruct-AWQ \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 16 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --port 8000

Why each flag for 14B specifically. awq_marlin over plain awq for the 1.7-2.4x throughput uplift. --kv-cache-dtype fp8 halves cache so 16 sequences at 32k context fit. --max-model-len 32768 rather than the model’s full 128k because beyond 32k the KV budget collapses; if you genuinely need 128k context, drop concurrency to 4 and shorten elsewhere. --max-num-seqs 16 matches the KV budget at 32k. --gpu-memory-utilization 0.92 is conservative for 14B; you can push to 0.94 if no other CUDA process shares the card.

32B AWQ is tighter: drop --max-num-seqs to 8 and --max-model-len to 16384 by default. 70B AWQ is tighter still and merits its own dedicated walkthrough at the 70B INT4 deployment page; the short version is --max-num-seqs 4 --max-model-len 16384 --gpu-memory-utilization 0.95 and skip chunked prefill. Mixtral 8x7B AWQ uses --max-num-seqs 8 and benefits from MoE’s only-13B-active runtime, giving it batch-1 throughput closer to a dense 13B than a dense 47B.

Calibration: when to roll your own

Public AWQ checkpoints are calibrated against Wikipedia and C4 — broad, English-heavy, formal prose. For most general chat, support, coding and RAG workloads this is fine and quality matches reported benchmarks within noise. You should consider rolling your own calibration when:

  • Your application is heavily non-English in domains the original calibration did not see (e.g. financial Mandarin, medical Spanish).
  • Your prompts have unusual structural features the calibration set lacked (e.g. very long structured JSON, extensive code with rare languages).
  • You see consistent degradation in your eval set on AWQ relative to FP16 of more than 2 percentage points on a metric that matters.

The recipe with AutoAWQ is short:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-14B-Instruct"
quant_path = "out/qwen-14b-awq-custom"

model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 256 in-domain calibration samples, ~512 tokens each
calibration_data = load_my_in_domain_samples(n=256, max_len=512)

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Calibration takes 20-40 minutes on a 4090 for a 14B model and roughly 2-3 hours for 70B (the 70B run requires CPU offloading; mind your system RAM). Validate the calibrated checkpoint against the original on your eval set before promoting to production.

Quality, evaluation and pitfalls

Empirical quality losses across publicly distributed AWQ checkpoints, measured against the FP16 base on standard benchmarks:

ModelMMLU deltaHumanEval deltaGSM8K delta
Qwen 2.5 14B AWQ-0.12-0.18-0.6
Qwen 2.5 32B AWQ-0.21-0.30-0.7
Llama 3.1 70B AWQ-0.40-0.50-1.0
Mixtral 8x7B AWQ-0.18-0.25-0.8

Math benchmarks (GSM8K, MATH) are the most sensitive because chain-of-thought reasoning compounds small per-token errors. If math accuracy is load-bearing for your application, run your own GSM8K subset against both AWQ and FP16 before committing.

Pitfalls to know about. The legacy awq kernel is dramatically slower; never ship it. Group size 64 gains roughly 0.05 perplexity at +12% VRAM versus group size 128 — rarely worth it. Quantising your own custom fine-tunes requires the LoRA to be merged before AWQ; AWQ over a base model with hot-swappable LoRAs at runtime works only with vLLM’s specific LoRA-aware AWQ path (verify with the version’s release notes). Public AWQ checkpoints are licensed by the original model’s licence — Llama 3.1’s community licence applies to the 70B AWQ download.

Verdict and decision criteria

AWQ INT4 with Marlin is the correct choice on a 4090 whenever the model exceeds 13B and the workload tolerates a small (sub-1 percentage point on most benchmarks, larger on math) quality cost. It is the only practical way to run 70B on one card and the most efficient way to run 32B. Below 13B, FP8 is preferred for quality and convenience. Above 70B (e.g. Llama 3.1 405B, DeepSeek-V3) you cannot single-card on a 4090 regardless of quantisation; consider multi-card pairing or step up to cloud H100.

The decision flow: choose AWQ if (model size > 13B) AND (FP16 does not fit) AND (math sensitivity is moderate). Choose FP8 if (model size <= 13B) OR (every quality fraction matters and you can fit FP8). Always pair with FP8 KV cache regardless of weight quantisation. For day-one verification use the vLLM setup tutorial and the first day checklist; for cost framing see monthly hosting cost and vs OpenAI API cost.

14B-70B inference on a single 24 GB card

AWQ INT4 with Marlin kernels makes the 4090 punch far above its weight class. UK dedicated hosting.

Order the RTX 4090 24GB

See also: FP8 deployment, 70B INT4 deployment, Qwen 14B, Qwen 32B, Mixtral 8x7B, Qwen Coder 32B, vLLM setup.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?