RTX 3050 - Order Now
Home / Blog / Tutorials / Deploying Llama 3.1 70B AWQ INT4 on a Single RTX 4090 24GB: The Definitive Tutorial
Tutorials

Deploying Llama 3.1 70B AWQ INT4 on a Single RTX 4090 24GB: The Definitive Tutorial

Exhaustive memory math, max-model-len calculation, FP8 KV essentials, max-num-seqs trade-offs and a real chat session for Llama 3.1 70B AWQ INT4 on a single RTX 4090 24GB.

Running Llama 3.1 70B Instruct on a single 24 GB consumer GPU is the defining workload of the Ada generation. Five years ago it required eight A100 80 GB cards. Today, a single RTX 4090 24GB dedicated server using AWQ INT4 weights, Marlin kernels and FP8 KV cache delivers 22 to 24 decode tokens per second at batch 1, around 110 t/s aggregate at batch 4 to 8, and a 16,384-token usable context, all within the silicon’s 24,564 MiB of GDDR6X. The constraints are tight, the memory math is unforgiving, and the difference between a working deployment and an out-of-memory crash at the 4,000th token of decode comes down to four flags. This tutorial walks every line. For the wider hardware menu see dedicated GPU hosting.

Contents

Why 70B fits at all on a 24 GB card

Llama 3.1 70B Instruct has 70.55 billion parameters across 80 transformer layers, 64 attention heads with 8 grouped-query KV heads and a head dimension of 128. At native FP16 the weights alone occupy 141.1 GB. At INT8 the figure is still 70.6 GB. Only when you push to INT4 with grouped quantisation does the model collapse below the 24 GB ceiling. AWQ (Activation-aware Weight Quantization) at group size 128 stores each weight as 4 bits plus a small per-group scale and zero-point, costing 4.25 effective bits per parameter. Multiply: 70.55 billion times 4.25 bits divided by 8 yields 37.5 GB raw. The publicly distributed hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 further packs the embedding and output layers, lands at 17.0 GB on disk and loads with the same footprint in VRAM under vLLM’s awq_marlin backend.

That leaves roughly 7 GB for everything else: CUDA context, vLLM workspace, scratch buffers and, most importantly, the KV cache. The KV cache is what dominates concurrent capacity. For an 80-layer model with 8 KV heads at 128 head_dim, each token requires 80 layers * 8 heads * 128 dim * 2 (K + V) * 2 bytes = 320 KB at FP16. Halve to 160 KB at FP8. Multiply by your concurrent token budget and you discover quickly that the 4090 is bandwidth-rich but cache-poor. The reason this hardware works for 70B at all is the convergence of three things: AWQ’s compression ratio, Marlin’s near-FP16 throughput on Ada’s 4th-gen tensor cores, and FP8 KV’s halving of the cache footprint. Drop any one and the build collapses. See the AWQ quantisation guide for the technique deep dive and the FP8 tensor cores on Ada piece for the silicon details.

Memory math, line by line

Let us reconstruct the budget exactly. The 4090 reports 24,564 MiB total, of which roughly 200 MiB is consumed by the kernel and display driver before any process attaches. Effective ceiling: 24.0 GB.

ComponentVRAMHow calculated
AWQ INT4 weights (gs=128)17.0 GB70.55B params * 4.25 bits / 8 with embed packing
CUDA context + driver overhead0.6 GBRoughly fixed across all vLLM workloads
vLLM workspace (Marlin scratch)0.4 GBDequant + accumulation buffers
Activations during decode0.5 GBDecode-only; prefill spikes briefly
KV cache budget remaining~5.5 GB24.0 – 17.0 – 0.6 – 0.4 – 0.5 = 5.5 GB
Total under load~24.0 GB97-98% utilisation

The 5.5 GB KV budget is the variable that determines everything downstream. KV per token at FP16 is 320 KB; at FP8 it is 160 KB. Convert the budget: 5.5 GB equals 5,632 MB equals 5,767,168 KB. Divide by 160 KB per token and you get 36,044 total tokens of KV at FP8, or 18,022 tokens at FP16. That total is split across all concurrent sequences. With --max-num-seqs 4 the per-sequence allocation at FP8 is 9,011 tokens; with batch 2 it is 18,022 tokens; at batch 1 you get the full 36k. vLLM’s PagedAttention does not strictly require uniform allocation, but the scheduler reserves blocks pessimistically to avoid mid-decode preemption.

max-num-seqsPer-seq KV at FP8Practical max-model-lenUse case
136,000 tok32,768Single long-context job
218,000 tok16,384Two long sessions
49,000 tok8,192-16,384*Recommended default
84,500 tok4,096Burst absorption only

* At --max-model-len 16384 with batch 4, vLLM relies on the fact that not every sequence will be at max context simultaneously; the scheduler swaps blocks. Steady-state-full sessions need batch 2.

Max-model-len and max-num-seqs trade-offs

This is the single most consequential decision in the deployment. Three operating points are worth considering:

Profile A: long-context single user. --max-model-len 32768 --max-num-seqs 1. One concurrent session at full 32k context, ideal for a research workstation summarising a long PDF or running an evaluation against a 30k-token context. Throughput in steady-state decode: 22-24 t/s. Time-to-first-token at 30k context: roughly 9.8 seconds.

Profile B: balanced (recommended default). --max-model-len 16384 --max-num-seqs 4. Four concurrent users at up to 16k tokens each. Aggregate throughput: ~70 t/s. Time-to-first-token at 8k context: ~3.2 seconds. This is the right choice for an internal team API or a small SaaS hard-question fallback. See the 70B use-case page for fit guidance.

Profile C: burst-absorbing. --max-model-len 8192 --max-num-seqs 8. Eight short-context sessions, optimal for a moderation or routing tier where 70B is invoked rarely. Aggregate: ~110 t/s peaking. Each call is short so KV pressure is transient.

For everything else, profile B is correct. The tempting fourth profile, --max-num-seqs 16 --max-model-len 4096, looks attractive on paper but in practice triggers KV thrash under any non-uniform context length distribution and produces highly variable p99 latency. Avoid it unless you have measured your traffic shape carefully and know it is uniform.

Why FP8 KV cache is essential, not optional

Several tutorials online suggest FP8 KV cache is a “tuning knob”. On 70B AWQ on a 24 GB card, it is not a knob. It is the difference between fitting and not fitting. Run the math without it: at FP16 KV, the 5.5 GB budget yields 18,022 total tokens. With --max-num-seqs 4 that is 4,505 per sequence — well below the model’s effective 8k context floor for useful long-form reasoning. You would have to drop concurrency to 1 just to get 16k context per session. Worse, vLLM’s block allocator wants headroom for swapping; at 18k total, you are constantly evicting and re-prefilling, and p99 decode latency spikes by 3-5x.

FP8 KV halves the per-token cost. Quality cost across summarisation, math reasoning and long-context retrieval benchmarks is consistently under 0.2 ROUGE-L points and under 0.05 perplexity. Llama Guard, MMLU and HumanEval scores are statistically indistinguishable from FP16 KV in our internal evals. The safety-critical workloads are the same; if your application can run on FP16-KV Llama, it can run on FP8-KV Llama. The flag is --kv-cache-dtype fp8 and the underlying format is E4M3 on Ada (sm_89). The startup log will confirm with the line KV cache dtype: fp8_e4m3; if you see auto falling back to fp16, your driver is below 550 or your vLLM build does not have the FP8 KV path compiled.

The launch command, every flag explained

Assuming the vLLM setup tutorial is complete (driver 550+, CUDA 12.4, vLLM 0.6.3 in a Python 3.11 venv) and you have accepted the Llama 3.1 community licence on Hugging Face:

source ~/vllm-env/bin/activate
export HF_TOKEN=hf_yourtoken
export HF_HUB_ENABLE_HF_TRANSFER=1

python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 16384 \
  --max-num-seqs 4 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --port 8000

Line by line. HF_HUB_ENABLE_HF_TRANSFER=1 activates the multipart Rust downloader; on a 1 Gbps link it pulls the 17 GB checkpoint in roughly 2.5 minutes versus 9 minutes with the Python downloader. --model points at the canonical AWQ-INT4 release maintained by hugging-quants; do not use random forks, several have miscalibrated scales. --quantization awq_marlin selects vLLM’s high-throughput Marlin GEMM kernels, which on Ada’s sm_89 deliver 1.7 to 2.4x the throughput of the legacy awq kernel; you are leaving roughly half the speed on the table if you forget this flag. --kv-cache-dtype fp8 is non-negotiable as discussed above. --max-model-len 16384 bounds the per-sequence context; the model itself supports 131k but you cannot fit that in 24 GB. --max-num-seqs 4 caps continuous batching at four concurrent sequences, matching the KV budget. --enable-prefix-caching stores prefill blocks indexed by token-prefix hash so repeated system prompts and shared context (e.g. RAG headers) skip re-prefill on hit. --gpu-memory-utilization 0.95 is intentionally aggressive because we know exactly what fits; lower this to 0.92 if you see other CUDA processes contend for memory. Note we deliberately omit --enable-chunked-prefill: at batch 4 the chunking overhead exceeds the latency-smoothing benefit; turn it back on only if you serve a heavy mix of very short and very long prompts.

Post-deploy verification and monitoring hooks

First boot takes 8 to 15 minutes (download plus AWQ kernel JIT compilation). Watch the journal: sudo journalctl -u vllm -f. Look for the lines confirming quantization=awq_marlin and KV cache dtype: fp8_e4m3. Subsequent boots reuse the on-disk cache and start in 60-90 seconds.

Sanity request:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
       "messages":[{"role":"user","content":"Explain dropout in two sentences."}],
       "max_tokens":160, "temperature":0.2}' | jq .

Expected wall time: ~7.5 seconds (1.6 s prefill on a sub-1k context plus ~5.8 s of decode at 23 t/s for ~135 output tokens). Token usage will appear in the response. If you receive 500 with “out of memory”, lower --gpu-memory-utilization to 0.93 and restart; if it persists, you have another CUDA process. nvidia-smi will reveal the offender.

SignalCommandExpected steady value
VRAM idle (after warm-up)nvidia-smi~23,400 MiB
VRAM under loadnvidia-smi~24,000 MiB (97-98%)
Power, decodenvidia-smi360-400 W
Temperature, decode steadynvidia-smi74-80 degrees C
Decode t/s, batch 1vllm bench latency22-24
Decode aggregate, batch 4vllm bench throughput~70
vllm:gpu_cache_usage_perc/metrics40-85% during traffic

Prometheus monitoring hooks: alert on vllm:gpu_cache_usage_perc > 92 sustained for more than 60 seconds (KV thrash imminent), vllm:num_requests_waiting > 2 sustained (capacity bottleneck), vllm:time_per_output_token_seconds p95 above 0.050 s (decode below 20 t/s, suggests power throttling or thermal). nvidia-smi-derived metrics: GPU utilisation should sit at 95-99% during decode bursts; if it stays under 80%, you are CPU- or network-bound. See thermal performance and power draw and efficiency for the underlying envelopes.

A real chat session, decoded

Worked example: a senior engineer asks the model to refactor a Python function with a 600-token system prompt and a 1,200-token code context. The conversation runs four turns, each with 200-300 token replies. Here is what actually happens in the deployment.

Turn 1: prefill of 1,800 tokens at roughly 850 prefill tokens per second on AWQ-Marlin works out to 2.1 seconds time-to-first-token. The model streams 250 tokens of decode at 23 t/s, completing in 10.9 seconds. Total: 13.0 s wall.

Turn 2: thanks to --enable-prefix-caching, the 1,800-token prefix from turn 1 plus the assistant’s 250-token reply (now 2,050 tokens of cached prefix) is reused. The new user turn adds 80 tokens; only 80 tokens go through fresh prefill. TTFT: 0.18 s. The 230-token reply takes 10.0 s. Total: 10.2 s — a 22% wall-time saving entirely from prefix caching, which on multi-turn chat compounds further. See the prefill/decode benchmark for the underlying numbers.

Turn 3: the conversation has grown to ~2,400 tokens of context. Prefix cache hit on the first 2,330 tokens, fresh prefill on the new 70 tokens. TTFT: 0.16 s. The model writes a 280-token explanation in 12.2 s. Total: 12.4 s.

Turn 4: at this point the cumulative context is 3,140 tokens. Prefix cache still warm. TTFT: 0.20 s, decode 11.0 s. Total: 11.2 s. Across four turns the user experienced an interactive feel from turn 2 onwards, with a noticeable but tolerable delay on turn 1. The pattern matters because real chat is rarely cold-start; once a session is established, prefix caching pulls the perceived experience meaningfully closer to a 14B-class model. For pure first-token cold start, 14B AWQ at Qwen 14B remains the better choice; 70B’s value is in answer quality on hard questions, not interactive snappiness.

BatchDecode t/s aggregatePer-stream t/sNotes
122-2422-24Memory-bandwidth limited
2~40~20Slight per-stream cost
4~70~17.5Recommended steady state
8~110~13.7KV-pressure ceiling
16memory-cappedn/aWill OOM at 16k context

Production gotchas and verdict

  • FP8 KV requires driver 550+. Earlier drivers silently fall back to FP16 KV, the model loads, but you OOM at the 4,000th token of decode. Always check the startup log for fp8_e4m3.
  • The Marlin kernel JIT-compiles on first launch. Add 90 seconds to your first health check timeout. Subsequent boots use the cache at ~/.cache/vllm.
  • Prefix caching plus per-tenant data is a leakage surface. If you serve multiple tenants from one endpoint, namespace cache keys at the gateway. See multi-tenant SaaS.
  • Power-throttle reduces throughput. At 30 degrees ambient with poor airflow, the 4090 down-clocks at ~83 degrees and decode drops from 23 to 18 t/s. Cap at 400 W via nvidia-smi -pl 400 for steadier latency at a 3-4% throughput cost.
  • Do not enable chunked prefill at batch 4. The scheduling overhead exceeds the latency benefit in this profile. Re-enable only if your prompt-length distribution is heavily bimodal.
  • The 17 GB checkpoint download will hit your monthly bandwidth allowance. Snapshot to local NVMe and restore on subsequent boots; the model rarely changes.
  • vLLM 0.5.x and earlier did not support FP8 KV with AWQ on Ada. Pin to 0.6.3 or later. Newer vLLM may have changed flag names; check vllm --help against the running version.

Verdict. The single 4090 70B AWQ INT4 deployment is the correct hardware choice when you need genuine 70B-class reasoning quality on a fixed-cost in-house box, you can tolerate 22-24 t/s and ~4 concurrent users, and your contexts stay under 16k tokens. It is the right pick for a hard-question fallback model behind a faster 8B or 14B frontline (see customer support), for an internal research assistant (see research lab), or for an agent backend’s harder planning tier (see agent backend). It is the wrong pick when you need 32k+ context routinely (consider 5090 32GB or cloud H100), when you need more than ~6 million decoded tokens per day (scale to two 4090s or a single H100), or when interactive snappiness for a high-concurrency frontend matters more than reasoning depth (use Llama 3 8B FP8 instead). Cost comparison against API alternatives is on the 70B monthly cost page and vs OpenAI API cost.

Run Llama 3.1 70B on a single 24 GB card

AWQ INT4 plus FP8 KV at 22-24 t/s decode, 4 concurrent sessions, 16k context. UK dedicated hosting.

Order the RTX 4090 24GB

See also: AWQ deep dive, 70B INT4 benchmark, Llama 70B INT4 VRAM requirements, FP8 Llama deployment, vLLM setup, spec breakdown, FP8 tensor cores on Ada.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?