RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 4090 24GB Llama 3.1 70B AWQ INT4 Benchmark: 70B on Consumer Silicon
Benchmarks

RTX 4090 24GB Llama 3.1 70B AWQ INT4 Benchmark: 70B on Consumer Silicon

Comprehensive RTX 4090 24GB Llama 3.1 70B AWQ INT4 benchmark - 23 t/s decode, 16k workable context, 110 t/s aggregate at b=8 with memory pressure. Full per-batch tables, prefill sweep, quality retention and the operational config that holds at the VRAM ceiling.

This is the headline benchmark: Llama 3.1 70B Instruct AWQ INT4 on a single RTX 4090 24GB from Gigagpu UK hosting. The model fits – just – at roughly 17 GB of marlin-packed weights plus FP8 KV cache, leaving room for a 16k-token context. Decode lands at 23 t/s, fast enough for interactive chat and small-batch agent workloads on a single consumer card. This benchmark walks through the memory math, the decode and prefill rate envelope, the concurrency ceiling (which is unforgiving), and the operational config that actually holds up at the 24 GB VRAM edge. We also include a cross-card comparison so you understand exactly what you give up by going single-4090 versus a 5090, H100, or two-card tensor parallel deployment.

Contents

Methodology and test rig

All measurements were captured on a single-tenant Gigagpu node: RTX 4090 24GB Founders Edition at stock 450 W TDP, Ryzen 9 7950X with 64 GB DDR5-5600, Samsung 990 Pro 2TB Gen 4 NVMe with the model files mounted on a dedicated dataset; Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4, PyTorch 2.5, FlashAttention 2.6. The model under test is hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4, the calibrated INT4 build packed for marlin kernels. Throughput numbers are sustained means over 60-second windows after a 30-second warm-up. Concurrency was driven by a custom asyncio harness wrapped around vLLM’s benchmark_throughput.py with prompts drawn from ShareGPT v3, MT-Bench and HumanEval-style coding tasks. Tail latencies are p50 and p99 over 200 samples per batch (smaller sample size because per-stream decode is slow and we wanted comparable wall-clock test budgets). VRAM headroom was tracked via nvidia-smi dmon -s u at 1 Hz to validate the safe-config claims. See our vLLM setup guide for installation steps and our Llama 70B INT4 deployment guide for the full operational walkthrough.

Memory map at the VRAM ceiling

ComponentSizeNotes
AWQ INT4 weights (70B, marlin)17.0 GBCalibrated build
FP8 KV @ 4k context, b=10.6 GBCheapest config
FP8 KV @ 8k context, b=11.3 GBComfortable
FP8 KV @ 16k context, b=12.5 GBWorkable ceiling
FP8 KV @ 16k context, b=25.0 GBOOM with weights + activations
Activations + CUDA scratch1.6 GBConstant overhead
Reserved by vLLM scheduler0.5 GBBlock manager state
Peak at 16k, b=1~21.6 GBStable
Headroom for spikes~2.4 GBTight but workable

Decode throughput and concurrency

Aggregate batch sweep

BatchAggregate t/sPer-user t/sVRAM usedStable?
1232321.6 GBYes
2412022.8 GBYes (4k context)
4721823.5 GBTight, no spare
81101423.8 GBMemory pressure – watch for OOM
16OOMNo

Per-context-length decode at batch 1

ContextDecode t/sVRAM at peakNotes
2k24.520.8 GBComfortable
4k24.221.0 GBComfortable
8k23.421.3 GBWorkable
12k22.721.5 GBWorkable
16k22.121.6 GBPractical ceiling
32kOOMActivations push past 24 GB

p50/p99 latency at AWQ INT4, varying context

BatchContextPer-stream t/sp50 latency (200 tok)p99 latencyp50 TTFT (1k)p99 TTFT
116k22.19.0 s9.6 s880 ms1,020 ms
18k23.48.5 s9.2 s880 ms1,020 ms
14k24.28.3 s9.0 s880 ms1,020 ms
24k2010.0 s11.4 s1,020 ms1,360 ms
44k1811.1 s13.2 s1,180 ms1,720 ms

TTFT and prefill rate

Prefill is compute-bound on Llama 70B INT4: 70 billion INT4 multiply-accumulates per token of prompt, dispatched across the 4090’s 16,384 CUDA cores and 4th-gen tensor pipeline.

Prompt tokensTTFTPrefill rateVRAM at end of prefill
256290 ms880 t/s20.7 GB
1,024880 ms1,160 t/s20.9 GB
2,0481,720 ms1,190 t/s21.0 GB
4,0963,400 ms1,200 t/s21.3 GB
8,1927,150 ms1,150 t/s21.5 GB
16,38414,600 ms1,120 t/s21.6 GB
32,768OOM

Quality retention

BenchmarkFP16 referenceAWQ INT4 (calibrated)Drop
MMLU83.682.71.1%
HumanEval80.578.72.2%
GSM8K95.194.40.7%
MT-Bench9.08.91.1%
BBH (Big-Bench Hard)83.182.01.3%
MATH57.856.42.4%

AWQ INT4 quality retention on Llama 3.1 70B is genuinely strong – the 1-2% drops are well within run-to-run variance for chat use cases, and the retained quality decisively beats Qwen 2.5 32B on most reasoning tasks.

Cross-card comparison

SetupVRAMDecode b=1 t/sMax contextConcurrent usersAggregate b=8
2x RTX 4090 24GB (TP)48 GB3832k4-8180
RTX 5090 32GB32 GB4232k2-4140
H100 80GB80 GB7232k+16+420
RTX 4090 24GB24 GB2316k1-2110 (memory pressure)
A100 80GB AWQ80 GB3232k+8+180
RTX 3090 24GB AWQ24 GB16 (FP16 KV)4k1OOM at 4

The 4090 is the smallest and cheapest path to single-card 70B deployment. For higher throughput see the two-card tensor parallel option, the 5090 32GB upgrade, or the H100. See 4090 vs 5090, 4090 vs H100, and Llama 70B INT4 VRAM requirements for the wider picture.

Production gotchas

  1. Always disable CUDA graphs. --enforce-eager is mandatory at this VRAM ceiling. The 0.6 t/s cost is worth the stability gain – CUDA graphs intermittently evict the kernel cache and trigger OOM on long-running workers.
  2. FP8 KV is non-negotiable. FP16 KV at 16k consumes 5 GB; the model OOMs immediately.
  3. Treat one 4090 as one 70B user. Batch 2 works for short contexts; anything longer or any multi-user load needs two cards or a 5090/H100.
  4. Pre-warm extensively. First request after worker start can have 3+ second TTFT vs steady-state 880ms. Probe with a 1k-token prompt before opening to traffic.
  5. Long prefill blocks the scheduler hard. A 16k-token prompt holds the GPU for 14.6 seconds; chunked prefill is mandatory.
  6. Watch nvidia-smi for thermal throttling. Sustained 70B inference at 95% GPU utilisation pushes the card to 78-82°C; monitor and ensure airflow.
  7. Two-card TP needs PCIe Gen4 x16/x16. Tensor parallel splits attention across cards over PCIe (no NVLink on 4090); slot configuration matters for sustained throughput. See PCIe Gen4 x16 notes.

Verdict and recommended config

Working config for single-user 70B at 16k context:

vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --enable-chunked-prefill

For two concurrent short-context users, swap --max-num-seqs 2 --max-model-len 4096. For multi-user serving at 70B class, the 4090 is the wrong tool – either go to two cards with tensor parallel for ~2x throughput, or to a single 5090 32GB or H100 80GB. The 4090 path is right when you need 70B-class quality on a single-tenant UK box at consumer pricing for one-to-two users; outside that envelope, look elsewhere. See the Llama 70B model guide for use-case discussion and the deployment walkthrough for the full ops detail.

70B on a single consumer card

22-24 t/s decode, 16k context, FP8 KV, AWQ-Marlin INT4. UK dedicated hosting with full root access.

Order the RTX 4090 24GB

See also: Llama 70B model guide, deployment guide, VRAM requirements, AWQ guide, 4090 vs 5090, vLLM setup, Llama 70B monthly cost, Qwen 32B benchmark, Mixtral vs Llama 70B.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?