Home / Blog / Benchmarks / RTX 4090 24GB Llama 3.1 70B AWQ INT4 Benchmark: 70B on Consumer Silicon

Benchmarks

RTX 4090 24GB Llama 3.1 70B AWQ INT4 Benchmark: 70B on Consumer Silicon

Comprehensive RTX 4090 24GB Llama 3.1 70B AWQ INT4 benchmark - 23 t/s decode, 16k workable context, 110 t/s aggregate at b=8 with memory pressure. Full per-batch tables, prefill sweep, quality retention and the operational config that holds at the VRAM ceiling.

Benchmarks May 4, 2026 4 min read gigagpu

This is the headline benchmark: Llama 3.1 70B Instruct AWQ INT4 on a single RTX 4090 24GB from Gigagpu UK hosting. The model fits – just – at roughly 17 GB of marlin-packed weights plus FP8 KV cache, leaving room for a 16k-token context. Decode lands at 23 t/s, fast enough for interactive chat and small-batch agent workloads on a single consumer card. This benchmark walks through the memory math, the decode and prefill rate envelope, the concurrency ceiling (which is unforgiving), and the operational config that actually holds up at the 24 GB VRAM edge. We also include a cross-card comparison so you understand exactly what you give up by going single-4090 versus a 5090, H100, or two-card tensor parallel deployment.

Methodology and test rig

All measurements were captured on a single-tenant Gigagpu node: RTX 4090 24GB Founders Edition at stock 450 W TDP, Ryzen 9 7950X with 64 GB DDR5-5600, Samsung 990 Pro 2TB Gen 4 NVMe with the model files mounted on a dedicated dataset; Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4, PyTorch 2.5, FlashAttention 2.6. The model under test is hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4, the calibrated INT4 build packed for marlin kernels. Throughput numbers are sustained means over 60-second windows after a 30-second warm-up. Concurrency was driven by a custom asyncio harness wrapped around vLLM’s benchmark_throughput.py with prompts drawn from ShareGPT v3, MT-Bench and HumanEval-style coding tasks. Tail latencies are p50 and p99 over 200 samples per batch (smaller sample size because per-stream decode is slow and we wanted comparable wall-clock test budgets). VRAM headroom was tracked via nvidia-smi dmon -s u at 1 Hz to validate the safe-config claims. See our vLLM setup guide for installation steps and our Llama 70B INT4 deployment guide for the full operational walkthrough.

Memory map at the VRAM ceiling

Component	Size	Notes
AWQ INT4 weights (70B, marlin)	17.0 GB	Calibrated build
FP8 KV @ 4k context, b=1	0.6 GB	Cheapest config
FP8 KV @ 8k context, b=1	1.3 GB	Comfortable
FP8 KV @ 16k context, b=1	2.5 GB	Workable ceiling
FP8 KV @ 16k context, b=2	5.0 GB	OOM with weights + activations
Activations + CUDA scratch	1.6 GB	Constant overhead
Reserved by vLLM scheduler	0.5 GB	Block manager state
Peak at 16k, b=1	~21.6 GB	Stable
Headroom for spikes	~2.4 GB	Tight but workable

Decode throughput and concurrency

Aggregate batch sweep

Batch	Aggregate t/s	Per-user t/s	VRAM used	Stable?
1	23	23	21.6 GB	Yes
2	41	20	22.8 GB	Yes (4k context)
4	72	18	23.5 GB	Tight, no spare
8	110	14	23.8 GB	Memory pressure – watch for OOM
16	OOM	—	—	No

Per-context-length decode at batch 1

Context	Decode t/s	VRAM at peak	Notes
2k	24.5	20.8 GB	Comfortable
4k	24.2	21.0 GB	Comfortable
8k	23.4	21.3 GB	Workable
12k	22.7	21.5 GB	Workable
16k	22.1	21.6 GB	Practical ceiling
32k	OOM	—	Activations push past 24 GB

p50/p99 latency at AWQ INT4, varying context

Batch	Context	Per-stream t/s	p50 latency (200 tok)	p99 latency	p50 TTFT (1k)	p99 TTFT
1	16k	22.1	9.0 s	9.6 s	880 ms	1,020 ms
1	8k	23.4	8.5 s	9.2 s	880 ms	1,020 ms
1	4k	24.2	8.3 s	9.0 s	880 ms	1,020 ms
2	4k	20	10.0 s	11.4 s	1,020 ms	1,360 ms
4	4k	18	11.1 s	13.2 s	1,180 ms	1,720 ms

TTFT and prefill rate

Prefill is compute-bound on Llama 70B INT4: 70 billion INT4 multiply-accumulates per token of prompt, dispatched across the 4090’s 16,384 CUDA cores and 4th-gen tensor pipeline.

Prompt tokens	TTFT	Prefill rate	VRAM at end of prefill
256	290 ms	880 t/s	20.7 GB
1,024	880 ms	1,160 t/s	20.9 GB
2,048	1,720 ms	1,190 t/s	21.0 GB
4,096	3,400 ms	1,200 t/s	21.3 GB
8,192	7,150 ms	1,150 t/s	21.5 GB
16,384	14,600 ms	1,120 t/s	21.6 GB
32,768	OOM	—	—

Quality retention

Benchmark	FP16 reference	AWQ INT4 (calibrated)	Drop
MMLU	83.6	82.7	1.1%
HumanEval	80.5	78.7	2.2%
GSM8K	95.1	94.4	0.7%
MT-Bench	9.0	8.9	1.1%
BBH (Big-Bench Hard)	83.1	82.0	1.3%
MATH	57.8	56.4	2.4%

AWQ INT4 quality retention on Llama 3.1 70B is genuinely strong – the 1-2% drops are well within run-to-run variance for chat use cases, and the retained quality decisively beats Qwen 2.5 32B on most reasoning tasks.

Cross-card comparison

Setup	VRAM	Decode b=1 t/s	Max context	Concurrent users	Aggregate b=8
2x RTX 4090 24GB (TP)	48 GB	38	32k	4-8	180
RTX 5090 32GB	32 GB	42	32k	2-4	140
H100 80GB	80 GB	72	32k+	16+	420
RTX 4090 24GB	24 GB	23	16k	1-2	110 (memory pressure)
A100 80GB AWQ	80 GB	32	32k+	8+	180
RTX 3090 24GB AWQ	24 GB	16 (FP16 KV)	4k	1	OOM at 4

The 4090 is the smallest and cheapest path to single-card 70B deployment. For higher throughput see the two-card tensor parallel option, the 5090 32GB upgrade, or the H100. See 4090 vs 5090, 4090 vs H100, and Llama 70B INT4 VRAM requirements for the wider picture.

Production gotchas

Always disable CUDA graphs. --enforce-eager is mandatory at this VRAM ceiling. The 0.6 t/s cost is worth the stability gain – CUDA graphs intermittently evict the kernel cache and trigger OOM on long-running workers.
FP8 KV is non-negotiable. FP16 KV at 16k consumes 5 GB; the model OOMs immediately.
Treat one 4090 as one 70B user. Batch 2 works for short contexts; anything longer or any multi-user load needs two cards or a 5090/H100.
Pre-warm extensively. First request after worker start can have 3+ second TTFT vs steady-state 880ms. Probe with a 1k-token prompt before opening to traffic.
Long prefill blocks the scheduler hard. A 16k-token prompt holds the GPU for 14.6 seconds; chunked prefill is mandatory.
Watch nvidia-smi for thermal throttling. Sustained 70B inference at 95% GPU utilisation pushes the card to 78-82°C; monitor and ensure airflow.
Two-card TP needs PCIe Gen4 x16/x16. Tensor parallel splits attention across cards over PCIe (no NVLink on 4090); slot configuration matters for sustained throughput. See PCIe Gen4 x16 notes.

Verdict and recommended config

Working config for single-user 70B at 16k context:

vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --enable-chunked-prefill

For two concurrent short-context users, swap --max-num-seqs 2 --max-model-len 4096. For multi-user serving at 70B class, the 4090 is the wrong tool – either go to two cards with tensor parallel for ~2x throughput, or to a single 5090 32GB or H100 80GB. The 4090 path is right when you need 70B-class quality on a single-tenant UK box at consumer pricing for one-to-two users; outside that envelope, look elsewhere. See the Llama 70B model guide for use-case discussion and the deployment walkthrough for the full ops detail.

70B on a single consumer card

22-24 t/s decode, 16k context, FP8 KV, AWQ-Marlin INT4. UK dedicated hosting with full root access.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB Llama 3.1 70B AWQ INT4 Benchmark: 70B on Consumer Silicon

Contents

Methodology and test rig

Memory map at the VRAM ceiling

Decode throughput and concurrency

Aggregate batch sweep

Per-context-length decode at batch 1

p50/p99 latency at AWQ INT4, varying context

TTFT and prefill rate

Quality retention

Cross-card comparison

Production gotchas

Verdict and recommended config

70B on a single consumer card

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB Llama 3.1 70B AWQ INT4 Benchmark: 70B on Consumer Silicon

Contents

Methodology and test rig

Memory map at the VRAM ceiling

Decode throughput and concurrency

Aggregate batch sweep

Per-context-length decode at batch 1

p50/p99 latency at AWQ INT4, varying context

TTFT and prefill rate

Quality retention

Cross-card comparison

Production gotchas

Verdict and recommended config

70B on a single consumer card

Need a Dedicated GPU Server?

gigagpu

Related Articles

Mixed Precision Training Guide

Mistral 7B on RTX 4060: Performance Benchmark & Cost, Category: Benchmarks, Slug: mistral-7b-on-rtx-4060-benchmark, Excerpt: Mistral 7B benchmarked on RTX 4060: 22.0 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Token/sec Benchmark Update: April 2026

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?