RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 4090 24GB Llama 3.1 8B Benchmark: FP16, FP8, AWQ, GPTQ, GGUF, EXL2, concurrency, TTFT, energy
Benchmarks

RTX 4090 24GB Llama 3.1 8B Benchmark: FP16, FP8, AWQ, GPTQ, GGUF, EXL2, concurrency, TTFT, energy

Deep Llama 3.1 8B benchmark on the RTX 4090 24GB across six quantisations, batch 1 to 64, full TTFT curve, energy and a five-card cross comparison.

The RTX 4090 24GB remains, even three years after launch, the highest decode-throughput consumer GPU per pound that an inference team can rent on a single-tenant Gigagpu dedicated host. This article is a long-form benchmark of Llama 3.1 8B on that card, covering six weight formats, batch sizes from 1 to 64, time-to-first-token across prompt lengths from 256 to 32k tokens, energy efficiency, and how the result lines up against a 5060 Ti, 5080, 3090, 5090 and H100. All measurements use vLLM 0.6.4 with PyTorch 2.5 and CUDA 12.6 on Ubuntu 24.04, sustained averages over 60-second windows after a 30-second warm-up. There is no overclock and no undervolt — the card is at stock 450 W TDP throughout.

Contents

Methodology and rig

The chassis is a single-socket Ryzen 9 7950X with 64 GB of DDR5-5600, a Samsung 990 Pro 2 TB Gen 4 NVMe holding the model cache, and a single RTX 4090 24GB Founders Edition in the primary x16 Gen 4 slot. The RTX 4090 itself is the Ada AD102 silicon: 16,384 CUDA cores, 24 GB of GDDR6X on a 384-bit bus delivering 1,008 GB/s of memory bandwidth, 72 MB of L2 cache, and 4th-generation tensor cores with native FP8 (E4M3 and E5M2) support. There is no NVLink connector. Power is capped at 450 W stock and the cooler is the triple-fan reference; ambient is 22 C and the card sits in an open Open Compute style frame so thermals never throttle.

Software is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4 built from source against PyTorch 2.5.1, FlashAttention 2.6.3, and Marlin kernels for AWQ and GPTQ. The benchmark harness is the stock vLLM benchmark_throughput.py for closed-loop tests and a small custom asyncio client for the open-loop concurrency table — the open-loop client maintains a target arrival rate so p50 and p99 TTFT figures are honest tail latencies, not just minimum-arrival warm starts. Every figure below is the median of three 60-second windows; per-run variance is under 2 percent for decode and under 5 percent for tail TTFT.

Standard launch line

Almost every server-side number in this article comes from this command. It enables FP8 weights, FP8 KV cache, chunked prefill so long inputs do not block decode steps, prefix caching for chat-pattern reuse, and a sane --max-num-seqs ceiling — vLLM’s default of 256 will happily allocate KV blocks until the card OOMs at high concurrency.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 65536 --max-num-seqs 32 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.92 --port 8000

For a deeper walk-through of every flag and how each one interacts with Ada FP8 hardware, see the vLLM setup guide and the FP8 Llama deployment walk-through. A complementary AWQ guide covers the INT4 path used in the AWQ-Marlin rows below.

Per-quantisation decode results

Single-stream decode at batch 1 with a 128-token input and 512-token output, sustained for 60 seconds. The “weights” column is the on-disk weight footprint, “VRAM peak” is total allocator residency including activations and a 4k-token KV cache. Quality figures (MMLU 5-shot delta from the BF16 reference) are from the lm-evaluation-harness 0.4.4 run on each quantised checkpoint and are included for context, not for headline ranking.

PrecisionWeightsDecode t/sVRAM peakMMLU delta
FP16 / BF1616.0 GB9518.0 GB0.0
FP8 E4M38.0 GB19511.0 GB-0.3
FP8 + FP8 KV8.0 GB1989.0 GB-0.4
AWQ-Marlin INT45.5 GB2258.0 GB-1.2
GPTQ-Marlin INT45.6 GB2208.0 GB-1.5
GGUF Q4_K_M4.9 GB1657.0 GB-1.5
EXL2 4.0 bpw4.8 GB2407.0 GB-1.7

Three things are worth pulling out of that table. First, FP16 at 95 t/s is purely a memory-bandwidth ceiling — at 16 GB of weights and 1,008 GB/s of HBM-equivalent bandwidth, you cannot read the weights faster than roughly 63 times per second per token, and the measured 95 t/s already includes activation reuse. Second, FP8 doubles throughput because halving the weight footprint roughly doubles the read rate, and the Ada tensor cores accept FP8 operands natively without an upcast. Third, EXL2 at 240 t/s is the single-stream king on the 4090 because the ExLlamaV2 kernels are aggressively hand-tuned for low batch sizes; the gap closes once you start batching, as the next section shows. The choice between AWQ-Marlin and FP8 is therefore not “which is faster” but “which engine does my serving stack speak natively” — vLLM, TensorRT-LLM and SGLang all run FP8 first-class while EXL2 lives in TabbyAPI or the ExLlamaV2 server.

Concurrency scaling, batch 1 to 64

This is the table that matters for any production deployment. The harness opens a fixed number of concurrent client streams, each with a 256-token input and a 512-token output, and measures aggregate generated tokens per second across all streams plus per-user steady-state rate. p50 and p99 TTFT come from the open-loop run with Poisson arrivals at the rate that just saturates the configured batch ceiling.

BatchAggregate t/sPer-user t/sp50 TTFTp99 TTFT
119819880 ms110 ms
236518295 ms140 ms
4620155130 ms220 ms
8880110200 ms380 ms
161,02064320 ms620 ms
321,10034530 ms1,100 ms
641,14018880 ms2,400 ms

The aggregate throughput curve is the classic shape: near-linear scaling to batch 4, sub-linear to batch 16, and asymptotic above batch 32 where memory bandwidth is the bottleneck and additional concurrent streams only steal SM time from one another. The per-user column is the operationally honest metric — at batch 32 each individual chat user perceives the model typing at 34 t/s, which is still faster than most humans can read but no longer feels instant. Above batch 32 the p99 TTFT climbs through one second; for an interactive chatbot that is the wall. For an asynchronous batch summariser or RAG ingestion job it is fine. A fuller decomposition of where each millisecond goes lives in the prefill/decode benchmark and the concurrent users writeup.

Prefill rate and TTFT vs prompt length

Prefill is a compute-bound regime — you are running a single matmul per layer over the entire input sequence at once — so the FP8 tensor cores get to flex in a way decode never lets them. The 4090’s theoretical FP8 dense throughput is roughly 660 TFLOPS without sparsity, and on Llama 3.1 8B prefill we measure around 12,200 single-sequence input tokens per second once the sequence is long enough to saturate the SMs.

QuantPrefill t/s (single seq)
FP812,000
FP8 + FP8 KV12,200
AWQ-Marlin7,200
GGUF Q4_K_M5,400
EXL2 4.0 bpw8,800

The prefill ranking inverts the decode ranking: FP8 wins because the tensor cores accept FP8 operands directly, while AWQ INT4 has to dequantise to FP16 inside the Marlin kernel before the matmul, paying a roughly 40 percent prefill tax for its decode advantage. GGUF Q4_K_M comes out worst at prefill because the llama.cpp backend used inside vLLM does not have a Marlin-class fused kernel and falls back to a less optimised path. If your workload is RAG with very long retrieved contexts and short generations, FP8 is unambiguously the right pick — the FP8 tensor cores Ada writeup explains why in more detail.

Time-to-first-token is dominated by prefill at any prompt longer than about 512 tokens. The table below is FP8 weights, FP8 KV, batch 1, FlashAttention-2 prefill kernel:

Prompt lengthTTFT
256 tok80 ms
1k145 ms
2k230 ms
4k415 ms
8k880 ms
16k1,850 ms
32k4,100 ms

TTFT scales roughly linearly with prompt length up to 8k and then degrades faster as KV writes start to compete with the prefill matmuls for memory bandwidth. The 32k figure of 4.1 seconds is genuinely usable for asynchronous workloads but is too slow for an interactive autocomplete — at that point you want chunked prefill with a lower chunk size, or you want to keep prompts under 8k and rely on prefix caching for the system message. For chat-pattern workloads where the system prompt and conversation history are stable across turns, prefix caching cuts effective TTFT to the cost of just the new user turn, often returning sub-200 ms first tokens even at 16k total context.

Memory consumption at long context

The 24 GB ceiling is the headline constraint. The table below shows allocator residency at three context lengths and four weight configurations, all measured at idle batch 1 immediately after a fresh 4k-prompt prefill so the KV cache is realistically populated.

ConfigWeightsKV at ctxFree for batching
FP16, ctx 8k16.0 GB1.0 GB~6.5 GB
FP8, ctx 32k8.0 GB2.1 GB (FP8 KV)~13 GB
FP8, ctx 128k8.0 GB8.4 GB (FP8 KV)~7 GB
AWQ, ctx 32k5.4 GB4.2 GB (FP16 KV)~14 GB
AWQ + FP8 KV, ctx 32k5.4 GB2.1 GB~16 GB

FP8 weights with FP8 KV is the sweet spot: 8 GB for weights, 2 GB for a 32k cache, and 13 GB free for batching. That free pool is what lets the 4090 hold roughly 32 concurrent 32k-context conversations on a single card, which is the configuration most production teams settle on. AWQ at INT4 frees even more headroom but, as the prefill table showed, costs 40 percent of TTFT — a trade-off worth taking only if you need to push concurrent users above the 32 mark or run multiple smaller models in parallel via a router. The RTX 4090 24GB for Llama 3 8B primer covers this trade-off in plain language for first-time deployers.

Cross-card comparison

How does the 4090 line up against the rest of the stable? All numbers below are Llama 3.1 8B in FP8 (or AWQ where FP8 hardware is unavailable, marked) at batch 1 single-stream and batch 32 aggregate, plus a tokens-per-Joule energy figure measured at the wall socket via a Kill-A-Watt P3 P4400 minus idle baseline.

Cardb=1 t/sb=32 aggregate t/stokens/Joule
5060 Ti 16GB1127204.6
5080 16GB1851,1003.8
4090 24GB1981,1003.4
3090 24GB (AWQ)1509503.3
5090 32GB2801,7003.4
H100 80GB3302,2005.0

Three observations. The 5080 16GB matches the 4090 on raw aggregate throughput at batch 32 because both are bandwidth-bound and the 5080 has only marginally less bandwidth, but the 5080 caps you at 16 GB which means no FP16 fallback and a tighter context budget — see the 4090 vs 5060 Ti and 4090 vs 5090 comparisons for the full picture. The 3090 24GB is the value floor: you give up FP8 entirely (Ampere has no native FP8) and have to live with AWQ INT4 for any kind of speed, but the price differential makes it competitive for some workloads, as the 4090 vs 3090 and 4090 or 3090 decision guides explore. The H100 wins everywhere on tokens per Joule because HBM3 at 3,350 GB/s plus a 700 W TDP is genuinely a different class of memory subsystem; whether you can justify the price differential is covered in the 4090 vs H100 writeup.

Production gotchas

The benchmark numbers above are best-case sustained throughput on a clean rig. The list below is what you actually trip over in production, in rough order of how often we see each one bite a deployment.

  • vLLM --max-num-seqs default is 256, which is too high for 24 GB. The scheduler will allocate KV blocks aggressively and either OOM mid-request or fragment the cache so badly that aggregate throughput drops 20 percent. Override to 32 for chat workloads and 64 for short-completion workloads. The default exists because vLLM was originally tuned for A100 80 GB.
  • AWQ + Marlin needs vLLM 0.5.0 or newer. Older builds fall back to the reference AWQ kernel, which is roughly 2.5 times slower at decode and breaks the per-quantisation table above entirely. Pin vllm>=0.6.0 in your requirements.
  • GGUF Q4 in vLLM goes through the llama.cpp backend, not Marlin. If you measure GGUF inside vLLM and it looks slow, that is why — the GGUF path is included for compatibility, not speed. Use AWQ-Marlin if you want INT4 throughput inside vLLM, or run GGUF in llama.cpp directly where it has a much more optimised CUDA kernel.
  • The 4090 is a Gen 4 PCIe card; a Gen 3 host slot loses about 5 percent prefill throughput. This matters more than people expect, because long-context prefill streams a lot of activations across the bus during chunked prefill. Verify with nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv.
  • The 4090 has no NVLink connector. Two-card tensor parallel runs over PCIe Gen 4 x16, which is roughly 32 GB/s bidirectional — about a tenth of what NVLink would give you. For Llama 3.1 70B INT4 across two 4090s the all-reduce overhead eats 25 to 30 percent of decode throughput; details in the Llama 70B INT4 benchmark.
  • Power draw is not what the spec sheet says. Under prefill the card pulls roughly 400 W; under decode at batch 32 it sits at 360 W; idle but loaded with a CUDA context it still draws 70 W. Provision the PSU for 500 W headroom and remember the idle floor when calculating monthly hosting cost — see the monthly hosting cost breakdown for a worked example.
  • Prefix caching is not optional for chat. Without --enable-prefix-caching every turn re-runs the full system prompt through prefill, which on a 4k system prompt adds 415 ms to every single response. With it on, the second-and-later turns of a conversation pay only the cost of the new user message. The prefix caching writeup explains the cache eviction behaviour you need to plan for.

Verdict and when to pick

The 4090 24GB is the right card if you need a single-tenant box that serves Llama 3.1 8B at 1,100 aggregate tokens per second with sub-second p99 TTFT up to batch 32, and you do not want to pay H100 money. It is the wrong card if your workload is a fleet of 70B-class models, a multi-card tensor-parallel deployment that would benefit from NVLink, or a fine-tuning rig that needs to fit optimiser state for anything larger than 8B in BF16 — in that case the 4090 or 5090 decision tilts toward the newer card despite the price.

For a focused 8B serving deployment with mixed RAG and chat traffic, the configuration we land on for most clients is FP8 weights with FP8 KV cache, --max-num-seqs 32, chunked prefill on, prefix caching on, --max-model-len 65536. That gives you 32 concurrent users at 34 t/s each, p99 TTFT around 1.1 seconds, and 13 GB of free VRAM headroom for spike absorption. Energy efficiency lands at roughly 3.4 tokens per Joule, which on UK retail electricity at 28 p/kWh works out to a marginal cost per million output tokens that beats the OpenAI 4o-mini API list price by a meaningful multiple — the vs OpenAI API cost calculator walks through the arithmetic.

Llama 3.1 8B at 1,100 aggregate t/s

Single-tenant RTX 4090 24GB on UK dedicated hosting, FP8 native, vLLM-tuned out of the box.

Order the RTX 4090 24GB

See also: spec breakdown, TFLOPS class, power draw efficiency, tokens per watt, Qwen 14B benchmark, Mixtral benchmark, fine-tune throughput.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?