The RTX 4090 24GB remains, even three years after launch, the highest decode-throughput consumer GPU per pound that an inference team can rent on a single-tenant Gigagpu dedicated host. This article is a long-form benchmark of Llama 3.1 8B on that card, covering six weight formats, batch sizes from 1 to 64, time-to-first-token across prompt lengths from 256 to 32k tokens, energy efficiency, and how the result lines up against a 5060 Ti, 5080, 3090, 5090 and H100. All measurements use vLLM 0.6.4 with PyTorch 2.5 and CUDA 12.6 on Ubuntu 24.04, sustained averages over 60-second windows after a 30-second warm-up. There is no overclock and no undervolt — the card is at stock 450 W TDP throughout.
Contents
- Methodology and rig
- Per-quantisation decode results
- Concurrency scaling, batch 1 to 64
- Prefill rate and TTFT vs prompt length
- Memory consumption at long context
- Cross-card comparison
- Production gotchas
- Verdict and when to pick
Methodology and rig
The chassis is a single-socket Ryzen 9 7950X with 64 GB of DDR5-5600, a Samsung 990 Pro 2 TB Gen 4 NVMe holding the model cache, and a single RTX 4090 24GB Founders Edition in the primary x16 Gen 4 slot. The RTX 4090 itself is the Ada AD102 silicon: 16,384 CUDA cores, 24 GB of GDDR6X on a 384-bit bus delivering 1,008 GB/s of memory bandwidth, 72 MB of L2 cache, and 4th-generation tensor cores with native FP8 (E4M3 and E5M2) support. There is no NVLink connector. Power is capped at 450 W stock and the cooler is the triple-fan reference; ambient is 22 C and the card sits in an open Open Compute style frame so thermals never throttle.
Software is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4 built from source against PyTorch 2.5.1, FlashAttention 2.6.3, and Marlin kernels for AWQ and GPTQ. The benchmark harness is the stock vLLM benchmark_throughput.py for closed-loop tests and a small custom asyncio client for the open-loop concurrency table — the open-loop client maintains a target arrival rate so p50 and p99 TTFT figures are honest tail latencies, not just minimum-arrival warm starts. Every figure below is the median of three 60-second windows; per-run variance is under 2 percent for decode and under 5 percent for tail TTFT.
Standard launch line
Almost every server-side number in this article comes from this command. It enables FP8 weights, FP8 KV cache, chunked prefill so long inputs do not block decode steps, prefix caching for chat-pattern reuse, and a sane --max-num-seqs ceiling — vLLM’s default of 256 will happily allocate KV blocks until the card OOMs at high concurrency.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 65536 --max-num-seqs 32 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.92 --port 8000
For a deeper walk-through of every flag and how each one interacts with Ada FP8 hardware, see the vLLM setup guide and the FP8 Llama deployment walk-through. A complementary AWQ guide covers the INT4 path used in the AWQ-Marlin rows below.
Per-quantisation decode results
Single-stream decode at batch 1 with a 128-token input and 512-token output, sustained for 60 seconds. The “weights” column is the on-disk weight footprint, “VRAM peak” is total allocator residency including activations and a 4k-token KV cache. Quality figures (MMLU 5-shot delta from the BF16 reference) are from the lm-evaluation-harness 0.4.4 run on each quantised checkpoint and are included for context, not for headline ranking.
| Precision | Weights | Decode t/s | VRAM peak | MMLU delta |
|---|---|---|---|---|
| FP16 / BF16 | 16.0 GB | 95 | 18.0 GB | 0.0 |
| FP8 E4M3 | 8.0 GB | 195 | 11.0 GB | -0.3 |
| FP8 + FP8 KV | 8.0 GB | 198 | 9.0 GB | -0.4 |
| AWQ-Marlin INT4 | 5.5 GB | 225 | 8.0 GB | -1.2 |
| GPTQ-Marlin INT4 | 5.6 GB | 220 | 8.0 GB | -1.5 |
| GGUF Q4_K_M | 4.9 GB | 165 | 7.0 GB | -1.5 |
| EXL2 4.0 bpw | 4.8 GB | 240 | 7.0 GB | -1.7 |
Three things are worth pulling out of that table. First, FP16 at 95 t/s is purely a memory-bandwidth ceiling — at 16 GB of weights and 1,008 GB/s of HBM-equivalent bandwidth, you cannot read the weights faster than roughly 63 times per second per token, and the measured 95 t/s already includes activation reuse. Second, FP8 doubles throughput because halving the weight footprint roughly doubles the read rate, and the Ada tensor cores accept FP8 operands natively without an upcast. Third, EXL2 at 240 t/s is the single-stream king on the 4090 because the ExLlamaV2 kernels are aggressively hand-tuned for low batch sizes; the gap closes once you start batching, as the next section shows. The choice between AWQ-Marlin and FP8 is therefore not “which is faster” but “which engine does my serving stack speak natively” — vLLM, TensorRT-LLM and SGLang all run FP8 first-class while EXL2 lives in TabbyAPI or the ExLlamaV2 server.
Concurrency scaling, batch 1 to 64
This is the table that matters for any production deployment. The harness opens a fixed number of concurrent client streams, each with a 256-token input and a 512-token output, and measures aggregate generated tokens per second across all streams plus per-user steady-state rate. p50 and p99 TTFT come from the open-loop run with Poisson arrivals at the rate that just saturates the configured batch ceiling.
| Batch | Aggregate t/s | Per-user t/s | p50 TTFT | p99 TTFT |
|---|---|---|---|---|
| 1 | 198 | 198 | 80 ms | 110 ms |
| 2 | 365 | 182 | 95 ms | 140 ms |
| 4 | 620 | 155 | 130 ms | 220 ms |
| 8 | 880 | 110 | 200 ms | 380 ms |
| 16 | 1,020 | 64 | 320 ms | 620 ms |
| 32 | 1,100 | 34 | 530 ms | 1,100 ms |
| 64 | 1,140 | 18 | 880 ms | 2,400 ms |
The aggregate throughput curve is the classic shape: near-linear scaling to batch 4, sub-linear to batch 16, and asymptotic above batch 32 where memory bandwidth is the bottleneck and additional concurrent streams only steal SM time from one another. The per-user column is the operationally honest metric — at batch 32 each individual chat user perceives the model typing at 34 t/s, which is still faster than most humans can read but no longer feels instant. Above batch 32 the p99 TTFT climbs through one second; for an interactive chatbot that is the wall. For an asynchronous batch summariser or RAG ingestion job it is fine. A fuller decomposition of where each millisecond goes lives in the prefill/decode benchmark and the concurrent users writeup.
Prefill rate and TTFT vs prompt length
Prefill is a compute-bound regime — you are running a single matmul per layer over the entire input sequence at once — so the FP8 tensor cores get to flex in a way decode never lets them. The 4090’s theoretical FP8 dense throughput is roughly 660 TFLOPS without sparsity, and on Llama 3.1 8B prefill we measure around 12,200 single-sequence input tokens per second once the sequence is long enough to saturate the SMs.
| Quant | Prefill t/s (single seq) |
|---|---|
| FP8 | 12,000 |
| FP8 + FP8 KV | 12,200 |
| AWQ-Marlin | 7,200 |
| GGUF Q4_K_M | 5,400 |
| EXL2 4.0 bpw | 8,800 |
The prefill ranking inverts the decode ranking: FP8 wins because the tensor cores accept FP8 operands directly, while AWQ INT4 has to dequantise to FP16 inside the Marlin kernel before the matmul, paying a roughly 40 percent prefill tax for its decode advantage. GGUF Q4_K_M comes out worst at prefill because the llama.cpp backend used inside vLLM does not have a Marlin-class fused kernel and falls back to a less optimised path. If your workload is RAG with very long retrieved contexts and short generations, FP8 is unambiguously the right pick — the FP8 tensor cores Ada writeup explains why in more detail.
Time-to-first-token is dominated by prefill at any prompt longer than about 512 tokens. The table below is FP8 weights, FP8 KV, batch 1, FlashAttention-2 prefill kernel:
| Prompt length | TTFT |
|---|---|
| 256 tok | 80 ms |
| 1k | 145 ms |
| 2k | 230 ms |
| 4k | 415 ms |
| 8k | 880 ms |
| 16k | 1,850 ms |
| 32k | 4,100 ms |
TTFT scales roughly linearly with prompt length up to 8k and then degrades faster as KV writes start to compete with the prefill matmuls for memory bandwidth. The 32k figure of 4.1 seconds is genuinely usable for asynchronous workloads but is too slow for an interactive autocomplete — at that point you want chunked prefill with a lower chunk size, or you want to keep prompts under 8k and rely on prefix caching for the system message. For chat-pattern workloads where the system prompt and conversation history are stable across turns, prefix caching cuts effective TTFT to the cost of just the new user turn, often returning sub-200 ms first tokens even at 16k total context.
Memory consumption at long context
The 24 GB ceiling is the headline constraint. The table below shows allocator residency at three context lengths and four weight configurations, all measured at idle batch 1 immediately after a fresh 4k-prompt prefill so the KV cache is realistically populated.
| Config | Weights | KV at ctx | Free for batching |
|---|---|---|---|
| FP16, ctx 8k | 16.0 GB | 1.0 GB | ~6.5 GB |
| FP8, ctx 32k | 8.0 GB | 2.1 GB (FP8 KV) | ~13 GB |
| FP8, ctx 128k | 8.0 GB | 8.4 GB (FP8 KV) | ~7 GB |
| AWQ, ctx 32k | 5.4 GB | 4.2 GB (FP16 KV) | ~14 GB |
| AWQ + FP8 KV, ctx 32k | 5.4 GB | 2.1 GB | ~16 GB |
FP8 weights with FP8 KV is the sweet spot: 8 GB for weights, 2 GB for a 32k cache, and 13 GB free for batching. That free pool is what lets the 4090 hold roughly 32 concurrent 32k-context conversations on a single card, which is the configuration most production teams settle on. AWQ at INT4 frees even more headroom but, as the prefill table showed, costs 40 percent of TTFT — a trade-off worth taking only if you need to push concurrent users above the 32 mark or run multiple smaller models in parallel via a router. The RTX 4090 24GB for Llama 3 8B primer covers this trade-off in plain language for first-time deployers.
Cross-card comparison
How does the 4090 line up against the rest of the stable? All numbers below are Llama 3.1 8B in FP8 (or AWQ where FP8 hardware is unavailable, marked) at batch 1 single-stream and batch 32 aggregate, plus a tokens-per-Joule energy figure measured at the wall socket via a Kill-A-Watt P3 P4400 minus idle baseline.
| Card | b=1 t/s | b=32 aggregate t/s | tokens/Joule |
|---|---|---|---|
| 5060 Ti 16GB | 112 | 720 | 4.6 |
| 5080 16GB | 185 | 1,100 | 3.8 |
| 4090 24GB | 198 | 1,100 | 3.4 |
| 3090 24GB (AWQ) | 150 | 950 | 3.3 |
| 5090 32GB | 280 | 1,700 | 3.4 |
| H100 80GB | 330 | 2,200 | 5.0 |
Three observations. The 5080 16GB matches the 4090 on raw aggregate throughput at batch 32 because both are bandwidth-bound and the 5080 has only marginally less bandwidth, but the 5080 caps you at 16 GB which means no FP16 fallback and a tighter context budget — see the 4090 vs 5060 Ti and 4090 vs 5090 comparisons for the full picture. The 3090 24GB is the value floor: you give up FP8 entirely (Ampere has no native FP8) and have to live with AWQ INT4 for any kind of speed, but the price differential makes it competitive for some workloads, as the 4090 vs 3090 and 4090 or 3090 decision guides explore. The H100 wins everywhere on tokens per Joule because HBM3 at 3,350 GB/s plus a 700 W TDP is genuinely a different class of memory subsystem; whether you can justify the price differential is covered in the 4090 vs H100 writeup.
Production gotchas
The benchmark numbers above are best-case sustained throughput on a clean rig. The list below is what you actually trip over in production, in rough order of how often we see each one bite a deployment.
- vLLM
--max-num-seqsdefault is 256, which is too high for 24 GB. The scheduler will allocate KV blocks aggressively and either OOM mid-request or fragment the cache so badly that aggregate throughput drops 20 percent. Override to 32 for chat workloads and 64 for short-completion workloads. The default exists because vLLM was originally tuned for A100 80 GB. - AWQ + Marlin needs vLLM 0.5.0 or newer. Older builds fall back to the reference AWQ kernel, which is roughly 2.5 times slower at decode and breaks the per-quantisation table above entirely. Pin
vllm>=0.6.0in your requirements. - GGUF Q4 in vLLM goes through the llama.cpp backend, not Marlin. If you measure GGUF inside vLLM and it looks slow, that is why — the GGUF path is included for compatibility, not speed. Use AWQ-Marlin if you want INT4 throughput inside vLLM, or run GGUF in llama.cpp directly where it has a much more optimised CUDA kernel.
- The 4090 is a Gen 4 PCIe card; a Gen 3 host slot loses about 5 percent prefill throughput. This matters more than people expect, because long-context prefill streams a lot of activations across the bus during chunked prefill. Verify with
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv. - The 4090 has no NVLink connector. Two-card tensor parallel runs over PCIe Gen 4 x16, which is roughly 32 GB/s bidirectional — about a tenth of what NVLink would give you. For Llama 3.1 70B INT4 across two 4090s the all-reduce overhead eats 25 to 30 percent of decode throughput; details in the Llama 70B INT4 benchmark.
- Power draw is not what the spec sheet says. Under prefill the card pulls roughly 400 W; under decode at batch 32 it sits at 360 W; idle but loaded with a CUDA context it still draws 70 W. Provision the PSU for 500 W headroom and remember the idle floor when calculating monthly hosting cost — see the monthly hosting cost breakdown for a worked example.
- Prefix caching is not optional for chat. Without
--enable-prefix-cachingevery turn re-runs the full system prompt through prefill, which on a 4k system prompt adds 415 ms to every single response. With it on, the second-and-later turns of a conversation pay only the cost of the new user message. The prefix caching writeup explains the cache eviction behaviour you need to plan for.
Verdict and when to pick
The 4090 24GB is the right card if you need a single-tenant box that serves Llama 3.1 8B at 1,100 aggregate tokens per second with sub-second p99 TTFT up to batch 32, and you do not want to pay H100 money. It is the wrong card if your workload is a fleet of 70B-class models, a multi-card tensor-parallel deployment that would benefit from NVLink, or a fine-tuning rig that needs to fit optimiser state for anything larger than 8B in BF16 — in that case the 4090 or 5090 decision tilts toward the newer card despite the price.
For a focused 8B serving deployment with mixed RAG and chat traffic, the configuration we land on for most clients is FP8 weights with FP8 KV cache, --max-num-seqs 32, chunked prefill on, prefix caching on, --max-model-len 65536. That gives you 32 concurrent users at 34 t/s each, p99 TTFT around 1.1 seconds, and 13 GB of free VRAM headroom for spike absorption. Energy efficiency lands at roughly 3.4 tokens per Joule, which on UK retail electricity at 28 p/kWh works out to a marginal cost per million output tokens that beats the OpenAI 4o-mini API list price by a meaningful multiple — the vs OpenAI API cost calculator walks through the arithmetic.
Llama 3.1 8B at 1,100 aggregate t/s
Single-tenant RTX 4090 24GB on UK dedicated hosting, FP8 native, vLLM-tuned out of the box.
Order the RTX 4090 24GBSee also: spec breakdown, TFLOPS class, power draw efficiency, tokens per watt, Qwen 14B benchmark, Mixtral benchmark, fine-tune throughput.