The RTX 4090 24GB is the highest TFLOPS-per-pound accelerator NVIDIA has ever shipped, and on dense FP16 it convincingly out-throws an A100 40GB. On dense FP8 it sits halfway to a single H100. The interesting question is not the marketing peak number on the box but what fraction of those teraflops actually reach a real LLM, diffusion or fine-tune kernel. That depends on Ada’s 4th-gen tensor cores, the 72 MB L2 (Ada’s defining architectural change), and how much your workload is bandwidth-bound rather than maths-bound. Spin up a card from the RTX 4090 24GB hosting page or browse the wider dedicated GPU range first, then read on.
Contents
- Dense theoretical TFLOPS by datatype
- Sparsity acceleration and when it really applies
- 4th-gen tensor cores and what changed from Ampere
- A100, H100, Blackwell and 5090 in context
- Real measured utilisation on production kernels
- Benchmark class: where the 4090 actually lands
- Production gotchas
- Verdict and when to pick the 4090 24GB
Dense theoretical TFLOPS by datatype
The Ada AD102 die powering the 4090 carries 16,384 active CUDA cores (out of 18,432 on the full die), 128 streaming multiprocessors and a typical observed boost clock of 2.55-2.6 GHz on a thermally healthy card. Every SM holds four 4th-generation tensor cores, and Ada is the first generation in the consumer line to expose native FP8 (E4M3 and E5M2) arithmetic, doubling rates over FP16 dense. The CUDA core FP32 figure is unchanged at 82.6 TFLOPS, which is the 2 ops per clock per CUDA core figure the architecture has held since Pascal.
| Format | Dense TFLOPS | Sparse TFLOPS | Tensor cores | Accumulator |
|---|---|---|---|---|
| FP32 (CUDA cores) | 82.6 | n/a | No | FP32 |
| TF32 | 82.6 | 165.2 | Yes | FP32 |
| BF16 / FP16 (FP32 accum) | 165.2 | 330.3 | Yes | FP32 |
| FP16 (FP16 accum) | 330.3 | 660.6 | Yes | FP16 |
| FP8 E4M3 / E5M2 | 660.6 | 1321.2 | Yes | FP16/FP32 |
| INT8 | 660.6 TOPS | 1321.2 TOPS | Yes | INT32 |
| INT4 | 1321 TOPS | 2642 TOPS | Yes | INT32 |
The single most important row is the FP8 dense rate. 660 TFLOPS at FP8 is roughly twice what a 3090 can muster at FP16 dense and within a factor of three of an H100 PCIe, achieved on a card with a £1,750 list price rather than £25,000. The accumulator column matters too: when an LLM kernel uses FP16 multiplies with an FP32 accumulator, the effective rate is the FP32-accum row (165 TFLOPS), not the FP16-accum 330 TFLOPS. Many older inference paths in PyTorch silently used FP32 accum until vLLM, FlashAttention and TensorRT-LLM closed the gap with explicit FP16-accum kernels.
Sparsity acceleration and when it really applies
Ada inherits Ampere’s 2:4 structured sparsity scheme. The pattern is rigid: of every four contiguous weights along the inner dimension of a matmul, exactly two must be zero. Tensor cores then skip the zero multiplies and double the throughput. NVIDIA quotes the doubled number as the headline because it doubles the marketing TFLOPS, but in practice almost no off-the-shelf LLM ships pre-pruned to 2:4 because the constraint forces accuracy loss in the 1-3 percent range on MMLU-class benchmarks unless you re-train.
Where sparse TFLOPS do reach the workload: NVIDIA’s apex.contrib.sparsity pruner during a fine-tune; Sparse Marlin kernels (added to vLLM 0.6) when serving a checkpoint produced by NVIDIA’s TensorRT-LLM 2:4 quantiser; and Sparse FlashAttention if you accept the additional pruning step. Treat the 1.32 PetaOPS INT8 sparse number as a ceiling, not a forecast.
Bandwidth shapes the achievable fraction
A tensor core maths peak is only reachable when the operand throughput keeps the cores fed. Decode is the inverse case: every generated token streams the entire model through the bus, so 1008 GB/s of GDDR6X is the binding wall and tensor cores idle for most cycles. See the GDDR6X bandwidth deep-dive for the per-token decode formula.
4th-gen tensor cores and what changed from Ampere
Ada’s tensor cores ship three changes over Ampere’s 3rd-gen units that matter for AI workloads:
| Feature | Ampere (3rd-gen) | Ada (4th-gen) | Hopper (4th-gen) |
|---|---|---|---|
| FP8 native | No | Yes (E4M3 + E5M2) | Yes (E4M3 + E5M2) |
| Transformer Engine | No | Software fallback | Native scaling |
| L2 cache | 6 MB | 72 MB | 50 MB |
| FP16 dense per SM | ~256 GFLOPS | ~256 GFLOPS | ~830 GFLOPS |
| SM count (top die) | 108 (A100) | 128 (4090) | 132 (H100) |
FP8 is the headline upgrade. The 12x larger L2 is the under-celebrated one: on Ada a small model’s weights can sit hot in cache between layer accesses, which is why a Phi-3-mini FP8 model regularly hits 480 t/s on a 4090, well above the naive bandwidth ceiling of 265 t/s. The 72 MB L2 also boosts FlashAttention throughput because attention tiles can be re-used across query blocks without round-tripping to HBM. FP8 tensor cores on Ada covers the kernel side in more detail.
A100, H100, Blackwell and 5090 in context
The numbers below are dense, peak, with default boost clock and standard tensor format. They are headline rates only; achievable percentages are in the next section.
| GPU | FP16 dense | FP16 sparse | FP8 dense | VRAM | Bandwidth |
|---|---|---|---|---|---|
| RTX 3090 24GB | 142 TFLOPS | 284 TFLOPS | n/a | 24 GB GDDR6X | 936 GB/s |
| RTX 4090 24GB | 330 TFLOPS | 660 TFLOPS | 660 TFLOPS | 24 GB GDDR6X | 1008 GB/s |
| RTX 5090 32GB | 419 TFLOPS | 838 TFLOPS | 838 TFLOPS | 32 GB GDDR7 | 1792 GB/s |
| A100 80GB SXM | 312 TFLOPS | 624 TFLOPS | n/a | 80 GB HBM2e | 2039 GB/s |
| H100 SXM | 989 TFLOPS | 1979 TFLOPS | 1979 TFLOPS | 80 GB HBM3 | 3350 GB/s |
| L40S | 362 TFLOPS | 725 TFLOPS | 725 TFLOPS | 48 GB GDDR6 ECC | 864 GB/s |
The 4090 beats an A100 80GB on dense FP16 by 6 percent, on FP8 it is the only sub-£10k consumer card with native support, and on bandwidth it is roughly half an A100. Compute is rarely the wall on a 4090; bandwidth is. The 5090 closes that gap with GDDR7 1792 GB/s, which is why the 4090 vs 5090 decision hinges almost entirely on whether you need the extra 8 GB of VRAM and 78 percent more bandwidth or can extract value from a card that costs less and ships in volume today.
Real measured utilisation on production kernels
Theoretical TFLOPS are negotiated; real utilisation is measured. Numbers below come from gigagpu.com/ production hosts running vLLM 0.6.x, FlashAttention 3 and the Marlin/Machete FP8 kernel families, captured with NVIDIA Nsight Compute and validated against nvidia-smi dmon SM activity counters.
| Workload | Kernel class | % of dense peak | Bound on |
|---|---|---|---|
| vLLM prefill, Llama 3.1 8B FP16 | cuBLAS GEMM | ~70% | Tensor cores |
| vLLM decode, Llama 3.1 8B FP16, batch 1 | FlashAttention 3 | ~9% | HBM bandwidth |
| vLLM decode, Llama 3.1 8B FP8, batch 32 | Marlin FP8 | ~36% | Mixed |
| vLLM prefill, Llama 3 70B AWQ INT4 | Marlin AWQ | ~62% | Tensor cores |
| SDXL UNet step, BF16 | cuDNN conv + GEMM | ~58% | Tensor cores |
| FLUX.1-dev FP16, 30-step | FlashAttention 3 + GEMM | ~52% | Mixed |
| QLoRA Llama 3.1 8B BF16, FA3 | Triton attention + cuBLAS | ~64% | Tensor cores |
| Whisper large-v3-turbo INT8 batched | cuBLASLt INT8 | ~48% | Encoder GEMM |
Two patterns are worth absorbing. First, prefill (large batch matmul) reaches 60-70 percent of peak across formats, while single-stream decode falls to single-digit utilisation because every token forces a full weight scan across the bus. Second, batched decode at batch 32 climbs back to a third of peak because weights are reused across the batch, amortising the bandwidth cost. This is why concurrent users matters so much for your effective TFLOPS-per-pound: idle hardware is wasted hardware.
Benchmark class: where the 4090 actually lands
By raw FP16 dense, the 4090 is in the A100 class. By FP8 dense, it is between an A100 (no native FP8) and an H100. By bandwidth, it is firmly in the consumer tier – 1008 GB/s versus 2-3.3 TB/s of HBM. For a small-batch interactive workload (1-8 concurrent users, 7-13B model, FP8 weights and FP8 KV) the 4090 will land within 10 percent of an H100 on tokens per second per user, at roughly one-eighth the rental cost.
| Workload | 4090 24GB | A100 80GB | H100 80GB | 4090/H100 |
|---|---|---|---|---|
| Llama 3.1 8B FP8 single-user decode | 195 t/s | 140 t/s (BF16 only) | 225 t/s | 0.87x |
| Llama 3.1 8B FP8 batch 32 aggregate | 1100 t/s | 1300 t/s | 2400 t/s | 0.46x |
| Llama 3.1 70B AWQ INT4 decode | 23 t/s | n/a (offload) | 40 t/s | 0.58x |
| SDXL 1024×1024 30-step | 2.0 s | 1.8 s | 1.4 s | 0.70x |
| FLUX.1-dev FP16 30-step | 6.0 s | 5.5 s | 3.8 s | 0.63x |
| Whisper large-v3-turbo INT8 | 80x RT | 105x RT | 140x RT | 0.57x |
| QLoRA Llama 8B (tok/s) | 14000 | 16000 | 22000 | 0.64x |
Single-user decode is where the 4090 shines: the ratio against H100 is 0.87x because both cards are bandwidth-bound and the 4090’s 1008 GB/s is ~30 percent of H100’s 3350 GB/s, but the 72 MB L2 reclaims a lot of that gap on small models. As you scale to batch 32, H100’s HBM3 pulls ahead because there are more tokens to feed through the bus per unit time. Compare against a 4090 vs H100 head-to-head for a more granular split, or the 4090 vs A100 piece if you are migrating an Ampere fleet.
The kernel choice that determines your TFLOPS
What you launch matters as much as what you launch on. A representative vLLM startup for the bandwidth-bound case, with line-by-line notes:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \ # halves bytes/token, doubles BW ceiling
--kv-cache-dtype fp8 \ # halves KV bytes per token
--max-model-len 65536 \ # leaves ~6 GB free for batch
--max-num-seqs 32 \ # forces batched decode -> tensor reuse
--enable-chunked-prefill \ # caps p99 TTFT on long prompts
--enable-prefix-caching \ # reuse system prompt across requests
--gpu-memory-utilization 0.92 \ # leaves headroom for cuBLAS workspace
--port 8000
The two FP8 flags are the headline. --quantization fp8 dispatches Marlin FP8 GEMM kernels that hit ~36 percent of dense FP8 peak in batched decode, versus 9 percent of FP16 dense for an unquantised model in the same regime. The --max-num-seqs 32 flag is what makes batched tensor reuse feasible; without it, weights move from HBM every token. --enable-prefix-caching hits the L2 cache because shared prompt prefixes deduplicate to a single set of KV blocks. See the full vLLM setup guide for the rest of the production flag set.
For the maximum-quality path on a single 4090, the AWQ-INT4 70B kernel is the headline trick:
python -m vllm.entrypoints.openai.api_server \
--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--quantization awq_marlin \ # AWQ packed weights via Marlin INT4 GEMM
--kv-cache-dtype fp8 \ # FP8 KV halves attention memory
--max-model-len 16384 \ # 16k is the sustainable target
--max-num-seqs 4 \ # KV constraint, not compute constraint
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \ # pushed high because weights are static
--port 8000
Marlin INT4 reaches roughly 62 percent of dense FP8 peak on prefill because it dequantises weights into FP16 registers on the fly, then runs the GEMM on the 4th-gen tensor cores at FP16 rates. The 70B-on-one-card configuration is unlocked entirely by AWQ Marlin plus FP8 KV; without either, the model falls off the card. Detailed walkthrough in the 70B INT4 deployment guide.
Production gotchas
- FP32 accumulator silently halves your FP16 throughput. Stock PyTorch
nn.Linearuses FP32 accum. Switch totorch.compileor vLLM/FlashAttention paths to get the 330 TFLOPS FP16-accum number. - Marlin requires AWQ checkpoints with group_size=128. Other group sizes fall back to slower kernels (~50 percent throughput drop).
- Sparsity is not free. A 2:4 pruned checkpoint loses 1-3 MMLU points unless you fine-tune to recover. Treat sparse TFLOPS as ceiling.
- Single-stream decode wastes 90 percent of your tensor cores. Batch your traffic with vLLM’s continuous batching or Triton’s dynamic batcher; do not run llama.cpp single-stream in production unless you must.
- L2 cache effect dominates for small models. Phi-3-mini and Qwen 2.5 0.5B can exceed bandwidth ceiling because the model fits in 72 MB L2. Larger models (Llama 70B INT4 at 17 GB) cannot benefit.
- FP8 quality degradation is real on long contexts. E5M2 KV at 128k context can lose 1-2 perplexity points. Validate on your eval before production.
- Driver matters. CUDA 12.4+ and driver 550+ are required for the most recent FP8 kernels in vLLM 0.6.3+. Older drivers silently fall back to FP16.
Verdict and when to pick the 4090 24GB
Pick a 4090 24GB if:
- Your model fits 24 GB at FP8 or AWQ INT4 (everything up to Llama 70B INT4, Mistral Small 3 24B INT4, Qwen 2.5 32B AWQ).
- You serve 1-32 concurrent users, where the 4090 is within 10-50 percent of an H100 at one-eighth the cost.
- You can use FP8 (Ada native) or AWQ Marlin kernels – this is where the per-pound TFLOPS lead translates into real product economics.
- You want UK-hosted dedicated metal at a known monthly cost rather than per-second cloud billing surprises.
Skip the 4090 24GB if you need >24 GB VRAM (look at the 5090 32GB, A6000 Ada, or RTX 6000 Pro 96GB), if your batch size is consistently >64 (where H100 HBM3 pulls decisively ahead), or if you need NVLink for tensor parallel (Ada disables it; consider the 3090 with NVLink for tight-coupled multi-GPU). For a tier map of the modern lineup see tier positioning 2026.
Bench-class throughput at consumer-class price
UK-hosted RTX 4090 24GB ready in minutes, vLLM and FP8 kernels pre-built. UK dedicated hosting.
Order the RTX 4090 24GBSee also: spec breakdown, FP8 tensor cores on Ada, GDDR6X bandwidth, tokens per watt, prefill vs decode benchmark, FP8 Llama deployment, all infrastructure posts.