RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / RTX 4090 24GB TFLOPS: AI Benchmark Class Explained
AI Hosting & Infrastructure

RTX 4090 24GB TFLOPS: AI Benchmark Class Explained

A senior engineer's tour of the RTX 4090 24GB throughput envelope: dense and sparse FP32, BF16, FP16, FP8 and INT8 numbers, achievable utilisation per kernel class, and how it lands against A100, H100 and Blackwell on real workloads.

The RTX 4090 24GB is the highest TFLOPS-per-pound accelerator NVIDIA has ever shipped, and on dense FP16 it convincingly out-throws an A100 40GB. On dense FP8 it sits halfway to a single H100. The interesting question is not the marketing peak number on the box but what fraction of those teraflops actually reach a real LLM, diffusion or fine-tune kernel. That depends on Ada’s 4th-gen tensor cores, the 72 MB L2 (Ada’s defining architectural change), and how much your workload is bandwidth-bound rather than maths-bound. Spin up a card from the RTX 4090 24GB hosting page or browse the wider dedicated GPU range first, then read on.

Contents

Dense theoretical TFLOPS by datatype

The Ada AD102 die powering the 4090 carries 16,384 active CUDA cores (out of 18,432 on the full die), 128 streaming multiprocessors and a typical observed boost clock of 2.55-2.6 GHz on a thermally healthy card. Every SM holds four 4th-generation tensor cores, and Ada is the first generation in the consumer line to expose native FP8 (E4M3 and E5M2) arithmetic, doubling rates over FP16 dense. The CUDA core FP32 figure is unchanged at 82.6 TFLOPS, which is the 2 ops per clock per CUDA core figure the architecture has held since Pascal.

FormatDense TFLOPSSparse TFLOPSTensor coresAccumulator
FP32 (CUDA cores)82.6n/aNoFP32
TF3282.6165.2YesFP32
BF16 / FP16 (FP32 accum)165.2330.3YesFP32
FP16 (FP16 accum)330.3660.6YesFP16
FP8 E4M3 / E5M2660.61321.2YesFP16/FP32
INT8660.6 TOPS1321.2 TOPSYesINT32
INT41321 TOPS2642 TOPSYesINT32

The single most important row is the FP8 dense rate. 660 TFLOPS at FP8 is roughly twice what a 3090 can muster at FP16 dense and within a factor of three of an H100 PCIe, achieved on a card with a £1,750 list price rather than £25,000. The accumulator column matters too: when an LLM kernel uses FP16 multiplies with an FP32 accumulator, the effective rate is the FP32-accum row (165 TFLOPS), not the FP16-accum 330 TFLOPS. Many older inference paths in PyTorch silently used FP32 accum until vLLM, FlashAttention and TensorRT-LLM closed the gap with explicit FP16-accum kernels.

Sparsity acceleration and when it really applies

Ada inherits Ampere’s 2:4 structured sparsity scheme. The pattern is rigid: of every four contiguous weights along the inner dimension of a matmul, exactly two must be zero. Tensor cores then skip the zero multiplies and double the throughput. NVIDIA quotes the doubled number as the headline because it doubles the marketing TFLOPS, but in practice almost no off-the-shelf LLM ships pre-pruned to 2:4 because the constraint forces accuracy loss in the 1-3 percent range on MMLU-class benchmarks unless you re-train.

Where sparse TFLOPS do reach the workload: NVIDIA’s apex.contrib.sparsity pruner during a fine-tune; Sparse Marlin kernels (added to vLLM 0.6) when serving a checkpoint produced by NVIDIA’s TensorRT-LLM 2:4 quantiser; and Sparse FlashAttention if you accept the additional pruning step. Treat the 1.32 PetaOPS INT8 sparse number as a ceiling, not a forecast.

Bandwidth shapes the achievable fraction

A tensor core maths peak is only reachable when the operand throughput keeps the cores fed. Decode is the inverse case: every generated token streams the entire model through the bus, so 1008 GB/s of GDDR6X is the binding wall and tensor cores idle for most cycles. See the GDDR6X bandwidth deep-dive for the per-token decode formula.

4th-gen tensor cores and what changed from Ampere

Ada’s tensor cores ship three changes over Ampere’s 3rd-gen units that matter for AI workloads:

FeatureAmpere (3rd-gen)Ada (4th-gen)Hopper (4th-gen)
FP8 nativeNoYes (E4M3 + E5M2)Yes (E4M3 + E5M2)
Transformer EngineNoSoftware fallbackNative scaling
L2 cache6 MB72 MB50 MB
FP16 dense per SM~256 GFLOPS~256 GFLOPS~830 GFLOPS
SM count (top die)108 (A100)128 (4090)132 (H100)

FP8 is the headline upgrade. The 12x larger L2 is the under-celebrated one: on Ada a small model’s weights can sit hot in cache between layer accesses, which is why a Phi-3-mini FP8 model regularly hits 480 t/s on a 4090, well above the naive bandwidth ceiling of 265 t/s. The 72 MB L2 also boosts FlashAttention throughput because attention tiles can be re-used across query blocks without round-tripping to HBM. FP8 tensor cores on Ada covers the kernel side in more detail.

A100, H100, Blackwell and 5090 in context

The numbers below are dense, peak, with default boost clock and standard tensor format. They are headline rates only; achievable percentages are in the next section.

GPUFP16 denseFP16 sparseFP8 denseVRAMBandwidth
RTX 3090 24GB142 TFLOPS284 TFLOPSn/a24 GB GDDR6X936 GB/s
RTX 4090 24GB330 TFLOPS660 TFLOPS660 TFLOPS24 GB GDDR6X1008 GB/s
RTX 5090 32GB419 TFLOPS838 TFLOPS838 TFLOPS32 GB GDDR71792 GB/s
A100 80GB SXM312 TFLOPS624 TFLOPSn/a80 GB HBM2e2039 GB/s
H100 SXM989 TFLOPS1979 TFLOPS1979 TFLOPS80 GB HBM33350 GB/s
L40S362 TFLOPS725 TFLOPS725 TFLOPS48 GB GDDR6 ECC864 GB/s

The 4090 beats an A100 80GB on dense FP16 by 6 percent, on FP8 it is the only sub-£10k consumer card with native support, and on bandwidth it is roughly half an A100. Compute is rarely the wall on a 4090; bandwidth is. The 5090 closes that gap with GDDR7 1792 GB/s, which is why the 4090 vs 5090 decision hinges almost entirely on whether you need the extra 8 GB of VRAM and 78 percent more bandwidth or can extract value from a card that costs less and ships in volume today.

Real measured utilisation on production kernels

Theoretical TFLOPS are negotiated; real utilisation is measured. Numbers below come from gigagpu.com/ production hosts running vLLM 0.6.x, FlashAttention 3 and the Marlin/Machete FP8 kernel families, captured with NVIDIA Nsight Compute and validated against nvidia-smi dmon SM activity counters.

WorkloadKernel class% of dense peakBound on
vLLM prefill, Llama 3.1 8B FP16cuBLAS GEMM~70%Tensor cores
vLLM decode, Llama 3.1 8B FP16, batch 1FlashAttention 3~9%HBM bandwidth
vLLM decode, Llama 3.1 8B FP8, batch 32Marlin FP8~36%Mixed
vLLM prefill, Llama 3 70B AWQ INT4Marlin AWQ~62%Tensor cores
SDXL UNet step, BF16cuDNN conv + GEMM~58%Tensor cores
FLUX.1-dev FP16, 30-stepFlashAttention 3 + GEMM~52%Mixed
QLoRA Llama 3.1 8B BF16, FA3Triton attention + cuBLAS~64%Tensor cores
Whisper large-v3-turbo INT8 batchedcuBLASLt INT8~48%Encoder GEMM

Two patterns are worth absorbing. First, prefill (large batch matmul) reaches 60-70 percent of peak across formats, while single-stream decode falls to single-digit utilisation because every token forces a full weight scan across the bus. Second, batched decode at batch 32 climbs back to a third of peak because weights are reused across the batch, amortising the bandwidth cost. This is why concurrent users matters so much for your effective TFLOPS-per-pound: idle hardware is wasted hardware.

Benchmark class: where the 4090 actually lands

By raw FP16 dense, the 4090 is in the A100 class. By FP8 dense, it is between an A100 (no native FP8) and an H100. By bandwidth, it is firmly in the consumer tier – 1008 GB/s versus 2-3.3 TB/s of HBM. For a small-batch interactive workload (1-8 concurrent users, 7-13B model, FP8 weights and FP8 KV) the 4090 will land within 10 percent of an H100 on tokens per second per user, at roughly one-eighth the rental cost.

Workload4090 24GBA100 80GBH100 80GB4090/H100
Llama 3.1 8B FP8 single-user decode195 t/s140 t/s (BF16 only)225 t/s0.87x
Llama 3.1 8B FP8 batch 32 aggregate1100 t/s1300 t/s2400 t/s0.46x
Llama 3.1 70B AWQ INT4 decode23 t/sn/a (offload)40 t/s0.58x
SDXL 1024×1024 30-step2.0 s1.8 s1.4 s0.70x
FLUX.1-dev FP16 30-step6.0 s5.5 s3.8 s0.63x
Whisper large-v3-turbo INT880x RT105x RT140x RT0.57x
QLoRA Llama 8B (tok/s)1400016000220000.64x

Single-user decode is where the 4090 shines: the ratio against H100 is 0.87x because both cards are bandwidth-bound and the 4090’s 1008 GB/s is ~30 percent of H100’s 3350 GB/s, but the 72 MB L2 reclaims a lot of that gap on small models. As you scale to batch 32, H100’s HBM3 pulls ahead because there are more tokens to feed through the bus per unit time. Compare against a 4090 vs H100 head-to-head for a more granular split, or the 4090 vs A100 piece if you are migrating an Ampere fleet.

The kernel choice that determines your TFLOPS

What you launch matters as much as what you launch on. A representative vLLM startup for the bandwidth-bound case, with line-by-line notes:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \                 # halves bytes/token, doubles BW ceiling
  --kv-cache-dtype fp8 \               # halves KV bytes per token
  --max-model-len 65536 \              # leaves ~6 GB free for batch
  --max-num-seqs 32 \                  # forces batched decode -> tensor reuse
  --enable-chunked-prefill \           # caps p99 TTFT on long prompts
  --enable-prefix-caching \            # reuse system prompt across requests
  --gpu-memory-utilization 0.92 \      # leaves headroom for cuBLAS workspace
  --port 8000

The two FP8 flags are the headline. --quantization fp8 dispatches Marlin FP8 GEMM kernels that hit ~36 percent of dense FP8 peak in batched decode, versus 9 percent of FP16 dense for an unquantised model in the same regime. The --max-num-seqs 32 flag is what makes batched tensor reuse feasible; without it, weights move from HBM every token. --enable-prefix-caching hits the L2 cache because shared prompt prefixes deduplicate to a single set of KV blocks. See the full vLLM setup guide for the rest of the production flag set.

For the maximum-quality path on a single 4090, the AWQ-INT4 70B kernel is the headline trick:

python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \          # AWQ packed weights via Marlin INT4 GEMM
  --kv-cache-dtype fp8 \               # FP8 KV halves attention memory
  --max-model-len 16384 \              # 16k is the sustainable target
  --max-num-seqs 4 \                   # KV constraint, not compute constraint
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \      # pushed high because weights are static
  --port 8000

Marlin INT4 reaches roughly 62 percent of dense FP8 peak on prefill because it dequantises weights into FP16 registers on the fly, then runs the GEMM on the 4th-gen tensor cores at FP16 rates. The 70B-on-one-card configuration is unlocked entirely by AWQ Marlin plus FP8 KV; without either, the model falls off the card. Detailed walkthrough in the 70B INT4 deployment guide.

Production gotchas

  1. FP32 accumulator silently halves your FP16 throughput. Stock PyTorch nn.Linear uses FP32 accum. Switch to torch.compile or vLLM/FlashAttention paths to get the 330 TFLOPS FP16-accum number.
  2. Marlin requires AWQ checkpoints with group_size=128. Other group sizes fall back to slower kernels (~50 percent throughput drop).
  3. Sparsity is not free. A 2:4 pruned checkpoint loses 1-3 MMLU points unless you fine-tune to recover. Treat sparse TFLOPS as ceiling.
  4. Single-stream decode wastes 90 percent of your tensor cores. Batch your traffic with vLLM’s continuous batching or Triton’s dynamic batcher; do not run llama.cpp single-stream in production unless you must.
  5. L2 cache effect dominates for small models. Phi-3-mini and Qwen 2.5 0.5B can exceed bandwidth ceiling because the model fits in 72 MB L2. Larger models (Llama 70B INT4 at 17 GB) cannot benefit.
  6. FP8 quality degradation is real on long contexts. E5M2 KV at 128k context can lose 1-2 perplexity points. Validate on your eval before production.
  7. Driver matters. CUDA 12.4+ and driver 550+ are required for the most recent FP8 kernels in vLLM 0.6.3+. Older drivers silently fall back to FP16.

Verdict and when to pick the 4090 24GB

Pick a 4090 24GB if:

  • Your model fits 24 GB at FP8 or AWQ INT4 (everything up to Llama 70B INT4, Mistral Small 3 24B INT4, Qwen 2.5 32B AWQ).
  • You serve 1-32 concurrent users, where the 4090 is within 10-50 percent of an H100 at one-eighth the cost.
  • You can use FP8 (Ada native) or AWQ Marlin kernels – this is where the per-pound TFLOPS lead translates into real product economics.
  • You want UK-hosted dedicated metal at a known monthly cost rather than per-second cloud billing surprises.

Skip the 4090 24GB if you need >24 GB VRAM (look at the 5090 32GB, A6000 Ada, or RTX 6000 Pro 96GB), if your batch size is consistently >64 (where H100 HBM3 pulls decisively ahead), or if you need NVLink for tensor parallel (Ada disables it; consider the 3090 with NVLink for tight-coupled multi-GPU). For a tier map of the modern lineup see tier positioning 2026.

Bench-class throughput at consumer-class price

UK-hosted RTX 4090 24GB ready in minutes, vLLM and FP8 kernels pre-built. UK dedicated hosting.

Order the RTX 4090 24GB

See also: spec breakdown, FP8 tensor cores on Ada, GDDR6X bandwidth, tokens per watt, prefill vs decode benchmark, FP8 Llama deployment, all infrastructure posts.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?