RTX 4090 24GB TFLOPS: AI Benchmark Class Explained GIGAGPU

The RTX 4090 24GB is the highest TFLOPS-per-pound accelerator NVIDIA has ever shipped, and on dense FP16 it convincingly out-throws an A100 40GB. On dense FP8 it sits halfway to a single H100. The interesting question is not the marketing peak number on the box but what fraction of those teraflops actually reach a real LLM, diffusion or fine-tune kernel. That depends on Ada’s 4th-gen tensor cores, the 72 MB L2 (Ada’s defining architectural change), and how much your workload is bandwidth-bound rather than maths-bound. Spin up a card from the RTX 4090 24GB hosting page or browse the wider dedicated GPU range first, then read on.

Dense theoretical TFLOPS by datatype

The Ada AD102 die powering the 4090 carries 16,384 active CUDA cores (out of 18,432 on the full die), 128 streaming multiprocessors and a typical observed boost clock of 2.55-2.6 GHz on a thermally healthy card. Every SM holds four 4th-generation tensor cores, and Ada is the first generation in the consumer line to expose native FP8 (E4M3 and E5M2) arithmetic, doubling rates over FP16 dense. The CUDA core FP32 figure is unchanged at 82.6 TFLOPS, which is the 2 ops per clock per CUDA core figure the architecture has held since Pascal.

Format	Dense TFLOPS	Sparse TFLOPS	Tensor cores	Accumulator
FP32 (CUDA cores)	82.6	n/a	No	FP32
TF32	82.6	165.2	Yes	FP32
BF16 / FP16 (FP32 accum)	165.2	330.3	Yes	FP32
FP16 (FP16 accum)	330.3	660.6	Yes	FP16
FP8 E4M3 / E5M2	660.6	1321.2	Yes	FP16/FP32
INT8	660.6 TOPS	1321.2 TOPS	Yes	INT32
INT4	1321 TOPS	2642 TOPS	Yes	INT32

The single most important row is the FP8 dense rate. 660 TFLOPS at FP8 is roughly twice what a 3090 can muster at FP16 dense and within a factor of three of an H100 PCIe, achieved on a card with a £1,750 list price rather than £25,000. The accumulator column matters too: when an LLM kernel uses FP16 multiplies with an FP32 accumulator, the effective rate is the FP32-accum row (165 TFLOPS), not the FP16-accum 330 TFLOPS. Many older inference paths in PyTorch silently used FP32 accum until vLLM, FlashAttention and TensorRT-LLM closed the gap with explicit FP16-accum kernels.

Sparsity acceleration and when it really applies

Ada inherits Ampere’s 2:4 structured sparsity scheme. The pattern is rigid: of every four contiguous weights along the inner dimension of a matmul, exactly two must be zero. Tensor cores then skip the zero multiplies and double the throughput. NVIDIA quotes the doubled number as the headline because it doubles the marketing TFLOPS, but in practice almost no off-the-shelf LLM ships pre-pruned to 2:4 because the constraint forces accuracy loss in the 1-3 percent range on MMLU-class benchmarks unless you re-train.

Where sparse TFLOPS do reach the workload: NVIDIA’s apex.contrib.sparsity pruner during a fine-tune; Sparse Marlin kernels (added to vLLM 0.6) when serving a checkpoint produced by NVIDIA’s TensorRT-LLM 2:4 quantiser; and Sparse FlashAttention if you accept the additional pruning step. Treat the 1.32 PetaOPS INT8 sparse number as a ceiling, not a forecast.

Bandwidth shapes the achievable fraction

A tensor core maths peak is only reachable when the operand throughput keeps the cores fed. Decode is the inverse case: every generated token streams the entire model through the bus, so 1008 GB/s of GDDR6X is the binding wall and tensor cores idle for most cycles. See the GDDR6X bandwidth deep-dive for the per-token decode formula.

4th-gen tensor cores and what changed from Ampere

Ada’s tensor cores ship three changes over Ampere’s 3rd-gen units that matter for AI workloads:

Feature	Ampere (3rd-gen)	Ada (4th-gen)	Hopper (4th-gen)
FP8 native	No	Yes (E4M3 + E5M2)	Yes (E4M3 + E5M2)
Transformer Engine	No	Software fallback	Native scaling
L2 cache	6 MB	72 MB	50 MB
FP16 dense per SM	~256 GFLOPS	~256 GFLOPS	~830 GFLOPS
SM count (top die)	108 (A100)	128 (4090)	132 (H100)

FP8 is the headline upgrade. The 12x larger L2 is the under-celebrated one: on Ada a small model’s weights can sit hot in cache between layer accesses, which is why a Phi-3-mini FP8 model regularly hits 480 t/s on a 4090, well above the naive bandwidth ceiling of 265 t/s. The 72 MB L2 also boosts FlashAttention throughput because attention tiles can be re-used across query blocks without round-tripping to HBM. FP8 tensor cores on Ada covers the kernel side in more detail.

A100, H100, Blackwell and 5090 in context

The numbers below are dense, peak, with default boost clock and standard tensor format. They are headline rates only; achievable percentages are in the next section.

GPU	FP16 dense	FP16 sparse	FP8 dense	VRAM	Bandwidth
RTX 3090 24GB	142 TFLOPS	284 TFLOPS	n/a	24 GB GDDR6X	936 GB/s
RTX 4090 24GB	330 TFLOPS	660 TFLOPS	660 TFLOPS	24 GB GDDR6X	1008 GB/s
RTX 5090 32GB	419 TFLOPS	838 TFLOPS	838 TFLOPS	32 GB GDDR7	1792 GB/s
A100 80GB SXM	312 TFLOPS	624 TFLOPS	n/a	80 GB HBM2e	2039 GB/s
H100 SXM	989 TFLOPS	1979 TFLOPS	1979 TFLOPS	80 GB HBM3	3350 GB/s
L40S	362 TFLOPS	725 TFLOPS	725 TFLOPS	48 GB GDDR6 ECC	864 GB/s

The 4090 beats an A100 80GB on dense FP16 by 6 percent, on FP8 it is the only sub-£10k consumer card with native support, and on bandwidth it is roughly half an A100. Compute is rarely the wall on a 4090; bandwidth is. The 5090 closes that gap with GDDR7 1792 GB/s, which is why the 4090 vs 5090 decision hinges almost entirely on whether you need the extra 8 GB of VRAM and 78 percent more bandwidth or can extract value from a card that costs less and ships in volume today.

Real measured utilisation on production kernels

Theoretical TFLOPS are negotiated; real utilisation is measured. Numbers below come from gigagpu.com/ production hosts running vLLM 0.6.x, FlashAttention 3 and the Marlin/Machete FP8 kernel families, captured with NVIDIA Nsight Compute and validated against nvidia-smi dmon SM activity counters.

Workload	Kernel class	% of dense peak	Bound on
vLLM prefill, Llama 3.1 8B FP16	cuBLAS GEMM	~70%	Tensor cores
vLLM decode, Llama 3.1 8B FP16, batch 1	FlashAttention 3	~9%	HBM bandwidth
vLLM decode, Llama 3.1 8B FP8, batch 32	Marlin FP8	~36%	Mixed
vLLM prefill, Llama 3 70B AWQ INT4	Marlin AWQ	~62%	Tensor cores
SDXL UNet step, BF16	cuDNN conv + GEMM	~58%	Tensor cores
FLUX.1-dev FP16, 30-step	FlashAttention 3 + GEMM	~52%	Mixed
QLoRA Llama 3.1 8B BF16, FA3	Triton attention + cuBLAS	~64%	Tensor cores
Whisper large-v3-turbo INT8 batched	cuBLASLt INT8	~48%	Encoder GEMM

Two patterns are worth absorbing. First, prefill (large batch matmul) reaches 60-70 percent of peak across formats, while single-stream decode falls to single-digit utilisation because every token forces a full weight scan across the bus. Second, batched decode at batch 32 climbs back to a third of peak because weights are reused across the batch, amortising the bandwidth cost. This is why concurrent users matters so much for your effective TFLOPS-per-pound: idle hardware is wasted hardware.

Benchmark class: where the 4090 actually lands

By raw FP16 dense, the 4090 is in the A100 class. By FP8 dense, it is between an A100 (no native FP8) and an H100. By bandwidth, it is firmly in the consumer tier – 1008 GB/s versus 2-3.3 TB/s of HBM. For a small-batch interactive workload (1-8 concurrent users, 7-13B model, FP8 weights and FP8 KV) the 4090 will land within 10 percent of an H100 on tokens per second per user, at roughly one-eighth the rental cost.

Workload	4090 24GB	A100 80GB	H100 80GB	4090/H100
Llama 3.1 8B FP8 single-user decode	195 t/s	140 t/s (BF16 only)	225 t/s	0.87x
Llama 3.1 8B FP8 batch 32 aggregate	1100 t/s	1300 t/s	2400 t/s	0.46x
Llama 3.1 70B AWQ INT4 decode	23 t/s	n/a (offload)	40 t/s	0.58x
SDXL 1024×1024 30-step	2.0 s	1.8 s	1.4 s	0.70x
FLUX.1-dev FP16 30-step	6.0 s	5.5 s	3.8 s	0.63x
Whisper large-v3-turbo INT8	80x RT	105x RT	140x RT	0.57x
QLoRA Llama 8B (tok/s)	14000	16000	22000	0.64x

Single-user decode is where the 4090 shines: the ratio against H100 is 0.87x because both cards are bandwidth-bound and the 4090’s 1008 GB/s is ~30 percent of H100’s 3350 GB/s, but the 72 MB L2 reclaims a lot of that gap on small models. As you scale to batch 32, H100’s HBM3 pulls ahead because there are more tokens to feed through the bus per unit time. Compare against a 4090 vs H100 head-to-head for a more granular split, or the 4090 vs A100 piece if you are migrating an Ampere fleet.

The kernel choice that determines your TFLOPS

What you launch matters as much as what you launch on. A representative vLLM startup for the bandwidth-bound case, with line-by-line notes:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \                 # halves bytes/token, doubles BW ceiling
  --kv-cache-dtype fp8 \               # halves KV bytes per token
  --max-model-len 65536 \              # leaves ~6 GB free for batch
  --max-num-seqs 32 \                  # forces batched decode -> tensor reuse
  --enable-chunked-prefill \           # caps p99 TTFT on long prompts
  --enable-prefix-caching \            # reuse system prompt across requests
  --gpu-memory-utilization 0.92 \      # leaves headroom for cuBLAS workspace
  --port 8000

The two FP8 flags are the headline. --quantization fp8 dispatches Marlin FP8 GEMM kernels that hit ~36 percent of dense FP8 peak in batched decode, versus 9 percent of FP16 dense for an unquantised model in the same regime. The --max-num-seqs 32 flag is what makes batched tensor reuse feasible; without it, weights move from HBM every token. --enable-prefix-caching hits the L2 cache because shared prompt prefixes deduplicate to a single set of KV blocks. See the full vLLM setup guide for the rest of the production flag set.

For the maximum-quality path on a single 4090, the AWQ-INT4 70B kernel is the headline trick:

python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \          # AWQ packed weights via Marlin INT4 GEMM
  --kv-cache-dtype fp8 \               # FP8 KV halves attention memory
  --max-model-len 16384 \              # 16k is the sustainable target
  --max-num-seqs 4 \                   # KV constraint, not compute constraint
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \      # pushed high because weights are static
  --port 8000

Marlin INT4 reaches roughly 62 percent of dense FP8 peak on prefill because it dequantises weights into FP16 registers on the fly, then runs the GEMM on the 4th-gen tensor cores at FP16 rates. The 70B-on-one-card configuration is unlocked entirely by AWQ Marlin plus FP8 KV; without either, the model falls off the card. Detailed walkthrough in the 70B INT4 deployment guide.

Production gotchas

FP32 accumulator silently halves your FP16 throughput. Stock PyTorch nn.Linear uses FP32 accum. Switch to torch.compile or vLLM/FlashAttention paths to get the 330 TFLOPS FP16-accum number.
Marlin requires AWQ checkpoints with group_size=128. Other group sizes fall back to slower kernels (~50 percent throughput drop).
Sparsity is not free. A 2:4 pruned checkpoint loses 1-3 MMLU points unless you fine-tune to recover. Treat sparse TFLOPS as ceiling.
Single-stream decode wastes 90 percent of your tensor cores. Batch your traffic with vLLM’s continuous batching or Triton’s dynamic batcher; do not run llama.cpp single-stream in production unless you must.
L2 cache effect dominates for small models. Phi-3-mini and Qwen 2.5 0.5B can exceed bandwidth ceiling because the model fits in 72 MB L2. Larger models (Llama 70B INT4 at 17 GB) cannot benefit.
FP8 quality degradation is real on long contexts. E5M2 KV at 128k context can lose 1-2 perplexity points. Validate on your eval before production.
Driver matters. CUDA 12.4+ and driver 550+ are required for the most recent FP8 kernels in vLLM 0.6.3+. Older drivers silently fall back to FP16.

Verdict and when to pick the 4090 24GB

Pick a 4090 24GB if:

Your model fits 24 GB at FP8 or AWQ INT4 (everything up to Llama 70B INT4, Mistral Small 3 24B INT4, Qwen 2.5 32B AWQ).
You serve 1-32 concurrent users, where the 4090 is within 10-50 percent of an H100 at one-eighth the cost.
You can use FP8 (Ada native) or AWQ Marlin kernels – this is where the per-pound TFLOPS lead translates into real product economics.
You want UK-hosted dedicated metal at a known monthly cost rather than per-second cloud billing surprises.

Skip the 4090 24GB if you need >24 GB VRAM (look at the 5090 32GB, A6000 Ada, or RTX 6000 Pro 96GB), if your batch size is consistently >64 (where H100 HBM3 pulls decisively ahead), or if you need NVLink for tensor parallel (Ada disables it; consider the 3090 with NVLink for tight-coupled multi-GPU). For a tier map of the modern lineup see tier positioning 2026.

Bench-class throughput at consumer-class price

UK-hosted RTX 4090 24GB ready in minutes, vLLM and FP8 kernels pre-built. UK dedicated hosting.

Order the RTX 4090 24GB

RTX 4090 24GB TFLOPS: AI Benchmark Class Explained

Contents

Dense theoretical TFLOPS by datatype

Sparsity acceleration and when it really applies

Bandwidth shapes the achievable fraction

4th-gen tensor cores and what changed from Ampere

A100, H100, Blackwell and 5090 in context

Real measured utilisation on production kernels

Benchmark class: where the 4090 actually lands

The kernel choice that determines your TFLOPS

Production gotchas

Verdict and when to pick the 4090 24GB

Bench-class throughput at consumer-class price

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB TFLOPS: AI Benchmark Class Explained

Contents

Dense theoretical TFLOPS by datatype

Sparsity acceleration and when it really applies

Bandwidth shapes the achievable fraction

4th-gen tensor cores and what changed from Ampere

A100, H100, Blackwell and 5090 in context

Real measured utilisation on production kernels

Benchmark class: where the 4090 actually lands

The kernel choice that determines your TFLOPS

Production gotchas

Verdict and when to pick the 4090 24GB

Bench-class throughput at consumer-class price

Need a Dedicated GPU Server?

gigagpu

Related Articles

Secure Model Download and Verification

Database + Vector Store Hybrid Architecture

GPU Server for 50 Concurrent Image generation Users: Sizing Guide

CPU Requirements for AI Inference: Does It Matter?

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?