RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / RTX 4090 24GB GDDR6X 1008 GB/s Bandwidth Explained
AI Hosting & Infrastructure

RTX 4090 24GB GDDR6X 1008 GB/s Bandwidth Explained

A senior engineer's tour of the RTX 4090's 1008 GB/s GDDR6X bus, the 72 MB Ada L2 cache, the bandwidth-bound decode formula, and how it lands against Ampere, HBM2e, HBM3 and Blackwell GDDR7 on real LLM and diffusion workloads.

For autoregressive LLM decode the RTX 4090 24GB is bandwidth-bound, not compute-bound, which makes the 1008 GB/s GDDR6X bus the single most important number on the spec sheet. Order it on UK dedicated GPU hosting and the bus, plus Ada’s 72 MB L2 cache, will dictate how many tokens per second per stream you can extract for any given model. This piece walks through the bandwidth fundamentals, the per-token decode formula, the role of the L2 cache, how Ada differs from Ampere on memory hierarchy, and the worked numbers for every workload class that actually lands on a 4090 in production.

Contents

GDDR6X spec and the 384-bit bus

The 4090 ships with twelve 2 GB Micron GDDR6X chips on a 384-bit bus running at 21 Gbps per pin. GDDR6X is GDDR6 with PAM4 signalling: every clock edge transmits two bits instead of one, and the I/O voltage drops to 1.35 V from GDDR6’s 1.45 V. The result is the same effective per-pin throughput as a notional 21 Gbps GDDR6 but with reduced power per transfer.

ParameterValueContext
Memory typeGDDR6X (Micron)PAM4 signalling, 1.35 V
Capacity24 GB12 chips x 2 GB, clamshell on rear
Speed21 Gbps per pinUp from 19.5 on the 3090 Ti
Bus width384-bit12 chips x 32-bit channels
Theoretical bandwidth1008 GB/s21 x 384 / 8
L2 cache72 MBUp from 6 MB on the 3090
L1 / SMEM per SM128 KBSame as Ampere
Memory voltage1.35 Vvs GDDR6 at 1.45 V
ECCNo (consumer)RTX 6000 Ada has it

The headline 1008 GB/s is theoretical peak; sustained read bandwidth in nvidia-smi dmon on a hot LLM decode workload typically hits 920-960 GB/s, with 940 a reasonable design figure for capacity planning. The remainder is consumed by refresh, write turnaround, and DMA descriptor overhead.

Why bandwidth dominates LLM decode

During autoregressive decode the model produces one token per forward pass per active request. For each token the GPU must stream every weight (because the matmul reads them all once) and every cached KV value (because attention reads them all once) through tensor cores. Compute per token is trivial: an 8B model needs about 16 GFLOPs of work per token, which a 4090 can chew through in 24 microseconds at full FP8 utilisation. But moving the 8 GB of FP8 weights through 1008 GB/s of HBM takes 8 milliseconds. The compute fraction is approximately 0.3 percent; the rest is data movement.

This is why FP8 doubles decode throughput even though the FP8 tensor cores are “only” 2x faster than FP16: the headline is bandwidth, not maths. Halving bytes per parameter halves time-on-bus per token, doubling tokens per second. The same applies to KV cache quantisation. See FP8 tensor cores on Ada for the kernel side of that story.

For prefill the picture inverts. A long prompt arrives all at once and the matmul reuses each weight across many query positions, which keeps tensor cores fed. Prefill on a 4090 runs at 60-70 percent of dense FP8 peak; decode runs at 8-12 percent. This is why prefill vs decode benchmarks always show such different shapes for the same hardware.

The 72 MB L2 cache and Ada’s memory hierarchy

The defining architectural change from Ampere to Ada is the L2 cache. The RTX 3090 ships with 6 MB; the 4090 ships with 72 MB, a 12x jump. NVIDIA followed AMD’s Infinity Cache strategy here: when you cannot afford HBM3 in the consumer envelope, push more cache on-die so that bandwidth-sensitive workloads hit a much larger working set without going to GDDR.

Layer4090 sizeLatency (cycles)BandwidthNotes
Registers (per SM)256 KB1n/a65,536 32-bit regs / SM
L1 + SMEM (per SM)128 KB~25~21 TB/s aggregateConfigurable split
L2 cache72 MB~200~5 TB/s12x Ampere, dominant change
GDDR6X24 GB~4001008 GB/sThe wall for decode
PCIe Gen 4 x16system RAM~2000~26 GB/sCold path only

The L2 effect dominates for small models. A 3.8B parameter model at FP8 occupies 3.8 GB, which does not fit in 72 MB L2 in one piece, but FlashAttention 3’s tile reuse means that within a single forward pass the KV blocks for the active attention window can sit hot in L2. Phi-3 mini at FP8 measures 480 t/s on a 4090, well above the naive 1008/3.8 = 265 t/s bandwidth ceiling, because the attention path benefits from L2 reuse across query positions in the batch. The same trick lifts batched decode throughput on Llama 3.1 8B: at batch 32, weights are reused across the batch and the KV blocks for adjacent sequences land in the same L2 set.

For large models (Llama 70B AWQ INT4 at 17 GB, Mixtral 8x7B AWQ at 25 GB) the L2 is too small to matter and decode falls back to the bandwidth ceiling. This is why the bandwidth wall hits 70B harder than 8B: not just because there is more weight to move, but because L2 cannot soften the blow.

The decode bandwidth formula

For a single-stream workload the binding inequality is:

tokens_per_second < sustained_bandwidth_GBs / model_bytes_GB

# Worked: Llama 70B AWQ INT4 (17 GB weights + KV streaming)
940 / 17 = 55.3 t/s naive ceiling
real measured: ~23 t/s
gap = KV streaming + dequant overhead + activation traffic

# Worked: Llama 8B FP8 (8 GB weights, FP8 KV)
940 / 8 = 117 t/s naive ceiling
real measured: 198 t/s
gap = L2 reuse, FA3 tile efficiency

# Worked: Phi-3 mini FP8 (3.8 GB)
940 / 3.8 = 247 t/s naive ceiling
real measured: 480 t/s
gap = L2 holds large fraction of model hot

The headline takeaway is that pushing weights to FP8 or INT4 nearly doubles or quadruples the bandwidth ceiling because you halve or quarter the bytes per token. The second takeaway is that the naive ceiling is a starting point, not a finish line: real measured throughput differs by a factor of 0.4 to 2.0 depending on whether the model fits in L2 and how well the kernel reuses tiles.

Batched decode and the amortisation effect

For batched decode the formula generalises: each weight is streamed once and reused across B sequences, so per-sequence bandwidth cost drops by 1/B until the KV cache becomes the dominant traffic. For Llama 3.1 8B FP8 at batch 32 the aggregate is ~1100 t/s, or 34 t/s per stream; the per-stream rate has fallen because the KV cache for 32 streams is now larger than the L2 and competes with weight streaming for HBM bandwidth.

Llama 3.1 8B FP8Aggregate t/sPer-stream t/sTTFT (8k prompt)
Batch 1198198880 ms
Batch 8880110200 ms (queue depth)
Batch 32110034530 ms
Batch 64114018880 ms

The aggregate plateau at batch 32-64 is the bandwidth wall: KV traffic plus weight streaming saturates 1008 GB/s. Pushing batch higher only worsens per-stream latency without lifting aggregate throughput. See concurrent users for sizing detail.

Worked examples by model and quant

ModelFormatWeight bytesBW ceilingReal t/sL2 effect
Llama 3.1 8BFP1616 GB59 t/s95+60% from L2
Llama 3.1 8BFP88 GB117 t/s198+69% from L2
Llama 3.1 8BAWQ INT44.5 GB209 t/s225Modest, near ceiling
Mistral 7B v0.3FP87.25 GB130 t/s215Sliding window helps
Mistral Nemo 12BFP812.2 GB77 t/s145+88% from FA3 + L2
Llama 3.1 70BAWQ INT417 GB55 t/s23None, model too large
Phi-3-mini 3.8BFP83.8 GB247 t/s480+94% from full L2 hits
Qwen 2.5 7BFP87 GB134 t/s210+57% from L2
Mixtral 8x7BAWQ25 GB37 t/s~35None, sparse activation helps elsewhere

Two patterns matter. First, smaller models exceed their naive bandwidth ceiling significantly because of L2 reuse: the entire model can sit hot in 72 MB of L2 between layer accesses for FP8 sub-2GB weights, and for slightly larger models the active subset of the working set fits. Second, the 70B model falls below its naive ceiling because the KV cache competes with weight streaming for the same bandwidth, and the model is too large to benefit from L2.

Compared to GDDR6, GDDR7, HBM2e and HBM3

GPUVRAMTypeBandwidthL2Decode regime
RTX 3090 24GB24 GBGDDR6X 19.5 Gbps936 GB/s6 MBBW-bound, no L2 lift
RTX 4090 24GB24 GBGDDR6X 21 Gbps1008 GB/s72 MBBW-bound, large L2 lift
RTX 5090 32GB32 GBGDDR7 28 Gbps1792 GB/s96 MBBW-bound, much higher ceiling
RTX 5060 Ti 16GB16 GBGDDR7 28 Gbps448 GB/s32 MBBW-bound, low ceiling
A100 40GB40 GBHBM2e1555 GB/s40 MBBW-bound, no FP8
A100 80GB80 GBHBM2e2039 GB/s40 MBBW-bound, no FP8
H100 SXM 80GB80 GBHBM33350 GB/s50 MBBW + compute mix
RTX 6000 Pro 96GB96 GBGDDR7 ECC1792 GB/s128 MBBW-bound, large model headroom

The 4090’s 1008 GB/s is modest next to HBM, but the 72 MB L2 is the largest in the chart aside from Blackwell’s 96-128 MB. For inference workloads with high temporal locality (small-batch decode of mid-size models) the L2 advantage offsets a lot of the raw bandwidth gap. An A100 80GB has 2x the bandwidth but 40 percent less L2 and no FP8: on a single-user Llama 3.1 8B FP8 decode, the 4090 is within 13 percent of an H100 (198 vs 225 t/s) despite having 30 percent of the HBM bandwidth, because the L2 lift and FP8 path together close most of the gap. See 4090 vs H100 80GB for the head-to-head and 4090 vs 3090 for the generational jump from the same VRAM tier.

Production gotchas

  1. The headline 1008 GB/s is theoretical peak. Sustained reads in production land at 920-960 GB/s. Use 940 GB/s as your sizing figure.
  2. Power capping kills bandwidth. Below ~350 W the GDDR6X clocks back, dropping sustained bandwidth to ~840 GB/s. Hold the card at ≥400 W for full bandwidth (see power draw efficiency).
  3. L2 hit rate is workload-dependent. Phi-3 mini hits 85+ percent L2 on the attention path; Llama 70B hits maybe 5 percent. Profile with ncu --metrics l2tex__t_sector_hit_rate.pct to know.
  4. FP16 KV at long context starves the bandwidth. A 4090 at 32k context with FP16 KV spends 30-40 percent of bandwidth on KV streaming. Switch to --kv-cache-dtype fp8 to halve that and recover decode throughput.
  5. Memory pads matter. 24/7 GDDR6X workloads can crack original Micron pads after 12-18 months. Production hosts repad with Honeywell PTM7950 to keep memory junction below 95 °C.
  6. nvidia-smi memory utilisation lies. The “Memory-Util” column reports the fraction of time the controller is active, not the fraction of bandwidth used. Use dcgmi dmon -e 1003 for true bytes/sec.
  7. Two streams on one card share bandwidth. Running two vLLM processes on a single 4090 cuts each one’s effective bandwidth roughly in half. Pin requests to a single vLLM instance with continuous batching instead.

Verdict and when bandwidth is the wall

The 4090’s 1008 GB/s GDDR6X bus is the binding constraint for any LLM decode workload on the card. Compute, with 660 dense FP8 TFLOPS available, is rarely reached. Practically that means: optimise for the smallest format your eval allows (FP8 weights and FP8 KV are the default sweet spot), batch traffic with vLLM continuous batching to amortise weight streaming across sequences, and treat the L2 cache as the under-celebrated feature that lifts small-model throughput well above its naive ceiling. For a 12-engineer coding team running Llama 3.1 8B FP8, the practical envelope is 32 concurrent active streams at 1100 t/s aggregate, which is more than the team will sustain in working hours.

The 4090 is bandwidth-tier 4 (1 TB/s class). Tier 5 is GDDR7 (1.8 TB/s on the 5090, see the 5090 comparison). Tier 6 is HBM3 (3.3 TB/s on H100). If your workload runs into the bandwidth wall on a 4090 today and you cannot squeeze more out of FP8 or AWQ, the next economically sensible step is the 5090 32GB, not the H100; the 78 percent bandwidth jump and 8 GB extra VRAM cover most of the gap at a fraction of the rental cost. See the 4090 or 5090 decision piece for the trade-off.

1008 GB/s of decode bandwidth, hosted in the UK

Full 384-bit GDDR6X, 72 MB Ada L2, FP8 kernels pre-built. UK dedicated hosting.

Order the RTX 4090 24GB

See also: benchmark class, 4090 vs 3090, 4090 vs 5090, FP8 tensor cores, 8B LLM VRAM requirements, power draw efficiency, prefill vs decode.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?