RTX 4090 24GB GDDR6X 1008 GB/s Bandwidth Explained GIGAGPU

For autoregressive LLM decode the RTX 4090 24GB is bandwidth-bound, not compute-bound, which makes the 1008 GB/s GDDR6X bus the single most important number on the spec sheet. Order it on UK dedicated GPU hosting and the bus, plus Ada’s 72 MB L2 cache, will dictate how many tokens per second per stream you can extract for any given model. This piece walks through the bandwidth fundamentals, the per-token decode formula, the role of the L2 cache, how Ada differs from Ampere on memory hierarchy, and the worked numbers for every workload class that actually lands on a 4090 in production.

GDDR6X spec and the 384-bit bus

The 4090 ships with twelve 2 GB Micron GDDR6X chips on a 384-bit bus running at 21 Gbps per pin. GDDR6X is GDDR6 with PAM4 signalling: every clock edge transmits two bits instead of one, and the I/O voltage drops to 1.35 V from GDDR6’s 1.45 V. The result is the same effective per-pin throughput as a notional 21 Gbps GDDR6 but with reduced power per transfer.

Parameter	Value	Context
Memory type	GDDR6X (Micron)	PAM4 signalling, 1.35 V
Capacity	24 GB	12 chips x 2 GB, clamshell on rear
Speed	21 Gbps per pin	Up from 19.5 on the 3090 Ti
Bus width	384-bit	12 chips x 32-bit channels
Theoretical bandwidth	1008 GB/s	21 x 384 / 8
L2 cache	72 MB	Up from 6 MB on the 3090
L1 / SMEM per SM	128 KB	Same as Ampere
Memory voltage	1.35 V	vs GDDR6 at 1.45 V
ECC	No (consumer)	RTX 6000 Ada has it

The headline 1008 GB/s is theoretical peak; sustained read bandwidth in nvidia-smi dmon on a hot LLM decode workload typically hits 920-960 GB/s, with 940 a reasonable design figure for capacity planning. The remainder is consumed by refresh, write turnaround, and DMA descriptor overhead.

Why bandwidth dominates LLM decode

During autoregressive decode the model produces one token per forward pass per active request. For each token the GPU must stream every weight (because the matmul reads them all once) and every cached KV value (because attention reads them all once) through tensor cores. Compute per token is trivial: an 8B model needs about 16 GFLOPs of work per token, which a 4090 can chew through in 24 microseconds at full FP8 utilisation. But moving the 8 GB of FP8 weights through 1008 GB/s of HBM takes 8 milliseconds. The compute fraction is approximately 0.3 percent; the rest is data movement.

This is why FP8 doubles decode throughput even though the FP8 tensor cores are “only” 2x faster than FP16: the headline is bandwidth, not maths. Halving bytes per parameter halves time-on-bus per token, doubling tokens per second. The same applies to KV cache quantisation. See FP8 tensor cores on Ada for the kernel side of that story.

For prefill the picture inverts. A long prompt arrives all at once and the matmul reuses each weight across many query positions, which keeps tensor cores fed. Prefill on a 4090 runs at 60-70 percent of dense FP8 peak; decode runs at 8-12 percent. This is why prefill vs decode benchmarks always show such different shapes for the same hardware.

The 72 MB L2 cache and Ada’s memory hierarchy

The defining architectural change from Ampere to Ada is the L2 cache. The RTX 3090 ships with 6 MB; the 4090 ships with 72 MB, a 12x jump. NVIDIA followed AMD’s Infinity Cache strategy here: when you cannot afford HBM3 in the consumer envelope, push more cache on-die so that bandwidth-sensitive workloads hit a much larger working set without going to GDDR.

Layer	4090 size	Latency (cycles)	Bandwidth	Notes
Registers (per SM)	256 KB	1	n/a	65,536 32-bit regs / SM
L1 + SMEM (per SM)	128 KB	~25	~21 TB/s aggregate	Configurable split
L2 cache	72 MB	~200	~5 TB/s	12x Ampere, dominant change
GDDR6X	24 GB	~400	1008 GB/s	The wall for decode
PCIe Gen 4 x16	system RAM	~2000	~26 GB/s	Cold path only

The L2 effect dominates for small models. A 3.8B parameter model at FP8 occupies 3.8 GB, which does not fit in 72 MB L2 in one piece, but FlashAttention 3’s tile reuse means that within a single forward pass the KV blocks for the active attention window can sit hot in L2. Phi-3 mini at FP8 measures 480 t/s on a 4090, well above the naive 1008/3.8 = 265 t/s bandwidth ceiling, because the attention path benefits from L2 reuse across query positions in the batch. The same trick lifts batched decode throughput on Llama 3.1 8B: at batch 32, weights are reused across the batch and the KV blocks for adjacent sequences land in the same L2 set.

For large models (Llama 70B AWQ INT4 at 17 GB, Mixtral 8x7B AWQ at 25 GB) the L2 is too small to matter and decode falls back to the bandwidth ceiling. This is why the bandwidth wall hits 70B harder than 8B: not just because there is more weight to move, but because L2 cannot soften the blow.

The decode bandwidth formula

For a single-stream workload the binding inequality is:

tokens_per_second < sustained_bandwidth_GBs / model_bytes_GB

# Worked: Llama 70B AWQ INT4 (17 GB weights + KV streaming)
940 / 17 = 55.3 t/s naive ceiling
real measured: ~23 t/s
gap = KV streaming + dequant overhead + activation traffic

# Worked: Llama 8B FP8 (8 GB weights, FP8 KV)
940 / 8 = 117 t/s naive ceiling
real measured: 198 t/s
gap = L2 reuse, FA3 tile efficiency

# Worked: Phi-3 mini FP8 (3.8 GB)
940 / 3.8 = 247 t/s naive ceiling
real measured: 480 t/s
gap = L2 holds large fraction of model hot

The headline takeaway is that pushing weights to FP8 or INT4 nearly doubles or quadruples the bandwidth ceiling because you halve or quarter the bytes per token. The second takeaway is that the naive ceiling is a starting point, not a finish line: real measured throughput differs by a factor of 0.4 to 2.0 depending on whether the model fits in L2 and how well the kernel reuses tiles.

Batched decode and the amortisation effect

For batched decode the formula generalises: each weight is streamed once and reused across B sequences, so per-sequence bandwidth cost drops by 1/B until the KV cache becomes the dominant traffic. For Llama 3.1 8B FP8 at batch 32 the aggregate is ~1100 t/s, or 34 t/s per stream; the per-stream rate has fallen because the KV cache for 32 streams is now larger than the L2 and competes with weight streaming for HBM bandwidth.

Llama 3.1 8B FP8	Aggregate t/s	Per-stream t/s	TTFT (8k prompt)
Batch 1	198	198	880 ms
Batch 8	880	110	200 ms (queue depth)
Batch 32	1100	34	530 ms
Batch 64	1140	18	880 ms

The aggregate plateau at batch 32-64 is the bandwidth wall: KV traffic plus weight streaming saturates 1008 GB/s. Pushing batch higher only worsens per-stream latency without lifting aggregate throughput. See concurrent users for sizing detail.

Worked examples by model and quant

Model	Format	Weight bytes	BW ceiling	Real t/s	L2 effect
Llama 3.1 8B	FP16	16 GB	59 t/s	95	+60% from L2
Llama 3.1 8B	FP8	8 GB	117 t/s	198	+69% from L2
Llama 3.1 8B	AWQ INT4	4.5 GB	209 t/s	225	Modest, near ceiling
Mistral 7B v0.3	FP8	7.25 GB	130 t/s	215	Sliding window helps
Mistral Nemo 12B	FP8	12.2 GB	77 t/s	145	+88% from FA3 + L2
Llama 3.1 70B	AWQ INT4	17 GB	55 t/s	23	None, model too large
Phi-3-mini 3.8B	FP8	3.8 GB	247 t/s	480	+94% from full L2 hits
Qwen 2.5 7B	FP8	7 GB	134 t/s	210	+57% from L2
Mixtral 8x7B	AWQ	25 GB	37 t/s	~35	None, sparse activation helps elsewhere

Two patterns matter. First, smaller models exceed their naive bandwidth ceiling significantly because of L2 reuse: the entire model can sit hot in 72 MB of L2 between layer accesses for FP8 sub-2GB weights, and for slightly larger models the active subset of the working set fits. Second, the 70B model falls below its naive ceiling because the KV cache competes with weight streaming for the same bandwidth, and the model is too large to benefit from L2.

Compared to GDDR6, GDDR7, HBM2e and HBM3

GPU	VRAM	Type	Bandwidth	L2	Decode regime
RTX 3090 24GB	24 GB	GDDR6X 19.5 Gbps	936 GB/s	6 MB	BW-bound, no L2 lift
RTX 4090 24GB	24 GB	GDDR6X 21 Gbps	1008 GB/s	72 MB	BW-bound, large L2 lift
RTX 5090 32GB	32 GB	GDDR7 28 Gbps	1792 GB/s	96 MB	BW-bound, much higher ceiling
RTX 5060 Ti 16GB	16 GB	GDDR7 28 Gbps	448 GB/s	32 MB	BW-bound, low ceiling
A100 40GB	40 GB	HBM2e	1555 GB/s	40 MB	BW-bound, no FP8
A100 80GB	80 GB	HBM2e	2039 GB/s	40 MB	BW-bound, no FP8
H100 SXM 80GB	80 GB	HBM3	3350 GB/s	50 MB	BW + compute mix
RTX 6000 Pro 96GB	96 GB	GDDR7 ECC	1792 GB/s	128 MB	BW-bound, large model headroom

The 4090’s 1008 GB/s is modest next to HBM, but the 72 MB L2 is the largest in the chart aside from Blackwell’s 96-128 MB. For inference workloads with high temporal locality (small-batch decode of mid-size models) the L2 advantage offsets a lot of the raw bandwidth gap. An A100 80GB has 2x the bandwidth but 40 percent less L2 and no FP8: on a single-user Llama 3.1 8B FP8 decode, the 4090 is within 13 percent of an H100 (198 vs 225 t/s) despite having 30 percent of the HBM bandwidth, because the L2 lift and FP8 path together close most of the gap. See 4090 vs H100 80GB for the head-to-head and 4090 vs 3090 for the generational jump from the same VRAM tier.

Production gotchas

The headline 1008 GB/s is theoretical peak. Sustained reads in production land at 920-960 GB/s. Use 940 GB/s as your sizing figure.
Power capping kills bandwidth. Below ~350 W the GDDR6X clocks back, dropping sustained bandwidth to ~840 GB/s. Hold the card at ≥400 W for full bandwidth (see power draw efficiency).
L2 hit rate is workload-dependent. Phi-3 mini hits 85+ percent L2 on the attention path; Llama 70B hits maybe 5 percent. Profile with ncu --metrics l2tex__t_sector_hit_rate.pct to know.
FP16 KV at long context starves the bandwidth. A 4090 at 32k context with FP16 KV spends 30-40 percent of bandwidth on KV streaming. Switch to --kv-cache-dtype fp8 to halve that and recover decode throughput.
Memory pads matter. 24/7 GDDR6X workloads can crack original Micron pads after 12-18 months. Production hosts repad with Honeywell PTM7950 to keep memory junction below 95 °C.
nvidia-smi memory utilisation lies. The “Memory-Util” column reports the fraction of time the controller is active, not the fraction of bandwidth used. Use dcgmi dmon -e 1003 for true bytes/sec.
Two streams on one card share bandwidth. Running two vLLM processes on a single 4090 cuts each one’s effective bandwidth roughly in half. Pin requests to a single vLLM instance with continuous batching instead.

Verdict and when bandwidth is the wall

The 4090’s 1008 GB/s GDDR6X bus is the binding constraint for any LLM decode workload on the card. Compute, with 660 dense FP8 TFLOPS available, is rarely reached. Practically that means: optimise for the smallest format your eval allows (FP8 weights and FP8 KV are the default sweet spot), batch traffic with vLLM continuous batching to amortise weight streaming across sequences, and treat the L2 cache as the under-celebrated feature that lifts small-model throughput well above its naive ceiling. For a 12-engineer coding team running Llama 3.1 8B FP8, the practical envelope is 32 concurrent active streams at 1100 t/s aggregate, which is more than the team will sustain in working hours.

The 4090 is bandwidth-tier 4 (1 TB/s class). Tier 5 is GDDR7 (1.8 TB/s on the 5090, see the 5090 comparison). Tier 6 is HBM3 (3.3 TB/s on H100). If your workload runs into the bandwidth wall on a 4090 today and you cannot squeeze more out of FP8 or AWQ, the next economically sensible step is the 5090 32GB, not the H100; the 78 percent bandwidth jump and 8 GB extra VRAM cover most of the gap at a fraction of the rental cost. See the 4090 or 5090 decision piece for the trade-off.

1008 GB/s of decode bandwidth, hosted in the UK

Full 384-bit GDDR6X, 72 MB Ada L2, FP8 kernels pre-built. UK dedicated hosting.

Order the RTX 4090 24GB

RTX 4090 24GB GDDR6X 1008 GB/s Bandwidth Explained

Contents

GDDR6X spec and the 384-bit bus

Why bandwidth dominates LLM decode

The 72 MB L2 cache and Ada’s memory hierarchy

The decode bandwidth formula

Batched decode and the amortisation effect

Worked examples by model and quant

Compared to GDDR6, GDDR7, HBM2e and HBM3

Production gotchas

Verdict and when bandwidth is the wall

1008 GB/s of decode bandwidth, hosted in the UK

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB GDDR6X 1008 GB/s Bandwidth Explained

Contents

GDDR6X spec and the 384-bit bus

Why bandwidth dominates LLM decode

The 72 MB L2 cache and Ada’s memory hierarchy

The decode bandwidth formula

Batched decode and the amortisation effect

Worked examples by model and quant

Compared to GDDR6, GDDR7, HBM2e and HBM3

Production gotchas

Verdict and when bandwidth is the wall

1008 GB/s of decode bandwidth, hosted in the UK

Need a Dedicated GPU Server?

gigagpu

Related Articles

Cold Storage for Historical LLM Logs

Version Pinning Strategy for AI Deployments: What to Pin, How Tight

Request Timeout Tuning on an Inference Server

Open-Source LLM Hosting Architecture Overview: 2026 State of the Art

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?