RTX 4090 24GB Tokens per Watt: Energy Efficiency Benchmark GIGAGPU

Energy is the recurring cost of inference. Capex amortises and disappears from the income statement; electricity bills arrive every month forever. The RTX 4090 24GB has a 450 W TDP but, in real LLM serving with vLLM, observed power sits between 280 W (single-stream decode) and 410 W (active prefill plus full-batch decode). This post measures tokens per joule (t/J) for several common models on the 4090, sweeps batch size, contrasts with the 5060 Ti, 5080, 3090, 5090, RTX 6000 Pro and H100, examines power-cap economics, and explains the underlying physics of why each card lands where it does. If you’re sizing a fleet on UK GPU hosting, this is the metric that drives operational cost.

Why tokens per joule is the right metric

Throughput tells you how many users a card can serve; tokens per joule tells you how many tokens you get per kilowatt-hour, which converts directly to British pounds. At UK industrial power around 0.18 GBP/kWh, a 400 W card running 24/7 burns roughly 51 GBP/month in raw electricity, before PUE. If you can squeeze 3.4 tokens per joule out of it, that is 12.2 million tokens per kWh, or about 1 GBP per 5.4 million served tokens at the wall. Multiply by your traffic to project the bill.

The reason t/J varies so widely across batch and model isn’t because the silicon changes, it’s because the LLM decode loop is bandwidth-bound at small batch and compute-bound at large batch. Bandwidth-bound work pays a fixed power cost per VRAM read regardless of how many sequences are sharing it; compute-bound work scales nearly linearly with FLOPs but the GPU is no more efficient per FLOP at batch 64 than at batch 1. The sweet spot is the batch where you’ve amortised weight reads but haven’t yet hit thermal saturation.

Why not tokens per dollar?

Tokens per dollar conflates capex amortisation, hosting margin and electricity. It is the right metric for a procurement decision, not for an operations decision. Once you’ve bought the card, you optimise for tokens per joule because energy is the only variable cost. We cover the procurement angle in the ROI analysis and 4090 vs cloud H100 pieces.

Methodology and instrumentation

All measurements: vLLM 0.6.4, PyTorch 2.5, FlashAttention 2.6, on the standard test rig: Ryzen 9 7950X, 64 GB DDR5-5600, Ubuntu 24.04, driver 560.x, CUDA 12.6. Power sampled via NVML at 100 ms cadence, 60-second steady-state averaging window. Prompt 256 input / 256 output tokens unless noted. Idle baseline (HBM clocks parked, no kernels) measured at 25 W; this is subtracted from the per-token-J figures only when explicitly noted.

# Power sampling helper
import pynvml, time
pynvml.nvmlInit()
h = pynvml.nvmlDeviceGetHandleByIndex(0)
samples = []
t0 = time.time()
while time.time() - t0 < 60:
    samples.append(pynvml.nvmlDeviceGetPowerUsage(h) / 1000.0)
    time.sleep(0.1)
print(f"avg={sum(samples)/len(samples):.0f}W p95={sorted(samples)[int(len(samples)*0.95)]:.0f}W")

Single-stream tokens per joule

Single request, batch 1, no contention. Decode-phase only (prefill power amortised separately).

Model	Quant	Decode t/s	Power	t/J	Million tok per kWh
Llama 3.1 8B	FP8	195	280 W	0.70	2.51
Mistral 7B	FP8	215	275 W	0.78	2.81
Qwen 2.5 14B	AWQ	135	305 W	0.44	1.59
Qwen 2.5 32B	AWQ	65	325 W	0.20	0.72
Llama 3.1 70B	AWQ INT4	23	340 W	0.068	0.24
Phi-3 mini	FP8	480	270 W	1.78	6.40

Single-stream is the worst case for t/J because the GPU spends most of each step waiting on VRAM. Phi-3 mini wins at 1.78 t/J because it fits in L2 cache far better than larger models. Llama 70B INT4 at 0.068 t/J shows the cost of a 35 GB model crammed into a 24 GB card via aggressive quantisation: bandwidth pressure dominates.

Batch effect on efficiency

Decoder-only LLMs are bandwidth-bound at batch 1. Increasing batch dramatically improves t/J because the same weight read serves many sequences. This is the single most impactful operational lever you have.

Batch	Aggregate t/s	Per-user t/s	Power	t/J
1	198	198	280 W	0.70
2	360	180	295 W	1.22
4	620	155	325 W	1.55
8	880	110	355 W	2.45
16	1020	64	375 W	2.95
32	1100	34	395 W	3.40
64	1140	18	410 W	3.45

Batch 32 is the practical sweet spot on Llama 3 8B FP8: per-user latency stays acceptable (34 t/s, more than human reading speed), aggregate throughput is within 4% of the saturation point at batch 64, and t/J is 3.40, almost 5x better than batch 1. Above batch 32, you trade per-user latency for marginal efficiency gain that you’ll pay for in P99 budget. Batch 64 is memory-bandwidth-bound; the GPU can’t read weights any faster.

Other workloads

Workload	Power (avg)	Notes
Idle (HBM parked)	25 W	Baseline floor
Decode batch 1 Llama 3 8B FP8	280 W	Bandwidth-bound
Decode batch 32 Llama 3 8B FP8	395 W	Compute and bandwidth balanced
Prefill phase	410 W	Compute-bound, briefly
SDXL 1024 generation	430 W	UNet is compute-heavy
FLUX.1-dev generation	440 W	Largest sustained draw
LoRA fine-tune Llama 3 8B	430 W	Optimiser + activation spikes
QLoRA fine-tune Llama 3 70B	390 W	NF4 unpacking is bandwidth-bound, lower compute

Cross-GPU comparison

Best t/J achieved on Llama 3 8B FP8 batch 32, identical vLLM config:

GPU	TDP	Best t/J	VRAM	Notes
RTX 5060 Ti 16GB	180 W	4.6	16 GB	Smaller card, bandwidth-balanced
RTX 6000 Pro	300 W	5.4	96 GB	Efficiency-tuned silicon
H100 80GB	700 W	5.0	80 GB	Datacentre, HBM3
RTX 5080 16GB	360 W	3.8	16 GB	Blackwell consumer
RTX 4090 24GB	450 W	3.4	24 GB	Highest single-card capacity in class
RTX 5090 32GB	575 W	3.4	32 GB	More VRAM, similar efficiency
RTX 3090 24GB	350 W	3.3	24 GB	No native FP8

The 5060 Ti wins per joule by being smaller and more bandwidth-balanced; it has just enough silicon to amortise its small TDP. The 4090 wins per chassis: it serves about 3x the throughput of a 5060 Ti and supports 70B AWQ models the smaller card cannot host at all. The 6000 Pro is the king of efficiency-per-watt thanks to lower clocks and the same architectural improvements as the 5090, but it costs 4-5x more upfront. See the full 4090 vs 5090, 4090 vs 3090, and 4090 vs 5060 Ti decision guides.

Power capping strategies

Setting nvidia-smi -pl 350 caps the 4090 at 350 W. We measured a 4-7% throughput drop and a 12% improvement in t/J. The Pareto-optimal cap depends on your binding constraint:

Cap	Aggregate t/s (Llama 3 8B FP8 batch 32)	t/J	Use when
450 W (stock)	1100	3.40	Throughput is everything
400 W	1078	3.55	Default; balances both
350 W	1040	3.85	Mains-constrained colos
300 W	945	4.12	Aggressive efficiency tuning
250 W	790	4.36	Heat-constrained or solar-batteried

For colocated UK racks where mains is the binding constraint, capping at 350-380 W is a sensible default and effectively gives you a 6000 Pro-grade efficiency curve at 4090 capex. Detail in the power draw and efficiency piece and the thermal performance writeup.

Undervolting

Undervolting via nvidia-smi --lock-gpu-clocks plus a custom voltage curve in MSI Afterburner (Linux equivalent: nvidia-settings) extracts another 4-6% efficiency at 350 W. The 4090 silicon lottery means individual cards vary by roughly 30 mV; a curve that holds 2700 MHz at 1015 mV is achievable on most samples. The work is not zero-risk and we generally recommend power capping over undervolting in production.

Cost implications and named scenarios

At UK industrial pricing the 4090 at 380 W draws 2.74 GBP/day (66 GBP/month) in raw power. Serving Llama 3 8B FP8 at 1100 t/s aggregate, that is roughly 95 million tokens per day per GPU, call it 2.85 billion tokens per month per card. Compare to OpenAI list pricing in our 4090 vs OpenAI API cost piece.

Named scenario: a 50-engineer SaaS RAG product

One real customer, a B2B SaaS with 50 internal engineers and ~3,000 paying tenants, runs Qwen 14B AWQ on a single 4090 at 350 W cap. Steady-state aggregate is 720 t/s, t/J is 2.95, and they consume 2.6 billion tokens per month at a wall-power cost of 51 GBP, versus a quoted Anthropic API bill of low five figures for the same volume. Full cost-of-ownership math is in the ROI analysis and monthly hosting cost posts.

Production gotchas

NVML samples lag. A 100 ms NVML poll shows averaged power, not instantaneous spikes. Real peak draw can be 30 W higher than NVML reports; size your PSU for at least 550 W per 4090.
Idle floor isn’t zero. Even with vLLM idle, the card sits at 25 W. A fleet of 20 4090s burns 500 W (12 GBP/month) when nobody is talking to it. Schedule burst capacity to power down properly.
Power cap doesn’t help thermal limits. If your chassis chokes airflow, capping power may not be enough; you’ll still throttle. Check VRAM junction temp under nvidia-smi -q -d TEMPERATURE.
FP8 KV cache halves bandwidth pressure. Always pair --quantization fp8 with --kv-cache-dtype fp8; the t/J table assumes both. Without FP8 KV, batch 32 t/J drops by 25%.
Continuous batching is non-negotiable. Without it, your card runs at batch 1 efficiency permanently. Validate with vllm.metrics.
Don’t measure during prefill warmup. First 200 tokens of any batch include cold-cache effects that distort t/J.
UPS double-conversion adds 6-9% loss. Add this to your wall-power-to-card calculation; it’s invisible to NVML.

Verdict

The 4090 is not the most efficient card per joule, that title goes to the 6000 Pro and 5060 Ti. But it is the most efficient card per chassis-slot at its capacity tier: you get 24 GB and 1,100 t/s aggregate Llama 8B in 1U, drawing 380 W steady-state. For most teams that’s the right axis to optimise. If your fleet is power-constrained rather than slot-constrained, consider hybrid pairings: the 4090 + 5060 Ti hybrid pattern lets the smaller card handle small models at higher t/J while reserving the 4090 for 14-70B work.

Optimise inference cost per token

Predictable UK power. No per-token API surprises.

Order the RTX 4090 24GB

RTX 4090 24GB Tokens per Watt: Energy Efficiency Benchmark

Contents

Why tokens per joule is the right metric

Why not tokens per dollar?

Methodology and instrumentation

Single-stream tokens per joule

Batch effect on efficiency

Other workloads

Cross-GPU comparison

Power capping strategies

Undervolting

Cost implications and named scenarios

Named scenario: a 50-engineer SaaS RAG product

Production gotchas

Verdict

Optimise inference cost per token

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB Tokens per Watt: Energy Efficiency Benchmark

Contents

Why tokens per joule is the right metric

Why not tokens per dollar?

Methodology and instrumentation

Single-stream tokens per joule

Batch effect on efficiency

Other workloads

Cross-GPU comparison

Power capping strategies

Undervolting

Cost implications and named scenarios

Named scenario: a 50-engineer SaaS RAG product

Production gotchas

Verdict

Optimise inference cost per token

Need a Dedicated GPU Server?

gigagpu

Related Articles

Stable Diffusion XL on RTX 3050: Images/sec & VRAM Usage, Category: Benchmarks, Slug: sdxl-on-rtx-3050-benchmark, Excerpt: Stable Diffusion XL benchmarked on RTX 3050: 0.6 it/s, 1.2 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

SDXL Turbo Images/sec by GPU

GDDR6 vs GDDR6X vs GDDR7 for AI

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?