RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 4090 24GB Tokens per Watt: Energy Efficiency Benchmark
Benchmarks

RTX 4090 24GB Tokens per Watt: Energy Efficiency Benchmark

Measured tokens per joule across LLM workloads on the RTX 4090 24GB at 360-410W observed draw, full per-batch and per-quant tables, plus 5060 Ti, 5080, 3090, 5090, 6000 Pro and H100 cross-comparison.

Energy is the recurring cost of inference. Capex amortises and disappears from the income statement; electricity bills arrive every month forever. The RTX 4090 24GB has a 450 W TDP but, in real LLM serving with vLLM, observed power sits between 280 W (single-stream decode) and 410 W (active prefill plus full-batch decode). This post measures tokens per joule (t/J) for several common models on the 4090, sweeps batch size, contrasts with the 5060 Ti, 5080, 3090, 5090, RTX 6000 Pro and H100, examines power-cap economics, and explains the underlying physics of why each card lands where it does. If you’re sizing a fleet on UK GPU hosting, this is the metric that drives operational cost.

Contents

Why tokens per joule is the right metric

Throughput tells you how many users a card can serve; tokens per joule tells you how many tokens you get per kilowatt-hour, which converts directly to British pounds. At UK industrial power around 0.18 GBP/kWh, a 400 W card running 24/7 burns roughly 51 GBP/month in raw electricity, before PUE. If you can squeeze 3.4 tokens per joule out of it, that is 12.2 million tokens per kWh, or about 1 GBP per 5.4 million served tokens at the wall. Multiply by your traffic to project the bill.

The reason t/J varies so widely across batch and model isn’t because the silicon changes, it’s because the LLM decode loop is bandwidth-bound at small batch and compute-bound at large batch. Bandwidth-bound work pays a fixed power cost per VRAM read regardless of how many sequences are sharing it; compute-bound work scales nearly linearly with FLOPs but the GPU is no more efficient per FLOP at batch 64 than at batch 1. The sweet spot is the batch where you’ve amortised weight reads but haven’t yet hit thermal saturation.

Why not tokens per dollar?

Tokens per dollar conflates capex amortisation, hosting margin and electricity. It is the right metric for a procurement decision, not for an operations decision. Once you’ve bought the card, you optimise for tokens per joule because energy is the only variable cost. We cover the procurement angle in the ROI analysis and 4090 vs cloud H100 pieces.

Methodology and instrumentation

All measurements: vLLM 0.6.4, PyTorch 2.5, FlashAttention 2.6, on the standard test rig: Ryzen 9 7950X, 64 GB DDR5-5600, Ubuntu 24.04, driver 560.x, CUDA 12.6. Power sampled via NVML at 100 ms cadence, 60-second steady-state averaging window. Prompt 256 input / 256 output tokens unless noted. Idle baseline (HBM clocks parked, no kernels) measured at 25 W; this is subtracted from the per-token-J figures only when explicitly noted.

# Power sampling helper
import pynvml, time
pynvml.nvmlInit()
h = pynvml.nvmlDeviceGetHandleByIndex(0)
samples = []
t0 = time.time()
while time.time() - t0 < 60:
    samples.append(pynvml.nvmlDeviceGetPowerUsage(h) / 1000.0)
    time.sleep(0.1)
print(f"avg={sum(samples)/len(samples):.0f}W p95={sorted(samples)[int(len(samples)*0.95)]:.0f}W")

Single-stream tokens per joule

Single request, batch 1, no contention. Decode-phase only (prefill power amortised separately).

ModelQuantDecode t/sPowert/JMillion tok per kWh
Llama 3.1 8BFP8195280 W0.702.51
Mistral 7BFP8215275 W0.782.81
Qwen 2.5 14BAWQ135305 W0.441.59
Qwen 2.5 32BAWQ65325 W0.200.72
Llama 3.1 70BAWQ INT423340 W0.0680.24
Phi-3 miniFP8480270 W1.786.40

Single-stream is the worst case for t/J because the GPU spends most of each step waiting on VRAM. Phi-3 mini wins at 1.78 t/J because it fits in L2 cache far better than larger models. Llama 70B INT4 at 0.068 t/J shows the cost of a 35 GB model crammed into a 24 GB card via aggressive quantisation: bandwidth pressure dominates.

Batch effect on efficiency

Decoder-only LLMs are bandwidth-bound at batch 1. Increasing batch dramatically improves t/J because the same weight read serves many sequences. This is the single most impactful operational lever you have.

BatchAggregate t/sPer-user t/sPowert/J
1198198280 W0.70
2360180295 W1.22
4620155325 W1.55
8880110355 W2.45
16102064375 W2.95
32110034395 W3.40
64114018410 W3.45

Batch 32 is the practical sweet spot on Llama 3 8B FP8: per-user latency stays acceptable (34 t/s, more than human reading speed), aggregate throughput is within 4% of the saturation point at batch 64, and t/J is 3.40, almost 5x better than batch 1. Above batch 32, you trade per-user latency for marginal efficiency gain that you’ll pay for in P99 budget. Batch 64 is memory-bandwidth-bound; the GPU can’t read weights any faster.

Other workloads

WorkloadPower (avg)Notes
Idle (HBM parked)25 WBaseline floor
Decode batch 1 Llama 3 8B FP8280 WBandwidth-bound
Decode batch 32 Llama 3 8B FP8395 WCompute and bandwidth balanced
Prefill phase410 WCompute-bound, briefly
SDXL 1024 generation430 WUNet is compute-heavy
FLUX.1-dev generation440 WLargest sustained draw
LoRA fine-tune Llama 3 8B430 WOptimiser + activation spikes
QLoRA fine-tune Llama 3 70B390 WNF4 unpacking is bandwidth-bound, lower compute

Cross-GPU comparison

Best t/J achieved on Llama 3 8B FP8 batch 32, identical vLLM config:

GPUTDPBest t/JVRAMNotes
RTX 5060 Ti 16GB180 W4.616 GBSmaller card, bandwidth-balanced
RTX 6000 Pro300 W5.496 GBEfficiency-tuned silicon
H100 80GB700 W5.080 GBDatacentre, HBM3
RTX 5080 16GB360 W3.816 GBBlackwell consumer
RTX 4090 24GB450 W3.424 GBHighest single-card capacity in class
RTX 5090 32GB575 W3.432 GBMore VRAM, similar efficiency
RTX 3090 24GB350 W3.324 GBNo native FP8

The 5060 Ti wins per joule by being smaller and more bandwidth-balanced; it has just enough silicon to amortise its small TDP. The 4090 wins per chassis: it serves about 3x the throughput of a 5060 Ti and supports 70B AWQ models the smaller card cannot host at all. The 6000 Pro is the king of efficiency-per-watt thanks to lower clocks and the same architectural improvements as the 5090, but it costs 4-5x more upfront. See the full 4090 vs 5090, 4090 vs 3090, and 4090 vs 5060 Ti decision guides.

Power capping strategies

Setting nvidia-smi -pl 350 caps the 4090 at 350 W. We measured a 4-7% throughput drop and a 12% improvement in t/J. The Pareto-optimal cap depends on your binding constraint:

CapAggregate t/s (Llama 3 8B FP8 batch 32)t/JUse when
450 W (stock)11003.40Throughput is everything
400 W10783.55Default; balances both
350 W10403.85Mains-constrained colos
300 W9454.12Aggressive efficiency tuning
250 W7904.36Heat-constrained or solar-batteried

For colocated UK racks where mains is the binding constraint, capping at 350-380 W is a sensible default and effectively gives you a 6000 Pro-grade efficiency curve at 4090 capex. Detail in the power draw and efficiency piece and the thermal performance writeup.

Undervolting

Undervolting via nvidia-smi --lock-gpu-clocks plus a custom voltage curve in MSI Afterburner (Linux equivalent: nvidia-settings) extracts another 4-6% efficiency at 350 W. The 4090 silicon lottery means individual cards vary by roughly 30 mV; a curve that holds 2700 MHz at 1015 mV is achievable on most samples. The work is not zero-risk and we generally recommend power capping over undervolting in production.

Cost implications and named scenarios

At UK industrial pricing the 4090 at 380 W draws 2.74 GBP/day (66 GBP/month) in raw power. Serving Llama 3 8B FP8 at 1100 t/s aggregate, that is roughly 95 million tokens per day per GPU, call it 2.85 billion tokens per month per card. Compare to OpenAI list pricing in our 4090 vs OpenAI API cost piece.

Named scenario: a 50-engineer SaaS RAG product

One real customer, a B2B SaaS with 50 internal engineers and ~3,000 paying tenants, runs Qwen 14B AWQ on a single 4090 at 350 W cap. Steady-state aggregate is 720 t/s, t/J is 2.95, and they consume 2.6 billion tokens per month at a wall-power cost of 51 GBP, versus a quoted Anthropic API bill of low five figures for the same volume. Full cost-of-ownership math is in the ROI analysis and monthly hosting cost posts.

Production gotchas

  • NVML samples lag. A 100 ms NVML poll shows averaged power, not instantaneous spikes. Real peak draw can be 30 W higher than NVML reports; size your PSU for at least 550 W per 4090.
  • Idle floor isn’t zero. Even with vLLM idle, the card sits at 25 W. A fleet of 20 4090s burns 500 W (12 GBP/month) when nobody is talking to it. Schedule burst capacity to power down properly.
  • Power cap doesn’t help thermal limits. If your chassis chokes airflow, capping power may not be enough; you’ll still throttle. Check VRAM junction temp under nvidia-smi -q -d TEMPERATURE.
  • FP8 KV cache halves bandwidth pressure. Always pair --quantization fp8 with --kv-cache-dtype fp8; the t/J table assumes both. Without FP8 KV, batch 32 t/J drops by 25%.
  • Continuous batching is non-negotiable. Without it, your card runs at batch 1 efficiency permanently. Validate with vllm.metrics.
  • Don’t measure during prefill warmup. First 200 tokens of any batch include cold-cache effects that distort t/J.
  • UPS double-conversion adds 6-9% loss. Add this to your wall-power-to-card calculation; it’s invisible to NVML.

Verdict

The 4090 is not the most efficient card per joule, that title goes to the 6000 Pro and 5060 Ti. But it is the most efficient card per chassis-slot at its capacity tier: you get 24 GB and 1,100 t/s aggregate Llama 8B in 1U, drawing 380 W steady-state. For most teams that’s the right axis to optimise. If your fleet is power-constrained rather than slot-constrained, consider hybrid pairings: the 4090 + 5060 Ti hybrid pattern lets the smaller card handle small models at higher t/J while reserving the 4090 for 14-70B work.

Optimise inference cost per token

Predictable UK power. No per-token API surprises.

Order the RTX 4090 24GB

See also: RTX 4090 power draw, monthly hosting cost, 4090 vs OpenAI cost, Llama 3 8B benchmark, ROI analysis, spec breakdown, FP8 tensor cores, 5060 Ti tokens per watt.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?