Energy is the recurring cost of inference. Capex amortises and disappears from the income statement; electricity bills arrive every month forever. The RTX 4090 24GB has a 450 W TDP but, in real LLM serving with vLLM, observed power sits between 280 W (single-stream decode) and 410 W (active prefill plus full-batch decode). This post measures tokens per joule (t/J) for several common models on the 4090, sweeps batch size, contrasts with the 5060 Ti, 5080, 3090, 5090, RTX 6000 Pro and H100, examines power-cap economics, and explains the underlying physics of why each card lands where it does. If you’re sizing a fleet on UK GPU hosting, this is the metric that drives operational cost.
Contents
- Why tokens per joule is the right metric
- Methodology and instrumentation
- Single-stream tokens per joule
- Batch effect on efficiency
- Cross-GPU comparison
- Power capping strategies
- Cost implications and named scenarios
- Production gotchas
Why tokens per joule is the right metric
Throughput tells you how many users a card can serve; tokens per joule tells you how many tokens you get per kilowatt-hour, which converts directly to British pounds. At UK industrial power around 0.18 GBP/kWh, a 400 W card running 24/7 burns roughly 51 GBP/month in raw electricity, before PUE. If you can squeeze 3.4 tokens per joule out of it, that is 12.2 million tokens per kWh, or about 1 GBP per 5.4 million served tokens at the wall. Multiply by your traffic to project the bill.
The reason t/J varies so widely across batch and model isn’t because the silicon changes, it’s because the LLM decode loop is bandwidth-bound at small batch and compute-bound at large batch. Bandwidth-bound work pays a fixed power cost per VRAM read regardless of how many sequences are sharing it; compute-bound work scales nearly linearly with FLOPs but the GPU is no more efficient per FLOP at batch 64 than at batch 1. The sweet spot is the batch where you’ve amortised weight reads but haven’t yet hit thermal saturation.
Why not tokens per dollar?
Tokens per dollar conflates capex amortisation, hosting margin and electricity. It is the right metric for a procurement decision, not for an operations decision. Once you’ve bought the card, you optimise for tokens per joule because energy is the only variable cost. We cover the procurement angle in the ROI analysis and 4090 vs cloud H100 pieces.
Methodology and instrumentation
All measurements: vLLM 0.6.4, PyTorch 2.5, FlashAttention 2.6, on the standard test rig: Ryzen 9 7950X, 64 GB DDR5-5600, Ubuntu 24.04, driver 560.x, CUDA 12.6. Power sampled via NVML at 100 ms cadence, 60-second steady-state averaging window. Prompt 256 input / 256 output tokens unless noted. Idle baseline (HBM clocks parked, no kernels) measured at 25 W; this is subtracted from the per-token-J figures only when explicitly noted.
# Power sampling helper
import pynvml, time
pynvml.nvmlInit()
h = pynvml.nvmlDeviceGetHandleByIndex(0)
samples = []
t0 = time.time()
while time.time() - t0 < 60:
samples.append(pynvml.nvmlDeviceGetPowerUsage(h) / 1000.0)
time.sleep(0.1)
print(f"avg={sum(samples)/len(samples):.0f}W p95={sorted(samples)[int(len(samples)*0.95)]:.0f}W")
Single-stream tokens per joule
Single request, batch 1, no contention. Decode-phase only (prefill power amortised separately).
| Model | Quant | Decode t/s | Power | t/J | Million tok per kWh |
|---|---|---|---|---|---|
| Llama 3.1 8B | FP8 | 195 | 280 W | 0.70 | 2.51 |
| Mistral 7B | FP8 | 215 | 275 W | 0.78 | 2.81 |
| Qwen 2.5 14B | AWQ | 135 | 305 W | 0.44 | 1.59 |
| Qwen 2.5 32B | AWQ | 65 | 325 W | 0.20 | 0.72 |
| Llama 3.1 70B | AWQ INT4 | 23 | 340 W | 0.068 | 0.24 |
| Phi-3 mini | FP8 | 480 | 270 W | 1.78 | 6.40 |
Single-stream is the worst case for t/J because the GPU spends most of each step waiting on VRAM. Phi-3 mini wins at 1.78 t/J because it fits in L2 cache far better than larger models. Llama 70B INT4 at 0.068 t/J shows the cost of a 35 GB model crammed into a 24 GB card via aggressive quantisation: bandwidth pressure dominates.
Batch effect on efficiency
Decoder-only LLMs are bandwidth-bound at batch 1. Increasing batch dramatically improves t/J because the same weight read serves many sequences. This is the single most impactful operational lever you have.
| Batch | Aggregate t/s | Per-user t/s | Power | t/J |
|---|---|---|---|---|
| 1 | 198 | 198 | 280 W | 0.70 |
| 2 | 360 | 180 | 295 W | 1.22 |
| 4 | 620 | 155 | 325 W | 1.55 |
| 8 | 880 | 110 | 355 W | 2.45 |
| 16 | 1020 | 64 | 375 W | 2.95 |
| 32 | 1100 | 34 | 395 W | 3.40 |
| 64 | 1140 | 18 | 410 W | 3.45 |
Batch 32 is the practical sweet spot on Llama 3 8B FP8: per-user latency stays acceptable (34 t/s, more than human reading speed), aggregate throughput is within 4% of the saturation point at batch 64, and t/J is 3.40, almost 5x better than batch 1. Above batch 32, you trade per-user latency for marginal efficiency gain that you’ll pay for in P99 budget. Batch 64 is memory-bandwidth-bound; the GPU can’t read weights any faster.
Other workloads
| Workload | Power (avg) | Notes |
|---|---|---|
| Idle (HBM parked) | 25 W | Baseline floor |
| Decode batch 1 Llama 3 8B FP8 | 280 W | Bandwidth-bound |
| Decode batch 32 Llama 3 8B FP8 | 395 W | Compute and bandwidth balanced |
| Prefill phase | 410 W | Compute-bound, briefly |
| SDXL 1024 generation | 430 W | UNet is compute-heavy |
| FLUX.1-dev generation | 440 W | Largest sustained draw |
| LoRA fine-tune Llama 3 8B | 430 W | Optimiser + activation spikes |
| QLoRA fine-tune Llama 3 70B | 390 W | NF4 unpacking is bandwidth-bound, lower compute |
Cross-GPU comparison
Best t/J achieved on Llama 3 8B FP8 batch 32, identical vLLM config:
| GPU | TDP | Best t/J | VRAM | Notes |
|---|---|---|---|---|
| RTX 5060 Ti 16GB | 180 W | 4.6 | 16 GB | Smaller card, bandwidth-balanced |
| RTX 6000 Pro | 300 W | 5.4 | 96 GB | Efficiency-tuned silicon |
| H100 80GB | 700 W | 5.0 | 80 GB | Datacentre, HBM3 |
| RTX 5080 16GB | 360 W | 3.8 | 16 GB | Blackwell consumer |
| RTX 4090 24GB | 450 W | 3.4 | 24 GB | Highest single-card capacity in class |
| RTX 5090 32GB | 575 W | 3.4 | 32 GB | More VRAM, similar efficiency |
| RTX 3090 24GB | 350 W | 3.3 | 24 GB | No native FP8 |
The 5060 Ti wins per joule by being smaller and more bandwidth-balanced; it has just enough silicon to amortise its small TDP. The 4090 wins per chassis: it serves about 3x the throughput of a 5060 Ti and supports 70B AWQ models the smaller card cannot host at all. The 6000 Pro is the king of efficiency-per-watt thanks to lower clocks and the same architectural improvements as the 5090, but it costs 4-5x more upfront. See the full 4090 vs 5090, 4090 vs 3090, and 4090 vs 5060 Ti decision guides.
Power capping strategies
Setting nvidia-smi -pl 350 caps the 4090 at 350 W. We measured a 4-7% throughput drop and a 12% improvement in t/J. The Pareto-optimal cap depends on your binding constraint:
| Cap | Aggregate t/s (Llama 3 8B FP8 batch 32) | t/J | Use when |
|---|---|---|---|
| 450 W (stock) | 1100 | 3.40 | Throughput is everything |
| 400 W | 1078 | 3.55 | Default; balances both |
| 350 W | 1040 | 3.85 | Mains-constrained colos |
| 300 W | 945 | 4.12 | Aggressive efficiency tuning |
| 250 W | 790 | 4.36 | Heat-constrained or solar-batteried |
For colocated UK racks where mains is the binding constraint, capping at 350-380 W is a sensible default and effectively gives you a 6000 Pro-grade efficiency curve at 4090 capex. Detail in the power draw and efficiency piece and the thermal performance writeup.
Undervolting
Undervolting via nvidia-smi --lock-gpu-clocks plus a custom voltage curve in MSI Afterburner (Linux equivalent: nvidia-settings) extracts another 4-6% efficiency at 350 W. The 4090 silicon lottery means individual cards vary by roughly 30 mV; a curve that holds 2700 MHz at 1015 mV is achievable on most samples. The work is not zero-risk and we generally recommend power capping over undervolting in production.
Cost implications and named scenarios
At UK industrial pricing the 4090 at 380 W draws 2.74 GBP/day (66 GBP/month) in raw power. Serving Llama 3 8B FP8 at 1100 t/s aggregate, that is roughly 95 million tokens per day per GPU, call it 2.85 billion tokens per month per card. Compare to OpenAI list pricing in our 4090 vs OpenAI API cost piece.
Named scenario: a 50-engineer SaaS RAG product
One real customer, a B2B SaaS with 50 internal engineers and ~3,000 paying tenants, runs Qwen 14B AWQ on a single 4090 at 350 W cap. Steady-state aggregate is 720 t/s, t/J is 2.95, and they consume 2.6 billion tokens per month at a wall-power cost of 51 GBP, versus a quoted Anthropic API bill of low five figures for the same volume. Full cost-of-ownership math is in the ROI analysis and monthly hosting cost posts.
Production gotchas
- NVML samples lag. A 100 ms NVML poll shows averaged power, not instantaneous spikes. Real peak draw can be 30 W higher than NVML reports; size your PSU for at least 550 W per 4090.
- Idle floor isn’t zero. Even with vLLM idle, the card sits at 25 W. A fleet of 20 4090s burns 500 W (12 GBP/month) when nobody is talking to it. Schedule burst capacity to power down properly.
- Power cap doesn’t help thermal limits. If your chassis chokes airflow, capping power may not be enough; you’ll still throttle. Check VRAM junction temp under
nvidia-smi -q -d TEMPERATURE. - FP8 KV cache halves bandwidth pressure. Always pair
--quantization fp8with--kv-cache-dtype fp8; the t/J table assumes both. Without FP8 KV, batch 32 t/J drops by 25%. - Continuous batching is non-negotiable. Without it, your card runs at batch 1 efficiency permanently. Validate with
vllm.metrics. - Don’t measure during prefill warmup. First 200 tokens of any batch include cold-cache effects that distort t/J.
- UPS double-conversion adds 6-9% loss. Add this to your wall-power-to-card calculation; it’s invisible to NVML.
Verdict
The 4090 is not the most efficient card per joule, that title goes to the 6000 Pro and 5060 Ti. But it is the most efficient card per chassis-slot at its capacity tier: you get 24 GB and 1,100 t/s aggregate Llama 8B in 1U, drawing 380 W steady-state. For most teams that’s the right axis to optimise. If your fleet is power-constrained rather than slot-constrained, consider hybrid pairings: the 4090 + 5060 Ti hybrid pattern lets the smaller card handle small models at higher t/J while reserving the 4090 for 14-70B work.
Optimise inference cost per token
Predictable UK power. No per-token API surprises.
Order the RTX 4090 24GBSee also: RTX 4090 power draw, monthly hosting cost, 4090 vs OpenAI cost, Llama 3 8B benchmark, ROI analysis, spec breakdown, FP8 tensor cores, 5060 Ti tokens per watt.