Tokens per watt is the best single metric for GPU energy efficiency. The RTX 5060 Ti 16GB at our hosting is surprisingly strong here thanks to 180 W TDP plus Blackwell FP8 tensor cores.
Contents
Method
Llama 3.1 8B FP8, vLLM, nvml-reported power average over a 60-second steady-state benchmark. Tokens/sec is aggregate (sum across concurrent sequences).
Per-Card Numbers
| GPU | TDP (W) | Observed Draw | Llama 3 8B FP8 t/s | tokens/Joule |
|---|---|---|---|---|
| RTX 4060 8GB | 115 | 102 W | Does not fit | – |
| RTX 4060 Ti 16GB | 165 | 138 W | 470 t/s | 3.4 |
| RTX 5060 Ti 16GB | 180 | 155 W | 720 t/s | 4.6 |
| RTX 5080 16GB | 360 | 305 W | 1,150 t/s | 3.8 |
| RTX 3090 24GB | 350 | 290 W | 950 t/s | 3.3 |
| RTX 5090 32GB | 575 | 485 W | 1,650 t/s | 3.4 |
| RTX 6000 Pro 96GB | 300 | 255 W | 1,380 t/s | 5.4 |
5060 Ti has best t/J among consumer cards. Only RTX 6000 Pro beats it (Blackwell tuned for efficiency). For pure tokens-per-watt, 5060 Ti is the value leader.
Batch vs Efficiency
Higher batch means more tokens per forward pass for the same power draw:
- Batch 1: 112 t/s at 130 W = 0.86 t/J
- Batch 8: 510 t/s at 150 W = 3.4 t/J
- Batch 32: 720 t/s at 155 W = 4.6 t/J
Efficiency more than quintuples at high batch. If energy matters, consolidate concurrent users rather than running separate boxes.
Verdict
For green AI deployments or power-constrained racks, the 5060 Ti at batch 32+ is the best consumer card on tokens/J. Pair it with chunked prefill and FP8 KV cache to push further.
Energy-Efficient LLM Hosting
4.6 tokens/joule at 180W TDP. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: vs 3090, vs 5080, concurrent users, max throughput.