Home / Blog / Benchmarks / Tokens per Watt: Energy Efficiency

Benchmarks

Tokens per Watt: Energy Efficiency

Benchmarking AI inference energy efficiency across GPU models measured in tokens per watt. Comparing power consumption against throughput to find the most cost-effective GPU for sustainable AI hosting.

Benchmarks April 16, 2026 2 min read admin

Benchmark Overview

As AI inference scales, energy consumption becomes a significant cost factor and sustainability concern. A data centre running 100 GPUs at 400W each consumes 40kW continuously, costing thousands monthly in electricity alone. We measured tokens-per-watt efficiency across GPU models and LLM sizes to identify the most energy-efficient configurations for production inference on dedicated GPU hosting.

Test Configuration

GPUs: RTX 5090 (450W TDP), RTX 6000 Pro (350W TDP), RTX 6000 Pro 96 GB (300W TDP), RTX 6000 Pro (700W TDP). Models: Llama 3 8B INT4, Llama 3 70B INT4. Workload: sustained inference at 10 concurrent users via vLLM. Power measured via nvidia-smi averaged over 30-minute sustained load. Energy efficiency = total tokens generated per second / average power draw in watts. See token benchmarks for raw throughput numbers.

Energy Efficiency: 8B INT4 Model

GPU	Throughput (tok/s)	Power Draw (W)	Tokens per Watt	Monthly Energy Cost (24/7, UK)
RTX 5090	680	285W	2.39	~62 GBP
RTX 6000 Pro	750	225W	3.33	~49 GBP
RTX 6000 Pro 96 GB	890	200W	4.45	~44 GBP
RTX 6000 Pro	1,420	390W	3.64	~85 GBP

Energy Efficiency: 70B INT4 Model

GPU	Throughput (tok/s)	Power Draw (W)	Tokens per Watt	Monthly Energy Cost (24/7, UK)
RTX 5090	320	380W	0.84	~83 GBP
RTX 6000 Pro	380	310W	1.23	~68 GBP
RTX 6000 Pro 96 GB	450	265W	1.70	~58 GBP
RTX 6000 Pro	720	550W	1.31	~120 GBP

Efficiency Rankings

The RTX 6000 Pro leads in energy efficiency for both model sizes, delivering the most tokens per watt consumed. Its 300W TDP and HBM2e bandwidth create an optimal ratio. The RTX 6000 Pro ranks second with excellent efficiency from its lower power profile. The RTX 6000 Pro delivers the highest absolute throughput but consumes proportionally more power, ranking third in efficiency. The RTX 5090 is consistently the least efficient due to its consumer power profile. See GPU comparisons for full specifications.

Cost-Efficiency Analysis

At UK electricity rates of approximately 0.30 GBP/kWh, monthly energy costs range from 44 to 120 GBP per GPU running 24/7. For multi-GPU clusters with 4-8 GPUs, energy adds 175-960 GBP monthly. This is 5-15% of the total hosting cost but becomes significant at scale. On managed dedicated servers, power costs are typically included in the monthly price, simplifying budgeting.

When comparing total cost per million tokens, the RTX 6000 Pro leads for energy-conscious deployments while the RTX 6000 Pro wins on time-efficiency where faster completion matters more than energy cost. Configure efficiency-optimised deployments via the vLLM production guide.

Recommendations

For maximum energy efficiency, the RTX 6000 Pro delivers the best tokens-per-watt across model sizes. For maximum throughput where energy cost is secondary, the RTX 6000 Pro leads. Factor energy costs into GPU selection when running 24/7 inference, especially in multi-GPU configurations. Deploy on GigaGPU dedicated servers with private AI hosting where power and cooling are managed. Visit the benchmarks section, LLM hosting guide, and infrastructure blog for comprehensive deployment planning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Tokens per Watt: Energy Efficiency

Benchmark Overview

Test Configuration

Energy Efficiency: 8B INT4 Model

Energy Efficiency: 70B INT4 Model

Efficiency Rankings

Cost-Efficiency Analysis

Recommendations

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Tokens per Watt: Energy Efficiency

Benchmark Overview

Test Configuration

Energy Efficiency: 8B INT4 Model

Energy Efficiency: 70B INT4 Model

Efficiency Rankings

Cost-Efficiency Analysis

Recommendations

Need a Dedicated GPU Server?

admin

Related Articles

Qwen 2.5 72B Tokens/sec by GPU

Mistral Large Performance Report: April 2026

RTX 3090: Maximum LLM Throughput (Requests/sec)

Qwen 2.5 7B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: qwen-2.5-7b-on-rtx-5080-benchmark, Excerpt: Qwen 2.5 7B benchmarked on RTX 5080: 66.5 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?