RTX 3050 - Order Now
Home / Blog / Benchmarks / GPU Power During AI Inference by Model
Benchmarks

GPU Power During AI Inference by Model

Measuring GPU power consumption during AI inference across model sizes and GPU types. Wattage under load, idle power draw, and energy cost analysis for production LLM serving.

Benchmark Overview

GPU power consumption directly affects operating costs and thermal management requirements. An RTX 6000 Pro at full load draws 700W while an RTX 5090 draws 450W, but they deliver very different throughput per watt. We measured real-world power draw during AI inference across GPU models and LLM sizes on dedicated GPU hosting to quantify energy efficiency.

Test Configuration

GPUs: RTX 5090 (450W TDP), RTX 6000 Pro (350W TDP), RTX 6000 Pro 96 GB (300W TDP), RTX 6000 Pro (700W TDP). Models: Llama 3 8B INT4, Llama 3 70B INT4. Workload: continuous inference at 10 concurrent users via vLLM. Power measured via nvidia-smi at 1-second intervals, averaged over 10-minute sustained load.

Power Draw During Inference

GPUIdle (W)8B INT4 Load (W)70B INT4 Load (W)TDP Utilisation
RTX 509025W280W380W62-84%
RTX 6000 Pro30W220W310W63-89%
RTX 6000 Pro 96 GB35W195W265W65-88%
RTX 6000 Pro45W380W550W54-79%

Throughput per Watt (70B INT4, 10 Users)

GPUThroughput (tok/s)Power (W)Tokens per WattMonthly Energy Cost (UK)
RTX 5090320380W0.84~85 GBP
RTX 6000 Pro380310W1.23~70 GBP
RTX 6000 Pro 96 GB450265W1.70~60 GBP
RTX 6000 Pro720550W1.31~125 GBP

Energy Efficiency Analysis

The RTX 6000 Pro delivers the best energy efficiency at 1.70 tokens per watt for 70B inference. The RTX 6000 Pro produces the highest absolute throughput but consumes more power per token than the RTX 6000 Pro. The RTX 5090 is the least efficient due to its consumer-grade power profile optimised for peak gaming performance rather than sustained compute. See token speed benchmarks and GPU comparisons for throughput data.

Monthly energy costs at UK electricity rates (approximately 0.30 GBP/kWh) range from 60 to 125 GBP per GPU running 24/7 inference. For multi-GPU clusters, these costs multiply linearly.

Cooling Implications

Higher power draw means more heat. The RTX 6000 Pro at 550W sustained requires enterprise-grade cooling. The RTX 6000 Pro at 265W runs comfortably in standard server chassis. For private AI hosting in colocation, power and cooling costs can exceed GPU rental costs. Managed dedicated servers include cooling in the monthly price, simplifying budgeting.

Recommendations

For energy-conscious deployments, the RTX 6000 Pro offers the best tokens-per-watt efficiency. For maximum throughput regardless of power, the RTX 6000 Pro leads. Factor energy costs into your total cost of ownership when comparing GPU options. Deploy on GigaGPU dedicated servers where power and cooling are included. See the benchmarks section, LLM hosting guide, and infrastructure blog for more analysis.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?