Home / Blog / Benchmarks / GPU Power During AI Inference by Model

Benchmarks

GPU Power During AI Inference by Model

Measuring GPU power consumption during AI inference across model sizes and GPU types. Wattage under load, idle power draw, and energy cost analysis for production LLM serving.

Benchmarks April 16, 2026 2 min read admin

Benchmark Overview

GPU power consumption directly affects operating costs and thermal management requirements. An RTX 6000 Pro at full load draws 700W while an RTX 5090 draws 450W, but they deliver very different throughput per watt. We measured real-world power draw during AI inference across GPU models and LLM sizes on dedicated GPU hosting to quantify energy efficiency.

Test Configuration

GPUs: RTX 5090 (450W TDP), RTX 6000 Pro (350W TDP), RTX 6000 Pro 96 GB (300W TDP), RTX 6000 Pro (700W TDP). Models: Llama 3 8B INT4, Llama 3 70B INT4. Workload: continuous inference at 10 concurrent users via vLLM. Power measured via nvidia-smi at 1-second intervals, averaged over 10-minute sustained load.

Power Draw During Inference

GPU	Idle (W)	8B INT4 Load (W)	70B INT4 Load (W)	TDP Utilisation
RTX 5090	25W	280W	380W	62-84%
RTX 6000 Pro	30W	220W	310W	63-89%
RTX 6000 Pro 96 GB	35W	195W	265W	65-88%
RTX 6000 Pro	45W	380W	550W	54-79%

Throughput per Watt (70B INT4, 10 Users)

GPU	Throughput (tok/s)	Power (W)	Tokens per Watt	Monthly Energy Cost (UK)
RTX 5090	320	380W	0.84	~85 GBP
RTX 6000 Pro	380	310W	1.23	~70 GBP
RTX 6000 Pro 96 GB	450	265W	1.70	~60 GBP
RTX 6000 Pro	720	550W	1.31	~125 GBP

Energy Efficiency Analysis

The RTX 6000 Pro delivers the best energy efficiency at 1.70 tokens per watt for 70B inference. The RTX 6000 Pro produces the highest absolute throughput but consumes more power per token than the RTX 6000 Pro. The RTX 5090 is the least efficient due to its consumer-grade power profile optimised for peak gaming performance rather than sustained compute. See token speed benchmarks and GPU comparisons for throughput data.

Monthly energy costs at UK electricity rates (approximately 0.30 GBP/kWh) range from 60 to 125 GBP per GPU running 24/7 inference. For multi-GPU clusters, these costs multiply linearly.

Cooling Implications

Higher power draw means more heat. The RTX 6000 Pro at 550W sustained requires enterprise-grade cooling. The RTX 6000 Pro at 265W runs comfortably in standard server chassis. For private AI hosting in colocation, power and cooling costs can exceed GPU rental costs. Managed dedicated servers include cooling in the monthly price, simplifying budgeting.

Recommendations

For energy-conscious deployments, the RTX 6000 Pro offers the best tokens-per-watt efficiency. For maximum throughput regardless of power, the RTX 6000 Pro leads. Factor energy costs into your total cost of ownership when comparing GPU options. Deploy on GigaGPU dedicated servers where power and cooling are included. See the benchmarks section, LLM hosting guide, and infrastructure blog for more analysis.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

GPU Power During AI Inference by Model

Benchmark Overview

Test Configuration

Power Draw During Inference

Throughput per Watt (70B INT4, 10 Users)

Energy Efficiency Analysis

Cooling Implications

Recommendations

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

GPU Power During AI Inference by Model

Benchmark Overview

Test Configuration

Power Draw During Inference

Throughput per Watt (70B INT4, 10 Users)

Energy Efficiency Analysis

Cooling Implications

Recommendations

Need a Dedicated GPU Server?

admin

Related Articles

How Many TTS Requests per Second per GPU?

Mistral Benchmarks: Performance on GigaGPU Servers

Whisper Large-v3 on RTX 5080: Transcription Speed & Cost, Category: Benchmarks, Slug: whisper-large-v3-on-rtx-5080-benchmark, Excerpt: Whisper Large-v3 benchmarked on RTX 5080: RTF 0.05, 20.0x real-time processing, VRAM usage, and cost per audio hour., Internal links: 8 –>

Mistral 7B Tokens/sec by GPU (Full Benchmark)

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?