Home / Blog / Benchmarks / RTX 5090: Maximum LLM Throughput (Requests/sec)

Benchmarks

RTX 5090: Maximum LLM Throughput (Requests/sec)

Maximum LLM request throughput for the RTX 5090 — requests per second at batch sizes 1 to 64 across 7B to 70B models with vLLM, plus cost-per-request analysis.

Benchmarks April 17, 2026 3 min read admin

Table of Contents

RTX 5090 Throughput Overview
Requests/sec — 7-8B Models
Requests/sec — Larger Models
Throughput Scaling Curve
Cost per Million Requests
Conclusion

RTX 5090 Throughput Overview

The RTX 5090 is the highest-throughput consumer GPU available for dedicated GPU hosting. Its 32 GB of GDDR7 VRAM and Blackwell architecture make it the only consumer card capable of running 70B-class models while simultaneously delivering top throughput on smaller models. We benchmarked maximum requests per second across batch sizes from 1 to 64.

All tests used vLLM continuous batching on GigaGPU bare-metal servers. Each request contained a 128-token prompt with 256-token output. Throughput was measured as sustained completed requests per second over 60-second windows. For single-user speed, see the tokens per second benchmark.

Requests/sec — 7-8B Models

Model (Quantisation)	Batch 1	Batch 4	Batch 8	Batch 16	Batch 32	Batch 64
LLaMA 3 8B (INT4)	0.52	1.90	3.45	5.80	8.20	9.80
LLaMA 3 8B (FP16)	0.37	1.30	2.35	3.95	5.60	6.70
Mistral 7B (INT4)	0.55	2.00	3.65	6.10	8.60	10.30
Mistral 7B (FP16)	0.40	1.40	2.50	4.20	5.90	7.10
DeepSeek R1 Distill 7B (INT4)	0.46	1.70	3.10	5.20	7.40	8.90
Qwen 2.5 7B (INT4)	0.50	1.85	3.35	5.65	8.00	9.60

The RTX 5090 peaks at 9.8-10.3 requests/sec with INT4 7B models at batch 64 — roughly 2.2x the RTX 3090’s throughput and 55 percent above the RTX 5080. At continuous saturation, that is over 26 million requests per month from a single card.

Requests/sec — Larger Models

The 5090’s 32 GB VRAM opens the door to larger models that simply cannot run on 16-24 GB cards at useful batch sizes.

Model (Quantisation)	Batch 1	Batch 4	Batch 8	Batch 16	Batch 32	Batch 64
LLaMA 3 70B (INT4)	0.06	0.20	0.35	0.55	0.72	OOM
Mixtral 8x7B (INT4)	0.10	0.34	0.58	0.90	1.20	OOM
CodeLlama 34B (INT4)	0.12	0.40	0.68	1.05	1.40	1.55

LLaMA 3 70B INT4 reaches 0.72 req/s at batch 32 before running out of memory at batch 64. That is still over 1.8 million requests per month — viable for moderate-traffic production APIs. For higher throughput with 70B models, multi-GPU tensor parallelism across two 5090 cards roughly doubles these figures.

Throughput Scaling Curve

Throughput scales near-linearly from batch 1 to batch 16 on the RTX 5090, then begins to flatten as memory bandwidth becomes the bottleneck. The inflection point is around batch 32 for 7B models and batch 16 for 70B models. Beyond these points, adding more concurrent requests yields diminishing throughput gains while significantly increasing per-request latency.

For interactive applications, batch 8-16 offers the best balance — 3.45-5.80 req/s with LLaMA 3 8B INT4 at manageable 2-4 second end-to-end latency. For batch processing, pushing to batch 32-64 maximises aggregate throughput. See the batch size impact on tokens/sec for a detailed analysis of this scaling behaviour.

Cost per Million Requests

At approximately £250/month for an RTX 5090 dedicated server, the cost per million requests at peak throughput is competitive.

Model	Peak req/s	Requests/Month	Cost per 1M Requests
Mistral 7B INT4	10.30	~26.6M	~£9.40
LLaMA 3 8B INT4	9.80	~25.3M	~£9.90
LLaMA 3 70B INT4	0.72	~1.86M	~£134

For 7B models, the 5090’s cost per million requests is nearly identical to the 3090 despite the higher monthly cost — you pay more but get proportionally more throughput. For 70B models, the self-hosted cost of £134/M requests is still dramatically cheaper than API providers. Use the LLM cost calculator to model your specific workload, and see cost per million tokens for token-level pricing.

Conclusion

The RTX 5090 delivers the highest single-card LLM throughput available in consumer hardware — over 10 requests per second with 7B INT4 models and viable throughput even for 70B models. For high-volume API workloads, it processes over 26 million requests per month on a single dedicated server. Compare all GPU options in the RTX 3090 vs RTX 5090 throughput per dollar guide or browse the full Benchmarks category.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 5090: Maximum LLM Throughput (Requests/sec)

RTX 5090 Throughput Overview

Requests/sec — 7-8B Models

Requests/sec — Larger Models

Throughput Scaling Curve

Cost per Million Requests

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 5090: Maximum LLM Throughput (Requests/sec)

RTX 5090 Throughput Overview

Requests/sec — 7-8B Models

Requests/sec — Larger Models

Throughput Scaling Curve

Cost per Million Requests

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

admin

Related Articles

How to Benchmark Your GPU Server for AI Workloads

YOLOv8 on RTX 3090: Detection FPS & Cost, Category: Benchmarks, Slug: yolov8-on-rtx-3090-benchmark, Excerpt: YOLOv8 benchmarked on RTX 3090: 78 FPS, VRAM usage, cost efficiency, and deployment configuration., Internal links: 8 –>

LLaMA 3 8B GPTQ vs AWQ vs GGUF: Speed by GPU

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?