RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 5090: Maximum LLM Throughput (Requests/sec)
Benchmarks

RTX 5090: Maximum LLM Throughput (Requests/sec)

Maximum LLM request throughput for the RTX 5090 — requests per second at batch sizes 1 to 64 across 7B to 70B models with vLLM, plus cost-per-request analysis.

RTX 5090 Throughput Overview

The RTX 5090 is the highest-throughput consumer GPU available for dedicated GPU hosting. Its 32 GB of GDDR7 VRAM and Blackwell architecture make it the only consumer card capable of running 70B-class models while simultaneously delivering top throughput on smaller models. We benchmarked maximum requests per second across batch sizes from 1 to 64.

All tests used vLLM continuous batching on GigaGPU bare-metal servers. Each request contained a 128-token prompt with 256-token output. Throughput was measured as sustained completed requests per second over 60-second windows. For single-user speed, see the tokens per second benchmark.

Requests/sec — 7-8B Models

Model (Quantisation)Batch 1Batch 4Batch 8Batch 16Batch 32Batch 64
LLaMA 3 8B (INT4)0.521.903.455.808.209.80
LLaMA 3 8B (FP16)0.371.302.353.955.606.70
Mistral 7B (INT4)0.552.003.656.108.6010.30
Mistral 7B (FP16)0.401.402.504.205.907.10
DeepSeek R1 Distill 7B (INT4)0.461.703.105.207.408.90
Qwen 2.5 7B (INT4)0.501.853.355.658.009.60

The RTX 5090 peaks at 9.8-10.3 requests/sec with INT4 7B models at batch 64 — roughly 2.2x the RTX 3090’s throughput and 55 percent above the RTX 5080. At continuous saturation, that is over 26 million requests per month from a single card.

Requests/sec — Larger Models

The 5090’s 32 GB VRAM opens the door to larger models that simply cannot run on 16-24 GB cards at useful batch sizes.

Model (Quantisation)Batch 1Batch 4Batch 8Batch 16Batch 32Batch 64
LLaMA 3 70B (INT4)0.060.200.350.550.72OOM
Mixtral 8x7B (INT4)0.100.340.580.901.20OOM
CodeLlama 34B (INT4)0.120.400.681.051.401.55

LLaMA 3 70B INT4 reaches 0.72 req/s at batch 32 before running out of memory at batch 64. That is still over 1.8 million requests per month — viable for moderate-traffic production APIs. For higher throughput with 70B models, multi-GPU tensor parallelism across two 5090 cards roughly doubles these figures.

Throughput Scaling Curve

Throughput scales near-linearly from batch 1 to batch 16 on the RTX 5090, then begins to flatten as memory bandwidth becomes the bottleneck. The inflection point is around batch 32 for 7B models and batch 16 for 70B models. Beyond these points, adding more concurrent requests yields diminishing throughput gains while significantly increasing per-request latency.

For interactive applications, batch 8-16 offers the best balance — 3.45-5.80 req/s with LLaMA 3 8B INT4 at manageable 2-4 second end-to-end latency. For batch processing, pushing to batch 32-64 maximises aggregate throughput. See the batch size impact on tokens/sec for a detailed analysis of this scaling behaviour.

Cost per Million Requests

At approximately £250/month for an RTX 5090 dedicated server, the cost per million requests at peak throughput is competitive.

ModelPeak req/sRequests/MonthCost per 1M Requests
Mistral 7B INT410.30~26.6M~£9.40
LLaMA 3 8B INT49.80~25.3M~£9.90
LLaMA 3 70B INT40.72~1.86M~£134

For 7B models, the 5090’s cost per million requests is nearly identical to the 3090 despite the higher monthly cost — you pay more but get proportionally more throughput. For 70B models, the self-hosted cost of £134/M requests is still dramatically cheaper than API providers. Use the LLM cost calculator to model your specific workload, and see cost per million tokens for token-level pricing.

Conclusion

The RTX 5090 delivers the highest single-card LLM throughput available in consumer hardware — over 10 requests per second with 7B INT4 models and viable throughput even for 70B models. For high-volume API workloads, it processes over 26 million requests per month on a single dedicated server. Compare all GPU options in the RTX 3090 vs RTX 5090 throughput per dollar guide or browse the full Benchmarks category.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?