Table of Contents
RTX 5090 Throughput Overview
The RTX 5090 is the highest-throughput consumer GPU available for dedicated GPU hosting. Its 32 GB of GDDR7 VRAM and Blackwell architecture make it the only consumer card capable of running 70B-class models while simultaneously delivering top throughput on smaller models. We benchmarked maximum requests per second across batch sizes from 1 to 64.
All tests used vLLM continuous batching on GigaGPU bare-metal servers. Each request contained a 128-token prompt with 256-token output. Throughput was measured as sustained completed requests per second over 60-second windows. For single-user speed, see the tokens per second benchmark.
Requests/sec — 7-8B Models
| Model (Quantisation) | Batch 1 | Batch 4 | Batch 8 | Batch 16 | Batch 32 | Batch 64 |
|---|---|---|---|---|---|---|
| LLaMA 3 8B (INT4) | 0.52 | 1.90 | 3.45 | 5.80 | 8.20 | 9.80 |
| LLaMA 3 8B (FP16) | 0.37 | 1.30 | 2.35 | 3.95 | 5.60 | 6.70 |
| Mistral 7B (INT4) | 0.55 | 2.00 | 3.65 | 6.10 | 8.60 | 10.30 |
| Mistral 7B (FP16) | 0.40 | 1.40 | 2.50 | 4.20 | 5.90 | 7.10 |
| DeepSeek R1 Distill 7B (INT4) | 0.46 | 1.70 | 3.10 | 5.20 | 7.40 | 8.90 |
| Qwen 2.5 7B (INT4) | 0.50 | 1.85 | 3.35 | 5.65 | 8.00 | 9.60 |
The RTX 5090 peaks at 9.8-10.3 requests/sec with INT4 7B models at batch 64 — roughly 2.2x the RTX 3090’s throughput and 55 percent above the RTX 5080. At continuous saturation, that is over 26 million requests per month from a single card.
Requests/sec — Larger Models
The 5090’s 32 GB VRAM opens the door to larger models that simply cannot run on 16-24 GB cards at useful batch sizes.
| Model (Quantisation) | Batch 1 | Batch 4 | Batch 8 | Batch 16 | Batch 32 | Batch 64 |
|---|---|---|---|---|---|---|
| LLaMA 3 70B (INT4) | 0.06 | 0.20 | 0.35 | 0.55 | 0.72 | OOM |
| Mixtral 8x7B (INT4) | 0.10 | 0.34 | 0.58 | 0.90 | 1.20 | OOM |
| CodeLlama 34B (INT4) | 0.12 | 0.40 | 0.68 | 1.05 | 1.40 | 1.55 |
LLaMA 3 70B INT4 reaches 0.72 req/s at batch 32 before running out of memory at batch 64. That is still over 1.8 million requests per month — viable for moderate-traffic production APIs. For higher throughput with 70B models, multi-GPU tensor parallelism across two 5090 cards roughly doubles these figures.
Throughput Scaling Curve
Throughput scales near-linearly from batch 1 to batch 16 on the RTX 5090, then begins to flatten as memory bandwidth becomes the bottleneck. The inflection point is around batch 32 for 7B models and batch 16 for 70B models. Beyond these points, adding more concurrent requests yields diminishing throughput gains while significantly increasing per-request latency.
For interactive applications, batch 8-16 offers the best balance — 3.45-5.80 req/s with LLaMA 3 8B INT4 at manageable 2-4 second end-to-end latency. For batch processing, pushing to batch 32-64 maximises aggregate throughput. See the batch size impact on tokens/sec for a detailed analysis of this scaling behaviour.
Cost per Million Requests
At approximately £250/month for an RTX 5090 dedicated server, the cost per million requests at peak throughput is competitive.
| Model | Peak req/s | Requests/Month | Cost per 1M Requests |
|---|---|---|---|
| Mistral 7B INT4 | 10.30 | ~26.6M | ~£9.40 |
| LLaMA 3 8B INT4 | 9.80 | ~25.3M | ~£9.90 |
| LLaMA 3 70B INT4 | 0.72 | ~1.86M | ~£134 |
For 7B models, the 5090’s cost per million requests is nearly identical to the 3090 despite the higher monthly cost — you pay more but get proportionally more throughput. For 70B models, the self-hosted cost of £134/M requests is still dramatically cheaper than API providers. Use the LLM cost calculator to model your specific workload, and see cost per million tokens for token-level pricing.
Conclusion
The RTX 5090 delivers the highest single-card LLM throughput available in consumer hardware — over 10 requests per second with 7B INT4 models and viable throughput even for 70B models. For high-volume API workloads, it processes over 26 million requests per month on a single dedicated server. Compare all GPU options in the RTX 3090 vs RTX 5090 throughput per dollar guide or browse the full Benchmarks category.