Table of Contents
RTX 5080 Throughput Overview
The RTX 5080 brings Blackwell-generation memory bandwidth to dedicated GPU hosting at a mid-range price. For API-style workloads where maximum requests per second matters more than single-request latency, we benchmarked the 5080’s throughput ceiling across popular 7B-8B models at varying batch sizes.
Tests ran on GigaGPU bare-metal hardware using vLLM continuous batching. Each request used a 128-token prompt with 256-token output. Throughput was measured as sustained completed requests per second over 60-second windows. For single-user speed data, see the tokens per second benchmark.
Requests/sec by Batch Size
| Model (Quantisation) | Batch 1 | Batch 4 | Batch 8 | Batch 16 | Batch 32 | Batch 64 |
|---|---|---|---|---|---|---|
| LLaMA 3 8B (INT4) | 0.36 | 1.30 | 2.35 | 3.90 | 5.40 | 6.20 |
| LLaMA 3 8B (FP16) | 0.26 | 0.85 | 1.42 | 2.15 | 2.80 | 3.10 |
| Mistral 7B (INT4) | 0.38 | 1.38 | 2.50 | 4.15 | 5.70 | 6.60 |
| Mistral 7B (FP16) | 0.28 | 0.92 | 1.52 | 2.30 | 3.00 | 3.30 |
| DeepSeek R1 Distill 7B (INT4) | 0.32 | 1.15 | 2.10 | 3.50 | 4.85 | 5.60 |
| Qwen 2.5 7B (INT4) | 0.35 | 1.25 | 2.28 | 3.80 | 5.25 | 6.10 |
The RTX 5080 peaks at 6.2-6.6 requests/sec with INT4 7B models at batch 64 — roughly 40 percent higher than the RTX 3090’s peak throughput. That translates to approximately 400 requests per minute or over 17 million requests per month at continuous saturation.
RTX 5080 vs RTX 3090 Throughput
The 5080 outperforms the RTX 3090 at every batch size despite having 8 GB less VRAM. The Blackwell architecture’s higher memory bandwidth (over 900 GB/s versus 936 GB/s theoretical on the 3090, but with better sustained throughput in practice) is the primary driver. At batch 16, the 5080 delivers 3.90 req/s compared to the 3090’s 2.60 — a 50 percent advantage.
The VRAM gap only becomes visible at batch 64 with FP16 models, where the 3090’s 24 GB allows slightly more KV cache headroom. With INT4 models, the 5080’s 16 GB is sufficient for batch 64 without memory pressure. For the full cost-adjusted comparison, see RTX 3090 vs RTX 5080 throughput per dollar.
Latency at Peak Throughput
At batch 64, individual requests take 9-10 seconds end-to-end on the 5080 (compared to 12-14 seconds on the 3090). The 5080’s faster per-token generation means even at high batch sizes, per-request latency remains more manageable. At batch 8, latency stays under 3 seconds — a practical operating point for near-real-time APIs that need both throughput and responsiveness.
For interactive chatbot use cases, batch 4-8 is the sweet spot on the 5080, delivering 1.30-2.35 req/s with sub-3-second response times. For user-facing concurrency numbers, see the RTX 5080 concurrent users benchmark. Our batch size impact analysis dives deeper into this relationship.
Production Capacity Planning
For an API serving 100 requests per minute with a 3-second latency SLA, a single RTX 5080 at batch 8 (2.35 req/s = 141 req/min) provides comfortable headroom. For 500 requests per minute, you would need three cards behind a load balancer, or you could use two RTX 5090 cards. Use the LLM cost calculator to model your specific throughput requirements.
For detailed guidance on capacity planning across different application types, see our GPU capacity planning for AI SaaS guide. You can also explore the full Benchmarks category for model-specific throughput data.
Conclusion
The RTX 5080 delivers 6.2-6.6 requests per second peak throughput with INT4 7B models — enough for over 17 million requests per month on a single card. Its Blackwell architecture provides a 40 percent throughput advantage over the RTX 3090 while maintaining lower per-request latency at every batch size. For mid-range dedicated GPU hosting, it is the current throughput-per-pound leader.