RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 3090: Maximum LLM Throughput (Requests/sec)
Benchmarks

RTX 3090: Maximum LLM Throughput (Requests/sec)

Maximum LLM request throughput benchmarks for the RTX 3090 — requests per second at batch sizes from 1 to 64 across popular 7-8B models with vLLM continuous batching.

RTX 3090 Throughput Overview

If you are building an API on a dedicated GPU server, the metric that matters most is requests per second — how many completions your single RTX 3090 can process in a given time window. Higher batch sizes increase aggregate throughput but add per-request latency. We measured both dimensions to give you the data you need for capacity planning.

All tests ran on GigaGPU bare-metal hardware using vLLM with continuous batching. Each request contained a 128-token prompt with a 256-token output. We measured sustained throughput (requests completed per second) over a 60-second window at each concurrency level. For per-token speed, see the tokens per second benchmark.

Requests/sec by Batch Size

The table shows sustained request throughput at different effective batch sizes (concurrent in-flight requests).

Model (Quantisation)Batch 1Batch 4Batch 8Batch 16Batch 32Batch 64
LLaMA 3 8B (INT4)0.240.851.522.603.804.50
LLaMA 3 8B (FP16)0.170.580.981.552.102.40
Mistral 7B (INT4)0.260.901.602.754.004.80
Mistral 7B (FP16)0.180.621.051.652.252.55
DeepSeek R1 Distill 7B (INT4)0.220.761.352.303.404.10
Qwen 2.5 7B (INT4)0.240.821.482.503.704.40

Peak throughput on the RTX 3090 reaches 4.5-4.8 requests/sec with INT4 7B models at batch 64. That translates to roughly 270-290 requests per minute or over 380,000 requests per day. For the RTX 5090’s throughput numbers, which roughly double these figures, see the companion benchmark.

Throughput vs Latency Trade-Off

Throughput and latency move in opposite directions. At batch 1, each request completes in roughly 4 seconds end-to-end, but the GPU processes only 0.24 req/s. At batch 64, aggregate throughput rises to 4.5 req/s, but individual requests take 12-14 seconds end-to-end because they share GPU cycles with 63 other sequences.

The optimal operating point depends on your use case. Interactive chatbots need low latency, so batch 4-8 (0.85-1.52 req/s) provides a good balance. Batch processing jobs like document summarisation or data extraction can use batch 32-64 for maximum throughput with no latency concern. Our batch size impact analysis explores this trade-off in detail.

Model Comparison

Mistral 7B consistently edges out LLaMA 3 8B by 5-8 percent in throughput across all batch sizes. This is partly because Mistral uses grouped-query attention with fewer KV heads, which reduces memory bandwidth pressure during decoding. DeepSeek R1 Distill 7B is slightly slower due to its deeper architecture, but the difference is modest.

For a broader comparison including larger models, see our best GPU for LLM inference guide. If you are deciding between models for production deployment, also consider the DeepSeek concurrent throughput and Mistral 7B concurrent throughput benchmarks for model-specific scaling curves.

Cost per 1M Requests

At approximately £110/month for an RTX 3090 dedicated server, we can calculate the cost per million requests at peak batch throughput.

ModelPeak req/s (Batch 64)Requests/MonthCost per 1M Requests
Mistral 7B INT44.80~12.4M~£8.85
LLaMA 3 8B INT44.50~11.6M~£9.45
Qwen 2.5 7B INT44.40~11.4M~£9.65

Under £10 per million requests is extremely competitive compared to API pricing from hosted providers. Use the LLM cost calculator to model costs for your specific workload, and check the cost per million tokens benchmark for token-level pricing comparisons.

Conclusion

The RTX 3090 delivers 4.5-4.8 requests per second at maximum batch throughput with INT4 7B models — enough for over 11 million requests per month on a single card. For batch-heavy workloads, it remains one of the most cost-efficient GPUs available. To compare value across cards, see the RTX 3090 vs RTX 5080 throughput per dollar analysis, or browse all results in the Benchmarks category.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?