Home / Blog / Benchmarks / RTX 3090: Maximum LLM Throughput (Requests/sec)

Benchmarks

RTX 3090: Maximum LLM Throughput (Requests/sec)

Maximum LLM request throughput benchmarks for the RTX 3090 — requests per second at batch sizes from 1 to 64 across popular 7-8B models with vLLM continuous batching.

Benchmarks April 17, 2026 3 min read admin

Table of Contents

RTX 3090 Throughput Overview
Requests/sec by Batch Size
Throughput vs Latency Trade-Off
Model Comparison
Cost per 1M Requests
Conclusion

RTX 3090 Throughput Overview

If you are building an API on a dedicated GPU server, the metric that matters most is requests per second — how many completions your single RTX 3090 can process in a given time window. Higher batch sizes increase aggregate throughput but add per-request latency. We measured both dimensions to give you the data you need for capacity planning.

All tests ran on GigaGPU bare-metal hardware using vLLM with continuous batching. Each request contained a 128-token prompt with a 256-token output. We measured sustained throughput (requests completed per second) over a 60-second window at each concurrency level. For per-token speed, see the tokens per second benchmark.

Requests/sec by Batch Size

The table shows sustained request throughput at different effective batch sizes (concurrent in-flight requests).

Model (Quantisation)	Batch 1	Batch 4	Batch 8	Batch 16	Batch 32	Batch 64
LLaMA 3 8B (INT4)	0.24	0.85	1.52	2.60	3.80	4.50
LLaMA 3 8B (FP16)	0.17	0.58	0.98	1.55	2.10	2.40
Mistral 7B (INT4)	0.26	0.90	1.60	2.75	4.00	4.80
Mistral 7B (FP16)	0.18	0.62	1.05	1.65	2.25	2.55
DeepSeek R1 Distill 7B (INT4)	0.22	0.76	1.35	2.30	3.40	4.10
Qwen 2.5 7B (INT4)	0.24	0.82	1.48	2.50	3.70	4.40

Peak throughput on the RTX 3090 reaches 4.5-4.8 requests/sec with INT4 7B models at batch 64. That translates to roughly 270-290 requests per minute or over 380,000 requests per day. For the RTX 5090’s throughput numbers, which roughly double these figures, see the companion benchmark.

Throughput vs Latency Trade-Off

Throughput and latency move in opposite directions. At batch 1, each request completes in roughly 4 seconds end-to-end, but the GPU processes only 0.24 req/s. At batch 64, aggregate throughput rises to 4.5 req/s, but individual requests take 12-14 seconds end-to-end because they share GPU cycles with 63 other sequences.

The optimal operating point depends on your use case. Interactive chatbots need low latency, so batch 4-8 (0.85-1.52 req/s) provides a good balance. Batch processing jobs like document summarisation or data extraction can use batch 32-64 for maximum throughput with no latency concern. Our batch size impact analysis explores this trade-off in detail.

Model Comparison

Mistral 7B consistently edges out LLaMA 3 8B by 5-8 percent in throughput across all batch sizes. This is partly because Mistral uses grouped-query attention with fewer KV heads, which reduces memory bandwidth pressure during decoding. DeepSeek R1 Distill 7B is slightly slower due to its deeper architecture, but the difference is modest.

For a broader comparison including larger models, see our best GPU for LLM inference guide. If you are deciding between models for production deployment, also consider the DeepSeek concurrent throughput and Mistral 7B concurrent throughput benchmarks for model-specific scaling curves.

Cost per 1M Requests

At approximately £110/month for an RTX 3090 dedicated server, we can calculate the cost per million requests at peak batch throughput.

Model	Peak req/s (Batch 64)	Requests/Month	Cost per 1M Requests
Mistral 7B INT4	4.80	~12.4M	~£8.85
LLaMA 3 8B INT4	4.50	~11.6M	~£9.45
Qwen 2.5 7B INT4	4.40	~11.4M	~£9.65

Under £10 per million requests is extremely competitive compared to API pricing from hosted providers. Use the LLM cost calculator to model costs for your specific workload, and check the cost per million tokens benchmark for token-level pricing comparisons.

Conclusion

The RTX 3090 delivers 4.5-4.8 requests per second at maximum batch throughput with INT4 7B models — enough for over 11 million requests per month on a single card. For batch-heavy workloads, it remains one of the most cost-efficient GPUs available. To compare value across cards, see the RTX 3090 vs RTX 5080 throughput per dollar analysis, or browse all results in the Benchmarks category.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 3090: Maximum LLM Throughput (Requests/sec)

RTX 3090 Throughput Overview

Requests/sec by Batch Size

Throughput vs Latency Trade-Off

Model Comparison

Cost per 1M Requests

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 3090: Maximum LLM Throughput (Requests/sec)

RTX 3090 Throughput Overview

Requests/sec by Batch Size

Throughput vs Latency Trade-Off

Model Comparison

Cost per 1M Requests

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

admin

Related Articles

Image Generation Benchmark Update: April 2026

Mixtral 8x7B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: mixtral-8x7b-on-rtx-5080-benchmark, Excerpt: Mixtral 8x7B benchmarked on RTX 5080: 32 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec

Flux.1 on RTX 4060: Images/sec & VRAM Usage, Category: Benchmarks, Slug: flux-1-on-rtx-4060-benchmark, Excerpt: Flux.1 benchmarked on RTX 4060: 0.35 it/s, 1.05 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?