Table of Contents
Benchmark Methodology
This April 2026 benchmark update measures tokens-per-second throughput under realistic production conditions. All tests were run on GigaGPU dedicated servers using vLLM 0.8.x with continuous batching enabled. We tested at 1, 10, and 50 concurrent users with a standardised prompt length of 512 tokens and generation length of 256 tokens.
For the interactive benchmark tool with additional configurations, visit the tokens per second benchmark page. This article highlights the most important data points from the April 2026 refresh.
Single GPU Results
Total throughput in tokens/sec at 10 concurrent users via vLLM:
| Model | Quant | RTX 3090 | RTX 5090 | RTX 5090 | RTX 6000 Pro |
|---|---|---|---|---|---|
| LLaMA 3.1 8B | FP16 | 125 | 195 | 248 | 165 |
| LLaMA 3.1 70B | Q4 | 35 | 62 | 88 | 48 |
| Qwen 2.5 72B | Q4 | 32 | 58 | 82 | 45 |
| Mistral Large 2 | Q4 | N/A* | 42 | 65 | 38 |
| Gemma 2 27B | FP16 | 68 | 95 | 128 | 82 |
| Phi-3 14B | FP16 | 95 | 142 | 185 | 120 |
*Mistral Large 2 at Q4 requires 36 GB VRAM, exceeding the RTX 3090’s 24 GB.
Multi-GPU Results
Tensor parallel inference across multiple GPUs using multi-GPU setups:
| Model | Quant | 2x RTX 5090 | 4x RTX 5090 | RTX 6000 Pro 96 GB |
|---|---|---|---|---|
| LLaMA 3.1 70B | FP16 | 85 | 145 | 95 |
| LLaMA 3.1 70B | Q4 | 105 | 180 | 115 |
| DeepSeek V3 (active) | FP16 | 72 | 130 | 88 |
| Qwen 2.5 72B | FP16 | 78 | 135 | 90 |
Dual RTX 5090 setups deliver strong throughput for 70B models at a fraction of RTX 6000 Pro pricing. See the best GPU for LLM inference guide for cost-effectiveness analysis.
vLLM vs Ollama Throughput
At single-user workloads, Ollama approaches vLLM throughput. At concurrency, vLLM pulls ahead dramatically due to continuous batching:
| Model / GPU | Engine | 1 User | 10 Users | 50 Users |
|---|---|---|---|---|
| LLaMA 70B Q4 / RTX 5090 | vLLM | 38 tok/s | 62 tok/s | 58 tok/s |
| LLaMA 70B Q4 / RTX 5090 | Ollama | 32 tok/s | 32 tok/s | 32 tok/s |
Ollama processes one request at a time, so throughput stays flat regardless of concurrent users. vLLM batches requests to maximise GPU utilisation. For a detailed comparison, see the vLLM vs Ollama throughput analysis.
Quantisation Impact on Speed
Quantisation reduces VRAM usage and often increases throughput because smaller weights move faster through memory bandwidth. Testing LLaMA 3.1 70B on an RTX 5090:
| Precision | VRAM Used | Tokens/sec | Quality (MMLU) |
|---|---|---|---|
| FP16 | Requires 2 GPUs | 85 (2x 5090) | 82.0 |
| Q8 (GPTQ) | ~72 GB (2 GPUs) | 92 (2x 5090) | 81.5 |
| Q4 (AWQ) | ~38 GB | 62 (1x 5090) | 80.8 |
Quality loss from 4-bit quantisation is under 1.5% on MMLU while enabling single-GPU deployment. For most production workloads, Q4 is the sweet spot. See the quantised vs full precision quality analysis for detailed measurements.
Test These Benchmarks on Your Own Server
Get a dedicated GPU and run your specific model at your concurrency level. See real throughput numbers before committing.
Browse GPU ServersKey Takeaways
The RTX 5090 delivers 40-45% higher throughput than the RTX 5090 for LLM inference, making it the new single-GPU leader for models under 30 GB VRAM. For 70B models, dual RTX 5090s remain the best value configuration. vLLM is essential for any workload serving more than one concurrent user.
Use the cost per million tokens calculator to convert these throughput numbers into cost projections. For GPU selection guidance, review the cheapest GPU for AI inference analysis or the comprehensive GPU comparisons page.