RTX 3050 - Order Now
Home / Blog / Benchmarks / Token/sec Benchmark Update: April 2026
Benchmarks

Token/sec Benchmark Update: April 2026

Updated April 2026 tokens-per-second benchmarks for open-source LLMs across NVIDIA GPUs. Covers LLaMA 3.1, DeepSeek V3, Qwen 2.5, and Mistral Large with vLLM and Ollama throughput data.

Benchmark Methodology

This April 2026 benchmark update measures tokens-per-second throughput under realistic production conditions. All tests were run on GigaGPU dedicated servers using vLLM 0.8.x with continuous batching enabled. We tested at 1, 10, and 50 concurrent users with a standardised prompt length of 512 tokens and generation length of 256 tokens.

For the interactive benchmark tool with additional configurations, visit the tokens per second benchmark page. This article highlights the most important data points from the April 2026 refresh.

Single GPU Results

Total throughput in tokens/sec at 10 concurrent users via vLLM:

Model Quant RTX 3090 RTX 5090 RTX 5090 RTX 6000 Pro
LLaMA 3.1 8B FP16 125 195 248 165
LLaMA 3.1 70B Q4 35 62 88 48
Qwen 2.5 72B Q4 32 58 82 45
Mistral Large 2 Q4 N/A* 42 65 38
Gemma 2 27B FP16 68 95 128 82
Phi-3 14B FP16 95 142 185 120

*Mistral Large 2 at Q4 requires 36 GB VRAM, exceeding the RTX 3090’s 24 GB.

Multi-GPU Results

Tensor parallel inference across multiple GPUs using multi-GPU setups:

Model Quant 2x RTX 5090 4x RTX 5090 RTX 6000 Pro 96 GB
LLaMA 3.1 70B FP16 85 145 95
LLaMA 3.1 70B Q4 105 180 115
DeepSeek V3 (active) FP16 72 130 88
Qwen 2.5 72B FP16 78 135 90

Dual RTX 5090 setups deliver strong throughput for 70B models at a fraction of RTX 6000 Pro pricing. See the best GPU for LLM inference guide for cost-effectiveness analysis.

vLLM vs Ollama Throughput

At single-user workloads, Ollama approaches vLLM throughput. At concurrency, vLLM pulls ahead dramatically due to continuous batching:

Model / GPU Engine 1 User 10 Users 50 Users
LLaMA 70B Q4 / RTX 5090 vLLM 38 tok/s 62 tok/s 58 tok/s
LLaMA 70B Q4 / RTX 5090 Ollama 32 tok/s 32 tok/s 32 tok/s

Ollama processes one request at a time, so throughput stays flat regardless of concurrent users. vLLM batches requests to maximise GPU utilisation. For a detailed comparison, see the vLLM vs Ollama throughput analysis.

Quantisation Impact on Speed

Quantisation reduces VRAM usage and often increases throughput because smaller weights move faster through memory bandwidth. Testing LLaMA 3.1 70B on an RTX 5090:

Precision VRAM Used Tokens/sec Quality (MMLU)
FP16 Requires 2 GPUs 85 (2x 5090) 82.0
Q8 (GPTQ) ~72 GB (2 GPUs) 92 (2x 5090) 81.5
Q4 (AWQ) ~38 GB 62 (1x 5090) 80.8

Quality loss from 4-bit quantisation is under 1.5% on MMLU while enabling single-GPU deployment. For most production workloads, Q4 is the sweet spot. See the quantised vs full precision quality analysis for detailed measurements.

Test These Benchmarks on Your Own Server

Get a dedicated GPU and run your specific model at your concurrency level. See real throughput numbers before committing.

Browse GPU Servers

Key Takeaways

The RTX 5090 delivers 40-45% higher throughput than the RTX 5090 for LLM inference, making it the new single-GPU leader for models under 30 GB VRAM. For 70B models, dual RTX 5090s remain the best value configuration. vLLM is essential for any workload serving more than one concurrent user.

Use the cost per million tokens calculator to convert these throughput numbers into cost projections. For GPU selection guidance, review the cheapest GPU for AI inference analysis or the comprehensive GPU comparisons page.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?