Home / Blog / Benchmarks / Token/sec Benchmark Update: April 2026

Benchmarks

Token/sec Benchmark Update: April 2026

Updated April 2026 tokens-per-second benchmarks for open-source LLMs across NVIDIA GPUs. Covers LLaMA 3.1, DeepSeek V3, Qwen 2.5, and Mistral Large with vLLM and Ollama throughput data.

Benchmarks April 16, 2026 2 min read admin

Benchmark Methodology
Single GPU Results
Multi-GPU Results
vLLM vs Ollama Throughput
Quantisation Impact on Speed
Key Takeaways

Benchmark Methodology

This April 2026 benchmark update measures tokens-per-second throughput under realistic production conditions. All tests were run on GigaGPU dedicated servers using vLLM 0.8.x with continuous batching enabled. We tested at 1, 10, and 50 concurrent users with a standardised prompt length of 512 tokens and generation length of 256 tokens.

For the interactive benchmark tool with additional configurations, visit the tokens per second benchmark page. This article highlights the most important data points from the April 2026 refresh.

Single GPU Results

Total throughput in tokens/sec at 10 concurrent users via vLLM:

Model	Quant	RTX 3090	RTX 5090	RTX 5090	RTX 6000 Pro
LLaMA 3.1 8B	FP16	125	195	248	165
LLaMA 3.1 70B	Q4	35	62	88	48
Qwen 2.5 72B	Q4	32	58	82	45
Mistral Large 2	Q4	N/A*	42	65	38
Gemma 2 27B	FP16	68	95	128	82
Phi-3 14B	FP16	95	142	185	120

*Mistral Large 2 at Q4 requires 36 GB VRAM, exceeding the RTX 3090’s 24 GB.

Multi-GPU Results

Tensor parallel inference across multiple GPUs using multi-GPU setups:

Model	Quant	2x RTX 5090	4x RTX 5090	RTX 6000 Pro 96 GB
LLaMA 3.1 70B	FP16	85	145	95
LLaMA 3.1 70B	Q4	105	180	115
DeepSeek V3 (active)	FP16	72	130	88
Qwen 2.5 72B	FP16	78	135	90

Dual RTX 5090 setups deliver strong throughput for 70B models at a fraction of RTX 6000 Pro pricing. See the best GPU for LLM inference guide for cost-effectiveness analysis.

vLLM vs Ollama Throughput

At single-user workloads, Ollama approaches vLLM throughput. At concurrency, vLLM pulls ahead dramatically due to continuous batching:

Model / GPU	Engine	1 User	10 Users	50 Users
LLaMA 70B Q4 / RTX 5090	vLLM	38 tok/s	62 tok/s	58 tok/s
LLaMA 70B Q4 / RTX 5090	Ollama	32 tok/s	32 tok/s	32 tok/s

Ollama processes one request at a time, so throughput stays flat regardless of concurrent users. vLLM batches requests to maximise GPU utilisation. For a detailed comparison, see the vLLM vs Ollama throughput analysis.

Quantisation Impact on Speed

Quantisation reduces VRAM usage and often increases throughput because smaller weights move faster through memory bandwidth. Testing LLaMA 3.1 70B on an RTX 5090:

Precision	VRAM Used	Tokens/sec	Quality (MMLU)
FP16	Requires 2 GPUs	85 (2x 5090)	82.0
Q8 (GPTQ)	~72 GB (2 GPUs)	92 (2x 5090)	81.5
Q4 (AWQ)	~38 GB	62 (1x 5090)	80.8

Quality loss from 4-bit quantisation is under 1.5% on MMLU while enabling single-GPU deployment. For most production workloads, Q4 is the sweet spot. See the quantised vs full precision quality analysis for detailed measurements.

Test These Benchmarks on Your Own Server

Get a dedicated GPU and run your specific model at your concurrency level. See real throughput numbers before committing.

Browse GPU Servers

Key Takeaways

The RTX 5090 delivers 40-45% higher throughput than the RTX 5090 for LLM inference, making it the new single-GPU leader for models under 30 GB VRAM. For 70B models, dual RTX 5090s remain the best value configuration. vLLM is essential for any workload serving more than one concurrent user.

Use the cost per million tokens calculator to convert these throughput numbers into cost projections. For GPU selection guidance, review the cheapest GPU for AI inference analysis or the comprehensive GPU comparisons page.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Token/sec Benchmark Update: April 2026

Table of Contents

Benchmark Methodology

Single GPU Results

Multi-GPU Results

vLLM vs Ollama Throughput

Quantisation Impact on Speed

Test These Benchmarks on Your Own Server

Key Takeaways

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Token/sec Benchmark Update: April 2026

Table of Contents

Benchmark Methodology

Single GPU Results

Multi-GPU Results

vLLM vs Ollama Throughput

Quantisation Impact on Speed

Test These Benchmarks on Your Own Server

Key Takeaways

Need a Dedicated GPU Server?

admin

Related Articles

DeepSeek: 1 to 64 Concurrent Requests Throughput

RAG Pipeline on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: rag-pipeline-on-rtx-5080-benchmark, Excerpt: RAG Pipeline benchmarked on RTX 5080: BGE-M3 Embedding + LLaMA 3 8B, concurrent performance, VRAM breakdown, and cost analysis., Internal links: 9 –>

Gemma 2 9B on RTX 4060: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-4060-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 4060: 18.5 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Memory Bandwidth vs TFLOPS: Why It Matters

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?