Home / Blog / Benchmarks / Whisper: How Many Audio Streams per GPU?

Benchmarks

Whisper: How Many Audio Streams per GPU?

How many concurrent Whisper audio streams can each GPU handle in real time? Benchmarks for Whisper Large v3, Medium, and Small across six GPUs on dedicated servers.

Benchmarks April 17, 2026 3 min read admin

Table of Contents

Whisper Concurrency Overview
Real-Time Streams by GPU
Large v3 vs Medium vs Small
Per-Stream Latency Impact
Scaling Beyond One GPU
Conclusion

Whisper Concurrency Overview

Audio transcription at scale — call centres, meeting recorders, podcast processing — requires knowing how many simultaneous audio streams your dedicated GPU server can process. A stream is “real-time” when the GPU transcribes audio faster than it arrives, measured by the real-time factor (RTF): an RTF below 1.0 means the GPU keeps up. We tested how many concurrent Whisper streams each GPU supports at RTF ≤ 1.0.

All tests ran on GigaGPU bare-metal servers using Faster-Whisper (CTranslate2 backend) for optimal throughput. Audio clips were 30 seconds of English speech at 16 kHz. Streams were added incrementally until the slowest stream exceeded RTF 1.0. For single-stream speed benchmarks, see the tokens per second benchmark hub.

Real-Time Streams by GPU

Maximum concurrent streams where every stream maintains RTF ≤ 1.0 (real-time processing).

GPU	Whisper Large v3	Whisper Medium	Whisper Small
RTX 3050 (6 GB)	1	2	5
RTX 4060 (8 GB)	2	4	10
RTX 4060 Ti (16 GB)	3	6	14
RTX 3090 (24 GB)	5	10	22
RTX 5080 (16 GB)	7	14	30
RTX 5090 (32 GB)	11	20	45

The RTX 5090 handles 11 concurrent Large v3 streams or 45 Small streams in real time. The RTX 3090 manages 5 Large v3 streams — suitable for a small call centre or meeting recording service. The RTX 4060 at 2 Large v3 streams is limited to low-volume use.

Large v3 vs Medium vs Small

Whisper model size directly determines how many streams fit on a GPU. Large v3 (1.5B parameters, ~3 GB VRAM) delivers the highest transcription quality but processes audio at roughly half the speed of Medium (769M parameters, ~1.5 GB). Small (244M parameters, ~0.5 GB) is 4x faster than Large v3 with a moderate accuracy trade-off.

For most production transcription workloads, Whisper Medium offers the best balance — accuracy is within 2-3 percent of Large v3 on English while supporting double the concurrent streams. Use Large v3 only for multilingual or noisy audio where accuracy is critical. The voice agent latency benchmark shows how model choice affects end-to-end pipeline latency.

Per-Stream Latency Impact

As you add streams, per-stream processing latency increases even though all streams remain real-time. On the RTX 3090 with Whisper Large v3, a single stream processes 30 seconds of audio in 5.8 seconds (RTF 0.19). At 5 concurrent streams, the same 30-second clip takes 28 seconds (RTF 0.93) — still real-time but with much less margin.

Streams (RTX 3090, Large v3)	RTF per Stream	Processing Time (30s clip)
1	0.19	5.8 s
2	0.35	10.5 s
3	0.52	15.6 s
5	0.93	27.9 s
6	1.12	33.6 s (not real-time)

Operating at 80 percent of the maximum stream count gives you headroom for audio spikes and prevents occasional timeouts. For the RTX 3090 with Large v3, that means targeting 4 streams rather than 5.

Scaling Beyond One GPU

Whisper streams are independent and embarrassingly parallel, making horizontal scaling straightforward. Two RTX 3090 servers behind a load balancer handle 10 concurrent Large v3 streams. For large-scale transcription services, this is more cost-effective than a single RTX 5090 (11 streams) because two 3090 servers cost roughly the same. See the 1 GPU vs 2 GPU scaling guide for details.

For batch transcription (not real-time), you can queue audio files and process them as fast as the GPU allows. A single RTX 3090 processes roughly 160 hours of audio per day with Large v3 at full utilisation. Use multi-GPU clusters to scale linearly for high-volume batch processing.

Conclusion

Concurrent Whisper stream capacity ranges from 1-2 (budget GPUs with Large v3) to 45 (RTX 5090 with Whisper Small). For production transcription services, target 80 percent of the maximum stream count for reliability. The RTX 3090 at 5 Large v3 streams or 10 Medium streams is the cost-effective choice for most deployments. Browse additional audio and speech benchmarks in the Benchmarks category or explore all GPU options at GigaGPU.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Whisper: How Many Audio Streams per GPU?

Whisper Concurrency Overview

Real-Time Streams by GPU

Large v3 vs Medium vs Small

Per-Stream Latency Impact

Scaling Beyond One GPU

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Whisper: How Many Audio Streams per GPU?

Whisper Concurrency Overview

Real-Time Streams by GPU

Large v3 vs Medium vs Small

Per-Stream Latency Impact

Scaling Beyond One GPU

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

admin

Related Articles

How to Benchmark Your GPU Server for AI Workloads

DeepSeek 7B on RTX 3050: Performance Benchmark & Cost, Category: Benchmarks, Slug: deepseek-7b-on-rtx-3050-benchmark, Excerpt: DeepSeek 7B benchmarked on RTX 3050: 10.0 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Qwen 2.5 7B on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: qwen-2.5-7b-on-rtx-3090-benchmark, Excerpt: Qwen 2.5 7B benchmarked on RTX 3090: 43.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?