Table of Contents
Whisper Concurrency Overview
Audio transcription at scale — call centres, meeting recorders, podcast processing — requires knowing how many simultaneous audio streams your dedicated GPU server can process. A stream is “real-time” when the GPU transcribes audio faster than it arrives, measured by the real-time factor (RTF): an RTF below 1.0 means the GPU keeps up. We tested how many concurrent Whisper streams each GPU supports at RTF ≤ 1.0.
All tests ran on GigaGPU bare-metal servers using Faster-Whisper (CTranslate2 backend) for optimal throughput. Audio clips were 30 seconds of English speech at 16 kHz. Streams were added incrementally until the slowest stream exceeded RTF 1.0. For single-stream speed benchmarks, see the tokens per second benchmark hub.
Real-Time Streams by GPU
Maximum concurrent streams where every stream maintains RTF ≤ 1.0 (real-time processing).
| GPU | Whisper Large v3 | Whisper Medium | Whisper Small |
|---|---|---|---|
| RTX 3050 (6 GB) | 1 | 2 | 5 |
| RTX 4060 (8 GB) | 2 | 4 | 10 |
| RTX 4060 Ti (16 GB) | 3 | 6 | 14 |
| RTX 3090 (24 GB) | 5 | 10 | 22 |
| RTX 5080 (16 GB) | 7 | 14 | 30 |
| RTX 5090 (32 GB) | 11 | 20 | 45 |
The RTX 5090 handles 11 concurrent Large v3 streams or 45 Small streams in real time. The RTX 3090 manages 5 Large v3 streams — suitable for a small call centre or meeting recording service. The RTX 4060 at 2 Large v3 streams is limited to low-volume use.
Large v3 vs Medium vs Small
Whisper model size directly determines how many streams fit on a GPU. Large v3 (1.5B parameters, ~3 GB VRAM) delivers the highest transcription quality but processes audio at roughly half the speed of Medium (769M parameters, ~1.5 GB). Small (244M parameters, ~0.5 GB) is 4x faster than Large v3 with a moderate accuracy trade-off.
For most production transcription workloads, Whisper Medium offers the best balance — accuracy is within 2-3 percent of Large v3 on English while supporting double the concurrent streams. Use Large v3 only for multilingual or noisy audio where accuracy is critical. The voice agent latency benchmark shows how model choice affects end-to-end pipeline latency.
Per-Stream Latency Impact
As you add streams, per-stream processing latency increases even though all streams remain real-time. On the RTX 3090 with Whisper Large v3, a single stream processes 30 seconds of audio in 5.8 seconds (RTF 0.19). At 5 concurrent streams, the same 30-second clip takes 28 seconds (RTF 0.93) — still real-time but with much less margin.
| Streams (RTX 3090, Large v3) | RTF per Stream | Processing Time (30s clip) |
|---|---|---|
| 1 | 0.19 | 5.8 s |
| 2 | 0.35 | 10.5 s |
| 3 | 0.52 | 15.6 s |
| 5 | 0.93 | 27.9 s |
| 6 | 1.12 | 33.6 s (not real-time) |
Operating at 80 percent of the maximum stream count gives you headroom for audio spikes and prevents occasional timeouts. For the RTX 3090 with Large v3, that means targeting 4 streams rather than 5.
Scaling Beyond One GPU
Whisper streams are independent and embarrassingly parallel, making horizontal scaling straightforward. Two RTX 3090 servers behind a load balancer handle 10 concurrent Large v3 streams. For large-scale transcription services, this is more cost-effective than a single RTX 5090 (11 streams) because two 3090 servers cost roughly the same. See the 1 GPU vs 2 GPU scaling guide for details.
For batch transcription (not real-time), you can queue audio files and process them as fast as the GPU allows. A single RTX 3090 processes roughly 160 hours of audio per day with Large v3 at full utilisation. Use multi-GPU clusters to scale linearly for high-volume batch processing.
Conclusion
Concurrent Whisper stream capacity ranges from 1-2 (budget GPUs with Large v3) to 45 (RTX 5090 with Whisper Small). For production transcription services, target 80 percent of the maximum stream count for reliability. The RTX 3090 at 5 Large v3 streams or 10 Medium streams is the cost-effective choice for most deployments. Browse additional audio and speech benchmarks in the Benchmarks category or explore all GPU options at GigaGPU.