RTX 3050 - Order Now
Home / Blog / Benchmarks / Whisper: How Many Audio Streams per GPU?
Benchmarks

Whisper: How Many Audio Streams per GPU?

How many concurrent Whisper audio streams can each GPU handle in real time? Benchmarks for Whisper Large v3, Medium, and Small across six GPUs on dedicated servers.

Whisper Concurrency Overview

Audio transcription at scale — call centres, meeting recorders, podcast processing — requires knowing how many simultaneous audio streams your dedicated GPU server can process. A stream is “real-time” when the GPU transcribes audio faster than it arrives, measured by the real-time factor (RTF): an RTF below 1.0 means the GPU keeps up. We tested how many concurrent Whisper streams each GPU supports at RTF ≤ 1.0.

All tests ran on GigaGPU bare-metal servers using Faster-Whisper (CTranslate2 backend) for optimal throughput. Audio clips were 30 seconds of English speech at 16 kHz. Streams were added incrementally until the slowest stream exceeded RTF 1.0. For single-stream speed benchmarks, see the tokens per second benchmark hub.

Real-Time Streams by GPU

Maximum concurrent streams where every stream maintains RTF ≤ 1.0 (real-time processing).

GPUWhisper Large v3Whisper MediumWhisper Small
RTX 3050 (6 GB)125
RTX 4060 (8 GB)2410
RTX 4060 Ti (16 GB)3614
RTX 3090 (24 GB)51022
RTX 5080 (16 GB)71430
RTX 5090 (32 GB)112045

The RTX 5090 handles 11 concurrent Large v3 streams or 45 Small streams in real time. The RTX 3090 manages 5 Large v3 streams — suitable for a small call centre or meeting recording service. The RTX 4060 at 2 Large v3 streams is limited to low-volume use.

Large v3 vs Medium vs Small

Whisper model size directly determines how many streams fit on a GPU. Large v3 (1.5B parameters, ~3 GB VRAM) delivers the highest transcription quality but processes audio at roughly half the speed of Medium (769M parameters, ~1.5 GB). Small (244M parameters, ~0.5 GB) is 4x faster than Large v3 with a moderate accuracy trade-off.

For most production transcription workloads, Whisper Medium offers the best balance — accuracy is within 2-3 percent of Large v3 on English while supporting double the concurrent streams. Use Large v3 only for multilingual or noisy audio where accuracy is critical. The voice agent latency benchmark shows how model choice affects end-to-end pipeline latency.

Per-Stream Latency Impact

As you add streams, per-stream processing latency increases even though all streams remain real-time. On the RTX 3090 with Whisper Large v3, a single stream processes 30 seconds of audio in 5.8 seconds (RTF 0.19). At 5 concurrent streams, the same 30-second clip takes 28 seconds (RTF 0.93) — still real-time but with much less margin.

Streams (RTX 3090, Large v3)RTF per StreamProcessing Time (30s clip)
10.195.8 s
20.3510.5 s
30.5215.6 s
50.9327.9 s
61.1233.6 s (not real-time)

Operating at 80 percent of the maximum stream count gives you headroom for audio spikes and prevents occasional timeouts. For the RTX 3090 with Large v3, that means targeting 4 streams rather than 5.

Scaling Beyond One GPU

Whisper streams are independent and embarrassingly parallel, making horizontal scaling straightforward. Two RTX 3090 servers behind a load balancer handle 10 concurrent Large v3 streams. For large-scale transcription services, this is more cost-effective than a single RTX 5090 (11 streams) because two 3090 servers cost roughly the same. See the 1 GPU vs 2 GPU scaling guide for details.

For batch transcription (not real-time), you can queue audio files and process them as fast as the GPU allows. A single RTX 3090 processes roughly 160 hours of audio per day with Large v3 at full utilisation. Use multi-GPU clusters to scale linearly for high-volume batch processing.

Conclusion

Concurrent Whisper stream capacity ranges from 1-2 (budget GPUs with Large v3) to 45 (RTX 5090 with Whisper Small). For production transcription services, target 80 percent of the maximum stream count for reliability. The RTX 3090 at 5 Large v3 streams or 10 Medium streams is the cost-effective choice for most deployments. Browse additional audio and speech benchmarks in the Benchmarks category or explore all GPU options at GigaGPU.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?