The RTX 4090 24GB is the strongest single-GPU choice for self-hosted speech recognition under 32 GB. With 1008 GB/s of GDDR6X bandwidth, native FP8 paths through Ada’s fourth-generation tensor cores and 16,384 CUDA lanes, it pushes Whisper large-v3-turbo to roughly 80x real-time on a single stream and 175x real-time when batched at 16. This post records reproducible numbers from a stock UK dedicated GPU host, with full methodology, per-batch tables, alignment overhead and the operational gotchas that bite teams when they move from a notebook to a 24/7 transcription service. If you only read one section, jump to the methodology and per-batch tables; the raw numbers are what most readers need to size their fleet.
Contents
- Why measure RTF this way
- Methodology and test rig
- Single-stream RTF table
- Batched WhisperX scaling
- Precision, accuracy and language
- Memory, VRAM and pipeline co-location
- Capacity planning and concurrency
- Production gotchas
Why measure RTF this way
Real-time factor (RTF) divides audio duration by wall-clock processing time, so 80x RTF means one minute of audio transcribes in 0.75 seconds. ASR is unusual in being a hybrid encoder/decoder workload: the encoder is a deep convolutional plus transformer stack that processes 30-second mel-spectrogram windows in parallel, and the decoder is autoregressive, with beam search over a small token alphabet. The encoder is compute-bound and benefits enormously from FP8 and large batch; the decoder is bandwidth-bound and gates the smallest batches. The 4090’s combination of 660 TFLOPS dense FP8 and 1008 GB/s VRAM is the rare consumer card where neither is a meaningful bottleneck, and that’s why it consistently turns in the best per-card RTF below A100 class.
We test RTF rather than throughput-per-GPU-hour because telephony, meeting capture and live captioning all care about the time a single utterance takes. For backfill workloads the batched aggregate matters more, and we report both. Our approach is closely modelled on the same methodology used in the RTX 5060 Ti 16GB Whisper benchmark, so direct cross-card comparisons are honest.
Methodology and test rig
All numbers come from a single 4090 (Founders Edition, stock 450 W TDP) on the standard spec we host: Ryzen 9 7950X, 64 GB DDR5-5600, Samsung 990 Pro 2 TB Gen 4 NVMe, Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Software stack is faster-whisper 1.1 (CTranslate2 backend), WhisperX 3.1, FlashAttention 2.6, PyTorch 2.5. Audio is 16 kHz mono PCM, English-language podcast and call-centre clips ranging 30 s to 60 min, with VAD enabled (Silero v4) and a 200 ms tail. Beam size 5 throughout. Power is sampled via NVML at 100 ms; observed draw during decode hovers between 360 and 410 W with brief 430 W spikes during long-form decoding.
# Single-stream Whisper Turbo INT8 launch
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8_float16")
segments, info = model.transcribe("clip.wav", beam_size=5, vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=200))
Each clip is run five times after a single warmup pass; the median is reported. We exclude model load time but include audio decode (libavcodec) and VAD chunking. Disk I/O is local NVMe; if your data sits in S3 or NFS, add 50-200 ms per clip for fetch latency.
What we did not change from defaults
We deliberately do not enable best-of sampling or temperature fallback because we want repeatable timings. Production deployments often turn these on for tail-quality robustness; expect a 5-10% RTF penalty.
Single-stream RTF
The headline single-stream numbers, batch size 1, beam 5, VAD enabled. RTF reported is end-to-end wall-clock divided by audio duration; warmup runs excluded.
| Model | Precision | VRAM | RTF (1 min) | RTF (10 min) | RTF (60 min) |
|---|---|---|---|---|---|
| large-v3 | FP16 | 5.1 GB | 34x | 32x | 30x |
| large-v3 | INT8 | 2.9 GB | 62x | 60x | 57x |
| large-v3-turbo | FP16 | 2.4 GB | 64x | 62x | 58x |
| large-v3-turbo | INT8 | 1.7 GB | 82x | 80x | 76x |
| medium | FP16 | 2.3 GB | 56x | 55x | 52x |
| medium | INT8 | 1.3 GB | 74x | 72x | 68x |
| small | INT8 | 0.6 GB | 148x | 140x | 130x |
Turbo INT8 at 80x is the sweet spot for production telephony, podcasts and meeting capture. A one-hour interview transcribes in 45 seconds wall-clock; a typical 4-second call-centre utterance in 50 ms, comfortably below human-perceptible. RTF declines slightly on longer clips because of disk I/O overlap with VAD and because beam-search history grows; preloading audio fully and capping segment length at 30 s keeps the figure flat.
WhisperX alignment overhead
If you need word-level timestamps for video captioning or call-quality scoring, WhisperX adds a wav2vec2 forced-aligner pass. On a 4090 this adds roughly 0.4-0.6x of the audio duration in wall-clock at FP16, dropping effective single-stream RTF on Turbo INT8 from 80x to about 55x. Diarisation via pyannote-audio adds another 0.3x, landing the full pipeline (transcribe + align + diarise) at roughly 38x RTF. That still means a 60-minute meeting recording with speaker labels and word timestamps lands in 95 seconds wall-clock.
Batched WhisperX scaling
For backfill jobs and bulk archives, WhisperX with batched inference scales further. We chunk audio into 30 s windows after VAD, batch them through the encoder, and run a shared decoder.
| Model | Batch | VRAM | Aggregate RTF | Hours per wall hour | p50 1-min clip |
|---|---|---|---|---|---|
| large-v3 | 4 | 6.1 GB | 78x | 78 | 0.95 s |
| large-v3 | 8 | 9.8 GB | 95x | 95 | 0.78 s |
| large-v3 | 16 | 16.4 GB | 118x | 118 | 0.62 s |
| large-v3-turbo | 4 | 3.2 GB | 120x | 120 | 0.58 s |
| large-v3-turbo | 8 | 5.6 GB | 150x | 150 | 0.45 s |
| large-v3-turbo | 16 | 9.4 GB | 175x | 175 | 0.36 s |
| large-v3-turbo | 32 | 17.2 GB | 178x | 178 | 0.34 s |
Batch 16 saturates the encoder; batch 32 adds 2% throughput at the cost of 80% more VRAM, and we recommend stopping at 16 in production. At batch 16 the 4090 transcribes 175 hours of audio in one wall-clock hour. A 5,000-hour backlog clears in just under 29 hours on a single card, handy when migrating a years-old podcast archive in a weekend.
Precision, accuracy and language
INT8 weights through CTranslate2 cost roughly 0.2-0.4 WER points on LibriSpeech test-clean compared with FP16. On noisy call-centre audio the gap widens to 0.6-0.9 WER, which is still inside the band of speaker variation. For most podcast, meeting and call-centre workloads the difference is invisible to humans. Where you need maximum fidelity (legal transcripts, medical dictation), keep FP16 on Turbo: the 4090 has plenty of headroom, you’ll still see ~62x RTF, and the WER is statistically indistinguishable from FP32.
Multilingual notes
Whisper Turbo is English-tuned but retains useful capability in roughly 50 languages. RTF is largely unaffected by language; WER varies sharply. For Mandarin or Arabic call-centre audio expect 12-18% WER from Turbo; large-v3 INT8 at 62x RTF brings that down to 9-12%. If you need multilingual at scale, large-v3 INT8 is the pragmatic default. For low-resource European languages (Welsh, Estonian, Basque) Turbo’s WER is poor enough that a fine-tune on roughly 10 hours of in-language audio is recommended; the LoRA path through Whisper-style heads is documented in the LoRA fine-tune guide.
Memory, VRAM and pipeline co-location
Even Turbo at batch 32 fits in 17.2 GB, leaving 6.8 GB free to run alignment (~600 MB), diarisation (pyannote, ~1.2 GB) and VAD (Silero, ~80 MB) on the same card. The full pipeline therefore lives on a single 4090, freeing you from multi-GPU orchestration and the price premium of an A100. For voice-assistant workloads where ASR sits alongside an LLM, see the voice assistant guide: a stack of Whisper Turbo + Llama 3 8B FP8 + XTTS v2 fits in roughly 17 GB resident, leaving 7 GB for KV cache.
Compare with the RTX 5060 Ti 16GB if your audio volume is modest and your batches stay small. The 5060 Ti reaches roughly 60x RTF on Turbo INT8 single-stream and 95x batched at 16, which is fine for many SaaS deployments but not enough for archive-scale work.
Capacity planning and concurrency
For live transcription at scale the relevant question is “how many concurrent callers can one 4090 hold?” With Whisper Turbo INT8 in streaming mode (continuous 5-second windows, VAD-trimmed), each call consumes roughly 2-3% of GPU compute on average, peaking at 8-10% during dense speech. We comfortably sustain 60-80 concurrent active callers per 4090 with sub-300 ms ASR latency end-to-end. The figures align with our broader concurrent users benchmark.
Named scenario: 1,200-seat contact-centre
A real deployment we sized: a 1,200-seat outbound sales centre wanted live transcription plus on-call sentiment scoring. With ~22% concurrent active call ratio (260 active calls peak), a pair of 4090s in active-active configuration delivers the workload with one card spare for failover. Total monthly cost is well under what a per-minute API would charge for the same minutes, with full audio control retained inside their VLAN. Power per card averages 320 W during steady-state because ASR is bursty and Silero VAD lets the GPU idle between utterances. See the monthly hosting cost page for a full TCO comparison.
Bulk transcription scenario: 50,000-hour podcast network
One customer migrated a back-catalogue of 50,000 hours of podcast audio to internal search. Running batched Turbo INT8 at batch 16 on three 4090s in parallel, the full corpus transcribed in roughly 96 hours wall-clock and indexed into Qdrant via BGE-large in another 26. Total infra cost was a single month of three-card hosting plus storage; the equivalent API spend would have been roughly 30x.
Production gotchas
- Beam search vs greedy. Beam 5 costs ~30% over greedy decoding for marginal WER gains in clean audio. For high-volume podcast ingest, greedy is the right default; reserve beam search for legal and medical.
- VAD chunking can clip syllables. Set
min_silence_duration_ms=300for English, 500 ms for tonal languages, or you’ll lose final consonants on backfill jobs. Spot-check WER before committing to a setting. - Disk-bound on small files. If you’re running 5-second utterances from S3, NVMe staging is essential; the 4090 can transcribe faster than typical S3 GETs return.
- FP8 isn’t supported by faster-whisper yet. CTranslate2’s INT8 path is the fast lane on the 4090; FP8 wins for LLMs (see the FP8 tensor-cores write-up) but not for Whisper as of this benchmark.
- Watch power on long-form bulk jobs. 60-minute clips at batch 16 sustain 410 W. If your rack is power-capped, set
nvidia-smi -pl 380; throughput drops by 5%, power by 15%. Detail in the power draw post. - Model files on first run. CTranslate2 conversion of large-v3 takes 90-120 s on first launch; cache the converted directory under
HF_HOMEon persistent NVMe or your container restarts will be slow. - Don’t share CUDA context with vLLM blindly. If you co-host an LLM, give vLLM
--gpu-memory-utilization 0.55and let Whisper grab on demand; the alternative (both grabbing greedily) causes OOMs at peak.
Verdict: when to pick the 4090 for Whisper
If you need single-card transcription throughput between 80x and 175x real-time, the 4090 is the right tool. It outperforms the RTX 3090 by roughly 1.4x on Turbo (the 3090 lacks the L2 cache improvements that help encoder batches) and trails an H100 by only 1.6x at one-tenth the cost. For lower volumes the 5060 Ti 16GB is half the price and reaches 60x RTF on Turbo INT8 single-stream. For multi-language plus alignment plus diarisation co-located with an LLM, only the 4090 (or larger) gives you enough VRAM headroom on a single card.
Run Whisper Turbo at 175x real-time
Ship transcription product without API per-minute fees. UK dedicated hosting.
Order the RTX 4090 24GBSee also: RTX 4090 voice assistant stack, tokens per watt benchmark, prefill and decode benchmarks, 4090 spec breakdown, power and efficiency, vLLM setup, best GPU for Whisper.