Whisper is the standard for open speech-to-text. On the RTX 5060 Ti 16GB at our hosting, every model variant fits, with excellent throughput.
Contents
Setup
- Backends: Faster-Whisper (CTranslate2 INT8), WhisperX, vanilla openai-whisper
- Input: 16 kHz WAV, various length
- Metrics: RTF (processing time / audio length), lower is faster
Model Variants
| Model | Params | FP16 VRAM | INT8 VRAM | WER (LibriSpeech) |
|---|---|---|---|---|
| tiny | 39M | 1.0 GB | 0.5 GB | 12.4% |
| base | 74M | 1.3 GB | 0.7 GB | 8.7% |
| small | 244M | 2.2 GB | 1.1 GB | 5.8% |
| medium | 769M | 4.8 GB | 2.5 GB | 4.2% |
| large-v3 | 1.55B | 6.0 GB | 3.1 GB | 3.0% |
| large-v3-turbo | 809M | 3.1 GB | 1.6 GB | 3.1% |
Real-Time Factor (Batch 1)
| Model | Faster-Whisper INT8 RTF | Throughput (audio-hours / wall-hour) |
|---|---|---|
| tiny | 0.008 | 125 |
| base | 0.012 | 83 |
| small | 0.022 | 45 |
| medium | 0.038 | 26 |
| large-v3 | 0.056 | 18 |
| large-v3-turbo | 0.018 | 55 |
Turbo is the new default – nearly large-v3 quality at small-class speed. A 1-hour meeting transcribes in ~65 seconds.
Batched Throughput
WhisperX with batched inference, large-v3, 30-second chunks, batch 8:
- Aggregate throughput: ~100 audio-hours / wall-hour
- VRAM: ~7.5 GB
Batching is critical for bulk transcription workloads (podcast backlogs, call-centre archives).
Recommendation
Default to large-v3-turbo via whisper-api. Use large-v3 for accuracy-critical domains (legal, medical). Use medium for budget. Leave plenty of VRAM for a paired LLM summarising the transcripts – see voice assistant stack or webinar transcription.
Whisper on Blackwell 16GB
55x real-time on Turbo, full stack headroom. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: voice pipeline setup, podcast tools, Coqui TTS benchmark.