Whisper Turbo (large-v3-turbo) is a distilled variant of Whisper large-v3 with the decoder layer count cut from 32 to 4. Roughly 8x faster transcription at nearly identical accuracy on most languages. On our dedicated GPU hosting it is the default self-hosted transcription model in 2026.
Contents
VRAM
~1.6 GB at FP16. Runs on any card including the 3050. Whisper is small compared to an LLM – any dedicated card has the capacity.
Deployment
faster-whisper is the recommended runtime – CTranslate2 backend, INT8 quantisation, batched inference:
pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("turbo", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5)
for s in segments:
print(s.text)
For an HTTP service, wrap in FastAPI or use a pre-built server like whisper-webservice.
Speed
Transcribing one hour of audio:
| Model | GPU | Time |
|---|---|---|
| Whisper large-v3 | 4060 Ti | ~8 minutes |
| Whisper Turbo | 4060 Ti | ~1 minute |
| Whisper Turbo | 5090 | ~25 seconds |
| Whisper Turbo INT8 | 5090 | ~15 seconds |
Quality
Turbo matches large-v3 on English and major European languages. On low-resource languages (Swahili, Burmese, Telugu) accuracy drops slightly. For production English workloads, always pick Turbo over large-v3. For rare languages, test both.
Fast Self-Hosted Transcription
Whisper Turbo preconfigured on UK dedicated GPUs, any tier.
Browse GPU ServersSee Whisper + diarization for speaker separation.