A private speech-to-text API on the RTX 5060 Ti 16GB via UK dedicated GPU hosting runs Whisper large-v3-turbo at 55x real-time on a single Blackwell card – fast enough to handle 20+ concurrent live streams plus heavy batch transcription, with none of the per-minute bill or data-residency headaches of OpenAI’s hosted Whisper API.
Contents
- Capacity and real-time factor
- Features and models
- Endpoints and integration
- Cost vs OpenAI Whisper API
- Deployment notes
Capacity and real-time factor
Whisper large-v3-turbo is a four-decoder-layer distillation of large-v3: near-identical WER, roughly 8x faster decode. Quantised to INT8 via CTranslate2 (faster-whisper), a 5060 Ti transcribes audio at 55x real-time – one hour of speech in about 65 seconds of wall time.
| Model | Precision | VRAM | RTF | WER (en) |
|---|---|---|---|---|
| large-v3-turbo | INT8 | 1.6 GB | 55x | ~5.5% |
| large-v3 | FP16 | 3.1 GB | 14x | ~5.1% |
| large-v3 | INT8 | 1.8 GB | 22x | ~5.3% |
| medium | FP16 | 1.5 GB | 32x | ~6.8% |
| distil-large-v3 | FP16 | 1.5 GB | 35x | ~6.0% |
| Workload | Throughput | Daily capacity |
|---|---|---|
| Batch transcription | 55 audio-hours per wall-clock hour | 1,320 hours |
| Concurrent live streams | 20+ streams at 1x real-time | 480 stream-hours |
| Podcast back-catalogue | ~2,400 one-hour episodes/day | – |
Features and models
- 99-language coverage and zero-shot translation to English.
- Word-level timestamps for captioning and karaoke-style UIs.
- VAD-based chunking via Silero to skip silence.
- Speaker diarisation via Pyannote 3.1 (adds ~2 GB VRAM).
- Custom vocabulary prompts for domain terms (drug names, ticker symbols, SKUs).
Endpoints and integration
faster-whisper-server or wyoming-faster-whisper expose an OpenAI-compatible /v1/audio/transcriptions endpoint. Point existing OpenAI SDK code at your URL by changing base_url – zero client-side code changes. See our Whisper API setup.
from openai import OpenAI
client = OpenAI(base_url="https://stt.example.com/v1", api_key="...")
with open("call.m4a", "rb") as f:
r = client.audio.transcriptions.create(
model="whisper-large-v3-turbo",
file=f,
response_format="verbose_json",
timestamp_granularities=["word"],
)
Cost vs OpenAI Whisper API
| Volume | OpenAI Whisper ($0.006/min) | Self-hosted 5060 Ti |
|---|---|---|
| 10k hours/month | $3,600 (£2,830) | Fixed monthly |
| 50k hours/month | $18,000 (£14,150) | Fixed monthly |
| 150k hours/month | $54,000 (£42,400) | Fixed monthly |
One 5060 Ti handles 1,320 hours/day of batch transcription – around 40,000 hours/month at 100% utilisation. Break-even lands roughly at 3,000-4,000 audio hours/month depending on GBP/USD.
Deployment notes
Co-host a lightweight diarisation model on the same card and pair with XTTS-v2 (RTF 0.1 – see voice pipeline setup) for a full duplex voice agent. Buffer uploaded audio to fast local NVMe, chunk into 30-second windows with 1-second overlap, and stream partial transcripts over websockets for live-captioning UIs.
Private Whisper API on Blackwell 16GB
55x real-time OpenAI-compatible. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: TTS API, Coqui TTS benchmark, embedding server, startup MVP.