A self-hosted Whisper API on the RTX 5060 Ti 16GB at our hosting replaces OpenAI’s Whisper API at flat cost.
Contents
- Option 1: speaches-ai/faster-whisper-server
- Option 2: Custom FastAPI
- OpenAI-compatible endpoints
- Performance
Option 1: Docker server
docker run --gpus all -p 8000:8000 \
-e WHISPER__MODEL=large-v3-turbo \
-e WHISPER__COMPUTE_TYPE=int8_float16 \
fedirz/faster-whisper-server:latest-cuda
Ships OpenAI-compatible /v1/audio/transcriptions and /v1/audio/translations.
Option 2: Custom FastAPI
from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import io
app = FastAPI()
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8_float16")
@app.post("/v1/audio/transcriptions")
async def transcribe(file: UploadFile):
segments, info = model.transcribe(io.BytesIO(await file.read()), beam_size=5)
return {
"text": " ".join(s.text for s in segments),
"language": info.language,
}
OpenAI-Compatible Usage
from openai import OpenAI
client = OpenAI(api_key="none", base_url="http://localhost:8000/v1")
with open("meeting.mp3", "rb") as f:
result = client.audio.transcriptions.create(model="whisper-1", file=f)
print(result.text)
Any OpenAI SDK points at your local server by changing the base_url.
Performance
- large-v3-turbo INT8: ~55x real-time on 5060 Ti
- 1-hour audio in ~65 seconds
- Memory usage: ~1.6 GB
- Concurrent transcriptions: batch 4-8 comfortably
For bulk workloads use WhisperX with batched inference – ~100x real-time aggregate at batch 8.
Whisper API on Blackwell 16GB
Self-hosted, 55x real-time. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Whisper benchmark, voice pipeline, webinar transcription, podcast tools.