What You’ll Build
In 30 minutes, you will have a production-ready transcription API that accepts audio files via HTTP, returns accurate transcripts with word-level timestamps and language detection, and handles concurrent requests with automatic queuing. Running Whisper large-v3 on a dedicated GPU server, your API transcribes one hour of audio in under three minutes at zero per-minute cost — whether you process 10 files or 10,000 per day.
Cloud transcription services charge $0.006-$0.024 per minute and impose rate limits that block batch workloads. At 1,000 hours of audio monthly, that is $6,000-$24,000 in API fees alone. Self-hosted Whisper on GPU hardware delivers identical or better accuracy with predictable monthly costs and no data leaving your infrastructure.
Architecture Overview
The API wraps Whisper behind a FastAPI service with an async task queue. Audio uploads land in a processing queue backed by Redis. Worker processes pull tasks and run them through Whisper on the GPU. Results are stored and returned via a polling endpoint or webhook callback. A pre-processing step normalises audio formats, resamples to 16kHz, and splits files exceeding 30 minutes into overlapping chunks for parallel processing.
The API layer exposes OpenAI-compatible transcription endpoints, so existing integrations that call the OpenAI Whisper API work with your self-hosted version by changing the base URL. Optional post-processing runs the transcript through an LLM for punctuation restoration, paragraph segmentation, and speaker label assignment.
GPU Requirements
| Concurrency | Recommended GPU | VRAM | Realtime Factor |
|---|---|---|---|
| 1-2 streams | RTX 5090 | 24 GB | ~20x realtime |
| 3-5 streams | RTX 6000 Pro | 40 GB | ~30x realtime |
| 5+ streams | RTX 6000 Pro 96 GB | 80 GB | ~40x realtime |
Whisper large-v3 uses approximately 10GB VRAM. The remaining capacity handles concurrent requests and batch processing. For real-time streaming transcription, Whisper runs in chunked mode with 5-second segments. Check our self-hosted LLM guide for pairing Whisper with an LLM on the same GPU.
Step-by-Step Build
Deploy Whisper and FastAPI on your GPU server. Set up Redis for task queuing and build the transcription endpoints.
from fastapi import FastAPI, UploadFile
import whisper, torch, tempfile, uuid
app = FastAPI()
model = whisper.load_model("large-v3", device="cuda")
@app.post("/v1/audio/transcriptions")
async def transcribe(file: UploadFile, language: str = None,
response_format: str = "json"):
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
tmp.write(await file.read())
tmp_path = tmp.name
result = model.transcribe(
tmp_path,
language=language,
word_timestamps=True,
fp16=torch.cuda.is_available()
)
if response_format == "verbose_json":
return {
"text": result["text"],
"language": result["language"],
"segments": result["segments"],
"duration": result["segments"][-1]["end"]
}
return {"text": result["text"]}
@app.get("/health")
async def health():
return {"status": "ok", "model": "whisper-large-v3"}
Add authentication middleware and rate limiting for production. The OpenAI-compatible endpoint format means existing client libraries work without modification. See production setup for scaling and monitoring configuration.
Scaling and Monitoring
Monitor GPU utilisation and queue depth to track capacity. When average queue wait time exceeds your SLA, add a second worker process or scale to a larger GPU. For burst workloads, pre-process audio into chunks and distribute across multiple workers on the same GPU.
Track transcription accuracy by sampling outputs and comparing against manual transcripts. Whisper large-v3 achieves word error rates below 5% for clear English audio and below 10% for most supported languages. Log request latency, audio duration, and detected language for capacity planning.
Deploy Your Transcription API
A self-hosted transcription API eliminates per-minute billing while delivering Whisper-quality accuracy under your complete control. Serve internal teams, integrate with your products, or offer it as a service. Launch on GigaGPU dedicated GPU hosting and start transcribing at scale. Browse more API use cases and tutorials in our library.