Why Self-Host Whisper for Transcription
OpenAI’s Whisper is the most capable open-weight speech recognition model available, supporting 99 languages with near-human accuracy. Running Whisper on dedicated GPU hardware gives you unlimited transcription without per-minute API costs, full data privacy for sensitive audio, and the low latency needed for real-time applications. GigaGPU provides pre-configured Whisper hosting, but this guide walks through the full deployment so you understand every component.
Compared with API-based transcription services, self-hosting eliminates per-minute charges and gives you full control over the processing pipeline. For teams transcribing call centre recordings, meeting audio, podcast content, or medical dictation, self-hosted Whisper on a single GPU can process audio faster than real-time, meaning your transcription pipeline can keep up with live audio streams while also clearing backlogs of recorded content.
Whisper Model Selection and GPU Requirements
Whisper comes in multiple sizes. Larger models are more accurate but require more VRAM and process audio more slowly. For real-time applications, the speed-accuracy trade-off is critical.
| Model | Parameters | VRAM | Real-Time Factor (RTX 5090) | Best For |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | 0.03x | Quick drafts, low-resource setups |
| base | 74M | ~1 GB | 0.05x | Acceptable quality, maximum speed |
| small | 244M | ~2 GB | 0.08x | Good quality-speed balance |
| medium | 769M | ~5 GB | 0.15x | High accuracy, still fast |
| large-v3 | 1.5B | ~10 GB | 0.25x | Maximum accuracy |
A real-time factor (RTF) below 1.0 means the model processes audio faster than real-time. An RTF of 0.25x means it transcribes a 60-second clip in about 15 seconds. For detailed benchmarks across GPU tiers, see our Whisper RTF by GPU comparison. Even the large-v3 model runs well under real-time on an RTX 5090, leaving headroom for concurrent requests.
Installing Faster Whisper for GPU Inference
Faster Whisper is a CTranslate2-based reimplementation that runs 4x faster than the original OpenAI implementation while using less memory. It is the recommended engine for production deployments.
# Create environment
python3 -m venv ~/whisper-env
source ~/whisper-env/bin/activate
# Install Faster Whisper
pip install faster-whisper
# Test basic transcription
python3 -c "
from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device='cuda', compute_type='float16')
segments, info = model.transcribe('test_audio.wav', beam_size=5)
print(f'Detected language: {info.language} ({info.language_probability:.2f})')
for segment in segments:
print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')
"
The first run downloads model weights (~3 GB for large-v3). Subsequent loads are fast thanks to NVMe storage on GigaGPU servers.
Building a Transcription API
Wrap Faster Whisper in a FastAPI service that accepts audio file uploads and returns transcriptions:
# whisper_server.py
from fastapi import FastAPI, UploadFile, File, Query
from faster_whisper import WhisperModel
import tempfile
import os
import time
app = FastAPI(title="Whisper Transcription API")
# Load model at startup
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
@app.post("/transcribe")
async def transcribe(
file: UploadFile = File(...),
language: str = Query(None, description="ISO language code"),
task: str = Query("transcribe", description="transcribe or translate")
):
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
try:
start = time.time()
segments, info = model.transcribe(
tmp_path,
beam_size=5,
language=language,
task=task,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
results = []
full_text = []
for segment in segments:
results.append({
"start": round(segment.start, 2),
"end": round(segment.end, 2),
"text": segment.text.strip()
})
full_text.append(segment.text.strip())
elapsed = time.time() - start
return {
"text": " ".join(full_text),
"segments": results,
"language": info.language,
"duration": round(info.duration, 2),
"processing_time": round(elapsed, 2)
}
finally:
os.unlink(tmp_path)
# Run the server
uvicorn whisper_server:app --host 0.0.0.0 --port 8000 --workers 1
# Test with curl
curl -X POST http://localhost:8000/transcribe \
-F "file=@meeting_recording.wav" \
-F "language=en"
Real-Time Streaming Transcription
For live audio streams, use a WebSocket endpoint that receives audio chunks and returns transcriptions progressively:
# Add to whisper_server.py
from fastapi import WebSocket
import numpy as np
import io
import soundfile as sf
@app.websocket("/stream")
async def stream_transcribe(websocket: WebSocket):
await websocket.accept()
buffer = np.array([], dtype=np.float32)
try:
while True:
# Receive audio chunk (16kHz, mono, float32)
data = await websocket.receive_bytes()
audio_chunk, sr = sf.read(io.BytesIO(data), dtype='float32')
buffer = np.concatenate([buffer, audio_chunk])
# Process when buffer reaches 5 seconds
if len(buffer) >= sr * 5:
segments, _ = model.transcribe(
buffer, beam_size=3, language="en",
vad_filter=True
)
text = " ".join(s.text.strip() for s in segments)
await websocket.send_json({"text": text})
buffer = np.array([], dtype=np.float32)
except Exception:
await websocket.close()
This approach buffers 5 seconds of audio before transcribing, providing a good balance between latency and accuracy. For lower latency, reduce the buffer size and use the small or medium model. If you need to select the most cost-effective GPU for your transcription workload, our cheapest GPU for AI inference guide covers the full range of options.
Production Configuration
Deploy the Whisper server as a systemd service with process management and logging:
# /etc/systemd/system/whisper.service
[Unit]
Description=Whisper Transcription Server
After=network.target
[Service]
User=deploy
WorkingDirectory=/home/deploy
ExecStart=/home/deploy/whisper-env/bin/uvicorn whisper_server:app \
--host 0.0.0.0 --port 8000 --workers 1
Restart=always
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable whisper
sudo systemctl start whisper
Add Nginx as a reverse proxy for TLS termination, following the same pattern described in our production inference server guide. For GPU servers handling multiple workloads, our PyTorch hosting page covers the shared runtime environment. If you are running Whisper alongside other models on the same GPU, monitor VRAM usage carefully — Whisper large-v3 uses about 10 GB, leaving room for smaller models on a 24 GB card.
Next Steps and Advanced Use Cases
With Whisper running on dedicated hardware, you can build sophisticated audio processing pipelines. Common extensions include:
- Voice agents: Combine Whisper transcription with an LLM-powered voice agent for conversational AI
- Meeting summarisation: Feed transcripts into a self-hosted LLM for automatic meeting notes
- Multi-model pipelines: Run Whisper alongside a chatbot or other speech models on the same server
- Batch processing: Process recorded audio archives at maximum throughput
For benchmarking your deployment, the TTS and speech latency benchmarks provide reference numbers for the full speech processing pipeline. Explore the model guides category for deployment instructions for other models that complement Whisper.
Deploy Whisper on Dedicated GPU Hardware
GigaGPU provides GPU servers optimised for real-time speech processing. Pre-configured with CUDA and fast NVMe storage for instant model loading. Process unlimited audio with zero per-minute costs.
Browse GPU Servers