Table of Contents
OpenAI Whisper is the de facto open-weights speech-to-text model and has been since 2022, but the production story has changed dramatically. The reference Python implementation is no longer the right thing to deploy; a CTranslate2 reimplementation called faster-whisper is roughly four times faster, uses half the VRAM thanks to INT8 quantisation, and runs cleanly on any modern NVIDIA GPU. This tutorial walks through a full production deployment of Whisper on a dedicated GPU server: drivers, Python environment, a clean transcription script, an HTTP API in FastAPI, a containerised build with Docker, TLS termination, authentication, rate limiting and a systemd unit. By the end you will have a service capable of transcribing roughly forty times realtime on an RTX 4090, at a unit cost well below the OpenAI Whisper API for any meaningful volume.
Prerequisites
Whisper inference is GPU-bound on the attention and matmul kernels and CPU-bound on audio decoding. The minimum sane configuration is a card with 4 GB of VRAM (enough for large-v3-turbo in INT8), but for production we recommend an RTX 3060 12 GB at the floor and an RTX 3090 or RTX 4090 24 GB for batch and concurrent workloads. The 24 GB cards let you hold large-v3 in FP16 with plenty of room for KV state across parallel transcription jobs. Our best GPU for LLM inference piece covers the wider tradeoffs; for Whisper specifically the RTX 3090 hits the sweet spot of VRAM and cost.
| Component | Required Version | Quick Install |
|---|---|---|
| OS | Ubuntu 22.04 or 24.04 LTS | Default on GigaGPU dedicated nodes |
| NVIDIA driver | 550 or newer | sudo apt install nvidia-driver-550-server |
| CUDA toolkit | 12.1+ (12.4 ideal) | Usually bundled with the driver metapackage |
| Python | 3.10, 3.11 or 3.12 | sudo apt install python3.11 python3.11-venv |
| ffmpeg | 4.4+ | sudo apt install ffmpeg |
| GPU VRAM | 4 GB minimum, 12 GB+ recommended | RTX 3060 / 3090 / 4090 |
Verify the driver and GPU are visible before going any further. If nvidia-smi errors out, fix that first — every subsequent step assumes a working CUDA stack. Our companion guide on how to install PyTorch on a GPU server walks through driver troubleshooting in detail.
nvidia-smi
# Expected: a populated table with driver version 550+ and your card name.
ffmpeg -version | head -1
# Expected: ffmpeg version 4.4 or newer
python3 --version
# Expected: Python 3.10, 3.11 or 3.12
Choose the Right Whisper Variant
“Whisper” today refers to a family of related projects rather than a single binary. The right pick depends on whether you need GPU throughput, CPU portability, or English-only speed. Here is the honest comparison.
| Variant | Engine | Best For | Speed (RTX 4090, large-v3) | Notes |
|---|---|---|---|---|
| openai-whisper | PyTorch reference | Compatibility, research | ~8x realtime | Slowest, most permissive feature set |
| faster-whisper | CTranslate2 | Production GPU serving | ~40x realtime (INT8) | Recommended default |
| whisper.cpp | GGML, C++ | CPU + low-VRAM GPU, edge | ~12x realtime (CUDA) | Smallest deps, single binary |
| distil-whisper | HF Transformers | English-only, latency-critical | ~60x realtime | 6x faster than large-v3, English only |
| WhisperX | faster-whisper + VAD + diarisation | Word-level timestamps, speakers | ~30x realtime | Adds pyannote diarisation |
For 90% of production deployments the answer is faster-whisper with the large-v3-turbo checkpoint. Turbo is a distilled large-v3 with a four-layer decoder instead of thirty-two; it is roughly eight times faster than full large-v3 with negligible WER loss on most languages. We will use it as the default throughout this guide, and note where you might pick something else.
Install faster-whisper and Dependencies
Install everything inside a virtual environment. System-wide pip on Ubuntu 24.04 is a recipe for grief because of PEP 668’s externally-managed-environment guard. Always venv.
sudo apt update
sudo apt install -y python3.11 python3.11-venv python3-pip ffmpeg
python3.11 -m venv ~/venv-whisper
source ~/venv-whisper/bin/activate
pip install --upgrade pip wheel
pip install faster-whisper==1.0.3
# CTranslate2 uses cuBLAS and cuDNN at runtime. On Ubuntu 22.04+ the
# easiest install path is the pip-packaged NVIDIA libraries:
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
The nvidia-cudnn-cu12 wheel ships the cuDNN 9.x runtime that CTranslate2 1.x links against. If you skip it you will get a cryptic libcudnn_ops_infer.so.9: cannot open shared object file at first call. Make these libraries discoverable to the dynamic loader by adding them to LD_LIBRARY_PATH inside the venv’s activate script:
echo 'export LD_LIBRARY_PATH=$(python -c "import os, nvidia.cudnn.lib, nvidia.cublas.lib; print(os.path.dirname(nvidia.cudnn.lib.__file__) + \":\" + os.path.dirname(nvidia.cublas.lib.__file__))")':$LD_LIBRARY_PATH >> ~/venv-whisper/bin/activate
source ~/venv-whisper/bin/activate
Smoke-test the install. The first run downloads the model weights into ~/.cache/huggingface, which takes a minute or two on a 1 Gbps link. Subsequent loads are off-disk and take under three seconds.
python -c "from faster_whisper import WhisperModel; m = WhisperModel('large-v3-turbo', device='cuda', compute_type='int8_float16'); print('OK')"
# Expected: a download progress bar, then 'OK'.
Build a Transcription Script
The minimum viable transcription script is twelve lines. It takes an audio path on the command line, runs large-v3-turbo in mixed INT8/FP16, prints the detected language, and concatenates segment text. Voice activity detection (vad_filter=True) trims silence before transcription, which both speeds things up and prevents Whisper’s well-known hallucinations on long silent runs.
# transcribe.py
import sys
from faster_whisper import WhisperModel
model = WhisperModel(
model_size_or_path="large-v3-turbo",
device="cuda",
compute_type="int8_float16", # ~50% VRAM, <0.1 WER drop
)
audio_file = sys.argv[1]
segments, info = model.transcribe(
audio_file,
beam_size=5,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
print(f"Language: {info.language} (p={info.language_probability:.2f})")
for seg in segments:
print(f"[{seg.start:6.2f} -> {seg.end:6.2f}] {seg.text}")
Run it on any common audio container — wav, mp3, m4a, opus, flac all decode through ffmpeg automatically. On an RTX 4090 a 60-second voice memo finishes in roughly 1.5 seconds end-to-end, including model warmup if the model is pre-loaded.
python transcribe.py meeting.m4a
# Language: en (p=0.99)
# [ 0.00 -> 4.20] All right, let's get started with the planning review.
# [ 4.20 -> 9.85] First item is the inference cluster migration.
Useful knobs to know: beam_size trades a tiny amount of speed for accuracy (5 is a good default); compute_type options include int8, int8_float16, float16 and float32 in increasing VRAM order; language="en" skips language detection and saves around 100ms; word_timestamps=True emits per-word timing if you need karaoke-style captions.
Wrap as a FastAPI HTTP API
For anything beyond a one-off script you want an HTTP service: load the model once at process startup, accept multipart audio uploads, and return JSON. FastAPI plus uvicorn is the lowest-friction path. The model is heavy and not thread-safe in the way you probably hope, so run a single uvicorn worker and let the GPU be the natural concurrency limit.
# whisper_api.py
import os, tempfile, uuid
from contextlib import asynccontextmanager
from fastapi import FastAPI, UploadFile, File, HTTPException, Header
from fastapi.responses import JSONResponse
from faster_whisper import WhisperModel
MAX_BYTES = 100 * 1024 * 1024 # 100 MB upload cap
ALLOWED_CT = {"audio/", "video/"} # accept anything ffmpeg will decode
API_KEY = os.environ.get("WHISPER_API_KEY", "")
model: WhisperModel | None = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global model
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8_float16")
yield
del model
app = FastAPI(title="Whisper Transcription API", lifespan=lifespan)
def _check_auth(x_api_key: str | None):
if API_KEY and x_api_key != API_KEY:
raise HTTPException(status_code=401, detail="invalid api key")
@app.post("/transcribe")
async def transcribe(
file: UploadFile = File(...),
language: str | None = None,
x_api_key: str | None = Header(default=None),
):
_check_auth(x_api_key)
if not any(file.content_type.startswith(p) for p in ALLOWED_CT):
raise HTTPException(status_code=415, detail=f"unsupported content type {file.content_type}")
suffix = os.path.splitext(file.filename or "audio")[1] or ".bin"
tmp = tempfile.NamedTemporaryFile(delete=False, suffix=suffix)
try:
size = 0
while chunk := await file.read(1 << 20):
size += len(chunk)
if size > MAX_BYTES:
raise HTTPException(status_code=413, detail="payload too large")
tmp.write(chunk)
tmp.close()
segments, info = model.transcribe(
tmp.name, beam_size=5, vad_filter=True, language=language,
)
out_segments = [
{"start": s.start, "end": s.end, "text": s.text} for s in segments
]
return JSONResponse({
"id": str(uuid.uuid4()),
"language": info.language,
"language_probability": info.language_probability,
"duration": info.duration,
"text": " ".join(s["text"].strip() for s in out_segments),
"segments": out_segments,
})
finally:
os.unlink(tmp.name)
@app.get("/healthz")
def healthz():
return {"status": "ok", "model_loaded": model is not None}
Run it with a single worker — the model lives in GPU memory and would simply OOM on a second worker. Concurrency comes from request queueing in front of the API, not from process forks.
pip install fastapi 'uvicorn[standard]' python-multipart
WHISPER_API_KEY=changeme uvicorn whisper_api:app \
--host 0.0.0.0 --port 8000 --workers 1
Smoke-test from another shell with a real audio clip. The response should arrive in roughly duration / 30 seconds for large-v3-turbo on an RTX 4090.
curl -X POST http://localhost:8000/transcribe \
-H "x-api-key: changeme" \
-F "file=@meeting.m4a" | jq '.text' | head -c 200
Containerise with Docker
For a reproducible deployment, ship the API as a CUDA-enabled container. The base image must include the CUDA runtime that matches your driver. We use nvidia/cuda:12.1.0-runtime-ubuntu22.04 as a small footprint base; switch to the devel tag only if you need to compile extensions in-image.
# Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3.11-venv python3-pip ffmpeg ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip3 install --break-system-packages -r requirements.txt
# Pre-load model weights into the image so cold start is instant.
RUN python3 -c "from faster_whisper import WhisperModel; \
WhisperModel('large-v3-turbo', device='cpu', compute_type='int8')"
COPY whisper_api.py .
EXPOSE 8000
ENTRYPOINT ["uvicorn", "whisper_api:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
The matching requirements.txt is short:
faster-whisper==1.0.3
fastapi==0.115.*
uvicorn[standard]==0.30.*
python-multipart==0.0.9
nvidia-cublas-cu12
nvidia-cudnn-cu12==9.*
Build and run. The host needs nvidia-container-toolkit installed (one-line on Ubuntu via apt install nvidia-container-toolkit, then sudo systemctl restart docker) — without it the --gpus flag will fail with a runtime hook error.
docker build -t whisper-api:1.0 .
docker run --rm -d \
--gpus all \
-p 8000:8000 \
-e WHISPER_API_KEY=changeme \
--name whisper-api \
whisper-api:1.0
For long-running deployments use docker-compose so restart policies and env files live in one place:
# docker-compose.yml
services:
whisper:
image: whisper-api:1.0
restart: always
ports:
- "127.0.0.1:8000:8000"
environment:
WHISPER_API_KEY: ${WHISPER_API_KEY}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Production Hardening
The container above is functional but not yet something you would expose to the public internet. Six things you should add before going live.
1. nginx reverse proxy with TLS
Terminate TLS at nginx and forward to uvicorn over loopback. Let’s Encrypt via certbot is the path of least resistance. Sample server block:
# /etc/nginx/sites-available/whisper.conf
server {
listen 443 ssl http2;
server_name whisper.example.com;
ssl_certificate /etc/letsencrypt/live/whisper.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/whisper.example.com/privkey.pem;
client_max_body_size 100M; # match MAX_BYTES in the app
proxy_read_timeout 600s; # large files take time
location / {
proxy_pass http://127.0.0.1:8000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
sudo apt install -y nginx certbot python3-certbot-nginx
sudo certbot --nginx -d whisper.example.com
2. API key authentication
The FastAPI app already enforces the x-api-key header when WHISPER_API_KEY is set. Generate one with openssl rand -hex 32 and inject through the systemd unit or compose env file. For multiple consumers, swap the single static key for a small SQLite table of keys plus a lookup in the _check_auth helper.
3. Rate limiting with slowapi
pip install slowapi
# in whisper_api.py
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
@app.post("/transcribe")
@limiter.limit("30/minute")
async def transcribe(request: Request, ...):
...
4. Request queueing for batch jobs
If transcribing files longer than five minutes, return a job id immediately and process in the background. A simple asyncio.Queue with one consumer task fits in fifty lines; for multi-host scale-out use Celery with Redis or RQ. Push results to S3 or a webhook on completion. This pattern mirrors what we recommended in the vLLM production setup guide for long generations.
5. GPU monitoring
Wire up Prometheus DCGM exporter and a Grafana dashboard so you can see VRAM headroom, temperature, and utilisation in realtime. Our walkthrough on how to monitor GPU usage on a dedicated server covers the full setup including alerting rules.
6. systemd service unit
If you prefer running uvicorn directly under systemd (skipping Docker), here is a minimal hardened unit:
# /etc/systemd/system/whisper-api.service
[Unit]
Description=Whisper Transcription API
After=network.target
[Service]
Type=simple
User=whisper
Group=whisper
WorkingDirectory=/opt/whisper
EnvironmentFile=/etc/whisper/env
ExecStart=/opt/whisper/venv/bin/uvicorn whisper_api:app \
--host 127.0.0.1 --port 8000 --workers 1
Restart=on-failure
RestartSec=5
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/whisper
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now whisper-api
sudo journalctl -u whisper-api -f
Realtime and Streaming Considerations
Whisper was not designed to be a streaming model. Each forward pass operates on a 30-second mel-spectrogram, so the minimum latency floor is bounded by your buffer length plus inference time. For live captioning, the standard pattern is to chunk audio into rolling 30-second windows with a 5-second overlap, transcribe each window, and deduplicate the overlap region by string match on the suffix of the previous segment.
# pseudo-streaming sketch
buffer = bytearray()
last_text = ""
async for chunk in websocket.iter_bytes(): # 100ms PCM frames
buffer.extend(chunk)
if len(buffer) >= seconds(30):
segs, _ = model.transcribe(io.BytesIO(buffer), language="en")
new_text = " ".join(s.text for s in segs)
delta = dedupe_overlap(last_text, new_text)
await websocket.send_text(delta)
last_text = new_text
buffer = buffer[seconds(25):] # keep 5s of overlap
This produces serviceable live captions with around 5–8 seconds of latency. For sub-second latency you need a different runtime: look at whisper-streaming (academic, LocalAgreement-based), RealtimeSTT (production-friendly Python wrapper around faster-whisper with WebRTC VAD), or move to NVIDIA Riva for true streaming ASR. For most batch and asynchronous workloads — call recordings, podcast transcription, voicemail-to-text — the chunked approach above is more than adequate.
Cost and Scaling
The economic case for self-hosting Whisper is much stronger than for general LLM serving, because Whisper inference is short, GPU-saturating, and easy to batch. A single RTX 4090 24GB transcribes large-v3-turbo at roughly 40 times realtime under steady load. That is 144,000 audio-seconds per wall-clock hour, or 2,400 minutes of audio every hour the GPU is busy.
| GPU | Model | RTF (faster-whisper INT8) | Audio-min / hour | £ / audio-min |
|---|---|---|---|---|
| RTX 3060 12GB | large-v3-turbo | ~12x | 720 | ~£0.10 |
| RTX 3090 24GB | large-v3-turbo | ~28x | 1,680 | ~£0.07 |
| RTX 4090 24GB | large-v3-turbo | ~40x | 2,400 | ~£0.06 |
| RTX 4090 24GB | distil-large-v3 (English) | ~70x | 4,200 | ~£0.03 |
Cost figures use the published RTX 4090 monthly hosting cost divided across a fully-utilised month. The cheapest GPU for AI inference piece is also relevant if you are running a smaller workload that fits on a 12 GB card.
OpenAI’s hosted Whisper API charges $0.006 per audio-minute. At a £/USD rate around 0.79, that is roughly £0.005/minute — cheaper per unit than self-hosting at low volume but with a hard fixed price. Self-hosting wins at scale because the GPU is a fixed monthly cost regardless of throughput. The GPU vs API breakeven analysis works through the maths in detail, but the rough crossover for Whisper specifically is about 12 hours of daily audio (≈360 hours/month) on a 4090. Beyond that you save money self-hosting; below it the API is cheaper.
Critically, the API also has uniform 25 MB upload limits and per-account rate caps, which become operational constraints for podcast and call-centre workflows. A self-hosted endpoint has neither, and the data never leaves your infrastructure — material for any team subject to GDPR processing constraints. See our cost per 1M tokens GPU vs OpenAI piece for the equivalent comparison on text generation.
Wrap-up
Whisper on a dedicated GPU server is one of the highest-leverage self-hosted AI deployments you can ship in 2026. The model is mature, the runtime story is settled (faster-whisper), the integration surface is small (FastAPI plus an audio decoder), and the economics tip in your favour with very modest volume. Stand up the API, put nginx and an API key in front of it, monitor the GPU, and you have an asset that will outlast several model generations with no code changes — when large-v4 drops you swap a string and restart.
If you are also serving an LLM from the same box, our self-host LLM guide shows how to share the GPU between Whisper and a vLLM server (TLDR: tensor-parallel them on different CUDA streams or pin Whisper to spare VRAM). For maximum throughput on a single card the RTX 4090 spec breakdown and concurrent users articles cover the headroom maths.
Ready to deploy? Provision a dedicated GPU server and follow this guide end to end. Spin up a GigaGPU dedicated GPU server — RTX 3060, 3090 and 4090 options are available with full root, NVIDIA drivers preinstalled, and London-based UK hosting.