Deploy Whisper on a Dedicated GPU Server: Step-by-Step (2026) GIGAGPU

OpenAI Whisper is the de facto open-weights speech-to-text model and has been since 2022, but the production story has changed dramatically. The reference Python implementation is no longer the right thing to deploy; a CTranslate2 reimplementation called faster-whisper is roughly four times faster, uses half the VRAM thanks to INT8 quantisation, and runs cleanly on any modern NVIDIA GPU. This tutorial walks through a full production deployment of Whisper on a dedicated GPU server: drivers, Python environment, a clean transcription script, an HTTP API in FastAPI, a containerised build with Docker, TLS termination, authentication, rate limiting and a systemd unit. By the end you will have a service capable of transcribing roughly forty times realtime on an RTX 4090, at a unit cost well below the OpenAI Whisper API for any meaningful volume.

Prerequisites

Whisper inference is GPU-bound on the attention and matmul kernels and CPU-bound on audio decoding. The minimum sane configuration is a card with 4 GB of VRAM (enough for large-v3-turbo in INT8), but for production we recommend an RTX 3060 12 GB at the floor and an RTX 3090 or RTX 4090 24 GB for batch and concurrent workloads. The 24 GB cards let you hold large-v3 in FP16 with plenty of room for KV state across parallel transcription jobs. Our best GPU for LLM inference piece covers the wider tradeoffs; for Whisper specifically the RTX 3090 hits the sweet spot of VRAM and cost.

Component	Required Version	Quick Install
OS	Ubuntu 22.04 or 24.04 LTS	Default on GigaGPU dedicated nodes
NVIDIA driver	550 or newer	`sudo apt install nvidia-driver-550-server`
CUDA toolkit	12.1+ (12.4 ideal)	Usually bundled with the driver metapackage
Python	3.10, 3.11 or 3.12	`sudo apt install python3.11 python3.11-venv`
ffmpeg	4.4+	`sudo apt install ffmpeg`
GPU VRAM	4 GB minimum, 12 GB+ recommended	RTX 3060 / 3090 / 4090

Verify the driver and GPU are visible before going any further. If nvidia-smi errors out, fix that first — every subsequent step assumes a working CUDA stack. Our companion guide on how to install PyTorch on a GPU server walks through driver troubleshooting in detail.

nvidia-smi
# Expected: a populated table with driver version 550+ and your card name.

ffmpeg -version | head -1
# Expected: ffmpeg version 4.4 or newer

python3 --version
# Expected: Python 3.10, 3.11 or 3.12

Choose the Right Whisper Variant

“Whisper” today refers to a family of related projects rather than a single binary. The right pick depends on whether you need GPU throughput, CPU portability, or English-only speed. Here is the honest comparison.

Variant	Engine	Best For	Speed (RTX 4090, large-v3)	Notes
openai-whisper	PyTorch reference	Compatibility, research	~8x realtime	Slowest, most permissive feature set
faster-whisper	CTranslate2	Production GPU serving	~40x realtime (INT8)	Recommended default
whisper.cpp	GGML, C++	CPU + low-VRAM GPU, edge	~12x realtime (CUDA)	Smallest deps, single binary
distil-whisper	HF Transformers	English-only, latency-critical	~60x realtime	6x faster than large-v3, English only
WhisperX	faster-whisper + VAD + diarisation	Word-level timestamps, speakers	~30x realtime	Adds pyannote diarisation

For 90% of production deployments the answer is faster-whisper with the large-v3-turbo checkpoint. Turbo is a distilled large-v3 with a four-layer decoder instead of thirty-two; it is roughly eight times faster than full large-v3 with negligible WER loss on most languages. We will use it as the default throughout this guide, and note where you might pick something else.

Install faster-whisper and Dependencies

Install everything inside a virtual environment. System-wide pip on Ubuntu 24.04 is a recipe for grief because of PEP 668’s externally-managed-environment guard. Always venv.

sudo apt update
sudo apt install -y python3.11 python3.11-venv python3-pip ffmpeg

python3.11 -m venv ~/venv-whisper
source ~/venv-whisper/bin/activate

pip install --upgrade pip wheel
pip install faster-whisper==1.0.3

# CTranslate2 uses cuBLAS and cuDNN at runtime. On Ubuntu 22.04+ the
# easiest install path is the pip-packaged NVIDIA libraries:
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

The nvidia-cudnn-cu12 wheel ships the cuDNN 9.x runtime that CTranslate2 1.x links against. If you skip it you will get a cryptic libcudnn_ops_infer.so.9: cannot open shared object file at first call. Make these libraries discoverable to the dynamic loader by adding them to LD_LIBRARY_PATH inside the venv’s activate script:

echo 'export LD_LIBRARY_PATH=$(python -c "import os, nvidia.cudnn.lib, nvidia.cublas.lib; print(os.path.dirname(nvidia.cudnn.lib.__file__) + \":\" + os.path.dirname(nvidia.cublas.lib.__file__))")':$LD_LIBRARY_PATH >> ~/venv-whisper/bin/activate
source ~/venv-whisper/bin/activate

Smoke-test the install. The first run downloads the model weights into ~/.cache/huggingface, which takes a minute or two on a 1 Gbps link. Subsequent loads are off-disk and take under three seconds.

python -c "from faster_whisper import WhisperModel; m = WhisperModel('large-v3-turbo', device='cuda', compute_type='int8_float16'); print('OK')"
# Expected: a download progress bar, then 'OK'.

Build a Transcription Script

The minimum viable transcription script is twelve lines. It takes an audio path on the command line, runs large-v3-turbo in mixed INT8/FP16, prints the detected language, and concatenates segment text. Voice activity detection (vad_filter=True) trims silence before transcription, which both speeds things up and prevents Whisper’s well-known hallucinations on long silent runs.

# transcribe.py
import sys
from faster_whisper import WhisperModel

model = WhisperModel(
    model_size_or_path="large-v3-turbo",
    device="cuda",
    compute_type="int8_float16",   # ~50% VRAM, <0.1 WER drop
)

audio_file = sys.argv[1]
segments, info = model.transcribe(
    audio_file,
    beam_size=5,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
)

print(f"Language: {info.language} (p={info.language_probability:.2f})")
for seg in segments:
    print(f"[{seg.start:6.2f} -> {seg.end:6.2f}] {seg.text}")

Run it on any common audio container — wav, mp3, m4a, opus, flac all decode through ffmpeg automatically. On an RTX 4090 a 60-second voice memo finishes in roughly 1.5 seconds end-to-end, including model warmup if the model is pre-loaded.

python transcribe.py meeting.m4a
# Language: en (p=0.99)
# [  0.00 ->   4.20]  All right, let's get started with the planning review.
# [  4.20 ->   9.85]  First item is the inference cluster migration.

Useful knobs to know: beam_size trades a tiny amount of speed for accuracy (5 is a good default); compute_type options include int8, int8_float16, float16 and float32 in increasing VRAM order; language="en" skips language detection and saves around 100ms; word_timestamps=True emits per-word timing if you need karaoke-style captions.

Wrap as a FastAPI HTTP API

For anything beyond a one-off script you want an HTTP service: load the model once at process startup, accept multipart audio uploads, and return JSON. FastAPI plus uvicorn is the lowest-friction path. The model is heavy and not thread-safe in the way you probably hope, so run a single uvicorn worker and let the GPU be the natural concurrency limit.

# whisper_api.py
import os, tempfile, uuid
from contextlib import asynccontextmanager
from fastapi import FastAPI, UploadFile, File, HTTPException, Header
from fastapi.responses import JSONResponse
from faster_whisper import WhisperModel

MAX_BYTES = 100 * 1024 * 1024            # 100 MB upload cap
ALLOWED_CT = {"audio/", "video/"}        # accept anything ffmpeg will decode
API_KEY    = os.environ.get("WHISPER_API_KEY", "")

model: WhisperModel | None = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global model
    model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8_float16")
    yield
    del model

app = FastAPI(title="Whisper Transcription API", lifespan=lifespan)

def _check_auth(x_api_key: str | None):
    if API_KEY and x_api_key != API_KEY:
        raise HTTPException(status_code=401, detail="invalid api key")

@app.post("/transcribe")
async def transcribe(
    file: UploadFile = File(...),
    language: str | None = None,
    x_api_key: str | None = Header(default=None),
):
    _check_auth(x_api_key)
    if not any(file.content_type.startswith(p) for p in ALLOWED_CT):
        raise HTTPException(status_code=415, detail=f"unsupported content type {file.content_type}")

    suffix = os.path.splitext(file.filename or "audio")[1] or ".bin"
    tmp = tempfile.NamedTemporaryFile(delete=False, suffix=suffix)
    try:
        size = 0
        while chunk := await file.read(1 << 20):
            size += len(chunk)
            if size > MAX_BYTES:
                raise HTTPException(status_code=413, detail="payload too large")
            tmp.write(chunk)
        tmp.close()

        segments, info = model.transcribe(
            tmp.name, beam_size=5, vad_filter=True, language=language,
        )
        out_segments = [
            {"start": s.start, "end": s.end, "text": s.text} for s in segments
        ]
        return JSONResponse({
            "id": str(uuid.uuid4()),
            "language": info.language,
            "language_probability": info.language_probability,
            "duration": info.duration,
            "text": " ".join(s["text"].strip() for s in out_segments),
            "segments": out_segments,
        })
    finally:
        os.unlink(tmp.name)

@app.get("/healthz")
def healthz():
    return {"status": "ok", "model_loaded": model is not None}

Run it with a single worker — the model lives in GPU memory and would simply OOM on a second worker. Concurrency comes from request queueing in front of the API, not from process forks.

pip install fastapi 'uvicorn[standard]' python-multipart
WHISPER_API_KEY=changeme uvicorn whisper_api:app \
  --host 0.0.0.0 --port 8000 --workers 1

Smoke-test from another shell with a real audio clip. The response should arrive in roughly duration / 30 seconds for large-v3-turbo on an RTX 4090.

curl -X POST http://localhost:8000/transcribe \
  -H "x-api-key: changeme" \
  -F "file=@meeting.m4a" | jq '.text' | head -c 200

Containerise with Docker

For a reproducible deployment, ship the API as a CUDA-enabled container. The base image must include the CUDA runtime that matches your driver. We use nvidia/cuda:12.1.0-runtime-ubuntu22.04 as a small footprint base; switch to the devel tag only if you need to compile extensions in-image.

# Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update && apt-get install -y --no-install-recommends \
        python3.11 python3.11-venv python3-pip ffmpeg ca-certificates \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip3 install --break-system-packages -r requirements.txt

# Pre-load model weights into the image so cold start is instant.
RUN python3 -c "from faster_whisper import WhisperModel; \
    WhisperModel('large-v3-turbo', device='cpu', compute_type='int8')"

COPY whisper_api.py .

EXPOSE 8000
ENTRYPOINT ["uvicorn", "whisper_api:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

The matching requirements.txt is short:

faster-whisper==1.0.3
fastapi==0.115.*
uvicorn[standard]==0.30.*
python-multipart==0.0.9
nvidia-cublas-cu12
nvidia-cudnn-cu12==9.*

Build and run. The host needs nvidia-container-toolkit installed (one-line on Ubuntu via apt install nvidia-container-toolkit, then sudo systemctl restart docker) — without it the --gpus flag will fail with a runtime hook error.

docker build -t whisper-api:1.0 .
docker run --rm -d \
  --gpus all \
  -p 8000:8000 \
  -e WHISPER_API_KEY=changeme \
  --name whisper-api \
  whisper-api:1.0

For long-running deployments use docker-compose so restart policies and env files live in one place:

# docker-compose.yml
services:
  whisper:
    image: whisper-api:1.0
    restart: always
    ports:
      - "127.0.0.1:8000:8000"
    environment:
      WHISPER_API_KEY: ${WHISPER_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Production Hardening

The container above is functional but not yet something you would expose to the public internet. Six things you should add before going live.

1. nginx reverse proxy with TLS

Terminate TLS at nginx and forward to uvicorn over loopback. Let’s Encrypt via certbot is the path of least resistance. Sample server block:

# /etc/nginx/sites-available/whisper.conf
server {
    listen 443 ssl http2;
    server_name whisper.example.com;

    ssl_certificate     /etc/letsencrypt/live/whisper.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/whisper.example.com/privkey.pem;

    client_max_body_size 100M;          # match MAX_BYTES in the app
    proxy_read_timeout   600s;          # large files take time

    location / {
        proxy_pass         http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
    }
}

sudo apt install -y nginx certbot python3-certbot-nginx
sudo certbot --nginx -d whisper.example.com

2. API key authentication

The FastAPI app already enforces the x-api-key header when WHISPER_API_KEY is set. Generate one with openssl rand -hex 32 and inject through the systemd unit or compose env file. For multiple consumers, swap the single static key for a small SQLite table of keys plus a lookup in the _check_auth helper.

3. Rate limiting with slowapi

pip install slowapi

# in whisper_api.py
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.post("/transcribe")
@limiter.limit("30/minute")
async def transcribe(request: Request, ...):
    ...

4. Request queueing for batch jobs

If transcribing files longer than five minutes, return a job id immediately and process in the background. A simple asyncio.Queue with one consumer task fits in fifty lines; for multi-host scale-out use Celery with Redis or RQ. Push results to S3 or a webhook on completion. This pattern mirrors what we recommended in the vLLM production setup guide for long generations.

5. GPU monitoring

Wire up Prometheus DCGM exporter and a Grafana dashboard so you can see VRAM headroom, temperature, and utilisation in realtime. Our walkthrough on how to monitor GPU usage on a dedicated server covers the full setup including alerting rules.

6. systemd service unit

If you prefer running uvicorn directly under systemd (skipping Docker), here is a minimal hardened unit:

# /etc/systemd/system/whisper-api.service
[Unit]
Description=Whisper Transcription API
After=network.target

[Service]
Type=simple
User=whisper
Group=whisper
WorkingDirectory=/opt/whisper
EnvironmentFile=/etc/whisper/env
ExecStart=/opt/whisper/venv/bin/uvicorn whisper_api:app \
          --host 127.0.0.1 --port 8000 --workers 1
Restart=on-failure
RestartSec=5
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/whisper

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now whisper-api
sudo journalctl -u whisper-api -f

Realtime and Streaming Considerations

Whisper was not designed to be a streaming model. Each forward pass operates on a 30-second mel-spectrogram, so the minimum latency floor is bounded by your buffer length plus inference time. For live captioning, the standard pattern is to chunk audio into rolling 30-second windows with a 5-second overlap, transcribe each window, and deduplicate the overlap region by string match on the suffix of the previous segment.

# pseudo-streaming sketch
buffer = bytearray()
last_text = ""
async for chunk in websocket.iter_bytes():        # 100ms PCM frames
    buffer.extend(chunk)
    if len(buffer) >= seconds(30):
        segs, _ = model.transcribe(io.BytesIO(buffer), language="en")
        new_text = " ".join(s.text for s in segs)
        delta = dedupe_overlap(last_text, new_text)
        await websocket.send_text(delta)
        last_text = new_text
        buffer = buffer[seconds(25):]              # keep 5s of overlap

This produces serviceable live captions with around 5–8 seconds of latency. For sub-second latency you need a different runtime: look at whisper-streaming (academic, LocalAgreement-based), RealtimeSTT (production-friendly Python wrapper around faster-whisper with WebRTC VAD), or move to NVIDIA Riva for true streaming ASR. For most batch and asynchronous workloads — call recordings, podcast transcription, voicemail-to-text — the chunked approach above is more than adequate.

Cost and Scaling

The economic case for self-hosting Whisper is much stronger than for general LLM serving, because Whisper inference is short, GPU-saturating, and easy to batch. A single RTX 4090 24GB transcribes large-v3-turbo at roughly 40 times realtime under steady load. That is 144,000 audio-seconds per wall-clock hour, or 2,400 minutes of audio every hour the GPU is busy.

GPU	Model	RTF (faster-whisper INT8)	Audio-min / hour	£ / audio-min
RTX 3060 12GB	large-v3-turbo	~12x	720	~£0.10
RTX 3090 24GB	large-v3-turbo	~28x	1,680	~£0.07
RTX 4090 24GB	large-v3-turbo	~40x	2,400	~£0.06
RTX 4090 24GB	distil-large-v3 (English)	~70x	4,200	~£0.03

Cost figures use the published RTX 4090 monthly hosting cost divided across a fully-utilised month. The cheapest GPU for AI inference piece is also relevant if you are running a smaller workload that fits on a 12 GB card.

OpenAI’s hosted Whisper API charges $0.006 per audio-minute. At a £/USD rate around 0.79, that is roughly £0.005/minute — cheaper per unit than self-hosting at low volume but with a hard fixed price. Self-hosting wins at scale because the GPU is a fixed monthly cost regardless of throughput. The GPU vs API breakeven analysis works through the maths in detail, but the rough crossover for Whisper specifically is about 12 hours of daily audio (≈360 hours/month) on a 4090. Beyond that you save money self-hosting; below it the API is cheaper.

Critically, the API also has uniform 25 MB upload limits and per-account rate caps, which become operational constraints for podcast and call-centre workflows. A self-hosted endpoint has neither, and the data never leaves your infrastructure — material for any team subject to GDPR processing constraints. See our cost per 1M tokens GPU vs OpenAI piece for the equivalent comparison on text generation.

Wrap-up

Whisper on a dedicated GPU server is one of the highest-leverage self-hosted AI deployments you can ship in 2026. The model is mature, the runtime story is settled (faster-whisper), the integration surface is small (FastAPI plus an audio decoder), and the economics tip in your favour with very modest volume. Stand up the API, put nginx and an API key in front of it, monitor the GPU, and you have an asset that will outlast several model generations with no code changes — when large-v4 drops you swap a string and restart.

If you are also serving an LLM from the same box, our self-host LLM guide shows how to share the GPU between Whisper and a vLLM server (TLDR: tensor-parallel them on different CUDA streams or pin Whisper to spare VRAM). For maximum throughput on a single card the RTX 4090 spec breakdown and concurrent users articles cover the headroom maths.

Ready to deploy? Provision a dedicated GPU server and follow this guide end to end. Spin up a GigaGPU dedicated GPU server — RTX 3060, 3090 and 4090 options are available with full root, NVIDIA drivers preinstalled, and London-based UK hosting.

Deploy Whisper on a Dedicated GPU Server: Step-by-Step (2026)

Table of Contents

Prerequisites

Choose the Right Whisper Variant

Install faster-whisper and Dependencies

Build a Transcription Script

Wrap as a FastAPI HTTP API

Containerise with Docker

Production Hardening

1. nginx reverse proxy with TLS

2. API key authentication

3. Rate limiting with slowapi

4. Request queueing for batch jobs

5. GPU monitoring

6. systemd service unit

Realtime and Streaming Considerations

Cost and Scaling

Wrap-up

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Deploy Whisper on a Dedicated GPU Server: Step-by-Step (2026)

Table of Contents

Prerequisites

Choose the Right Whisper Variant

Install faster-whisper and Dependencies

Build a Transcription Script

Wrap as a FastAPI HTTP API

Containerise with Docker

Production Hardening

1. nginx reverse proxy with TLS

2. API key authentication

3. Rate limiting with slowapi

4. Request queueing for batch jobs

5. GPU monitoring

6. systemd service unit

Realtime and Streaming Considerations

Cost and Scaling

Wrap-up

Need a Dedicated GPU Server?

gigagpu

Related Articles

Connect Elasticsearch to AI Search on GPU

AI Feature Flag Rollout Best Practices

Tuning TTFT P99 on the RTX 5060 Ti 16 GB: Six Things That Actually Move the Number

Whisper Accuracy Issues: Improvement Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?