RTX 3050 - Order Now
Home / Blog / Model Guides / How to Deploy Whisper for Real-Time Transcription on a GPU Server
Model Guides

How to Deploy Whisper for Real-Time Transcription on a GPU Server

Step-by-step guide to deploying OpenAI Whisper on a dedicated GPU server for real-time transcription. Covers model selection, Faster Whisper setup, streaming configuration, and API deployment.

Why Self-Host Whisper for Transcription

OpenAI’s Whisper is the most capable open-weight speech recognition model available, supporting 99 languages with near-human accuracy. Running Whisper on dedicated GPU hardware gives you unlimited transcription without per-minute API costs, full data privacy for sensitive audio, and the low latency needed for real-time applications. GigaGPU provides pre-configured Whisper hosting, but this guide walks through the full deployment so you understand every component.

Compared with API-based transcription services, self-hosting eliminates per-minute charges and gives you full control over the processing pipeline. For teams transcribing call centre recordings, meeting audio, podcast content, or medical dictation, self-hosted Whisper on a single GPU can process audio faster than real-time, meaning your transcription pipeline can keep up with live audio streams while also clearing backlogs of recorded content.

Whisper Model Selection and GPU Requirements

Whisper comes in multiple sizes. Larger models are more accurate but require more VRAM and process audio more slowly. For real-time applications, the speed-accuracy trade-off is critical.

Model Parameters VRAM Real-Time Factor (RTX 5090) Best For
tiny 39M ~1 GB 0.03x Quick drafts, low-resource setups
base 74M ~1 GB 0.05x Acceptable quality, maximum speed
small 244M ~2 GB 0.08x Good quality-speed balance
medium 769M ~5 GB 0.15x High accuracy, still fast
large-v3 1.5B ~10 GB 0.25x Maximum accuracy

A real-time factor (RTF) below 1.0 means the model processes audio faster than real-time. An RTF of 0.25x means it transcribes a 60-second clip in about 15 seconds. For detailed benchmarks across GPU tiers, see our Whisper RTF by GPU comparison. Even the large-v3 model runs well under real-time on an RTX 5090, leaving headroom for concurrent requests.

Installing Faster Whisper for GPU Inference

Faster Whisper is a CTranslate2-based reimplementation that runs 4x faster than the original OpenAI implementation while using less memory. It is the recommended engine for production deployments.

# Create environment
python3 -m venv ~/whisper-env
source ~/whisper-env/bin/activate

# Install Faster Whisper
pip install faster-whisper

# Test basic transcription
python3 -c "
from faster_whisper import WhisperModel

model = WhisperModel('large-v3', device='cuda', compute_type='float16')
segments, info = model.transcribe('test_audio.wav', beam_size=5)

print(f'Detected language: {info.language} ({info.language_probability:.2f})')
for segment in segments:
    print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')
"

The first run downloads model weights (~3 GB for large-v3). Subsequent loads are fast thanks to NVMe storage on GigaGPU servers.

Building a Transcription API

Wrap Faster Whisper in a FastAPI service that accepts audio file uploads and returns transcriptions:

# whisper_server.py
from fastapi import FastAPI, UploadFile, File, Query
from faster_whisper import WhisperModel
import tempfile
import os
import time

app = FastAPI(title="Whisper Transcription API")

# Load model at startup
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(
    file: UploadFile = File(...),
    language: str = Query(None, description="ISO language code"),
    task: str = Query("transcribe", description="transcribe or translate")
):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name

    try:
        start = time.time()
        segments, info = model.transcribe(
            tmp_path,
            beam_size=5,
            language=language,
            task=task,
            vad_filter=True,
            vad_parameters=dict(min_silence_duration_ms=500)
        )

        results = []
        full_text = []
        for segment in segments:
            results.append({
                "start": round(segment.start, 2),
                "end": round(segment.end, 2),
                "text": segment.text.strip()
            })
            full_text.append(segment.text.strip())

        elapsed = time.time() - start
        return {
            "text": " ".join(full_text),
            "segments": results,
            "language": info.language,
            "duration": round(info.duration, 2),
            "processing_time": round(elapsed, 2)
        }
    finally:
        os.unlink(tmp_path)
# Run the server
uvicorn whisper_server:app --host 0.0.0.0 --port 8000 --workers 1

# Test with curl
curl -X POST http://localhost:8000/transcribe \
  -F "file=@meeting_recording.wav" \
  -F "language=en"

Real-Time Streaming Transcription

For live audio streams, use a WebSocket endpoint that receives audio chunks and returns transcriptions progressively:

# Add to whisper_server.py
from fastapi import WebSocket
import numpy as np
import io
import soundfile as sf

@app.websocket("/stream")
async def stream_transcribe(websocket: WebSocket):
    await websocket.accept()
    buffer = np.array([], dtype=np.float32)

    try:
        while True:
            # Receive audio chunk (16kHz, mono, float32)
            data = await websocket.receive_bytes()
            audio_chunk, sr = sf.read(io.BytesIO(data), dtype='float32')
            buffer = np.concatenate([buffer, audio_chunk])

            # Process when buffer reaches 5 seconds
            if len(buffer) >= sr * 5:
                segments, _ = model.transcribe(
                    buffer, beam_size=3, language="en",
                    vad_filter=True
                )
                text = " ".join(s.text.strip() for s in segments)
                await websocket.send_json({"text": text})
                buffer = np.array([], dtype=np.float32)
    except Exception:
        await websocket.close()

This approach buffers 5 seconds of audio before transcribing, providing a good balance between latency and accuracy. For lower latency, reduce the buffer size and use the small or medium model. If you need to select the most cost-effective GPU for your transcription workload, our cheapest GPU for AI inference guide covers the full range of options.

Production Configuration

Deploy the Whisper server as a systemd service with process management and logging:

# /etc/systemd/system/whisper.service
[Unit]
Description=Whisper Transcription Server
After=network.target

[Service]
User=deploy
WorkingDirectory=/home/deploy
ExecStart=/home/deploy/whisper-env/bin/uvicorn whisper_server:app \
  --host 0.0.0.0 --port 8000 --workers 1
Restart=always
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable whisper
sudo systemctl start whisper

Add Nginx as a reverse proxy for TLS termination, following the same pattern described in our production inference server guide. For GPU servers handling multiple workloads, our PyTorch hosting page covers the shared runtime environment. If you are running Whisper alongside other models on the same GPU, monitor VRAM usage carefully — Whisper large-v3 uses about 10 GB, leaving room for smaller models on a 24 GB card.

Next Steps and Advanced Use Cases

With Whisper running on dedicated hardware, you can build sophisticated audio processing pipelines. Common extensions include:

  • Voice agents: Combine Whisper transcription with an LLM-powered voice agent for conversational AI
  • Meeting summarisation: Feed transcripts into a self-hosted LLM for automatic meeting notes
  • Multi-model pipelines: Run Whisper alongside a chatbot or other speech models on the same server
  • Batch processing: Process recorded audio archives at maximum throughput

For benchmarking your deployment, the TTS and speech latency benchmarks provide reference numbers for the full speech processing pipeline. Explore the model guides category for deployment instructions for other models that complement Whisper.

Deploy Whisper on Dedicated GPU Hardware

GigaGPU provides GPU servers optimised for real-time speech processing. Pre-configured with CUDA and fast NVMe storage for instant model loading. Process unlimited audio with zero per-minute costs.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?