Home / Blog / Model Guides / How to Deploy Whisper for Real-Time Transcription on a GPU Server

Model Guides

How to Deploy Whisper for Real-Time Transcription on a GPU Server

Step-by-step guide to deploying OpenAI Whisper on a dedicated GPU server for real-time transcription. Covers model selection, Faster Whisper setup, streaming configuration, and API deployment.

Model Guides April 10, 2026 5 min read gigagpu

Table of Contents

Why Self-Host Whisper for Transcription
Whisper Model Selection and GPU Requirements
Installing Faster Whisper for GPU Inference
Building a Transcription API
Real-Time Streaming Transcription
Production Configuration
Next Steps and Advanced Use Cases

Why Self-Host Whisper for Transcription

OpenAI’s Whisper is the most capable open-weight speech recognition model available, supporting 99 languages with near-human accuracy. Running Whisper on dedicated GPU hardware gives you unlimited transcription without per-minute API costs, full data privacy for sensitive audio, and the low latency needed for real-time applications. GigaGPU provides pre-configured Whisper hosting, but this guide walks through the full deployment so you understand every component.

Compared with API-based transcription services, self-hosting eliminates per-minute charges and gives you full control over the processing pipeline. For teams transcribing call centre recordings, meeting audio, podcast content, or medical dictation, self-hosted Whisper on a single GPU can process audio faster than real-time, meaning your transcription pipeline can keep up with live audio streams while also clearing backlogs of recorded content.

Whisper Model Selection and GPU Requirements

Whisper comes in multiple sizes. Larger models are more accurate but require more VRAM and process audio more slowly. For real-time applications, the speed-accuracy trade-off is critical.

Model	Parameters	VRAM	Real-Time Factor (RTX 5090)	Best For
tiny	39M	~1 GB	0.03x	Quick drafts, low-resource setups
base	74M	~1 GB	0.05x	Acceptable quality, maximum speed
small	244M	~2 GB	0.08x	Good quality-speed balance
medium	769M	~5 GB	0.15x	High accuracy, still fast
large-v3	1.5B	~10 GB	0.25x	Maximum accuracy

A real-time factor (RTF) below 1.0 means the model processes audio faster than real-time. An RTF of 0.25x means it transcribes a 60-second clip in about 15 seconds. For detailed benchmarks across GPU tiers, see our Whisper RTF by GPU comparison. Even the large-v3 model runs well under real-time on an RTX 5090, leaving headroom for concurrent requests.

Installing Faster Whisper for GPU Inference

Faster Whisper is a CTranslate2-based reimplementation that runs 4x faster than the original OpenAI implementation while using less memory. It is the recommended engine for production deployments.

# Create environment
python3 -m venv ~/whisper-env
source ~/whisper-env/bin/activate

# Install Faster Whisper
pip install faster-whisper

# Test basic transcription
python3 -c "
from faster_whisper import WhisperModel

model = WhisperModel('large-v3', device='cuda', compute_type='float16')
segments, info = model.transcribe('test_audio.wav', beam_size=5)

print(f'Detected language: {info.language} ({info.language_probability:.2f})')
for segment in segments:
    print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')
"

The first run downloads model weights (~3 GB for large-v3). Subsequent loads are fast thanks to NVMe storage on GigaGPU servers.

Building a Transcription API

Wrap Faster Whisper in a FastAPI service that accepts audio file uploads and returns transcriptions:

# whisper_server.py
from fastapi import FastAPI, UploadFile, File, Query
from faster_whisper import WhisperModel
import tempfile
import os
import time

app = FastAPI(title="Whisper Transcription API")

# Load model at startup
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(
    file: UploadFile = File(...),
    language: str = Query(None, description="ISO language code"),
    task: str = Query("transcribe", description="transcribe or translate")
):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name

    try:
        start = time.time()
        segments, info = model.transcribe(
            tmp_path,
            beam_size=5,
            language=language,
            task=task,
            vad_filter=True,
            vad_parameters=dict(min_silence_duration_ms=500)
        )

        results = []
        full_text = []
        for segment in segments:
            results.append({
                "start": round(segment.start, 2),
                "end": round(segment.end, 2),
                "text": segment.text.strip()
            })
            full_text.append(segment.text.strip())

        elapsed = time.time() - start
        return {
            "text": " ".join(full_text),
            "segments": results,
            "language": info.language,
            "duration": round(info.duration, 2),
            "processing_time": round(elapsed, 2)
        }
    finally:
        os.unlink(tmp_path)

# Run the server
uvicorn whisper_server:app --host 0.0.0.0 --port 8000 --workers 1

# Test with curl
curl -X POST http://localhost:8000/transcribe \
  -F "file=@meeting_recording.wav" \
  -F "language=en"

Real-Time Streaming Transcription

For live audio streams, use a WebSocket endpoint that receives audio chunks and returns transcriptions progressively:

# Add to whisper_server.py
from fastapi import WebSocket
import numpy as np
import io
import soundfile as sf

@app.websocket("/stream")
async def stream_transcribe(websocket: WebSocket):
    await websocket.accept()
    buffer = np.array([], dtype=np.float32)

    try:
        while True:
            # Receive audio chunk (16kHz, mono, float32)
            data = await websocket.receive_bytes()
            audio_chunk, sr = sf.read(io.BytesIO(data), dtype='float32')
            buffer = np.concatenate([buffer, audio_chunk])

            # Process when buffer reaches 5 seconds
            if len(buffer) >= sr * 5:
                segments, _ = model.transcribe(
                    buffer, beam_size=3, language="en",
                    vad_filter=True
                )
                text = " ".join(s.text.strip() for s in segments)
                await websocket.send_json({"text": text})
                buffer = np.array([], dtype=np.float32)
    except Exception:
        await websocket.close()

This approach buffers 5 seconds of audio before transcribing, providing a good balance between latency and accuracy. For lower latency, reduce the buffer size and use the small or medium model. If you need to select the most cost-effective GPU for your transcription workload, our cheapest GPU for AI inference guide covers the full range of options.

Production Configuration

Deploy the Whisper server as a systemd service with process management and logging:

# /etc/systemd/system/whisper.service
[Unit]
Description=Whisper Transcription Server
After=network.target

[Service]
User=deploy
WorkingDirectory=/home/deploy
ExecStart=/home/deploy/whisper-env/bin/uvicorn whisper_server:app \
  --host 0.0.0.0 --port 8000 --workers 1
Restart=always
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable whisper
sudo systemctl start whisper

Add Nginx as a reverse proxy for TLS termination, following the same pattern described in our production inference server guide. For GPU servers handling multiple workloads, our PyTorch hosting page covers the shared runtime environment. If you are running Whisper alongside other models on the same GPU, monitor VRAM usage carefully — Whisper large-v3 uses about 10 GB, leaving room for smaller models on a 24 GB card.

Next Steps and Advanced Use Cases

With Whisper running on dedicated hardware, you can build sophisticated audio processing pipelines. Common extensions include:

Voice agents: Combine Whisper transcription with an LLM-powered voice agent for conversational AI
Meeting summarisation: Feed transcripts into a self-hosted LLM for automatic meeting notes
Multi-model pipelines: Run Whisper alongside a chatbot or other speech models on the same server
Batch processing: Process recorded audio archives at maximum throughput

For benchmarking your deployment, the TTS and speech latency benchmarks provide reference numbers for the full speech processing pipeline. Explore the model guides category for deployment instructions for other models that complement Whisper.

Deploy Whisper on Dedicated GPU Hardware

GigaGPU provides GPU servers optimised for real-time speech processing. Pre-configured with CUDA and fast NVMe storage for instant model loading. Process unlimited audio with zero per-minute costs.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

How to Deploy Whisper for Real-Time Transcription on a GPU Server

Why Self-Host Whisper for Transcription

Whisper Model Selection and GPU Requirements

Installing Faster Whisper for GPU Inference

Building a Transcription API

Real-Time Streaming Transcription

Production Configuration

Next Steps and Advanced Use Cases

Deploy Whisper on Dedicated GPU Hardware

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

How to Deploy Whisper for Real-Time Transcription on a GPU Server

Why Self-Host Whisper for Transcription

Whisper Model Selection and GPU Requirements

Installing Faster Whisper for GPU Inference

Building a Transcription API

Real-Time Streaming Transcription

Production Configuration

Next Steps and Advanced Use Cases

Deploy Whisper on Dedicated GPU Hardware

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 4090 24 GB for Codestral 22B: Fits, Just Barely, and Here’s How

Self-Hosted Qwen 2.5 72B Deployment Guide: Hardware, vLLM Config, Real Benchmarks

Gemma 2 for Transcription Enhancement: GPU Requirements & Setup

RTX 4090 24GB for Qwen 2.5 Coder 14B: The Best Self-Hosted Code Assistant on One Card

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?