RTX 3050 - Order Now
Home / Blog / Tutorials / Whisper Slow on GPU: Speed Optimization
Tutorials

Whisper Slow on GPU: Speed Optimization

Fix slow Whisper transcription on GPU servers. Covers FP16 inference, batched decoding, faster-whisper CTranslate2, model compilation, chunked processing, and GPU utilisation improvements.

Whisper Is Running Slower Than Expected on Your GPU

You have a capable GPU, but Whisper transcribes a 60-minute audio file in 15 minutes or more. The GPU utilisation in nvidia-smi sits well below 50 percent during transcription, and real-time factor hovers around 0.25x when you expected near real-time or faster. On a properly optimised dedicated GPU server, Whisper large-v3 should achieve 5-10x real-time speed. Here is how to get there.

Enable FP16 Inference

The default OpenAI Whisper implementation runs in FP32 unless told otherwise. Switching to half precision nearly doubles speed:

import whisper

# Default (slow): loads in FP32
model = whisper.load_model("large-v3")

# Fast: explicit FP16
model = whisper.load_model("large-v3", device="cuda")
result = model.transcribe("audio.wav", fp16=True)

# Verify GPU is actually being used
import torch
print(f"Model device: {next(model.parameters()).device}")
print(f"CUDA available: {torch.cuda.is_available()}")

If the model device prints cpu, Whisper is not using your GPU at all — the most common cause of unexpectedly slow transcription.

Switch to Faster-Whisper (CTranslate2)

The faster-whisper library uses CTranslate2, a C++ inference engine that runs 4-6x faster than the original Python implementation:

pip install faster-whisper

from faster_whisper import WhisperModel

# INT8 quantisation: fastest option, minimal quality loss
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# FP16: best speed/quality balance
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")

# Benchmark comparison (60-min audio, RTX 5090):
# OpenAI Whisper FP32:    ~15 minutes
# OpenAI Whisper FP16:    ~8 minutes
# faster-whisper FP16:    ~2.5 minutes
# faster-whisper INT8:    ~1.8 minutes

For production deployments, faster-whisper should be the default choice. It produces identical output to the original with dramatically lower latency.

Batched Audio Processing

Process multiple audio files concurrently to maximise GPU utilisation:

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
batched = BatchedInferencePipeline(model=model)

# Batched inference processes multiple segments simultaneously
segments, info = batched.transcribe("long_audio.wav", batch_size=16)

# For multiple files, use a processing pool
import concurrent.futures
import os

audio_files = [f for f in os.listdir("audio/") if f.endswith(".wav")]

def transcribe_file(filepath):
    segments, _ = model.transcribe(filepath)
    return " ".join([s.text for s in segments])

# Process queue of files
for audio_file in audio_files:
    result = transcribe_file(f"audio/{audio_file}")
    print(f"{audio_file}: {result[:100]}...")

Chunked Processing for Long Audio

Whisper processes audio in 30-second windows internally. For very long files, explicit chunking with overlap prevents accuracy degradation at segment boundaries:

from pydub import AudioSegment
import tempfile, os

def chunk_and_transcribe(audio_path, model, chunk_ms=300000, overlap_ms=5000):
    audio = AudioSegment.from_file(audio_path)
    results = []
    start = 0
    while start < len(audio):
        end = min(start + chunk_ms, len(audio))
        chunk = audio[start:end]
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            chunk.export(f.name, format="wav")
            segments, _ = model.transcribe(f.name)
            text = " ".join([s.text for s in segments])
            results.append(text)
            os.unlink(f.name)
        start += chunk_ms - overlap_ms  # Overlap to avoid cut words
    return " ".join(results)

Verifying GPU Utilisation Is Maximised

After applying optimisations, confirm the GPU is fully engaged:

# Watch GPU utilisation during transcription
watch -n 0.5 nvidia-smi

# Target metrics:
# GPU-Util: 85-100% during active transcription
# Memory:   model size + working memory (large-v3 ~10 GB)
# Power:    near TDP indicates full compute usage

# If GPU-Util stays low, check:
# 1. Audio preprocessing is not the bottleneck (decode on CPU first)
# 2. Disk I/O is not starving the pipeline (use NVMe, not HDD)
# 3. CPU is not a bottleneck (check with htop)

On a properly configured GPU server, faster-whisper with INT8 on an RTX 5090 achieves 30x real-time factor — a 60-minute recording transcribes in under two minutes. Check the benchmarks section for GPU-specific numbers. The tutorials cover pipeline deployment, our PyTorch guide covers environment setup, and the PyTorch hosting page lists compatible hardware.

Fast Whisper Transcription

GigaGPU dedicated servers deliver 30x real-time Whisper transcription. Deploy faster-whisper on high-bandwidth GPU hardware.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?