Home / Blog / Tutorials / Whisper Slow on GPU: Speed Optimization

Tutorials

Whisper Slow on GPU: Speed Optimization

Fix slow Whisper transcription on GPU servers. Covers FP16 inference, batched decoding, faster-whisper CTranslate2, model compilation, chunked processing, and GPU utilisation improvements.

Tutorials April 16, 2026 3 min read admin

Whisper Is Running Slower Than Expected on Your GPU

You have a capable GPU, but Whisper transcribes a 60-minute audio file in 15 minutes or more. The GPU utilisation in nvidia-smi sits well below 50 percent during transcription, and real-time factor hovers around 0.25x when you expected near real-time or faster. On a properly optimised dedicated GPU server, Whisper large-v3 should achieve 5-10x real-time speed. Here is how to get there.

Enable FP16 Inference

The default OpenAI Whisper implementation runs in FP32 unless told otherwise. Switching to half precision nearly doubles speed:

import whisper

# Default (slow): loads in FP32
model = whisper.load_model("large-v3")

# Fast: explicit FP16
model = whisper.load_model("large-v3", device="cuda")
result = model.transcribe("audio.wav", fp16=True)

# Verify GPU is actually being used
import torch
print(f"Model device: {next(model.parameters()).device}")
print(f"CUDA available: {torch.cuda.is_available()}")

If the model device prints cpu, Whisper is not using your GPU at all — the most common cause of unexpectedly slow transcription.

Switch to Faster-Whisper (CTranslate2)

The faster-whisper library uses CTranslate2, a C++ inference engine that runs 4-6x faster than the original Python implementation:

pip install faster-whisper

from faster_whisper import WhisperModel

# INT8 quantisation: fastest option, minimal quality loss
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# FP16: best speed/quality balance
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")

# Benchmark comparison (60-min audio, RTX 5090):
# OpenAI Whisper FP32:    ~15 minutes
# OpenAI Whisper FP16:    ~8 minutes
# faster-whisper FP16:    ~2.5 minutes
# faster-whisper INT8:    ~1.8 minutes

For production deployments, faster-whisper should be the default choice. It produces identical output to the original with dramatically lower latency.

Batched Audio Processing

Process multiple audio files concurrently to maximise GPU utilisation:

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
batched = BatchedInferencePipeline(model=model)

# Batched inference processes multiple segments simultaneously
segments, info = batched.transcribe("long_audio.wav", batch_size=16)

# For multiple files, use a processing pool
import concurrent.futures
import os

audio_files = [f for f in os.listdir("audio/") if f.endswith(".wav")]

def transcribe_file(filepath):
    segments, _ = model.transcribe(filepath)
    return " ".join([s.text for s in segments])

# Process queue of files
for audio_file in audio_files:
    result = transcribe_file(f"audio/{audio_file}")
    print(f"{audio_file}: {result[:100]}...")

Chunked Processing for Long Audio

Whisper processes audio in 30-second windows internally. For very long files, explicit chunking with overlap prevents accuracy degradation at segment boundaries:

from pydub import AudioSegment
import tempfile, os

def chunk_and_transcribe(audio_path, model, chunk_ms=300000, overlap_ms=5000):
    audio = AudioSegment.from_file(audio_path)
    results = []
    start = 0
    while start < len(audio):
        end = min(start + chunk_ms, len(audio))
        chunk = audio[start:end]
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            chunk.export(f.name, format="wav")
            segments, _ = model.transcribe(f.name)
            text = " ".join([s.text for s in segments])
            results.append(text)
            os.unlink(f.name)
        start += chunk_ms - overlap_ms  # Overlap to avoid cut words
    return " ".join(results)

Verifying GPU Utilisation Is Maximised

After applying optimisations, confirm the GPU is fully engaged:

# Watch GPU utilisation during transcription
watch -n 0.5 nvidia-smi

# Target metrics:
# GPU-Util: 85-100% during active transcription
# Memory:   model size + working memory (large-v3 ~10 GB)
# Power:    near TDP indicates full compute usage

# If GPU-Util stays low, check:
# 1. Audio preprocessing is not the bottleneck (decode on CPU first)
# 2. Disk I/O is not starving the pipeline (use NVMe, not HDD)
# 3. CPU is not a bottleneck (check with htop)

On a properly configured GPU server, faster-whisper with INT8 on an RTX 5090 achieves 30x real-time factor — a 60-minute recording transcribes in under two minutes. Check the benchmarks section for GPU-specific numbers. The tutorials cover pipeline deployment, our PyTorch guide covers environment setup, and the PyTorch hosting page lists compatible hardware.

Fast Whisper Transcription

GigaGPU dedicated servers deliver 30x real-time Whisper transcription. Deploy faster-whisper on high-bandwidth GPU hardware.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Whisper Slow on GPU: Speed Optimization

Whisper Is Running Slower Than Expected on Your GPU

Enable FP16 Inference

Switch to Faster-Whisper (CTranslate2)

Batched Audio Processing

Chunked Processing for Long Audio

Verifying GPU Utilisation Is Maximised

Fast Whisper Transcription

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Whisper Slow on GPU: Speed Optimization

Whisper Is Running Slower Than Expected on Your GPU

Enable FP16 Inference

Switch to Faster-Whisper (CTranslate2)

Batched Audio Processing

Chunked Processing for Long Audio

Verifying GPU Utilisation Is Maximised

Fast Whisper Transcription

Need a Dedicated GPU Server?

admin

Related Articles

LangChain vs LlamaIndex vs Haystack for RAG 2026

OpenAI API Compatibility: vLLM as Drop-In Replacement

How to Secure Your AI Inference API (Authentication + Rate Limiting)

Migrate from Replicate to Dedicated GPU: Audio Transcription

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?