Whisper Is Running Slower Than Expected on Your GPU
You have a capable GPU, but Whisper transcribes a 60-minute audio file in 15 minutes or more. The GPU utilisation in nvidia-smi sits well below 50 percent during transcription, and real-time factor hovers around 0.25x when you expected near real-time or faster. On a properly optimised dedicated GPU server, Whisper large-v3 should achieve 5-10x real-time speed. Here is how to get there.
Enable FP16 Inference
The default OpenAI Whisper implementation runs in FP32 unless told otherwise. Switching to half precision nearly doubles speed:
import whisper
# Default (slow): loads in FP32
model = whisper.load_model("large-v3")
# Fast: explicit FP16
model = whisper.load_model("large-v3", device="cuda")
result = model.transcribe("audio.wav", fp16=True)
# Verify GPU is actually being used
import torch
print(f"Model device: {next(model.parameters()).device}")
print(f"CUDA available: {torch.cuda.is_available()}")
If the model device prints cpu, Whisper is not using your GPU at all — the most common cause of unexpectedly slow transcription.
Switch to Faster-Whisper (CTranslate2)
The faster-whisper library uses CTranslate2, a C++ inference engine that runs 4-6x faster than the original Python implementation:
pip install faster-whisper
from faster_whisper import WhisperModel
# INT8 quantisation: fastest option, minimal quality loss
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
# FP16: best speed/quality balance
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")
# Benchmark comparison (60-min audio, RTX 5090):
# OpenAI Whisper FP32: ~15 minutes
# OpenAI Whisper FP16: ~8 minutes
# faster-whisper FP16: ~2.5 minutes
# faster-whisper INT8: ~1.8 minutes
For production deployments, faster-whisper should be the default choice. It produces identical output to the original with dramatically lower latency.
Batched Audio Processing
Process multiple audio files concurrently to maximise GPU utilisation:
from faster_whisper import WhisperModel, BatchedInferencePipeline
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
batched = BatchedInferencePipeline(model=model)
# Batched inference processes multiple segments simultaneously
segments, info = batched.transcribe("long_audio.wav", batch_size=16)
# For multiple files, use a processing pool
import concurrent.futures
import os
audio_files = [f for f in os.listdir("audio/") if f.endswith(".wav")]
def transcribe_file(filepath):
segments, _ = model.transcribe(filepath)
return " ".join([s.text for s in segments])
# Process queue of files
for audio_file in audio_files:
result = transcribe_file(f"audio/{audio_file}")
print(f"{audio_file}: {result[:100]}...")
Chunked Processing for Long Audio
Whisper processes audio in 30-second windows internally. For very long files, explicit chunking with overlap prevents accuracy degradation at segment boundaries:
from pydub import AudioSegment
import tempfile, os
def chunk_and_transcribe(audio_path, model, chunk_ms=300000, overlap_ms=5000):
audio = AudioSegment.from_file(audio_path)
results = []
start = 0
while start < len(audio):
end = min(start + chunk_ms, len(audio))
chunk = audio[start:end]
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
chunk.export(f.name, format="wav")
segments, _ = model.transcribe(f.name)
text = " ".join([s.text for s in segments])
results.append(text)
os.unlink(f.name)
start += chunk_ms - overlap_ms # Overlap to avoid cut words
return " ".join(results)
Verifying GPU Utilisation Is Maximised
After applying optimisations, confirm the GPU is fully engaged:
# Watch GPU utilisation during transcription
watch -n 0.5 nvidia-smi
# Target metrics:
# GPU-Util: 85-100% during active transcription
# Memory: model size + working memory (large-v3 ~10 GB)
# Power: near TDP indicates full compute usage
# If GPU-Util stays low, check:
# 1. Audio preprocessing is not the bottleneck (decode on CPU first)
# 2. Disk I/O is not starving the pipeline (use NVMe, not HDD)
# 3. CPU is not a bottleneck (check with htop)
On a properly configured GPU server, faster-whisper with INT8 on an RTX 5090 achieves 30x real-time factor — a 60-minute recording transcribes in under two minutes. Check the benchmarks section for GPU-specific numbers. The tutorials cover pipeline deployment, our PyTorch guide covers environment setup, and the PyTorch hosting page lists compatible hardware.
Fast Whisper Transcription
GigaGPU dedicated servers deliver 30x real-time Whisper transcription. Deploy faster-whisper on high-bandwidth GPU hardware.
Browse GPU Servers