Whisper Is Producing Inaccurate Transcriptions
Your Whisper deployment returns transcriptions riddled with errors — wrong words, hallucinated sentences during silence, mangled proper nouns, or entire phrases missing from the output. A word error rate above 10 percent on clean speech means something in the pipeline is misconfigured. On Whisper GPU hosting setups, most accuracy problems come from model selection, audio quality, or decoding settings — all of which are fixable.
Choose the Right Model Size
Model size is the single largest factor in transcription quality:
# Model sizes and approximate WER on English (LibriSpeech test-clean)
# tiny: 7.6% WER — 39M params, ~1 GB VRAM
# base: 5.0% WER — 74M params, ~1 GB VRAM
# small: 3.4% WER — 244M params, ~2 GB VRAM
# medium: 2.9% WER — 769M params, ~5 GB VRAM
# large-v3: 2.0% WER — 1.55B params, ~10 GB VRAM
import whisper
model = whisper.load_model("large-v3", device="cuda")
# For non-English or accented speech, large-v3 is strongly recommended
# Smaller models degrade much faster on non-English content
If you are using tiny or base for production transcription and seeing errors, upgrade to large-v3. The VRAM cost is modest — it fits comfortably on any dedicated GPU server.
Audio Preprocessing for Better Results
Whisper expects 16 kHz mono audio. Feeding it other formats forces internal resampling that can degrade quality:
# Proper preprocessing with ffmpeg before feeding to Whisper
ffmpeg -i input.mp4 -ar 16000 -ac 1 -c:a pcm_s16le clean_audio.wav
# Noise reduction with sox (install: apt install sox)
sox noisy.wav cleaned.wav noisered noise_profile.txt 0.21
# Volume normalisation prevents clipping and quiet segments
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.5:LRA=11 normalised.wav
# Remove silence padding that causes hallucinations
ffmpeg -i input.wav -af silenceremove=stop_periods=-1:stop_threshold=-40dB trimmed.wav
Whisper hallucinates repeated phrases when fed long silent sections. Trimming silence before transcription eliminates this failure mode entirely.
Tune Decoding Parameters
Default decoding settings work for general audio but leave accuracy on the table for specific use cases:
result = model.transcribe("audio.wav",
language="en", # Set explicitly; auto-detect can misidentify
temperature=0, # Greedy decoding = most consistent output
beam_size=5, # Beam search improves accuracy by ~0.5% WER
best_of=5, # Sample multiple candidates at each step
condition_on_previous_text=False, # Prevents error propagation
compression_ratio_threshold=2.4, # Reject hallucinated segments
no_speech_threshold=0.6, # Filter silence more aggressively
word_timestamps=True # Enable for alignment verification
)
Setting condition_on_previous_text=False prevents a single misrecognised word from corrupting all subsequent segments. This alone can fix cascading errors.
Post-Processing Corrections
Apply domain-specific corrections after transcription to catch systematic errors:
import re
def post_process_transcript(text, domain_terms=None):
# Fix common Whisper mistakes
replacements = {
"gonna": "going to",
"wanna": "want to",
}
for wrong, right in replacements.items():
text = text.replace(wrong, right)
# Domain-specific term correction
if domain_terms:
for correct_term in domain_terms:
pattern = re.compile(re.escape(correct_term), re.IGNORECASE)
text = pattern.sub(correct_term, text)
# Remove hallucination artifacts (repeated phrases)
text = re.sub(r'(.{20,}?)\1{2,}', r'\1', text)
return text
# Usage with medical terms
medical_terms = ["paracetamol", "ibuprofen", "amoxicillin"]
cleaned = post_process_transcript(result["text"], medical_terms)
Measuring Accuracy Improvements
Track word error rate across your test set to verify each change helps:
pip install jiwer
from jiwer import wer
reference = "the actual spoken text goes here"
hypothesis = result["text"]
error_rate = wer(reference, hypothesis)
print(f"WER: {error_rate:.2%}")
Build a reference set of 50-100 manually transcribed audio clips representative of your production data. Test every pipeline change against this set. For large-scale transcription on your GPU server, the tutorials section covers deployment patterns. See the benchmarks for per-GPU throughput at each model size, our PyTorch guide for environment setup, and the PyTorch hosting page for compatible hardware. The infrastructure section covers production deployment patterns.
GPU Servers for Whisper
Run Whisper large-v3 with headroom to spare. GigaGPU servers come preconfigured for audio AI workloads.
Browse GPU Servers