RTX 3050 - Order Now
Home / Blog / Tutorials / Whisper Accuracy Issues: Improvement Guide
Tutorials

Whisper Accuracy Issues: Improvement Guide

Improve Whisper transcription accuracy by selecting the right model size, preprocessing audio, tuning decoding parameters, handling accents, and applying post-processing corrections on GPU servers.

Whisper Is Producing Inaccurate Transcriptions

Your Whisper deployment returns transcriptions riddled with errors — wrong words, hallucinated sentences during silence, mangled proper nouns, or entire phrases missing from the output. A word error rate above 10 percent on clean speech means something in the pipeline is misconfigured. On Whisper GPU hosting setups, most accuracy problems come from model selection, audio quality, or decoding settings — all of which are fixable.

Choose the Right Model Size

Model size is the single largest factor in transcription quality:

# Model sizes and approximate WER on English (LibriSpeech test-clean)
# tiny:    7.6% WER  — 39M params, ~1 GB VRAM
# base:    5.0% WER  — 74M params, ~1 GB VRAM
# small:   3.4% WER  — 244M params, ~2 GB VRAM
# medium:  2.9% WER  — 769M params, ~5 GB VRAM
# large-v3: 2.0% WER — 1.55B params, ~10 GB VRAM

import whisper
model = whisper.load_model("large-v3", device="cuda")

# For non-English or accented speech, large-v3 is strongly recommended
# Smaller models degrade much faster on non-English content

If you are using tiny or base for production transcription and seeing errors, upgrade to large-v3. The VRAM cost is modest — it fits comfortably on any dedicated GPU server.

Audio Preprocessing for Better Results

Whisper expects 16 kHz mono audio. Feeding it other formats forces internal resampling that can degrade quality:

# Proper preprocessing with ffmpeg before feeding to Whisper
ffmpeg -i input.mp4 -ar 16000 -ac 1 -c:a pcm_s16le clean_audio.wav

# Noise reduction with sox (install: apt install sox)
sox noisy.wav cleaned.wav noisered noise_profile.txt 0.21

# Volume normalisation prevents clipping and quiet segments
ffmpeg -i input.wav -af loudnorm=I=-16:TP=-1.5:LRA=11 normalised.wav

# Remove silence padding that causes hallucinations
ffmpeg -i input.wav -af silenceremove=stop_periods=-1:stop_threshold=-40dB trimmed.wav

Whisper hallucinates repeated phrases when fed long silent sections. Trimming silence before transcription eliminates this failure mode entirely.

Tune Decoding Parameters

Default decoding settings work for general audio but leave accuracy on the table for specific use cases:

result = model.transcribe("audio.wav",
    language="en",              # Set explicitly; auto-detect can misidentify
    temperature=0,              # Greedy decoding = most consistent output
    beam_size=5,                # Beam search improves accuracy by ~0.5% WER
    best_of=5,                  # Sample multiple candidates at each step
    condition_on_previous_text=False,  # Prevents error propagation
    compression_ratio_threshold=2.4,   # Reject hallucinated segments
    no_speech_threshold=0.6,           # Filter silence more aggressively
    word_timestamps=True               # Enable for alignment verification
)

Setting condition_on_previous_text=False prevents a single misrecognised word from corrupting all subsequent segments. This alone can fix cascading errors.

Post-Processing Corrections

Apply domain-specific corrections after transcription to catch systematic errors:

import re

def post_process_transcript(text, domain_terms=None):
    # Fix common Whisper mistakes
    replacements = {
        "gonna": "going to",
        "wanna": "want to",
    }
    for wrong, right in replacements.items():
        text = text.replace(wrong, right)

    # Domain-specific term correction
    if domain_terms:
        for correct_term in domain_terms:
            pattern = re.compile(re.escape(correct_term), re.IGNORECASE)
            text = pattern.sub(correct_term, text)

    # Remove hallucination artifacts (repeated phrases)
    text = re.sub(r'(.{20,}?)\1{2,}', r'\1', text)

    return text

# Usage with medical terms
medical_terms = ["paracetamol", "ibuprofen", "amoxicillin"]
cleaned = post_process_transcript(result["text"], medical_terms)

Measuring Accuracy Improvements

Track word error rate across your test set to verify each change helps:

pip install jiwer

from jiwer import wer
reference = "the actual spoken text goes here"
hypothesis = result["text"]
error_rate = wer(reference, hypothesis)
print(f"WER: {error_rate:.2%}")

Build a reference set of 50-100 manually transcribed audio clips representative of your production data. Test every pipeline change against this set. For large-scale transcription on your GPU server, the tutorials section covers deployment patterns. See the benchmarks for per-GPU throughput at each model size, our PyTorch guide for environment setup, and the PyTorch hosting page for compatible hardware. The infrastructure section covers production deployment patterns.

GPU Servers for Whisper

Run Whisper large-v3 with headroom to spare. GigaGPU servers come preconfigured for audio AI workloads.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?