RTX 3050 - Order Now
Home / Blog / Tutorials / Whisper Timestamp Errors: Fix Guide
Tutorials

Whisper Timestamp Errors: Fix Guide

Fix Whisper word-level and segment-level timestamp alignment errors including drifting timestamps, overlapping segments, missing alignment, and incorrect word boundaries on GPU servers.

Whisper Timestamps Are Wrong or Misaligned

Your Whisper transcription text is accurate, but the timestamps are off — words appear several seconds before or after they were actually spoken, segments overlap each other, or word-level timestamps pile up at segment boundaries instead of distributing evenly. If you are building subtitles, karaoke-style highlighting, or audio search, broken timestamps render the output useless. These alignment problems have specific causes and fixes on your GPU server.

Fix 1: Segment Timestamp Drift

Timestamps drift when Whisper’s 30-second processing windows accumulate rounding errors across a long file:

# Problem: timestamps become progressively wrong after minute 5
# Root cause: default chunking does not compensate for silence trimming

# Fix: use word_timestamps for precise alignment
import whisper
model = whisper.load_model("large-v3", device="cuda")

result = model.transcribe("long_audio.wav",
    word_timestamps=True,       # Enable word-level timing
    condition_on_previous_text=False,  # Prevent error propagation
    prepend_punctuations="\"'([{-",
    append_punctuations="\"'.googl,googl?!:;)]}-"
)

for segment in result["segments"]:
    print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
    if "words" in segment:
        for word in segment["words"]:
            print(f"  {word['start']:.3f}-{word['end']:.3f}: {word['word']}")

Fix 2: Overlapping Segment Boundaries

Consecutive segments may have overlapping time ranges, causing subtitles to stack on screen:

def fix_overlapping_segments(segments, min_gap=0.05):
    """Ensure no two segments overlap by adjusting end times."""
    fixed = []
    for i, seg in enumerate(segments):
        s = dict(seg)
        if i > 0 and s["start"] < fixed[-1]["end"]:
            # Option A: trim previous segment
            fixed[-1]["end"] = s["start"] - min_gap
            # Option B: push current segment forward (uncomment below)
            # s["start"] = fixed[-1]["end"] + min_gap
        fixed.append(s)
    return fixed

segments = result["segments"]
clean_segments = fix_overlapping_segments(segments)

# Convert to SRT format for subtitle use
def to_srt(segments):
    lines = []
    for i, s in enumerate(segments, 1):
        start = format_timestamp(s["start"])
        end = format_timestamp(s["end"])
        lines.append(f"{i}\n{start} --> {end}\n{s['text'].strip()}\n")
    return "\n".join(lines)

def format_timestamp(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = seconds % 60
    return f"{h:02d}:{m:02d}:{s:06.3f}".replace(".", ",")

Fix 3: Inaccurate Word-Level Timestamps

Whisper’s native word timestamps use cross-attention weights, which are approximate. For precise word alignment, use whisper-timestamped or stable-ts:

# Option A: stable-ts (most accurate word alignment)
pip install stable-ts

import stable_whisper
model = stable_whisper.load_model("large-v3", device="cuda")
result = model.transcribe("audio.wav")

# Refine timestamps using audio signal analysis
result.adjust_by_silence()   # Align to silence boundaries
result.refine()              # Cross-reference with audio features

# Export with word-level timing
result.to_srt_vtt("output.srt", word_level=True)

# Option B: whisperx (forced alignment with phoneme model)
pip install whisperx

import whisperx
model = whisperx.load_model("large-v3", device="cuda", compute_type="float16")
audio = whisperx.load_audio("audio.wav")
result = model.transcribe(audio, batch_size=16)

# Align with phoneme-level model
align_model, metadata = whisperx.load_align_model(language_code="en", device="cuda")
aligned = whisperx.align(result["segments"], align_model, metadata, audio, device="cuda")

Fix 4: Timestamps Attached to Silence

Whisper sometimes assigns timestamps to non-speech segments. Apply Voice Activity Detection to filter these:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# Enable VAD filtering — silero-vad runs on CPU alongside Whisper
segments, _ = model.transcribe("audio.wav",
    vad_filter=True,
    vad_parameters={
        "threshold": 0.5,            # Speech probability threshold
        "min_speech_duration_ms": 250,  # Ignore speech shorter than this
        "min_silence_duration_ms": 500, # Merge segments closer than this
        "speech_pad_ms": 200            # Pad detected speech regions
    }
)

Verifying Timestamp Quality

Visually spot-check timestamps against the audio waveform:

# Quick verification: play audio with subtitle overlay
ffmpeg -i audio.wav -vf "subtitles=output.srt:force_style='FontSize=24'" \
  -c:a copy preview.mp4

# Programmatic check: ensure monotonic, non-overlapping timestamps
prev_end = 0
for seg in result["segments"]:
    assert seg["start"] >= prev_end, f"Overlap at {seg['start']}"
    assert seg["end"] > seg["start"], f"Zero-length at {seg['start']}"
    prev_end = seg["end"]
print("All timestamps valid")

For production subtitle generation on your GPU server, stable-ts with refinement produces broadcast-quality alignment. The Whisper hosting platform supports all these libraries. Check the tutorials section for deployment patterns, benchmarks for speed comparisons, and the PyTorch guide for environment configuration. The PyTorch hosting page lists compatible GPU hardware.

Precision Transcription on GPU

Run Whisper with word-level alignment on GigaGPU dedicated servers. Fast hardware, accurate timestamps.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?