Whisper Timestamps Are Wrong or Misaligned
Your Whisper transcription text is accurate, but the timestamps are off — words appear several seconds before or after they were actually spoken, segments overlap each other, or word-level timestamps pile up at segment boundaries instead of distributing evenly. If you are building subtitles, karaoke-style highlighting, or audio search, broken timestamps render the output useless. These alignment problems have specific causes and fixes on your GPU server.
Fix 1: Segment Timestamp Drift
Timestamps drift when Whisper’s 30-second processing windows accumulate rounding errors across a long file:
# Problem: timestamps become progressively wrong after minute 5
# Root cause: default chunking does not compensate for silence trimming
# Fix: use word_timestamps for precise alignment
import whisper
model = whisper.load_model("large-v3", device="cuda")
result = model.transcribe("long_audio.wav",
word_timestamps=True, # Enable word-level timing
condition_on_previous_text=False, # Prevent error propagation
prepend_punctuations="\"'([{-",
append_punctuations="\"'.googl,googl?!:;)]}-"
)
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
if "words" in segment:
for word in segment["words"]:
print(f" {word['start']:.3f}-{word['end']:.3f}: {word['word']}")
Fix 2: Overlapping Segment Boundaries
Consecutive segments may have overlapping time ranges, causing subtitles to stack on screen:
def fix_overlapping_segments(segments, min_gap=0.05):
"""Ensure no two segments overlap by adjusting end times."""
fixed = []
for i, seg in enumerate(segments):
s = dict(seg)
if i > 0 and s["start"] < fixed[-1]["end"]:
# Option A: trim previous segment
fixed[-1]["end"] = s["start"] - min_gap
# Option B: push current segment forward (uncomment below)
# s["start"] = fixed[-1]["end"] + min_gap
fixed.append(s)
return fixed
segments = result["segments"]
clean_segments = fix_overlapping_segments(segments)
# Convert to SRT format for subtitle use
def to_srt(segments):
lines = []
for i, s in enumerate(segments, 1):
start = format_timestamp(s["start"])
end = format_timestamp(s["end"])
lines.append(f"{i}\n{start} --> {end}\n{s['text'].strip()}\n")
return "\n".join(lines)
def format_timestamp(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = seconds % 60
return f"{h:02d}:{m:02d}:{s:06.3f}".replace(".", ",")
Fix 3: Inaccurate Word-Level Timestamps
Whisper’s native word timestamps use cross-attention weights, which are approximate. For precise word alignment, use whisper-timestamped or stable-ts:
# Option A: stable-ts (most accurate word alignment)
pip install stable-ts
import stable_whisper
model = stable_whisper.load_model("large-v3", device="cuda")
result = model.transcribe("audio.wav")
# Refine timestamps using audio signal analysis
result.adjust_by_silence() # Align to silence boundaries
result.refine() # Cross-reference with audio features
# Export with word-level timing
result.to_srt_vtt("output.srt", word_level=True)
# Option B: whisperx (forced alignment with phoneme model)
pip install whisperx
import whisperx
model = whisperx.load_model("large-v3", device="cuda", compute_type="float16")
audio = whisperx.load_audio("audio.wav")
result = model.transcribe(audio, batch_size=16)
# Align with phoneme-level model
align_model, metadata = whisperx.load_align_model(language_code="en", device="cuda")
aligned = whisperx.align(result["segments"], align_model, metadata, audio, device="cuda")
Fix 4: Timestamps Attached to Silence
Whisper sometimes assigns timestamps to non-speech segments. Apply Voice Activity Detection to filter these:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Enable VAD filtering — silero-vad runs on CPU alongside Whisper
segments, _ = model.transcribe("audio.wav",
vad_filter=True,
vad_parameters={
"threshold": 0.5, # Speech probability threshold
"min_speech_duration_ms": 250, # Ignore speech shorter than this
"min_silence_duration_ms": 500, # Merge segments closer than this
"speech_pad_ms": 200 # Pad detected speech regions
}
)
Verifying Timestamp Quality
Visually spot-check timestamps against the audio waveform:
# Quick verification: play audio with subtitle overlay
ffmpeg -i audio.wav -vf "subtitles=output.srt:force_style='FontSize=24'" \
-c:a copy preview.mp4
# Programmatic check: ensure monotonic, non-overlapping timestamps
prev_end = 0
for seg in result["segments"]:
assert seg["start"] >= prev_end, f"Overlap at {seg['start']}"
assert seg["end"] > seg["start"], f"Zero-length at {seg['start']}"
prev_end = seg["end"]
print("All timestamps valid")
For production subtitle generation on your GPU server, stable-ts with refinement produces broadcast-quality alignment. The Whisper hosting platform supports all these libraries. Check the tutorials section for deployment patterns, benchmarks for speed comparisons, and the PyTorch guide for environment configuration. The PyTorch hosting page lists compatible GPU hardware.
Precision Transcription on GPU
Run Whisper with word-level alignment on GigaGPU dedicated servers. Fast hardware, accurate timestamps.
Browse GPU Servers