Raw Whisper transcription is a monologue-style stream of text. For meetings, calls, or interviews you need to know who said what. Pyannote handles speaker diarization. On dedicated GPU hosting the combined pipeline produces speaker-labelled transcripts reliably.
Contents
Stack
- faster-whisper for transcription with per-segment timestamps
- pyannote.audio for speaker diarization
- A merger that assigns segments to speakers based on overlapping timestamps
The whisperX project combines these and is the practical default.
Pipeline
- Transcribe audio with faster-whisper, keeping word-level timestamps
- Diarize audio with Pyannote – produces (start, end, speaker_id) intervals
- Assign each transcribed word to the speaker whose interval contains its timestamp
- Collapse consecutive same-speaker words into turns
Code
import whisperx
audio = whisperx.load_audio("meeting.wav")
model = whisperx.load_model("large-v3-turbo", device="cuda")
result = model.transcribe(audio, batch_size=16)
align_model, meta = whisperx.load_align_model(language_code=result["language"], device="cuda")
result = whisperx.align(result["segments"], align_model, meta, audio, device="cuda")
diarize = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device="cuda")
segments = diarize(audio)
result = whisperx.assign_word_speakers(segments, result)
for seg in result["segments"]:
print(f"{seg['speaker']}: {seg['text']}")
Quality Tips
- Pyannote needs clean audio – noise and overlapping speakers degrade diarization
- Set the expected speaker count if you know it (
min_speakers=2, max_speakers=4) - For phone calls, two-speaker diarization is usually accurate; large conference calls are harder
- Diarization runs on GPU for speed but CPU is fine for non-real-time batch jobs
Speaker-Labelled Transcription Hosting
Whisper + Pyannote on UK dedicated GPUs with HuggingFace tokens configured.
Browse GPU ServersSee Whisper Turbo.