RTX 3050 - Order Now
Home / Blog / Tutorials / Whisper + Pyannote Diarization on a GPU
Tutorials

Whisper + Pyannote Diarization on a GPU

Transcription tells you what was said. Diarization tells you who said it. Combined pipeline on a dedicated GPU for full speaker-labelled transcripts.

Raw Whisper transcription is a monologue-style stream of text. For meetings, calls, or interviews you need to know who said what. Pyannote handles speaker diarization. On dedicated GPU hosting the combined pipeline produces speaker-labelled transcripts reliably.

Contents

Stack

  • faster-whisper for transcription with per-segment timestamps
  • pyannote.audio for speaker diarization
  • A merger that assigns segments to speakers based on overlapping timestamps

The whisperX project combines these and is the practical default.

Pipeline

  1. Transcribe audio with faster-whisper, keeping word-level timestamps
  2. Diarize audio with Pyannote – produces (start, end, speaker_id) intervals
  3. Assign each transcribed word to the speaker whose interval contains its timestamp
  4. Collapse consecutive same-speaker words into turns

Code

import whisperx

audio = whisperx.load_audio("meeting.wav")
model = whisperx.load_model("large-v3-turbo", device="cuda")
result = model.transcribe(audio, batch_size=16)

align_model, meta = whisperx.load_align_model(language_code=result["language"], device="cuda")
result = whisperx.align(result["segments"], align_model, meta, audio, device="cuda")

diarize = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device="cuda")
segments = diarize(audio)
result = whisperx.assign_word_speakers(segments, result)

for seg in result["segments"]:
    print(f"{seg['speaker']}: {seg['text']}")

Quality Tips

  • Pyannote needs clean audio – noise and overlapping speakers degrade diarization
  • Set the expected speaker count if you know it (min_speakers=2, max_speakers=4)
  • For phone calls, two-speaker diarization is usually accurate; large conference calls are harder
  • Diarization runs on GPU for speed but CPU is fine for non-real-time batch jobs

Speaker-Labelled Transcription Hosting

Whisper + Pyannote on UK dedicated GPUs with HuggingFace tokens configured.

Browse GPU Servers

See Whisper Turbo.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?