A voice activity detector is the quiet workhorse of any real-time audio pipeline. Silero VAD runs in milliseconds, decides whether each audio chunk contains speech, and is how you avoid transcribing endless silence on dedicated GPU hosting.
Contents
Why VAD
Without VAD, an always-on microphone feeds silence, background noise, and speech alike to Whisper. Whisper hallucinates on silence. VAD gates the transcriber – only speech-containing chunks go through. Benefits:
- No hallucinated transcripts during silence
- Lower compute cost (Whisper only runs on real speech)
- Accurate speech/silence boundaries for downstream tasks
Deployment
import torch
model, utils = torch.hub.load(
repo_or_dir="snakers4/silero-vad",
model="silero_vad",
force_reload=False,
)
(get_speech_timestamps, _, _, _, _) = utils
audio = read_audio("input.wav")
timestamps = get_speech_timestamps(audio, model, threshold=0.5)
Silero VAD is tiny (~2 MB). Runs on CPU or GPU. GPU gives a minor speed-up but is not required.
Streaming
For real-time audio, feed Silero VAD 512-sample chunks (32 ms at 16 kHz) as they arrive. Accumulate chunks flagged as speech, and when you get 2+ seconds of speech-then-silence, hand the buffer to Whisper:
while True:
chunk = read_from_mic()
is_speech = vad(chunk) > 0.5
if is_speech:
buffer.append(chunk)
elif buffer and silence_count > 30: # ~1 second of silence
transcribe(buffer)
buffer.clear()
Threshold
Default threshold 0.5 works for clean audio. In noisy environments raise to 0.6-0.7 to reduce false positives. In quiet audio with soft speakers drop to 0.3-0.4 to catch quiet speech.
Real-Time Audio Pipeline Hosting
UK dedicated GPUs with Silero VAD and Whisper preconfigured for streaming.
Browse GPU ServersSee Whisper Turbo and Whisper + diarization.