Your AI Model Rejects or Mangles Audio Input
You feed an audio file to Whisper and get garbled transcription, or your TTS voice cloning reference produces distorted output. The problem is rarely the model — it is the input format. Whisper expects 16 kHz mono float32 PCM. Coqui XTTS wants 22050 Hz mono. Most raw recordings arrive as 44.1 kHz stereo MP3 or compressed AAC from smartphones. The format mismatch degrades accuracy, introduces artefacts, or triggers outright errors on your GPU server.
Converting Audio for Whisper
Whisper’s expected input format and the FFmpeg command to produce it:
# Whisper native format: 16 kHz, mono, 16-bit PCM WAV
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le whisper_ready.wav
# From video files (strip video, keep audio)
ffmpeg -i input.mp4 -vn -ar 16000 -ac 1 -c:a pcm_s16le whisper_ready.wav
# From stereo podcast (mix to mono properly)
ffmpeg -i stereo_podcast.wav -ar 16000 -ac 1 \
-af "pan=mono|c0=0.5*c0+0.5*c1" whisper_ready.wav
# From phone recording (often 8kHz AMR — upsample)
ffmpeg -i phone_call.amr -ar 16000 -ac 1 -c:a pcm_s16le whisper_ready.wav
# Verify the output format
ffprobe -v quiet -show_format -show_streams whisper_ready.wav 2>&1 | \
grep -E "sample_rate|channels|codec_name|bits_per_sample"
Converting Audio for TTS Voice Cloning
TTS models need clean reference audio at specific sample rates:
# Coqui XTTS v2: 22050 Hz mono (or 24000 Hz depending on version)
ffmpeg -i reference.mp3 -ar 22050 -ac 1 -c:a pcm_s16le tts_reference.wav
# Trim to optimal 6-15 second clip
ffmpeg -i long_recording.wav -ss 00:01:23 -t 12 \
-ar 22050 -ac 1 -c:a pcm_s16le reference_clip.wav
# Remove background noise before using as reference
ffmpeg -i noisy_reference.wav -ar 22050 -ac 1 \
-af "highpass=f=80,lowpass=f=8000,afftdn=nf=-25" clean_reference.wav
# Normalise volume for consistent cloning quality
ffmpeg -i reference.wav -ar 22050 -ac 1 \
-af "loudnorm=I=-16:TP=-1.5:LRA=11" normalised_reference.wav
Batch Conversion for Large Datasets
When processing hundreds or thousands of audio files for training or transcription:
# Convert all MP3 files in a directory to Whisper format
#!/bin/bash
INPUT_DIR="/data/raw_audio"
OUTPUT_DIR="/data/whisper_ready"
mkdir -p "$OUTPUT_DIR"
for f in "$INPUT_DIR"/*.mp3; do
basename=$(basename "$f" .mp3)
ffmpeg -i "$f" -ar 16000 -ac 1 -c:a pcm_s16le \
"$OUTPUT_DIR/${basename}.wav" -y -loglevel error
done
echo "Converted $(ls "$OUTPUT_DIR"/*.wav | wc -l) files"
# Parallel conversion using GNU parallel (much faster)
find "$INPUT_DIR" -name "*.mp3" | parallel -j $(nproc) \
'ffmpeg -i {} -ar 16000 -ac 1 -c:a pcm_s16le \
'"$OUTPUT_DIR"'/{/.}.wav -y -loglevel error'
# Python batch conversion with progress tracking
import subprocess, os, glob
from tqdm import tqdm
files = glob.glob("/data/raw_audio/*.mp3")
for f in tqdm(files, desc="Converting"):
out = f"/data/whisper_ready/{os.path.basename(f).replace('.mp3', '.wav')}"
subprocess.run(["ffmpeg", "-i", f, "-ar", "16000", "-ac", "1",
"-c:a", "pcm_s16le", out, "-y", "-loglevel", "error"])
Noise Reduction Before AI Processing
Clean audio before feeding it to the model rather than hoping the model handles noise:
# Multi-stage noise reduction pipeline
ffmpeg -i raw.wav -af "\
highpass=f=80,\
lowpass=f=12000,\
afftdn=nf=-20:nt=w,\
loudnorm=I=-16:TP=-1.5:LRA=11\
" -ar 16000 -ac 1 cleaned.wav
# Filter explanations:
# highpass=f=80 — Remove rumble below 80 Hz
# lowpass=f=12000 — Remove hiss above 12 kHz
# afftdn=nf=-20 — FFT-based noise reduction (adaptive)
# loudnorm — EBU R128 loudness normalisation
# For very noisy audio, use two-pass noise reduction:
# Pass 1: capture noise profile from a silent section
ffmpeg -i noisy.wav -ss 0 -t 1 -af "asplit[a][b];[b]anoisesrc=d=1[n];[a][n]afftdn=nr=20" \
-f null /dev/null
# Pass 2: apply the profile
ffmpeg -i noisy.wav -af "afftdn=nr=20:nt=w" cleaned.wav
Quick Reference: AI Audio Format Requirements
# Model Sample Rate Channels Format Notes
# Whisper (all) 16000 Hz Mono PCM S16 Required by design
# faster-whisper 16000 Hz Mono PCM S16 Same as Whisper
# Coqui XTTS v2 22050 Hz Mono PCM S16 For reference audio
# Bark 24000 Hz Mono Float32 Output format
# Tortoise TTS 24000 Hz Mono Float32 Voice cloning ref
# AudioLDM 16000 Hz Mono Float32 Audio generation
# Universal conversion template:
# ffmpeg -i INPUT -ar RATE -ac 1 -c:a pcm_s16le OUTPUT.wav
Proper format conversion is the foundation of reliable audio AI pipelines on your GPU server. The Whisper hosting and Coqui TTS hosting pages cover model-specific deployment. Check the tutorials section for pipeline guides, the benchmarks for processing throughput, and our PyTorch guide for environment setup. The infrastructure section has storage and server configuration advice.
Audio AI on Dedicated GPUs
Process thousands of audio files per hour. GigaGPU servers pair fast NVMe storage with GPU compute for audio pipelines.
Browse GPU Servers