Your TTS Output Has Crackling, Pops, or Distortion
The synthesised speech from your TTS model contains audible crackling between words, periodic pops at sentence boundaries, metallic distortion on certain vowels, or a persistent high-frequency whine behind the voice. These artefacts make the output unusable for production applications. The text-to-speech model itself may be generating clean mel spectrograms, but something in the pipeline between synthesis and final WAV output is introducing corruption on your GPU server.
Fix 1: Sample Rate Mismatches
The most common cause of crackling. If any component in the pipeline expects a different sample rate than it receives, the waveform is distorted:
# Check what sample rate the model outputs
from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
print(f"Model sample rate: {tts.synthesizer.output_sample_rate}")
# XTTS v2 outputs at 24000 Hz
# Wrong: saving at mismatched sample rate
import soundfile as sf
wav = tts.tts("Hello world", speaker_wav="ref.wav", language="en")
sf.write("output.wav", wav, samplerate=44100) # WRONG — causes pitch shift + artifacts
# Correct: match the model's native sample rate
sf.write("output.wav", wav, samplerate=24000) # CORRECT
# If you need 44100 Hz output, resample properly after synthesis
import librosa
wav_resampled = librosa.resample(wav, orig_sr=24000, target_sr=44100)
sf.write("output_44k.wav", wav_resampled, samplerate=44100)
Fix 2: Audio Clipping and Amplitude Overflow
TTS models sometimes output samples that exceed the [-1.0, 1.0] range, causing harsh clipping distortion:
import numpy as np
wav = tts.tts("Test sentence with emphasis!", speaker_wav="ref.wav", language="en")
wav = np.array(wav)
# Check for clipping
peak = np.max(np.abs(wav))
print(f"Peak amplitude: {peak:.4f}")
if peak > 1.0:
print(f"CLIPPING DETECTED — {np.sum(np.abs(wav) > 1.0)} samples clipped")
# Fix: normalise to safe amplitude
def safe_normalize(audio, target_peak=0.9):
peak = np.max(np.abs(audio))
if peak > 0:
audio = audio * (target_peak / peak)
return audio
wav_clean = safe_normalize(wav)
# Alternative: soft-clip instead of hard-clip (preserves dynamics)
def soft_clip(audio, threshold=0.9):
return np.tanh(audio / threshold) * threshold
wav_soft = soft_clip(wav)
Fix 3: Pops at Sentence and Segment Boundaries
When TTS generates long text in segments, discontinuities at segment boundaries produce audible pops:
import numpy as np
def crossfade_segments(segments, sr=24000, fade_ms=15):
"""Join audio segments with crossfade to eliminate pops."""
fade_samples = int(sr * fade_ms / 1000)
if fade_samples < 2:
return np.concatenate(segments)
result = np.array(segments[0], dtype=np.float32)
for seg in segments[1:]:
seg = np.array(seg, dtype=np.float32)
# Create fade curves
fade_out = np.linspace(1.0, 0.0, fade_samples)
fade_in = np.linspace(0.0, 1.0, fade_samples)
# Apply crossfade
result[-fade_samples:] *= fade_out
seg[:fade_samples] *= fade_in
# Overlap-add
overlap = result[-fade_samples:] + seg[:fade_samples]
result = np.concatenate([result[:-fade_samples], overlap, seg[fade_samples:]])
return result
# Generate sentences separately, then crossfade
sentences = ["First sentence.", "Second sentence.", "Third sentence."]
wavs = [tts.tts(s, speaker_wav="ref.wav", language="en") for s in sentences]
final = crossfade_segments(wavs)
Fix 4: GPU Precision-Related Artefacts
FP16 inference can introduce subtle numerical artefacts in vocoder output, especially at low amplitudes:
# If using manual model loading, check precision
import torch
# Problem: FP16 vocoder introduces quantisation noise
model = model.half().cuda() # May cause artifacts in audio output
# Fix: keep the vocoder in FP32 even when the acoustic model uses FP16
# Most TTS frameworks handle this automatically, but if using custom code:
acoustic_model = acoustic_model.half().cuda() # FP16 is fine here
vocoder = vocoder.float().cuda() # Keep vocoder in FP32
# In Coqui TTS, this is handled internally but you can verify:
# The synthesizer.vocoder_model should be in float32
Fix 5: Streaming Buffer Underruns
For real-time TTS streaming, buffer underruns cause gaps and clicks:
# Ensure adequate buffer size for streaming
import pyaudio
CHUNK_SIZE = 4096 # Increase if hearing clicks (try 8192)
SAMPLE_RATE = 24000
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paFloat32,
channels=1,
rate=SAMPLE_RATE,
output=True,
frames_per_buffer=CHUNK_SIZE)
# Pre-buffer before starting playback
prebuffer_chunks = 3
buffer = []
for chunk in tts_streaming_generator():
buffer.append(chunk)
if len(buffer) >= prebuffer_chunks:
stream.write(buffer.pop(0).tobytes())
Diagnosing Artefact Source
Isolate whether artefacts come from the model or the output pipeline:
# Step 1: Save raw model output without any processing
raw_wav = tts.tts("Test", speaker_wav="ref.wav", language="en")
np.save("raw_output.npy", raw_wav)
# Step 2: Inspect the waveform for anomalies
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 3))
plt.plot(raw_wav[:48000]) # First 2 seconds
plt.title("Raw TTS waveform")
plt.savefig("waveform.png")
# Step 3: Check spectrum for unexpected frequencies
from scipy import signal
f, Pxx = signal.welch(raw_wav, fs=24000, nperseg=2048)
plt.semilogy(f, Pxx)
plt.xlabel("Frequency (Hz)")
plt.savefig("spectrum.png")
# Spikes above 10kHz often indicate aliasing artefacts
For production TTS on your GPU server, apply normalisation and crossfading as standard pipeline stages. The Coqui TTS hosting page has ready configurations. Check the tutorials section for pipeline architecture, benchmarks for GPU throughput, and the PyTorch guide for environment setup. The Whisper hosting page covers the complementary speech-to-text side.
Clean TTS on Dedicated GPUs
Eliminate audio artefacts with properly configured GPU servers. GigaGPU hardware delivers glitch-free synthesis.
Browse GPU Servers