RTX 3050 - Order Now
Home / Blog / Tutorials / TTS Audio Artifacts: Fix Crackling/Distortion
Tutorials

TTS Audio Artifacts: Fix Crackling/Distortion

Fix crackling, popping, distortion, and other audio artifacts in GPU-accelerated TTS output. Covers sample rate mismatches, buffer underruns, vocoder issues, and post-processing corrections.

Your TTS Output Has Crackling, Pops, or Distortion

The synthesised speech from your TTS model contains audible crackling between words, periodic pops at sentence boundaries, metallic distortion on certain vowels, or a persistent high-frequency whine behind the voice. These artefacts make the output unusable for production applications. The text-to-speech model itself may be generating clean mel spectrograms, but something in the pipeline between synthesis and final WAV output is introducing corruption on your GPU server.

Fix 1: Sample Rate Mismatches

The most common cause of crackling. If any component in the pipeline expects a different sample rate than it receives, the waveform is distorted:

# Check what sample rate the model outputs
from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
print(f"Model sample rate: {tts.synthesizer.output_sample_rate}")
# XTTS v2 outputs at 24000 Hz

# Wrong: saving at mismatched sample rate
import soundfile as sf
wav = tts.tts("Hello world", speaker_wav="ref.wav", language="en")
sf.write("output.wav", wav, samplerate=44100)  # WRONG — causes pitch shift + artifacts

# Correct: match the model's native sample rate
sf.write("output.wav", wav, samplerate=24000)  # CORRECT

# If you need 44100 Hz output, resample properly after synthesis
import librosa
wav_resampled = librosa.resample(wav, orig_sr=24000, target_sr=44100)
sf.write("output_44k.wav", wav_resampled, samplerate=44100)

Fix 2: Audio Clipping and Amplitude Overflow

TTS models sometimes output samples that exceed the [-1.0, 1.0] range, causing harsh clipping distortion:

import numpy as np

wav = tts.tts("Test sentence with emphasis!", speaker_wav="ref.wav", language="en")
wav = np.array(wav)

# Check for clipping
peak = np.max(np.abs(wav))
print(f"Peak amplitude: {peak:.4f}")
if peak > 1.0:
    print(f"CLIPPING DETECTED — {np.sum(np.abs(wav) > 1.0)} samples clipped")

# Fix: normalise to safe amplitude
def safe_normalize(audio, target_peak=0.9):
    peak = np.max(np.abs(audio))
    if peak > 0:
        audio = audio * (target_peak / peak)
    return audio

wav_clean = safe_normalize(wav)

# Alternative: soft-clip instead of hard-clip (preserves dynamics)
def soft_clip(audio, threshold=0.9):
    return np.tanh(audio / threshold) * threshold

wav_soft = soft_clip(wav)

Fix 3: Pops at Sentence and Segment Boundaries

When TTS generates long text in segments, discontinuities at segment boundaries produce audible pops:

import numpy as np

def crossfade_segments(segments, sr=24000, fade_ms=15):
    """Join audio segments with crossfade to eliminate pops."""
    fade_samples = int(sr * fade_ms / 1000)
    if fade_samples < 2:
        return np.concatenate(segments)

    result = np.array(segments[0], dtype=np.float32)
    for seg in segments[1:]:
        seg = np.array(seg, dtype=np.float32)
        # Create fade curves
        fade_out = np.linspace(1.0, 0.0, fade_samples)
        fade_in = np.linspace(0.0, 1.0, fade_samples)
        # Apply crossfade
        result[-fade_samples:] *= fade_out
        seg[:fade_samples] *= fade_in
        # Overlap-add
        overlap = result[-fade_samples:] + seg[:fade_samples]
        result = np.concatenate([result[:-fade_samples], overlap, seg[fade_samples:]])
    return result

# Generate sentences separately, then crossfade
sentences = ["First sentence.", "Second sentence.", "Third sentence."]
wavs = [tts.tts(s, speaker_wav="ref.wav", language="en") for s in sentences]
final = crossfade_segments(wavs)

Fix 4: GPU Precision-Related Artefacts

FP16 inference can introduce subtle numerical artefacts in vocoder output, especially at low amplitudes:

# If using manual model loading, check precision
import torch

# Problem: FP16 vocoder introduces quantisation noise
model = model.half().cuda()  # May cause artifacts in audio output

# Fix: keep the vocoder in FP32 even when the acoustic model uses FP16
# Most TTS frameworks handle this automatically, but if using custom code:
acoustic_model = acoustic_model.half().cuda()  # FP16 is fine here
vocoder = vocoder.float().cuda()  # Keep vocoder in FP32

# In Coqui TTS, this is handled internally but you can verify:
# The synthesizer.vocoder_model should be in float32

Fix 5: Streaming Buffer Underruns

For real-time TTS streaming, buffer underruns cause gaps and clicks:

# Ensure adequate buffer size for streaming
import pyaudio

CHUNK_SIZE = 4096  # Increase if hearing clicks (try 8192)
SAMPLE_RATE = 24000

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paFloat32,
                channels=1,
                rate=SAMPLE_RATE,
                output=True,
                frames_per_buffer=CHUNK_SIZE)

# Pre-buffer before starting playback
prebuffer_chunks = 3
buffer = []
for chunk in tts_streaming_generator():
    buffer.append(chunk)
    if len(buffer) >= prebuffer_chunks:
        stream.write(buffer.pop(0).tobytes())

Diagnosing Artefact Source

Isolate whether artefacts come from the model or the output pipeline:

# Step 1: Save raw model output without any processing
raw_wav = tts.tts("Test", speaker_wav="ref.wav", language="en")
np.save("raw_output.npy", raw_wav)

# Step 2: Inspect the waveform for anomalies
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 3))
plt.plot(raw_wav[:48000])  # First 2 seconds
plt.title("Raw TTS waveform")
plt.savefig("waveform.png")

# Step 3: Check spectrum for unexpected frequencies
from scipy import signal
f, Pxx = signal.welch(raw_wav, fs=24000, nperseg=2048)
plt.semilogy(f, Pxx)
plt.xlabel("Frequency (Hz)")
plt.savefig("spectrum.png")
# Spikes above 10kHz often indicate aliasing artefacts

For production TTS on your GPU server, apply normalisation and crossfading as standard pipeline stages. The Coqui TTS hosting page has ready configurations. Check the tutorials section for pipeline architecture, benchmarks for GPU throughput, and the PyTorch guide for environment setup. The Whisper hosting page covers the complementary speech-to-text side.

Clean TTS on Dedicated GPUs

Eliminate audio artefacts with properly configured GPU servers. GigaGPU hardware delivers glitch-free synthesis.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?