Fish Speech v1.5 is a text-to-speech model from Fish Audio with zero-shot voice cloning: given 10-30 seconds of reference audio, it synthesises new speech in that voice. On our dedicated GPU hosting it fits an 8 GB card comfortably.
Contents
VRAM
~4-6 GB at FP16. Runs on any card from the 4060 up.
Deployment
git clone https://github.com/fishaudio/fish-speech
cd fish-speech
pip install -e .
python tools/api_server.py \
--llama-checkpoint-path checkpoints/fish-speech-1.5 \
--decoder-checkpoint-path checkpoints/fish-speech-1.5/decoder.pth \
--decoder-config-name firefly_gan_vq
The API exposes an HTTP endpoint that accepts reference audio plus target text.
Cloning Workflow
- Record 10-30 seconds of the target speaker reading a varied text
- POST to Fish Speech API with reference audio and target text
- Receive synthesised audio in the cloned voice
Quality improves with cleaner reference audio. Room echo, background noise, and short references (<10s) degrade cloning fidelity.
Ethics
Voice cloning technology is easily abused. For UK-facing products, get documented consent from anyone whose voice you clone. Do not synthesise voices of public figures or deceased persons without permission. Add watermarking to synthetic audio where possible. These are not legal requirements in every jurisdiction but represent basic professional conduct.
Self-Hosted Voice Cloning
Fish Speech on UK dedicated GPUs with clear operational logging.
Browse GPU ServersSee RVC voice cloning and Parler-TTS.