Table of Contents
Kokoro TTS Overview
Kokoro is a lightweight, high-quality text-to-speech model designed for low-latency inference. With under 100M parameters, it is one of the most efficient TTS models available for self-hosting on a dedicated GPU server. Kokoro TTS hosting is accessible on even the most budget-friendly GPUs, making it ideal for production deployments where latency and cost matter.
VRAM Requirements by Precision
| Precision | Model Weights | Generation Overhead | Total VRAM |
|---|---|---|---|
| FP32 | ~0.4 GB | ~0.3 GB | ~0.7 GB |
| FP16 / BF16 | ~0.2 GB | ~0.2 GB | ~0.4 GB |
| INT8 | ~0.1 GB | ~0.2 GB | ~0.3 GB |
Kokoro uses under 0.5 GB at FP16, making it the lightest TTS model in common use. This means it can run alongside virtually any other model without adding meaningful VRAM pressure. For context, Bark TTS uses 12-15x more VRAM at the same precision.
Latency and Throughput Scaling
| GPU | Precision | Latency (10s clip) | Real-Time Factor |
|---|---|---|---|
| RTX 3050 | FP16 | 0.8s | 12.5x |
| RTX 4060 | FP16 | 0.5s | 20x |
| RTX 4060 Ti | FP16 | 0.4s | 25x |
| RTX 3090 | FP16 | 0.3s | 33x |
Kokoro generates speech at 20-33x real-time on mid-range GPUs, making it suitable for streaming synthesis and real-time voice applications. Check the TTS latency benchmarks for current data.
GPU Recommendations
| GPU | VRAM | Kokoro Capability | Best Use Case |
|---|---|---|---|
| RTX 3050 | 6 GB | FP16, 5.5 GB free for other models | Budget TTS + small LLM |
| RTX 4060 | 8 GB | FP16, 7.5 GB free | TTS + 7B LLM pipeline |
| RTX 4060 Ti | 16 GB | FP16, 15.5 GB free | TTS + larger LLM |
| RTX 3090 | 24 GB | FP16, 23.5 GB free | Multi-model pipelines |
Kokoro is so lightweight that GPU selection should be based on whatever other models you plan to co-host, not on Kokoro itself.
Comparison with Bark and XTTS-v2
Kokoro, Bark, and XTTS-v2 represent three different TTS design philosophies:
| Model | FP16 VRAM | Speed (RTF) | Voice Cloning | Non-Speech Audio |
|---|---|---|---|---|
| Kokoro | ~0.4 GB | 20-33x | No | No |
| XTTS-v2 | ~2-4 GB | 3-8x | Yes | No |
| Bark | ~6 GB | 0.8-1.5x | Limited | Yes |
Choose Kokoro for maximum speed and minimum resource usage. Choose XTTS-v2 for voice cloning. Choose Bark for creative audio generation including music and sound effects.
Deployment Recommendations
Kokoro is ideal for latency-sensitive applications like conversational AI, real-time assistants, and high-throughput batch TTS. Deploy it alongside an LLM for end-to-end text-to-speech pipelines. On a single RTX 4060, you can run Kokoro plus a quantised 7B LLM plus Whisper for a complete voice assistant stack.
Use the GPU comparisons tool to evaluate hardware options. Estimate costs with the cost calculator. Browse all TTS guides in the model guides section.
Host Kokoro TTS on Dedicated GPUs
Run ultra-fast text-to-speech with Kokoro on dedicated GPU servers. Co-host with LLMs and speech models on a single card.
Browse GPU Servers