Table of Contents
Quick Verdict
A TTS API that buckles under concurrent load is worse than no API at all. Coqui TTS handles 18.1 requests per second versus Bark’s 7.1 — a 2.5x throughput advantage that means Coqui serves the same traffic volume with 60% fewer GPU instances. On a dedicated GPU server, Coqui is the production-grade choice for TTS API serving.
Bark’s autoregressive architecture generates more expressive audio but fundamentally limits its throughput ceiling. For APIs where reliability and capacity matter more than vocal expressiveness, Coqui wins decisively.
Full data below. More at the GPU comparisons hub.
Specs Comparison
Coqui’s XTTS-v2 architecture separates speech encoding from generation, allowing more efficient parallel processing. Bark’s fully autoregressive design processes every audio token sequentially.
| Specification | Coqui TTS | Bark TTS |
|---|---|---|
| Parameters | ~80M (XTTS-v2) | ~350M |
| Architecture | GPT + Decoder | GPT-style autoregressive |
| Context Length | 24s audio | 15s audio |
| VRAM (FP16) | 2.5 GB | 4 GB |
| VRAM (INT4) | N/A | N/A |
| Licence | MPL 2.0 | MIT |
Guides: Coqui TTS VRAM requirements and Bark TTS VRAM requirements.
API Throughput Benchmark
Tested on an NVIDIA RTX 3090 under sustained concurrent API load. See our benchmark tool.
| Model (INT4) | Requests/sec | p50 Latency (ms) | p99 Latency (ms) | VRAM Used |
|---|---|---|---|---|
| Coqui TTS | 18.1 | 127 | 352 | 2.5 GB |
| Bark TTS | 7.1 | 105 | 399 | 4 GB |
Bark’s slightly lower p50 (105 ms versus 127 ms) reflects faster initialisation for individual requests, but its p99 (399 ms) is worse than Coqui’s (352 ms) and its total throughput is 2.5x lower. Under load, Coqui maintains more consistent latency. See our best GPU for LLM inference guide.
See also: Coqui TTS vs Bark TTS for Chatbot / Conversational AI for a related comparison.
See also: Coqui TTS vs Kokoro TTS for API Serving (Throughput) for a related comparison.
Cost Analysis
Coqui’s 2.5x throughput advantage translates directly into 2.5x fewer GPU instances needed for the same API traffic volume.
| Cost Factor | Coqui TTS | Bark TTS |
|---|---|---|
| GPU Required | RTX 3090 (24 GB) | RTX 3090 (24 GB) |
| VRAM Used | 2.5 GB | 4 GB |
| Real-time Factor | 5.7x | 9.1x |
| Cost/hr Audio Processed | £0.13 | £0.15 |
See our cost calculator.
Recommendation
Choose Coqui TTS for production TTS APIs. Its 2.5x higher throughput, tighter tail latency, and lower VRAM footprint make it the clear choice for any endpoint that needs to serve concurrent users reliably.
Choose Bark TTS only for niche APIs where expressive audio features (laughter, emotion, music interjections) are a core product requirement and throughput is secondary.
Serve on dedicated GPU servers for consistent TTS API performance.
Deploy the Winner
Run Coqui TTS or Bark TTS on bare-metal GPU servers with full root access, no shared resources, and no token limits.
Browse GPU Servers