GPU Server for 100 Concurrent Voice agent Users: Sizing Guide
Hardware recommendations for running real-time STT + TTS pipeline with 100 simultaneous users on dedicated GPU servers.
Call Centre Scale Without Call Centre Pricing
Running 100 concurrent voice agents through API providers costs £4,500-£12,000/month. Here is what most teams do not realise: a pair of RTX 5080 GPUs at £218/month total handles the identical workload with better latency, because the voice data never leaves your data centre. That is a 95-98% cost reduction with improved privacy as a bonus.
Recommended Configurations
| GPU | VRAM | Monthly Cost | Recommended Models | Notes |
|---|---|---|---|---|
| RTX 5080 | 16 GB | £109/mo | Whisper + XTTS concurrent | Low-latency voice pipeline |
| RTX 5090 | 32 GB | £179/mo | Full pipeline: STT + LLM + TTS | All-in-one voice agent |
Scaling Voice Pipelines to 100 Users
Each pipeline stage needs its VRAM allocation: Whisper Large (~3 GB), LLM (4-8 GB), TTS (2-4 GB). At 100 concurrent sessions, you are managing roughly 30-40 active GPU inference tasks at any given second, as voice conversations naturally stagger speaking and listening phases.
Two GPUs is the minimum production configuration at this scale. Dedicate one to STT+LLM and the other to TTS, or split sessions evenly with session affinity. Either architecture maintains the critical sub-500ms latency threshold.
Architectural Decisions at 100 Users
- Pipeline splitting vs session splitting: Splitting by pipeline stage (STT on GPU 1, TTS on GPU 2) gives optimal VRAM usage. Splitting by session (users 1-50 on GPU 1, 51-100 on GPU 2) gives simpler routing. Both work; choose based on your ops team’s comfort.
- Health monitoring: At 100 users, a GPU failure impacts enough people to warrant automated failover. Run health checks every 5 seconds and route traffic to the surviving node within 10 seconds.
- Audio codec optimisation: Use Opus codec for network transport. It cuts bandwidth by 80% compared to raw PCM without meaningful quality loss, reducing CPU overhead for audio handling.
- Load shedding strategy: Define what happens at 110% capacity. Queueing with estimated wait times is better than dropping calls or degrading quality silently.
Scaling Beyond 100
A multi-GPU setup with 2-3 nodes is the right approach at 100 users. Use load balancing with session affinity to ensure consistent conversation quality. As you grow toward 250 users, add nodes linearly — each additional RTX 5080 at £109/month supports roughly 40-50 more concurrent sessions.
GigaGPU supports seamless multi-server deployments. Architect for horizontal scaling from the start and you will never need to re-platform.
Annual Savings at 100 Users
API costs for 100 concurrent voice agents: £4,500-£12,000/month. Dedicated GPU cost: £109-£218/month for 1-2 nodes. Annual savings: £51,384-£141,384. At this scale, the cost of not self-hosting is itself a significant line item on your P&L.
Deploy Production Voice Infrastructure
100 concurrent voice agents on dedicated GPUs. Flat monthly pricing starting at £109/month, no per-minute billing, complete data privacy.