Table of Contents
Bandwidth in AI Inference
Network bandwidth is rarely the bottleneck in AI inference, but underestimating it can cause latency spikes for end users and slow model downloads during deployment. On a dedicated GPU server, bandwidth affects two phases: the initial setup (downloading model weights from Hugging Face or other registries) and ongoing inference (serving API responses to clients). Understanding bandwidth needs ensures a smooth deployment.
Bandwidth by Workload Type
| Workload | Response Size | Bandwidth per Request | Notes |
|---|---|---|---|
| LLM chat (500 tokens out) | ~2-4 KB | Negligible | Streamed token by token |
| LLM batch (10K tokens out) | ~40-80 KB | Negligible | Still text-only |
| Image generation (1024×1024 PNG) | ~1-3 MB | ~1-3 MB | Single image response |
| Image generation (batch of 4) | ~4-12 MB | ~4-12 MB | Multiple images per request |
| Speech-to-text (upload 1h audio) | ~60-120 MB input | ~60-120 MB | Upload-heavy workload |
| TTS (10s audio output) | ~300-600 KB | ~300-600 KB | WAV or compressed output |
| Video generation (5s clip) | ~5-20 MB | ~5-20 MB | Depends on resolution/codec |
LLM text inference uses almost no bandwidth. Image and video generation are more bandwidth-intensive but still modest by server standards. Speech-to-text (Whisper) workloads are upload-heavy because raw audio files can be large.
Scaling with Concurrent Users
| Concurrent Users | LLM Chat Bandwidth | Image Gen Bandwidth | Whisper Bandwidth |
|---|---|---|---|
| 1 | < 1 Mbps | ~1-5 Mbps | ~5-10 Mbps |
| 10 | < 5 Mbps | ~10-50 Mbps | ~50-100 Mbps |
| 50 | < 20 Mbps | ~50-250 Mbps | ~250-500 Mbps |
| 100 | < 40 Mbps | ~100-500 Mbps | ~500 Mbps-1 Gbps |
For most single-GPU deployments serving 1-10 concurrent users, a 1 Gbps connection is more than sufficient. GPU compute is almost always the bottleneck before bandwidth. For high-volume image or video serving, consider a CDN or object storage for serving generated assets.
Model Download Bandwidth
The initial model download is often the most bandwidth-intensive event. Large models require significant download time on slow connections:
| Model Size | 100 Mbps | 1 Gbps | 10 Gbps |
|---|---|---|---|
| 7B FP16 (~14 GB) | ~19 min | ~2 min | ~11 sec |
| 7B GGUF Q4 (~5 GB) | ~7 min | ~40 sec | ~4 sec |
| 70B FP16 (~140 GB) | ~3.1 hours | ~19 min | ~2 min |
| Flux.1 full (~34 GB) | ~45 min | ~5 min | ~27 sec |
A 1 Gbps connection makes even 70B model downloads practical within 20 minutes. For rapid model swapping and experimentation, 10 Gbps connectivity is valuable. See our storage requirements guide for model file sizes.
Sizing Recommendations
| Use Case | Minimum Bandwidth | Recommended |
|---|---|---|
| Single-user development | 100 Mbps | 1 Gbps |
| Production LLM API (1-10 users) | 100 Mbps | 1 Gbps |
| Production image API (1-10 users) | 1 Gbps | 1 Gbps |
| Multi-model production (10-50 users) | 1 Gbps | 10 Gbps |
| High-volume Whisper processing | 1 Gbps | 10 Gbps |
1 Gbps is the standard recommendation for dedicated GPU servers running AI inference. It covers all common workloads at typical concurrency levels and makes model downloads fast.
Next Steps
Bandwidth is typically the least constrained resource in AI inference hosting. For the resources that matter more, see our GPU memory vs system RAM guide, RAM requirements guide, and CPU requirements guide. Compare GPU options with the GPU comparisons tool. Browse all infrastructure guides in the AI hosting and infrastructure section.
Dedicated GPU Servers with Fast Connectivity
GigaGPU dedicated servers include 1 Gbps+ network connectivity optimised for AI inference workloads. UK data centre hosting with low-latency routing.
Browse GPU Servers