Table of Contents
Why Networking Matters for AI Servers
A powerful GPU is only part of the equation. If your network configuration introduces bottlenecks, your AI inference pipeline suffers — regardless of how fast your hardware is. On dedicated GPU servers, you have full control over networking, which means you can tune every layer from TCP settings to reverse proxy configuration. This guide covers practical networking for AI inference workloads served from dedicated hardware.
Whether you’re running a private LLM endpoint, serving a vision model, or hosting an AI API for production traffic, understanding bandwidth, latency, and endpoint architecture directly impacts your user experience and throughput. For broader infrastructure guidance, see our AI hosting and infrastructure articles.
Bandwidth Requirements for AI Workloads
AI inference traffic is asymmetric. Requests are small (a few kilobytes of prompt text) and responses are streamed token-by-token. This makes bandwidth requirements far lower than most teams expect.
| Workload Type | Request Size | Response Size | Bandwidth per Request |
|---|---|---|---|
| LLM chat (500-token reply) | 1-5 KB | 2-4 KB | <10 KB total |
| LLM long-form (2000 tokens) | 2-10 KB | 8-15 KB | <25 KB total |
| Vision model (image input) | 500 KB-2 MB | 1-5 KB | <2 MB total |
| Speech-to-text (audio input) | 1-10 MB | 1-5 KB | <10 MB total |
| Image generation (output) | 1-5 KB | 2-8 MB | <8 MB total |
At 1Gbps dedicated bandwidth, a server can handle approximately 12,500 concurrent LLM chat streams or around 125 simultaneous image generation outputs. In practice, GPU compute is the bottleneck long before bandwidth. For workloads involving image or audio inputs, see our vision model hosting and speech model hosting pages for hardware recommendations.
Latency Optimisation
For AI inference, latency breaks down into three components:
| Latency Component | Typical Range | How to Reduce |
|---|---|---|
| Network round-trip (client to server) | 5-50ms (same region) | Choose server location near users; UK datacenters for European traffic |
| Processing overhead (reverse proxy, auth) | 1-5ms | Minimise middleware; use efficient proxy like Nginx or Caddy |
| GPU inference (time to first token) | 50-500ms | Use optimised serving frameworks; keep models loaded in VRAM |
| Token generation (streaming) | 20-80ms per token | Use faster GPUs; optimise batch sizes; use continuous batching |
The largest gains come from keeping models loaded in VRAM (eliminating cold starts) and using optimised inference frameworks. vLLM hosting with continuous batching and PagedAttention delivers significantly better throughput than naive serving approaches. Check our tokens per second benchmarks for measured performance across different GPUs and frameworks.
Network-level optimisations that make a measurable difference:
- TCP tuning — increase socket buffer sizes and enable TCP BBR congestion control for better throughput
- HTTP/2 or HTTP/3 — reduce connection overhead for clients making repeated requests
- Connection pooling — reuse connections at the reverse proxy level to avoid TCP handshake latency
- Keep-alive — maintain persistent connections for streaming responses
- Server-sent events (SSE) — use streaming responses for LLM output rather than waiting for full completion
Low-Latency GPU Servers in the UK
1Gbps dedicated bandwidth, bare-metal hardware, full root access. Optimise your AI network stack from the ground up.
Browse GPU ServersAPI Endpoint Architecture
A production AI inference endpoint typically has three layers between the client and the GPU:
1. Reverse proxy (Nginx / Caddy)
- TLS termination with Let’s Encrypt certificates
- Rate limiting to prevent abuse and manage concurrency
- Request buffering and timeout configuration
- Access logging for audit trails
2. Inference server (vLLM / TGI / Ollama)
- OpenAI-compatible API endpoint for drop-in integration
- Continuous batching for optimal GPU utilisation
- Health check endpoints for monitoring and load balancing
- With Ollama, you get built-in model management alongside the API
3. Authentication layer
- API key validation at the proxy level (avoids hitting the inference server for invalid requests)
- Optional JWT or OAuth for more complex access control
- IP allowlisting for internal services
For teams building customer-facing AI products, this architecture mirrors what commercial API providers use — but running on your own private infrastructure with full data control.
Load Balancing for Inference
When a single GPU server can’t handle your traffic, you need to distribute requests across multiple servers. AI inference has unique load balancing requirements compared to traditional web traffic:
- Long-lived connections — streaming responses can last several seconds; naive round-robin causes uneven distribution
- Variable request cost — a 50-token prompt and a 4000-token prompt require vastly different GPU time
- Model-specific routing — different servers may host different models or model versions
Effective strategies for AI load balancing:
| Strategy | Best For | Implementation |
|---|---|---|
| Least connections | Similar request sizes | Nginx upstream with least_conn directive |
| Weighted routing | Mixed GPU hardware | Assign weights based on GPU performance (e.g., RTX 5090 gets 1.5x weight vs RTX 3090) |
| Model-based routing | Multi-model deployments | Route by path or header to specific backend servers |
| Queue-based | Bursty traffic | Message queue (Redis, RabbitMQ) feeding worker processes |
For multi-GPU setups where a single large model is split across cards, multi-GPU clusters handle the inter-GPU communication internally — load balancing applies at the request level above this. To understand which GPU hardware best fits your throughput needs, see our guide on the best GPU for LLM inference.
Monitoring & Troubleshooting
Network issues in AI inference pipelines often masquerade as model problems. Key metrics to monitor:
- Time to first token (TTFT) — measures combined network and GPU latency; spikes indicate queuing or network issues
- Tokens per second (TPS) — if this drops while GPU utilisation is low, the bottleneck is network or serving overhead
- Request queue depth — growing queues mean your GPU can’t keep up with incoming requests
- Connection errors / timeouts — often caused by aggressive proxy timeouts on long inference requests
- Bandwidth utilisation — rarely the bottleneck for text inference, but worth monitoring for vision or audio workloads
Common pitfalls and fixes:
- Proxy timeout too low — set Nginx proxy_read_timeout to at least 120s for long-form generation
- Buffer size too small — increase proxy_buffer_size for large prompt inputs
- Missing streaming support — ensure proxy_buffering is off for SSE streaming endpoints
- DNS resolution delays — use static IPs or local DNS caching for backend servers
To estimate how your inference costs scale with traffic, our cost per million tokens calculator provides concrete numbers alongside a breakdown of self-hosting economics.
Configuration Recommendations
Based on common deployment patterns, here are our networking recommendations by workload:
Low-traffic API (under 10 req/sec):
- Single server with Nginx reverse proxy
- Standard 1Gbps connection is more than sufficient
- Focus on TLS, rate limiting, and proper timeout configuration
Medium-traffic API (10-100 req/sec):
- Single powerful GPU (RTX 5090/5090) with vLLM continuous batching
- HTTP/2 enabled, connection pooling, TCP BBR
- Monitor queue depth and scale to a second server when TTFT exceeds your SLA
High-traffic API (100+ req/sec):
- Multiple GPU servers behind a load balancer
- Least-connections routing with health checks
- Consider queue-based architecture for traffic bursts
- Our self-host LLM guide covers multi-server deployment in detail
Getting your networking right from the start saves significant debugging time later. Start with a dedicated GPU server where you control every layer of the stack, then scale horizontally as your traffic demands grow.