Table of Contents
Most production AI inference servers fail in roughly the same ways. The model loads fine, the first 100 requests work, and then on day three something quietly goes wrong — VRAM fills up, a driver mismatch surfaces, the queue depths blow out under real traffic. This guide is the playbook we hand to teams launching their first dedicated GPU server, refined from several hundred deployments.
A production inference server is not a model + an HTTP framework. It’s a model + an inference engine + an HTTP frontend + auth + metrics + queueing + a degradation strategy + a deployment harness. Skipping any one of those will bite you within a month. We’ll go through each in turn.
1. Picking the GPU and chassis
Three questions, in order:
- What models will I serve? The biggest one sets the VRAM floor. Don’t pick the GPU for today’s model; pick it for the model you’ll want in 6 months.
- What’s my latency budget? Single-stream <100 ms TTFT means you want a Blackwell card (5080, 5090, 6000 Pro). Aggregate throughput at high concurrency is a different optimisation.
- How predictable is my traffic? Steady traffic favours dedicated bare-metal. Spiky traffic favours either oversized hardware or autoscaling fronting cloud GPUs.
Common combinations that work:
| Workload | GPU | Why |
|---|---|---|
| 7B chatbot, <50 concurrent users | RTX 3090 24 GB | Cheapest 24 GB. Plenty for FP16 7B + KV cache. |
| 7B/8B chatbot, customer-facing | RTX 5090 32 GB | Highest throughput per card; FP8 hardware. |
| 13B–14B chatbot | RTX 5090 32 GB | FP16 fits. INT4 leaves room for two models. |
| Single-card 70B | RTX 6000 Pro 96 GB | The only single-card 70B FP8 deployment. |
| Whisper + LLM voice agent | RTX 5090 32 GB | Both models hot-loaded with headroom. |
| Embeddings only | RTX 3060 12 GB | Don’t over-buy. ~50K embeds/sec. |
Don’t run inference on consumer desktops or workstations dragged into a closet. The thermal envelope, power redundancy and 24/7 uptime requirements of production traffic break that setup within weeks.
2. OS, driver and CUDA stack
Pin everything. Ubuntu 22.04 LTS, NVIDIA driver pinned to a specific version, CUDA toolkit pinned to a specific version, vLLM/TGI pinned to a specific tag. The number of incidents we’ve seen caused by an unattended-upgrades package bumping the NVIDIA driver in the middle of the night is non-zero.
sudo apt-mark hold nvidia-driver-* cuda-toolkit-* libcudnn*
# Versions we pin (as of mid-2026):
# nvidia-driver-555.x (Blackwell)
# CUDA 12.4
# cuDNN 9.1
# vLLM 0.6.3
cat /sys/module/nvidia/version # verify
nvidia-smi --query-gpu=driver_version,name,memory.total --format=csv
If you’re on Blackwell hardware (5080/5090/6000 Pro), the driver lower bound is 555.x. Older drivers will load but TensorRT-LLM kernels and FP4 paths won’t be available.
3. Inference engine: vLLM, TGI, Triton or Ollama?
What works
- vLLM — the default. Continuous batching, PagedAttention, OpenAI-compatible API, AWQ/GPTQ/FP8 support, prefix caching. Active development, first-class on Blackwell.
- Text Generation Inference (TGI) — Hugging Face’s engine. Slightly stricter quantisation support, excellent multi-GPU. Production-tested at scale.
- Triton + TensorRT-LLM — NVIDIA’s stack. Highest single-card throughput on FP4/FP8. More integration work; biggest payoff on Hopper/Blackwell.
- Ollama — easiest to get running. Multi-model multiplexing on one port. Underneath is llama.cpp; fine for low-throughput / dev work.
Where it breaks
- vLLM has aggressive memory tuning — naive defaults will OOM under load. Tune
gpu-memory-utilizationandmax-num-seqs. - TGI lags vLLM by 2–4 weeks on quantisation support for new models.
- Triton + TensorRT-LLM is genuinely complex — engine compilation step alone is a learning curve.
- Ollama is not a production engine. No tracing, weak metrics, no rate limiting. Don’t put it in front of paying users.
For 90% of new deployments: start with vLLM. Move to TGI if you hit a vLLM bug; move to Triton/TensorRT-LLM only when single-card throughput is the bottleneck and you have ops capacity to invest.
4. API surface, auth and rate limiting
Use the OpenAI shape. Even if you don’t use OpenAI today, the SDK ecosystem (Python, Node, Go, Rust, Ruby) is overwhelmingly built around it. vLLM and TGI both expose /v1/chat/completions, /v1/completions, /v1/embeddings out of the box — see our API hosting page.
For auth and rate limiting, do not rely on vLLM’s --api-key flag for a public endpoint. It’s a single static token. Front the engine with one of:
- LiteLLM — lightweight router. Per-key rate limits, model multiplexing, retries, streaming pass-through. Our default recommendation.
- Caddy / nginx — TLS, mTLS, IP allow-list. Pair with LiteLLM for auth.
- Cloudflare Access — for internal tools, drops the auth surface entirely. Sign in with your IDP, hit the endpoint.
# litellm-config.yaml
model_list:
- model_name: chat-fast
litellm_params:
model: openai/qwen2.5-7b
api_base: http://127.0.0.1:8000/v1
api_key: vllm-internal
- model_name: chat-strong
litellm_params:
model: openai/qwen2.5-32b
api_base: http://127.0.0.1:8001/v1
api_key: vllm-internal
router_settings:
routing_strategy: latency-based-routing
fallbacks: [{"chat-strong": ["chat-fast"]}]
litellm_settings:
drop_params: true
set_verbose: false
general_settings:
master_key: sk-master-key-rotates-monthly
database_url: "postgres://..." # for per-key tracking
5. Observability: metrics, logs and tracing
Three signals you must export from day one:
- vLLM Prometheus metrics —
--enable-metricsexposes/metrics. Scrape with Prometheus, dashboard with Grafana. Key alerts:vllm:gpu_cache_usage_perc > 0.95,vllm:num_requests_waiting > 100,vllm:time_to_first_token_seconds (p99) > 2. - Structured JSON request logs — request ID, user/key ID, model, input tokens, output tokens, latency, completion reason. Ship to your SIEM.
- OpenTelemetry traces — span the full request through LiteLLM → vLLM → response. Required for diagnosing tail-latency issues.
NVIDIA’s DCGM exporter on top gives you GPU-level telemetry — power draw, thermal, ECC errors, throttle reasons. Worth running on every box.
6. Failover and graceful degradation
You will eventually hit one of these failure modes:
- The big model (32B / 70B) is OOM-ing under unexpected load
- The driver crashed and CUDA is unrecoverable
- The network to your storage tier is having a bad afternoon
The right response, in priority order:
- Drain gracefully. SIGTERM should let in-flight requests finish (30 s budget) before exiting.
- Fall back to a smaller, hotter model. LiteLLM’s
fallbacksconfig does this transparently. - Drop to a cached response or refuse politely. Better than 500-ing.
- Restart automatically.
systemdwithRestart=on-failure,RestartSec=5,StartLimitBurst=5.
7. Cost control
The two questions every CFO asks within a month: "why is this bill higher than expected?" and "how does it compare to OpenAI?". Get ahead of both:
- Track cost per million tokens on your dashboard — see our cost-per-million-tokens breakdown for the formula.
- Set per-key budgets in LiteLLM. A misbehaving consumer should hit a wall, not your invoice.
- Run a break-even calculator against your token volume. Below ~50M tokens/month, hosted APIs are usually cheaper.
8. The eight mistakes we see every month
- Treating vLLM defaults as production-ready. They are tuned for benchmarks.
--gpu-memory-utilization 0.95will OOM under real load. Lower it to 0.90 and tune up. - Never load-testing. Use Locust or k6 with realistic prompt distributions. Most teams discover their server collapses at 30 concurrent users only after launch.
- Putting Ollama in front of paying customers. Ollama is great for development. It is not a production inference engine.
- Forgetting to pin the model commit SHA. Hugging Face hub tags can move. We’ve had two incidents this year caused by a quietly-updated checkpoint changing tokeniser behaviour.
- Running on consumer hardware in a closet. The first thermal event is the last day your model serves traffic. Use a real datacenter chassis or a real datacenter.
- No fallback model. When the 70B has a bad minute, having no plan B is a bad afternoon. Keep a 7B warm.
- Per-token pricing comparisons that ignore embedding traffic. Embeddings are 10–100× cheaper per call than LLM inference but can dominate by volume. Track them separately.
- Auth as an afterthought. A leaked vLLM
--api-keyon a public IP is an expensive lesson. mTLS or per-user auth from day one.
Bottom line
Build the boring infrastructure first — pinned versions, structured logs, Prometheus, LiteLLM-fronted auth, systemd-managed processes — and then run a model. Most teams reverse the order, ship a cool demo, and spend the next quarter retrofitting. The dedicated server doesn’t care which order you choose; the on-call engineer does.
If you want the shorter version of this guide for a specific stack, see our self-host LLM guide and the vLLM production setup walkthrough.