RTX 3050 - Order Now
Home / Blog / Tutorials / How to Build a Production AI Inference Server: Hardware, Software, and the 8 Mistakes Everyone Makes
Tutorials

How to Build a Production AI Inference Server: Hardware, Software, and the 8 Mistakes Everyone Makes

A practical, opinionated playbook for building a production-grade AI inference server — from picking the GPU to wiring up auth, observability and graceful failover. The version we wish someone had handed us in 2023.

Most production AI inference servers fail in roughly the same ways. The model loads fine, the first 100 requests work, and then on day three something quietly goes wrong — VRAM fills up, a driver mismatch surfaces, the queue depths blow out under real traffic. This guide is the playbook we hand to teams launching their first dedicated GPU server, refined from several hundred deployments.

TL;DR

A production inference server is not a model + an HTTP framework. It’s a model + an inference engine + an HTTP frontend + auth + metrics + queueing + a degradation strategy + a deployment harness. Skipping any one of those will bite you within a month. We’ll go through each in turn.

1. Picking the GPU and chassis

Three questions, in order:

  1. What models will I serve? The biggest one sets the VRAM floor. Don’t pick the GPU for today’s model; pick it for the model you’ll want in 6 months.
  2. What’s my latency budget? Single-stream <100 ms TTFT means you want a Blackwell card (5080, 5090, 6000 Pro). Aggregate throughput at high concurrency is a different optimisation.
  3. How predictable is my traffic? Steady traffic favours dedicated bare-metal. Spiky traffic favours either oversized hardware or autoscaling fronting cloud GPUs.

Common combinations that work:

WorkloadGPUWhy
7B chatbot, <50 concurrent usersRTX 3090 24 GBCheapest 24 GB. Plenty for FP16 7B + KV cache.
7B/8B chatbot, customer-facingRTX 5090 32 GBHighest throughput per card; FP8 hardware.
13B–14B chatbotRTX 5090 32 GBFP16 fits. INT4 leaves room for two models.
Single-card 70BRTX 6000 Pro 96 GBThe only single-card 70B FP8 deployment.
Whisper + LLM voice agentRTX 5090 32 GBBoth models hot-loaded with headroom.
Embeddings onlyRTX 3060 12 GBDon’t over-buy. ~50K embeds/sec.

Don’t run inference on consumer desktops or workstations dragged into a closet. The thermal envelope, power redundancy and 24/7 uptime requirements of production traffic break that setup within weeks.

2. OS, driver and CUDA stack

Pin everything. Ubuntu 22.04 LTS, NVIDIA driver pinned to a specific version, CUDA toolkit pinned to a specific version, vLLM/TGI pinned to a specific tag. The number of incidents we’ve seen caused by an unattended-upgrades package bumping the NVIDIA driver in the middle of the night is non-zero.

sudo apt-mark hold nvidia-driver-* cuda-toolkit-* libcudnn*

# Versions we pin (as of mid-2026):
# nvidia-driver-555.x  (Blackwell)
# CUDA 12.4
# cuDNN 9.1
# vLLM 0.6.3

cat /sys/module/nvidia/version    # verify
nvidia-smi --query-gpu=driver_version,name,memory.total --format=csv

If you’re on Blackwell hardware (5080/5090/6000 Pro), the driver lower bound is 555.x. Older drivers will load but TensorRT-LLM kernels and FP4 paths won’t be available.

3. Inference engine: vLLM, TGI, Triton or Ollama?

What works

  • vLLM — the default. Continuous batching, PagedAttention, OpenAI-compatible API, AWQ/GPTQ/FP8 support, prefix caching. Active development, first-class on Blackwell.
  • Text Generation Inference (TGI) — Hugging Face’s engine. Slightly stricter quantisation support, excellent multi-GPU. Production-tested at scale.
  • Triton + TensorRT-LLM — NVIDIA’s stack. Highest single-card throughput on FP4/FP8. More integration work; biggest payoff on Hopper/Blackwell.
  • Ollama — easiest to get running. Multi-model multiplexing on one port. Underneath is llama.cpp; fine for low-throughput / dev work.

Where it breaks

  • vLLM has aggressive memory tuning — naive defaults will OOM under load. Tune gpu-memory-utilization and max-num-seqs.
  • TGI lags vLLM by 2–4 weeks on quantisation support for new models.
  • Triton + TensorRT-LLM is genuinely complex — engine compilation step alone is a learning curve.
  • Ollama is not a production engine. No tracing, weak metrics, no rate limiting. Don’t put it in front of paying users.

For 90% of new deployments: start with vLLM. Move to TGI if you hit a vLLM bug; move to Triton/TensorRT-LLM only when single-card throughput is the bottleneck and you have ops capacity to invest.

4. API surface, auth and rate limiting

Use the OpenAI shape. Even if you don’t use OpenAI today, the SDK ecosystem (Python, Node, Go, Rust, Ruby) is overwhelmingly built around it. vLLM and TGI both expose /v1/chat/completions, /v1/completions, /v1/embeddings out of the box — see our API hosting page.

For auth and rate limiting, do not rely on vLLM’s --api-key flag for a public endpoint. It’s a single static token. Front the engine with one of:

  • LiteLLM — lightweight router. Per-key rate limits, model multiplexing, retries, streaming pass-through. Our default recommendation.
  • Caddy / nginx — TLS, mTLS, IP allow-list. Pair with LiteLLM for auth.
  • Cloudflare Access — for internal tools, drops the auth surface entirely. Sign in with your IDP, hit the endpoint.
# litellm-config.yaml
model_list:
  - model_name: chat-fast
    litellm_params:
      model: openai/qwen2.5-7b
      api_base: http://127.0.0.1:8000/v1
      api_key: vllm-internal
  - model_name: chat-strong
    litellm_params:
      model: openai/qwen2.5-32b
      api_base: http://127.0.0.1:8001/v1
      api_key: vllm-internal

router_settings:
  routing_strategy: latency-based-routing
  fallbacks: [{"chat-strong": ["chat-fast"]}]

litellm_settings:
  drop_params: true
  set_verbose: false

general_settings:
  master_key: sk-master-key-rotates-monthly
  database_url: "postgres://..."   # for per-key tracking

5. Observability: metrics, logs and tracing

Three signals you must export from day one:

  1. vLLM Prometheus metrics--enable-metrics exposes /metrics. Scrape with Prometheus, dashboard with Grafana. Key alerts: vllm:gpu_cache_usage_perc > 0.95, vllm:num_requests_waiting > 100, vllm:time_to_first_token_seconds (p99) > 2.
  2. Structured JSON request logs — request ID, user/key ID, model, input tokens, output tokens, latency, completion reason. Ship to your SIEM.
  3. OpenTelemetry traces — span the full request through LiteLLM → vLLM → response. Required for diagnosing tail-latency issues.

NVIDIA’s DCGM exporter on top gives you GPU-level telemetry — power draw, thermal, ECC errors, throttle reasons. Worth running on every box.

6. Failover and graceful degradation

You will eventually hit one of these failure modes:

  • The big model (32B / 70B) is OOM-ing under unexpected load
  • The driver crashed and CUDA is unrecoverable
  • The network to your storage tier is having a bad afternoon

The right response, in priority order:

  1. Drain gracefully. SIGTERM should let in-flight requests finish (30 s budget) before exiting.
  2. Fall back to a smaller, hotter model. LiteLLM’s fallbacks config does this transparently.
  3. Drop to a cached response or refuse politely. Better than 500-ing.
  4. Restart automatically. systemd with Restart=on-failure, RestartSec=5, StartLimitBurst=5.

7. Cost control

The two questions every CFO asks within a month: "why is this bill higher than expected?" and "how does it compare to OpenAI?". Get ahead of both:

  • Track cost per million tokens on your dashboard — see our cost-per-million-tokens breakdown for the formula.
  • Set per-key budgets in LiteLLM. A misbehaving consumer should hit a wall, not your invoice.
  • Run a break-even calculator against your token volume. Below ~50M tokens/month, hosted APIs are usually cheaper.

8. The eight mistakes we see every month

  1. Treating vLLM defaults as production-ready. They are tuned for benchmarks. --gpu-memory-utilization 0.95 will OOM under real load. Lower it to 0.90 and tune up.
  2. Never load-testing. Use Locust or k6 with realistic prompt distributions. Most teams discover their server collapses at 30 concurrent users only after launch.
  3. Putting Ollama in front of paying customers. Ollama is great for development. It is not a production inference engine.
  4. Forgetting to pin the model commit SHA. Hugging Face hub tags can move. We’ve had two incidents this year caused by a quietly-updated checkpoint changing tokeniser behaviour.
  5. Running on consumer hardware in a closet. The first thermal event is the last day your model serves traffic. Use a real datacenter chassis or a real datacenter.
  6. No fallback model. When the 70B has a bad minute, having no plan B is a bad afternoon. Keep a 7B warm.
  7. Per-token pricing comparisons that ignore embedding traffic. Embeddings are 10–100× cheaper per call than LLM inference but can dominate by volume. Track them separately.
  8. Auth as an afterthought. A leaked vLLM --api-key on a public IP is an expensive lesson. mTLS or per-user auth from day one.

Bottom line

Build the boring infrastructure first — pinned versions, structured logs, Prometheus, LiteLLM-fronted auth, systemd-managed processes — and then run a model. Most teams reverse the order, ship a cool demo, and spend the next quarter retrofitting. The dedicated server doesn’t care which order you choose; the on-call engineer does.

If you want the shorter version of this guide for a specific stack, see our self-host LLM guide and the vLLM production setup walkthrough.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?