RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Request Timeout Tuning on an Inference Server
AI Hosting & Infrastructure

Request Timeout Tuning on an Inference Server

Four timeout layers sit between your client and the GPU. Getting any one wrong causes mysterious cancellations. Here is the full map.

A simple LLM API call on dedicated GPU hosting traverses at least four timeout boundaries: client, reverse proxy, application, inference engine. Tune them wrong and long generations drop halfway through with cryptic errors. Here is how to set each layer.

Contents

Four Layers

Client HTTP timeout: on the caller side. OpenAI Python SDK default is 10 minutes. Langchain default is shorter. Set it explicitly.

Reverse proxy: nginx proxy_read_timeout default is 60 seconds. For LLM requests this must be much higher.

Application layer: FastAPI, Flask, or your API gateway. Usually inherits OS or process defaults.

Inference engine: vLLM, TGI, Ollama may have internal request timeouts or generation time caps.

Recommended Values

LayerNon-streamingStreaming
Client600s600s + SSE keep-alive
nginx proxy_read_timeout600s3600s
nginx proxy_bufferingonoff
App layer (FastAPI)600sno cap for streaming
vLLMno engine capno engine cap

Example nginx config for streaming LLM:

location /v1/ {
  proxy_pass http://vllm:8000;
  proxy_http_version 1.1;
  proxy_set_header Connection '';
  proxy_buffering off;
  proxy_cache off;
  proxy_read_timeout 3600s;
  proxy_send_timeout 3600s;
  chunked_transfer_encoding on;
}

Streaming Specifics

Three things break streaming if not set:

  • proxy_buffering off – or nginx holds tokens until the buffer fills
  • proxy_http_version 1.1 – required for chunked transfer
  • Client must handle SSE keep-alive (the : keepalive comments vLLM sends)

Cloudflare proxies add an additional timeout layer. Use Cloudflare Tunnel or WebSockets for long-running streams. See Ollama behind Cloudflare Tunnel.

Preconfigured nginx for vLLM Streaming

We ship reverse-proxy configs tested for long streaming LLM requests on UK dedicated hosting.

Browse GPU Servers

See nginx config for OpenAI API and vLLM behind nginx.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?