Home / Blog / AI Hosting & Infrastructure / Request Timeout Tuning on an Inference Server

AI Hosting & Infrastructure

Request Timeout Tuning on an Inference Server

Four timeout layers sit between your client and the GPU. Getting any one wrong causes mysterious cancellations. Here is the full map.

AI Hosting & Infrastructure April 19, 2026 2 min read gigagpu

A simple LLM API call on dedicated GPU hosting traverses at least four timeout boundaries: client, reverse proxy, application, inference engine. Tune them wrong and long generations drop halfway through with cryptic errors. Here is how to set each layer.

The four timeout layers
Recommended values
Streaming specifically

Four Layers

Client HTTP timeout: on the caller side. OpenAI Python SDK default is 10 minutes. Langchain default is shorter. Set it explicitly.

Reverse proxy: nginx proxy_read_timeout default is 60 seconds. For LLM requests this must be much higher.

Application layer: FastAPI, Flask, or your API gateway. Usually inherits OS or process defaults.

Inference engine: vLLM, TGI, Ollama may have internal request timeouts or generation time caps.

Recommended Values

Layer	Non-streaming	Streaming
Client	600s	600s + SSE keep-alive
nginx proxy_read_timeout	600s	3600s
nginx proxy_buffering	on	off
App layer (FastAPI)	600s	no cap for streaming
vLLM	no engine cap	no engine cap

Example nginx config for streaming LLM:

location /v1/ {
  proxy_pass http://vllm:8000;
  proxy_http_version 1.1;
  proxy_set_header Connection '';
  proxy_buffering off;
  proxy_cache off;
  proxy_read_timeout 3600s;
  proxy_send_timeout 3600s;
  chunked_transfer_encoding on;
}

Streaming Specifics

Three things break streaming if not set:

proxy_buffering off – or nginx holds tokens until the buffer fills
proxy_http_version 1.1 – required for chunked transfer
Client must handle SSE keep-alive (the : keepalive comments vLLM sends)

Cloudflare proxies add an additional timeout layer. Use Cloudflare Tunnel or WebSockets for long-running streams. See Ollama behind Cloudflare Tunnel.

Preconfigured nginx for vLLM Streaming

We ship reverse-proxy configs tested for long streaming LLM requests on UK dedicated hosting.

Browse GPU Servers

See nginx config for OpenAI API and vLLM behind nginx.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

AI Hosting & Infrastructure

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Request Timeout Tuning on an Inference Server

Contents

Four Layers

Recommended Values

Streaming Specifics

Preconfigured nginx for vLLM Streaming

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Request Timeout Tuning on an Inference Server

Contents

Four Layers

Recommended Values

Streaming Specifics

Preconfigured nginx for vLLM Streaming

Need a Dedicated GPU Server?

gigagpu

Related Articles

Private Cloud AI vs Public API: Architecture Decision Framework

Multi-Region AI Inference Architecture: When and How

GPU Server Networking: Bandwidth, Latency & Configuration Guide

Dedicated GPU vs Cloud GPU: Pros and Cons for AI Workloads

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?