A simple LLM API call on dedicated GPU hosting traverses at least four timeout boundaries: client, reverse proxy, application, inference engine. Tune them wrong and long generations drop halfway through with cryptic errors. Here is how to set each layer.
Contents
Four Layers
Client HTTP timeout: on the caller side. OpenAI Python SDK default is 10 minutes. Langchain default is shorter. Set it explicitly.
Reverse proxy: nginx proxy_read_timeout default is 60 seconds. For LLM requests this must be much higher.
Application layer: FastAPI, Flask, or your API gateway. Usually inherits OS or process defaults.
Inference engine: vLLM, TGI, Ollama may have internal request timeouts or generation time caps.
Recommended Values
| Layer | Non-streaming | Streaming |
|---|---|---|
| Client | 600s | 600s + SSE keep-alive |
| nginx proxy_read_timeout | 600s | 3600s |
| nginx proxy_buffering | on | off |
| App layer (FastAPI) | 600s | no cap for streaming |
| vLLM | no engine cap | no engine cap |
Example nginx config for streaming LLM:
location /v1/ {
proxy_pass http://vllm:8000;
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
chunked_transfer_encoding on;
}
Streaming Specifics
Three things break streaming if not set:
proxy_buffering off– or nginx holds tokens until the buffer fillsproxy_http_version 1.1– required for chunked transfer- Client must handle SSE keep-alive (the
: keepalivecomments vLLM sends)
Cloudflare proxies add an additional timeout layer. Use Cloudflare Tunnel or WebSockets for long-running streams. See Ollama behind Cloudflare Tunnel.
Preconfigured nginx for vLLM Streaming
We ship reverse-proxy configs tested for long streaming LLM requests on UK dedicated hosting.
Browse GPU Servers