Table of Contents
For interactive AI features (chat, voice agents), streaming responses are essential UX. Users see tokens as generated; perceived latency drops dramatically vs waiting for full response. The implementation has specific gotchas worth knowing.
vLLM exposes OpenAI-compatible SSE streaming. Server: nginx proxy_buffering off; long timeouts. Client: OpenAI SDK stream=True handles SSE parsing. Errors: handle disconnect, partial responses, retry semantics. Track streaming-specific metrics: time-to-first-token, time-per-output-token.
Why streaming
- Perceived latency: users see tokens as generated; engagement higher than waiting
- Cancellable: client can stop mid-generation if needed
- Memory-friendly: don't buffer entire response server or client side
- Cost visibility: see partial output before full cost incurred
Server
vLLM streaming is OpenAI-compatible by default. Critical nginx config:
location /v1/chat/completions {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
# Streaming critical: disable buffering
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 600s;
proxy_send_timeout 600s;
chunked_transfer_encoding on;
}
Client
from openai import OpenAI
client = OpenAI(base_url="http://your-vllm:8000/v1", api_key="dummy")
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "..."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Errors
- Client disconnect: vLLM detects + cancels generation (saves resources)
- Mid-stream error: partial response returned + error code in final chunk
- Network blip: client should buffer last-N tokens for graceful resumption (rarely worth implementing)
- Timeout: client timeout should exceed server timeout
Verdict
For interactive AI UX, streaming is non-optional. Implement OpenAI-compatible SSE; configure nginx correctly; handle client disconnects + partial responses. Track time-to-first-token + time-per-token metrics separately from full-request latency. Standard pattern in 2026 production AI.
Bottom line
Streaming + nginx config + OpenAI client. See nginx config.