Table of Contents
Streaming responses (SSE — Server-Sent Events) is what makes a chatbot feel alive. The OpenAI spec uses text/event-stream with data: ... lines. vLLM speaks the same protocol, but production deployments hit pitfalls — usually in the reverse proxy.
vLLM streams SSE out of the box. The pitfalls are: nginx buffering by default, HTTP/2 fragmentation, and Cloudflare buffering. Disable buffering at every hop and add X-Accel-Buffering: no headers. Test with curl before assuming the network is fine.
How OpenAI streaming works
Client sends {"stream": true} in the request body. Server responds with Content-Type: text/event-stream and emits incrementally:
data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":" world"}}]}
data: [DONE]
Each data: line is a chunk; chunks are separated by blank lines.
vLLM SSE setup
vLLM streams SSE by default when stream=true. No special flags. Verify with curl:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{\"model\":\"mistral-7b\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}],\"stream\":true}"
# Expect to see chunks streaming as they arrive, not all at once
Reverse proxy buffering
Three places buffering kills streaming:
- nginx:
proxy_buffering off;in the server block. Otherwise nginx waits for the complete response. - Caddy: streams by default, but check
request_buffers offandresponse_buffers offif customised. - Cloudflare: streams SSE by default but caching can interfere. Set
Cache-Control: no-cacheon the response.
Add X-Accel-Buffering: no to vLLM’s response headers (some proxies use this hint).
Client patterns
Python with the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://your-server:8000/v1", api_key="sk-int")
stream = client.chat.completions.create(
model="mistral-7b",
messages=[{"role": "user", "content": "hi"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Verdict
SSE streaming on self-hosted vLLM is a 5-line change. The pitfalls are all in the proxy layer. Test with curl before assuming the API is broken.
Bottom line
Always verify streaming with raw curl before integrating clients. If chunks arrive in one batch, your proxy is buffering. See OpenAI-compatible API guide.