RTX 3050 - Order Now
Home / Blog / Tutorials / Self-Hosted OpenAI-Compatible Streaming: SSE, WebSocket, and the Pitfalls
Tutorials

Self-Hosted OpenAI-Compatible Streaming: SSE, WebSocket, and the Pitfalls

Server-Sent Events streaming on a self-hosted vLLM endpoint, with the buffering, reverse-proxy, and CORS gotchas that bite teams in production.

Streaming responses (SSE — Server-Sent Events) is what makes a chatbot feel alive. The OpenAI spec uses text/event-stream with data: ... lines. vLLM speaks the same protocol, but production deployments hit pitfalls — usually in the reverse proxy.

TL;DR

vLLM streams SSE out of the box. The pitfalls are: nginx buffering by default, HTTP/2 fragmentation, and Cloudflare buffering. Disable buffering at every hop and add X-Accel-Buffering: no headers. Test with curl before assuming the network is fine.

How OpenAI streaming works

Client sends {"stream": true} in the request body. Server responds with Content-Type: text/event-stream and emits incrementally:

data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":" world"}}]}

data: [DONE]

Each data: line is a chunk; chunks are separated by blank lines.

vLLM SSE setup

vLLM streams SSE by default when stream=true. No special flags. Verify with curl:

curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"mistral-7b\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}],\"stream\":true}"

# Expect to see chunks streaming as they arrive, not all at once

Reverse proxy buffering

Three places buffering kills streaming:

  1. nginx: proxy_buffering off; in the server block. Otherwise nginx waits for the complete response.
  2. Caddy: streams by default, but check request_buffers off and response_buffers off if customised.
  3. Cloudflare: streams SSE by default but caching can interfere. Set Cache-Control: no-cache on the response.

Add X-Accel-Buffering: no to vLLM’s response headers (some proxies use this hint).

Client patterns

Python with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(base_url="http://your-server:8000/v1", api_key="sk-int")

stream = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "hi"}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Verdict

SSE streaming on self-hosted vLLM is a 5-line change. The pitfalls are all in the proxy layer. Test with curl before assuming the API is broken.

Bottom line

Always verify streaming with raw curl before integrating clients. If chunks arrive in one batch, your proxy is buffering. See OpenAI-compatible API guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?