RTX 3050 - Order Now
Home / Blog / Tutorials / Streaming Response Handling
Tutorials

Streaming Response Handling

SSE streaming for LLM responses — client patterns, server config, error handling. The reference implementation.

For interactive AI features (chat, voice agents), streaming responses are essential UX. Users see tokens as generated; perceived latency drops dramatically vs waiting for full response. The implementation has specific gotchas worth knowing.

TL;DR

vLLM exposes OpenAI-compatible SSE streaming. Server: nginx proxy_buffering off; long timeouts. Client: OpenAI SDK stream=True handles SSE parsing. Errors: handle disconnect, partial responses, retry semantics. Track streaming-specific metrics: time-to-first-token, time-per-output-token.

Why streaming

  • Perceived latency: users see tokens as generated; engagement higher than waiting
  • Cancellable: client can stop mid-generation if needed
  • Memory-friendly: don't buffer entire response server or client side
  • Cost visibility: see partial output before full cost incurred

Server

vLLM streaming is OpenAI-compatible by default. Critical nginx config:

location /v1/chat/completions {
  proxy_pass http://vllm_backend;
  proxy_http_version 1.1;
  proxy_set_header Connection "";

  # Streaming critical: disable buffering
  proxy_buffering off;
  proxy_cache off;
  proxy_read_timeout 600s;
  proxy_send_timeout 600s;
  chunked_transfer_encoding on;
}

Client

from openai import OpenAI

client = OpenAI(base_url="http://your-vllm:8000/v1", api_key="dummy")
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "..."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Errors

  • Client disconnect: vLLM detects + cancels generation (saves resources)
  • Mid-stream error: partial response returned + error code in final chunk
  • Network blip: client should buffer last-N tokens for graceful resumption (rarely worth implementing)
  • Timeout: client timeout should exceed server timeout

Verdict

For interactive AI UX, streaming is non-optional. Implement OpenAI-compatible SSE; configure nginx correctly; handle client disconnects + partial responses. Track time-to-first-token + time-per-token metrics separately from full-request latency. Standard pattern in 2026 production AI.

Bottom line

Streaming + nginx config + OpenAI client. See nginx config.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?