You will build a customer support chatbot that streams responses token-by-token (like ChatGPT), retrieves answers from your knowledge base via RAG, maintains conversation history, and serves a clean web interface. The end result: customers type a question, see the answer appear word-by-word in under 500ms to first token, with source links to your documentation. Everything runs on your GPU server — no per-token API fees, no data shared with third parties. Here is the full stack on dedicated GPU infrastructure.
Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| LLM | LLaMA 3.1 8B via vLLM | Response generation with streaming |
| Retrieval | ChromaDB + BGE embeddings | Knowledge base search |
| Backend | FastAPI + SSE | API with server-sent events |
| Memory | Redis | Conversation history per session |
| Frontend | HTML + JavaScript | Chat interface |
Streaming Backend
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json
app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
async def stream_chat_response(question: str, context: str, history: list):
messages = [
{"role": "system", "content": f"You are a helpful support assistant. "
f"Answer based on this context:\n{context}\n"
f"If unsure, say you'll escalate to a human agent."}
] + history + [{"role": "user", "content": question}]
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=messages, stream=True, max_tokens=500, temperature=0.3
)
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
yield f"data: {json.dumps({'token': token})}\n\n"
yield "data: [DONE]\n\n"
@app.post("/chat")
async def chat(request: dict):
question = request["message"]
session_id = request.get("session_id", "default")
# Retrieve relevant context from knowledge base
context = retrieve_context(question) # ChromaDB lookup
history = get_history(session_id) # Redis lookup
return StreamingResponse(
stream_chat_response(question, context, history),
media_type="text/event-stream"
)
The vLLM server natively supports streaming. Tokens flow from GPU to browser with minimal buffering. ChromaDB provides the RAG retrieval layer.
Conversation Memory
import redis, json
r = redis.Redis()
def get_history(session_id: str, max_turns: int = 10) -> list:
raw = r.lrange(f"chat:{session_id}", -max_turns * 2, -1)
return [json.loads(msg) for msg in raw]
def save_turn(session_id: str, role: str, content: str):
r.rpush(f"chat:{session_id}", json.dumps({"role": role, "content": content}))
r.expire(f"chat:{session_id}", 3600) # 1-hour session TTL
Redis stores the last 10 conversation turns per session with a one-hour expiry. This gives the LLM enough context to handle follow-up questions without consuming excessive prompt tokens.
RAG Integration
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
vectorstore = Chroma(persist_directory="/data/knowledge_base", embedding_function=embeddings)
def retrieve_context(question: str, k: int = 3) -> str:
docs = vectorstore.similarity_search(question, k=k)
context_parts = []
for doc in docs:
source = doc.metadata.get("source", "Unknown")
context_parts.append(f"[Source: {source}]\n{doc.page_content}")
return "\n\n".join(context_parts)
The retriever uses LangChain with ChromaDB. For higher retrieval quality, consider Qdrant or add a reranking step.
Web Chat Interface
The frontend connects via EventSource for streaming. On each keystroke of Enter, it POSTs the message and opens an SSE connection. Tokens append to the chat bubble as they arrive, creating the real-time typing effect. Include a “Sources” section below each response showing which documents were used. Add an “Escalate to human” button that creates a support ticket with the full conversation history.
Production Deployment
For production: add rate limiting per session to prevent abuse; implement input sanitisation to block prompt injection attempts; monitor response quality with automated sampling; set up fallback responses when the model is unavailable; and log conversations for quality review (with appropriate data protection measures). Scale to handle more concurrent users by deploying multiple LLM instances behind a load balancer. See chatbot hosting for infrastructure sizing, industry examples for deployment patterns, and more tutorials for extending this pipeline.
Chatbot GPU Servers
Dedicated GPU servers for streaming chatbot deployments. Low latency, high throughput, UK-hosted with full data control.
Browse GPU Servers