RTX 3050 - Order Now
Home / Blog / Tutorials / Streaming Chatbot with LLaMA and RAG
Tutorials

Streaming Chatbot with LLaMA and RAG

Build a production streaming chatbot combining LLaMA with RAG retrieval, server-sent events, conversation memory, and a web frontend on dedicated GPU infrastructure.

You will build a customer support chatbot that streams responses token-by-token (like ChatGPT), retrieves answers from your knowledge base via RAG, maintains conversation history, and serves a clean web interface. The end result: customers type a question, see the answer appear word-by-word in under 500ms to first token, with source links to your documentation. Everything runs on your GPU server — no per-token API fees, no data shared with third parties. Here is the full stack on dedicated GPU infrastructure.

Technology Stack

LayerTechnologyPurpose
LLMLLaMA 3.1 8B via vLLMResponse generation with streaming
RetrievalChromaDB + BGE embeddingsKnowledge base search
BackendFastAPI + SSEAPI with server-sent events
MemoryRedisConversation history per session
FrontendHTML + JavaScriptChat interface

Streaming Backend

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

async def stream_chat_response(question: str, context: str, history: list):
    messages = [
        {"role": "system", "content": f"You are a helpful support assistant. "
         f"Answer based on this context:\n{context}\n"
         f"If unsure, say you'll escalate to a human agent."}
    ] + history + [{"role": "user", "content": question}]

    stream = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages, stream=True, max_tokens=500, temperature=0.3
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            token = chunk.choices[0].delta.content
            yield f"data: {json.dumps({'token': token})}\n\n"
    yield "data: [DONE]\n\n"

@app.post("/chat")
async def chat(request: dict):
    question = request["message"]
    session_id = request.get("session_id", "default")

    # Retrieve relevant context from knowledge base
    context = retrieve_context(question)  # ChromaDB lookup
    history = get_history(session_id)     # Redis lookup

    return StreamingResponse(
        stream_chat_response(question, context, history),
        media_type="text/event-stream"
    )

The vLLM server natively supports streaming. Tokens flow from GPU to browser with minimal buffering. ChromaDB provides the RAG retrieval layer.

Conversation Memory

import redis, json
r = redis.Redis()

def get_history(session_id: str, max_turns: int = 10) -> list:
    raw = r.lrange(f"chat:{session_id}", -max_turns * 2, -1)
    return [json.loads(msg) for msg in raw]

def save_turn(session_id: str, role: str, content: str):
    r.rpush(f"chat:{session_id}", json.dumps({"role": role, "content": content}))
    r.expire(f"chat:{session_id}", 3600)  # 1-hour session TTL

Redis stores the last 10 conversation turns per session with a one-hour expiry. This gives the LLM enough context to handle follow-up questions without consuming excessive prompt tokens.

RAG Integration

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
vectorstore = Chroma(persist_directory="/data/knowledge_base", embedding_function=embeddings)

def retrieve_context(question: str, k: int = 3) -> str:
    docs = vectorstore.similarity_search(question, k=k)
    context_parts = []
    for doc in docs:
        source = doc.metadata.get("source", "Unknown")
        context_parts.append(f"[Source: {source}]\n{doc.page_content}")
    return "\n\n".join(context_parts)

The retriever uses LangChain with ChromaDB. For higher retrieval quality, consider Qdrant or add a reranking step.

Web Chat Interface

The frontend connects via EventSource for streaming. On each keystroke of Enter, it POSTs the message and opens an SSE connection. Tokens append to the chat bubble as they arrive, creating the real-time typing effect. Include a “Sources” section below each response showing which documents were used. Add an “Escalate to human” button that creates a support ticket with the full conversation history.

Production Deployment

For production: add rate limiting per session to prevent abuse; implement input sanitisation to block prompt injection attempts; monitor response quality with automated sampling; set up fallback responses when the model is unavailable; and log conversations for quality review (with appropriate data protection measures). Scale to handle more concurrent users by deploying multiple LLM instances behind a load balancer. See chatbot hosting for infrastructure sizing, industry examples for deployment patterns, and more tutorials for extending this pipeline.

Chatbot GPU Servers

Dedicated GPU servers for streaming chatbot deployments. Low latency, high throughput, UK-hosted with full data control.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?