Every Message Grows the Context Window
Your chatbot works perfectly for the first few exchanges. By turn fifteen, responses slow to a crawl, context gets silently truncated, and the model forgets what was discussed three messages ago. Multi-turn conversations accumulate tokens with every exchange, and without explicit memory management your GPU server either runs out of KV cache or the model loses coherence. The context window is finite — you need a strategy for what stays and what goes.
Context Window Token Budgeting
Allocate your context window into fixed segments to prevent overflow:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct")
MAX_CONTEXT = 8192
SYSTEM_BUDGET = 500 # System prompt
COMPLETION_BUDGET = 1024 # Reserved for model response
HISTORY_BUDGET = MAX_CONTEXT - SYSTEM_BUDGET - COMPLETION_BUDGET
# = 6668 tokens for conversation history
def count_tokens(text):
return len(tokenizer.encode(text))
def build_prompt(system_msg, messages, max_history=HISTORY_BUDGET):
"""Build prompt that fits within context budget."""
history_tokens = 0
included = []
# Always include most recent messages first
for msg in reversed(messages):
msg_tokens = count_tokens(msg["content"]) + 4
if history_tokens + msg_tokens > max_history:
break
included.insert(0, msg)
history_tokens += msg_tokens
return [{"role": "system", "content": system_msg}] + included
Message Truncation Strategies
When history exceeds the budget, choose a truncation approach that preserves coherence:
# Strategy 1: Sliding window — drop oldest messages
def sliding_window(messages, max_tokens):
total = 0
result = []
for msg in reversed(messages):
tokens = count_tokens(msg["content"]) + 4
if total + tokens > max_tokens:
break
result.insert(0, msg)
total += tokens
return result
# Strategy 2: Keep first + last — preserve opening context
def first_plus_recent(messages, max_tokens):
first_msg = messages[0]
first_tokens = count_tokens(first_msg["content"]) + 4
remaining = max_tokens - first_tokens
recent = sliding_window(messages[1:], remaining)
return [first_msg] + recent
# Strategy 3: Summarise and compress old messages
def summarise_and_keep(messages, max_tokens, summariser):
if count_all(messages) <= max_tokens:
return messages
# Summarise older half, keep recent half verbatim
midpoint = len(messages) // 2
old = messages[:midpoint]
recent = messages[midpoint:]
summary = summariser(old)
return [{"role": "system", "content": f"Prior context: {summary}"}
] + recent
Rolling Conversation Summaries
Use the LLM itself to compress old context into a summary that fits a fraction of the original tokens:
import requests
def generate_summary(messages, model="meta-llama/Meta-Llama-3.1-8B-Instruct"):
conversation_text = "\n".join(
f"{m['role']}: {m['content']}" for m in messages)
resp = requests.post("http://localhost:8000/v1/chat/completions",
json={
"model": model,
"messages": [{
"role": "user",
"content": f"Summarise this conversation in under 200 words."
f" Preserve key facts, decisions, and user "
f"preferences:\n\n{conversation_text}"
}],
"temperature": 0.1,
"max_tokens": 300
})
return resp.json()["choices"][0]["message"]["content"]
# Trigger summarisation when history exceeds 70% of budget
class ConversationManager:
def __init__(self, max_history_tokens=6000):
self.messages = []
self.summary = ""
self.max_tokens = max_history_tokens
self.threshold = int(max_history_tokens * 0.7)
def add_message(self, role, content):
self.messages.append({"role": role, "content": content})
if count_all(self.messages) > self.threshold:
old = self.messages[:len(self.messages)//2]
self.summary = generate_summary(old)
self.messages = self.messages[len(self.messages)//2:]
def get_prompt(self, system_msg):
msgs = []
if self.summary:
msgs.append({"role": "system",
"content": f"{system_msg}\n\nContext: {self.summary}"})
else:
msgs.append({"role": "system", "content": system_msg})
return msgs + self.messages
Session Storage and Persistence
Store conversation state externally so sessions survive server restarts:
import json, redis, time
r = redis.Redis(host="localhost", port=6379, db=0)
def save_session(session_id, messages, summary="", ttl=86400):
data = json.dumps({"messages": messages, "summary": summary,
"updated": time.time()})
r.setex(f"chat:{session_id}", ttl, data)
def load_session(session_id):
data = r.get(f"chat:{session_id}")
if not data:
return [], ""
parsed = json.loads(data)
return parsed["messages"], parsed.get("summary", "")
# Clean up expired sessions
def cleanup_old_sessions(max_age=604800):
for key in r.scan_iter("chat:*"):
data = json.loads(r.get(key))
if time.time() - data["updated"] > max_age:
r.delete(key)
KV Cache Reuse for Multi-Turn
On vLLM, prefix caching avoids recomputing the KV cache for shared conversation history:
# Enable prefix caching in vLLM (reuses KV cache across turns)
# vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
# --enable-prefix-caching \
# --max-model-len 8192
# With prefix caching enabled, sending the same conversation prefix
# on subsequent turns reuses the cached KV entries — only new tokens
# require computation. This reduces TTFT from seconds to milliseconds
# on long conversations.
Multi-turn memory management is essential for production chatbots on your GPU server. Use vLLM prefix caching for latency wins and Ollama for simpler deployments. The vLLM production guide covers cache configuration. See the infrastructure section for storage setup, tutorials for implementation details, and benchmarks for throughput under multi-turn load.
Multi-Turn LLM Inference
Run long conversations on GigaGPU dedicated servers with ample VRAM and fast KV cache. Keep context, keep quality.
Browse GPU Servers