RTX 3050 - Order Now
Home / Blog / LLM Hosting / LLM Multi-Turn Memory Management
LLM Hosting

LLM Multi-Turn Memory Management

Manage multi-turn conversation memory for self-hosted LLMs. Covers context window budgeting, message truncation strategies, summarisation, KV cache reuse, and session storage on GPU servers.

Every Message Grows the Context Window

Your chatbot works perfectly for the first few exchanges. By turn fifteen, responses slow to a crawl, context gets silently truncated, and the model forgets what was discussed three messages ago. Multi-turn conversations accumulate tokens with every exchange, and without explicit memory management your GPU server either runs out of KV cache or the model loses coherence. The context window is finite — you need a strategy for what stays and what goes.

Context Window Token Budgeting

Allocate your context window into fixed segments to prevent overflow:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct")

MAX_CONTEXT = 8192
SYSTEM_BUDGET = 500       # System prompt
COMPLETION_BUDGET = 1024  # Reserved for model response
HISTORY_BUDGET = MAX_CONTEXT - SYSTEM_BUDGET - COMPLETION_BUDGET
# = 6668 tokens for conversation history

def count_tokens(text):
    return len(tokenizer.encode(text))

def build_prompt(system_msg, messages, max_history=HISTORY_BUDGET):
    """Build prompt that fits within context budget."""
    history_tokens = 0
    included = []

    # Always include most recent messages first
    for msg in reversed(messages):
        msg_tokens = count_tokens(msg["content"]) + 4
        if history_tokens + msg_tokens > max_history:
            break
        included.insert(0, msg)
        history_tokens += msg_tokens

    return [{"role": "system", "content": system_msg}] + included

Message Truncation Strategies

When history exceeds the budget, choose a truncation approach that preserves coherence:

# Strategy 1: Sliding window — drop oldest messages
def sliding_window(messages, max_tokens):
    total = 0
    result = []
    for msg in reversed(messages):
        tokens = count_tokens(msg["content"]) + 4
        if total + tokens > max_tokens:
            break
        result.insert(0, msg)
        total += tokens
    return result

# Strategy 2: Keep first + last — preserve opening context
def first_plus_recent(messages, max_tokens):
    first_msg = messages[0]
    first_tokens = count_tokens(first_msg["content"]) + 4
    remaining = max_tokens - first_tokens
    recent = sliding_window(messages[1:], remaining)
    return [first_msg] + recent

# Strategy 3: Summarise and compress old messages
def summarise_and_keep(messages, max_tokens, summariser):
    if count_all(messages) <= max_tokens:
        return messages
    # Summarise older half, keep recent half verbatim
    midpoint = len(messages) // 2
    old = messages[:midpoint]
    recent = messages[midpoint:]
    summary = summariser(old)
    return [{"role": "system", "content": f"Prior context: {summary}"}
            ] + recent

Rolling Conversation Summaries

Use the LLM itself to compress old context into a summary that fits a fraction of the original tokens:

import requests

def generate_summary(messages, model="meta-llama/Meta-Llama-3.1-8B-Instruct"):
    conversation_text = "\n".join(
        f"{m['role']}: {m['content']}" for m in messages)

    resp = requests.post("http://localhost:8000/v1/chat/completions",
        json={
            "model": model,
            "messages": [{
                "role": "user",
                "content": f"Summarise this conversation in under 200 words."
                           f" Preserve key facts, decisions, and user "
                           f"preferences:\n\n{conversation_text}"
            }],
            "temperature": 0.1,
            "max_tokens": 300
        })
    return resp.json()["choices"][0]["message"]["content"]

# Trigger summarisation when history exceeds 70% of budget
class ConversationManager:
    def __init__(self, max_history_tokens=6000):
        self.messages = []
        self.summary = ""
        self.max_tokens = max_history_tokens
        self.threshold = int(max_history_tokens * 0.7)

    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})
        if count_all(self.messages) > self.threshold:
            old = self.messages[:len(self.messages)//2]
            self.summary = generate_summary(old)
            self.messages = self.messages[len(self.messages)//2:]

    def get_prompt(self, system_msg):
        msgs = []
        if self.summary:
            msgs.append({"role": "system",
                         "content": f"{system_msg}\n\nContext: {self.summary}"})
        else:
            msgs.append({"role": "system", "content": system_msg})
        return msgs + self.messages

Session Storage and Persistence

Store conversation state externally so sessions survive server restarts:

import json, redis, time

r = redis.Redis(host="localhost", port=6379, db=0)

def save_session(session_id, messages, summary="", ttl=86400):
    data = json.dumps({"messages": messages, "summary": summary,
                       "updated": time.time()})
    r.setex(f"chat:{session_id}", ttl, data)

def load_session(session_id):
    data = r.get(f"chat:{session_id}")
    if not data:
        return [], ""
    parsed = json.loads(data)
    return parsed["messages"], parsed.get("summary", "")

# Clean up expired sessions
def cleanup_old_sessions(max_age=604800):
    for key in r.scan_iter("chat:*"):
        data = json.loads(r.get(key))
        if time.time() - data["updated"] > max_age:
            r.delete(key)

KV Cache Reuse for Multi-Turn

On vLLM, prefix caching avoids recomputing the KV cache for shared conversation history:

# Enable prefix caching in vLLM (reuses KV cache across turns)
# vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
#   --enable-prefix-caching \
#   --max-model-len 8192

# With prefix caching enabled, sending the same conversation prefix
# on subsequent turns reuses the cached KV entries — only new tokens
# require computation. This reduces TTFT from seconds to milliseconds
# on long conversations.

Multi-turn memory management is essential for production chatbots on your GPU server. Use vLLM prefix caching for latency wins and Ollama for simpler deployments. The vLLM production guide covers cache configuration. See the infrastructure section for storage setup, tutorials for implementation details, and benchmarks for throughput under multi-turn load.

Multi-Turn LLM Inference

Run long conversations on GigaGPU dedicated servers with ample VRAM and fast KV cache. Keep context, keep quality.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?