RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama Context Length Configuration
Tutorials

Ollama Context Length Configuration

Configure Ollama context length for different use cases. Covers num_ctx parameter, VRAM impact calculations, per-request overrides, Modelfile defaults, and long-context model selection on GPU servers.

Why Context Length Directly Affects Your GPU Bill

Context length determines how much text your model can process in a single request. A 4096-token context handles short conversations. A 128K context can process entire documents. The catch: every doubling of context length roughly doubles the KV cache memory, eating into the VRAM available for model weights on your GPU server. Getting this number right means the difference between a model that fits comfortably in VRAM and one that spills to CPU, destroying performance.

Check Your Current Context Setting

# See the default context length for a model
ollama show llama3.1:8b --modelfile | grep num_ctx

# Check what Ollama actually allocates at runtime
# Start a request and look at server logs
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep "context"

# The API response includes context usage
curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Hello"
}' | grep -o '"context":\[[0-9,]*\]' | wc -c

Many models default to 2048 or 4096 tokens in Ollama, even when the architecture supports much more. Llama 3.1 supports 128K natively but Ollama may not allocate that by default.

Set Context Length Per Request

Override context length on individual API calls without changing the model’s default:

# Short context for simple Q&A (saves VRAM)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "What is GPU hosting?",
  "options": {
    "num_ctx": 2048
  }
}'

# Long context for document analysis
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Summarize this document: ...",
  "options": {
    "num_ctx": 32768
  }
}'

Per-request overrides let you optimise VRAM usage dynamically. Short queries use minimal memory, freeing capacity for concurrent requests.

Set a Permanent Default via Modelfile

Create a model variant with your preferred context length baked in:

# Create a long-context variant
cat > Modelfile-32k << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 32768
PARAMETER num_batch 512
EOF

ollama create llama3.1-32k -f Modelfile-32k

# Create a short-context variant for chat
cat > Modelfile-4k << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 4096
PARAMETER num_batch 256
EOF

ollama create llama3.1-chat -f Modelfile-4k

Maintain separate model aliases for different context needs rather than constantly overriding at request time.

VRAM Impact by Context Length

The KV cache size scales with context length, number of layers, and hidden dimensions:

# Approximate KV cache VRAM for Llama 3.1 8B:
# 2048 tokens:  ~0.5 GB
# 4096 tokens:  ~1.0 GB
# 8192 tokens:  ~2.0 GB
# 16384 tokens: ~4.0 GB
# 32768 tokens: ~8.0 GB
# 65536 tokens: ~16.0 GB

# Model weights (Q4_K_M): ~4.5 GB
# Total VRAM at 32K context: ~12.5 GB (fits in 16 GB card)
# Total VRAM at 64K context: ~20.5 GB (needs 24 GB card)

# Check actual usage during inference
nvidia-smi --query-gpu=memory.used --format=csv -l 1

On a 24 GB GPU, a Q4 8B model comfortably runs 32K context. For 128K context, you need either a 48+ GB GPU or vLLM's PagedAttention which manages KV cache memory more efficiently.

Optimisation Strategies

Enable flash attention to reduce the memory overhead of long contexts:

# Flash attention significantly reduces KV cache memory at long contexts
OLLAMA_FLASH_ATTENTION=1 ollama serve

# Combine with num_batch for throughput tuning
# Higher num_batch = faster prompt processing, more VRAM
# Lower num_batch = slower processing, less VRAM peak usage

For Ollama hosting in production, set context length to the minimum your application genuinely requires. The benchmarks show throughput at different context lengths. Review the vLLM production guide for workloads needing very long context with better memory efficiency. The CUDA guide and tutorials section cover GPU configuration, and the LLM hosting blog compares context handling across serving frameworks.

High-VRAM GPUs for Long Context

GigaGPU RTX 6000 Pro 96 GB and RTX 6000 Pro servers handle 128K+ context windows with room to spare.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?