Table of Contents
For long-context LLM workloads, attention is quadratic in context length without optimisation. Several mask-based optimisations cut this: sliding window, sparse, hybrid local-global. Choosing depends on the workload — some compress quality; others are essentially free.
Three optimisation patterns: sliding window attention (each token attends to recent N tokens; Mistral 7B uses 4096), sparse / hybrid attention (local + global; Longformer / Gemma 2), recurrent / state-space models (Mamba; not strictly attention but same goal). For most workloads, sliding window is the right default; supported natively in many models.
Patterns
- Sliding window attention: each token attends to last N tokens (e.g., 4096). Linear in total context. Mistral 7B native; Gemma 2 partial.
- Sparse attention: each token attends to a subset (typically O(sqrt(N))). Lower quality on long-range dependencies.
- Hybrid local-global: most tokens attend locally; some attend globally. Longformer pattern.
- State-space models (Mamba, RWKV): linear-time alternative to attention; different architecture. Quality competitive in some benchmarks; not strictly attention.
- Linear attention (Performer, Linformer): linearisation of attention; quality drop typically real.
Model support
- Mistral 7B: sliding window attention 4096 tokens (in v0.1; v0.3 dropped it for full attention)
- Mistral Small 3: full attention
- Llama 3.x: full attention with RoPE position scaling for long context
- Gemma 2: hybrid local-global attention
- Qwen 2.5: full attention with various RoPE scaling
- Mamba (Codestral Mamba): state-space; alternative architecture
Trade-offs
- Sliding window: quality essentially same as full attention for most tasks; long-range dependencies degrade slightly
- Sparse / hybrid: more aggressive compression; quality drop on long-range tasks measurable
- State-space models: linear time; quality competitive but architecture less mature
Verdict
For long-context production deployments, choose models with appropriate attention mechanisms. Mistral 7B v0.1 (sliding window) for cost-anchored long-context; Llama 3.x or Qwen 2.5 (full attention with RoPE scaling) for premium quality at long context. Don't mix architectures within a deployment without measuring quality; attention pattern affects subtle behaviours.
Bottom line
Sliding window for cost; full attention for quality. See long-context VRAM.