RTX 3050 - Order Now
Home / Blog / Tutorials / Attention Mask Optimisation
Tutorials

Attention Mask Optimisation

Sliding window, sparse attention, and mask-based optimisations for long-context LLM serving. The patterns and the trade-offs.

For long-context LLM workloads, attention is quadratic in context length without optimisation. Several mask-based optimisations cut this: sliding window, sparse, hybrid local-global. Choosing depends on the workload — some compress quality; others are essentially free.

TL;DR

Three optimisation patterns: sliding window attention (each token attends to recent N tokens; Mistral 7B uses 4096), sparse / hybrid attention (local + global; Longformer / Gemma 2), recurrent / state-space models (Mamba; not strictly attention but same goal). For most workloads, sliding window is the right default; supported natively in many models.

Patterns

  • Sliding window attention: each token attends to last N tokens (e.g., 4096). Linear in total context. Mistral 7B native; Gemma 2 partial.
  • Sparse attention: each token attends to a subset (typically O(sqrt(N))). Lower quality on long-range dependencies.
  • Hybrid local-global: most tokens attend locally; some attend globally. Longformer pattern.
  • State-space models (Mamba, RWKV): linear-time alternative to attention; different architecture. Quality competitive in some benchmarks; not strictly attention.
  • Linear attention (Performer, Linformer): linearisation of attention; quality drop typically real.

Model support

  • Mistral 7B: sliding window attention 4096 tokens (in v0.1; v0.3 dropped it for full attention)
  • Mistral Small 3: full attention
  • Llama 3.x: full attention with RoPE position scaling for long context
  • Gemma 2: hybrid local-global attention
  • Qwen 2.5: full attention with various RoPE scaling
  • Mamba (Codestral Mamba): state-space; alternative architecture

Trade-offs

  • Sliding window: quality essentially same as full attention for most tasks; long-range dependencies degrade slightly
  • Sparse / hybrid: more aggressive compression; quality drop on long-range tasks measurable
  • State-space models: linear time; quality competitive but architecture less mature

Verdict

For long-context production deployments, choose models with appropriate attention mechanisms. Mistral 7B v0.1 (sliding window) for cost-anchored long-context; Llama 3.x or Qwen 2.5 (full attention with RoPE scaling) for premium quality at long context. Don't mix architectures within a deployment without measuring quality; attention pattern affects subtle behaviours.

Bottom line

Sliding window for cost; full attention for quality. See long-context VRAM.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?