RTX 3050 - Order Now
Home / Blog / Model Guides / CodeLlama VRAM Requirements (7B, 13B, 34B)
Model Guides

CodeLlama VRAM Requirements (7B, 13B, 34B)

Complete CodeLlama VRAM requirements for 7B, 13B, and 34B across all variants (Base, Instruct, Python). FP32, FP16, INT8, INT4 tables and GPU picks.

CodeLlama VRAM Requirements Overview

Meta’s CodeLlama is a code-specialized version of LLaMA 2, available in 7B, 13B, and 34B parameter sizes with Base, Instruct, and Python variants. It supports up to 100K tokens of context through infilling and long-context fine-tuning. This guide covers VRAM for every variant to help you pick the right dedicated GPU server for code model hosting.

CodeLlama uses the same architecture as LLaMA 2 with additional training on code datasets. The 7B and 13B variants support 16K context natively, while the 34B model was trained on 16K context. All sizes support code infilling (fill-in-the-middle).

Complete VRAM Table (All Models)

ModelParametersFP32FP16INT8INT4
CodeLlama 7B6.7B~27 GB~13.5 GB~7 GB~4.5 GB
CodeLlama 7B Instruct6.7B~27 GB~13.5 GB~7 GB~4.5 GB
CodeLlama 7B Python6.7B~27 GB~13.5 GB~7 GB~4.5 GB
CodeLlama 13B13B~52 GB~26 GB~13 GB~8 GB
CodeLlama 13B Instruct13B~52 GB~26 GB~13 GB~8 GB
CodeLlama 13B Python13B~52 GB~26 GB~13 GB~8 GB
CodeLlama 34B33.7B~135 GB~67.5 GB~34 GB~20 GB
CodeLlama 34B Instruct33.7B~135 GB~67.5 GB~34 GB~20 GB
CodeLlama 34B Python33.7B~135 GB~67.5 GB~34 GB~20 GB

The Base, Instruct, and Python variants at each size have identical VRAM requirements since they share the same architecture and parameter count. The difference is in training data and fine-tuning. For newer code models, also see our Qwen VRAM requirements page (Qwen2.5-Coder) and DeepSeek VRAM requirements (DeepSeek Coder).

Which GPU Do You Need?

GPUVRAMBest CodeLlamaPrecisionUse Case
RTX 30508 GB7B4-bit / INT8IDE copilot
RTX 40608 GB7B / 13BINT8 / 4-bitDev / personal
RTX 4060 Ti16 GB7B / 13BFP16 / INT8Code API
RTX 309024 GB13B / 34BFP16 / 4-bitProduction
2x RTX 309048 GB34BFP16Best quality

Context Length Impact on VRAM

Code completion tasks often need long context to include relevant files. Here is the KV cache impact:

Context7B KV Cache13B KV Cache34B KV Cache
4,096~0.5 GB~1 GB~2 GB
8,192~1 GB~2 GB~4 GB
16,384~2 GB~4 GB~8 GB
32,768~4 GB~8 GB~16 GB
100,000~12 GB~24 GB~48 GB

At 16K context (native limit for 7B/13B), the KV cache is manageable. Extended context to 100K using RoPE scaling requires substantial additional VRAM, especially for the 34B variant.

Batch Size Impact on VRAM

For IDE-style code completion, you typically serve one user at a time. For team code APIs, batching matters:

Model (4-bit, 4K ctx)Batch 1Batch 4Batch 8Batch 16
CodeLlama 7B~5 GB~7 GB~9 GB~13 GB
CodeLlama 13B~9 GB~13 GB~17 GB~25 GB
CodeLlama 34B~22 GB~30 GB~38 GB~54 GB

For a small development team (4-8 developers), CodeLlama 13B at 4-bit on an RTX 3090 handles concurrent requests well.

Practical Deployment Recommendations

  • Personal IDE copilot: CodeLlama 7B on RTX 4060 (INT8 or 4-bit). Fast code completion at 25-30 tok/s. Use the Instruct variant for chat, Python variant for Python-specific tasks.
  • Small team code API: CodeLlama 13B on RTX 3090 (INT8). Better code quality, handles 4+ concurrent developers.
  • Best code quality: CodeLlama 34B on RTX 3090 (4-bit) or 2x RTX 3090 (FP16). State-of-the-art code generation.
  • Newer alternatives: Consider Qwen2.5-Coder or DeepSeek Coder V2 for potentially better performance on recent benchmarks. See our code model hosting page for comparisons.

For cost comparisons, see our cost per 1M tokens: GPU vs API analysis and the LLM cost calculator.

Quick Setup Commands

Ollama (IDE Integration)

# Install and run for code completion
curl -fsSL https://ollama.com/install.sh | sh
ollama run codellama:7b-code      # Code completion
ollama run codellama:7b-instruct  # Chat about code
ollama run codellama:13b-instruct # Better quality

vLLM (Team API)

# CodeLlama 13B INT8 API server
pip install vllm
vllm serve codellama/CodeLlama-13b-Instruct-hf \
  --quantization awq --max-model-len 16384

# CodeLlama 34B on RTX 3090 (4-bit)
vllm serve codellama/CodeLlama-34b-Instruct-hf \
  --quantization awq --max-model-len 8192

Continue.dev Integration

# After starting Ollama, configure Continue.dev in VS Code:
# Set model to codellama:7b-code for autocomplete
# Set model to codellama:13b-instruct for chat

For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare with other code models on our best GPU for LLM inference page and use the benchmark tool for speed data.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?