RTX 3050 - Order Now
Home / Blog / GPU Comparisons / Can RTX 3090 Run CodeLlama 34B?
GPU Comparisons

Can RTX 3090 Run CodeLlama 34B?

The RTX 3090 can run CodeLlama 34B in INT4 quantisation with its 24GB VRAM. Here is the VRAM breakdown, coding benchmarks, and setup guide.

Yes, the RTX 3090 can run CodeLlama 34B in INT4 quantisation. With 24GB GDDR6X VRAM, the RTX 3090 fits this large coding model when quantised to 4-bit precision, making it a viable option for code model hosting. FP16 and INT8 require more VRAM than a single 3090 provides, but INT4 delivers surprisingly good code generation quality.

The Short Answer

YES in INT4 quantisation with 4K-8K context. NO in FP16 or INT8.

CodeLlama 34B has 33.7 billion parameters. In FP16, the model weights need approximately 67GB of VRAM, well beyond the RTX 3090’s 24GB. In INT8, it drops to about 34GB, still too large. In INT4 (GPTQ or AWQ), the model compresses to roughly 18-19GB for weights, leaving 5-6GB for KV cache and overhead.

For code generation specifically, INT4 quantisation preserves the model’s ability to produce syntactically correct, well-structured code. The quality degradation is more noticeable in natural language tasks than in code output, making CodeLlama 34B in INT4 a practical choice for development workflows.

VRAM Analysis

QuantisationModel VRAMKV Cache (4K ctx)TotalRTX 3090 (24GB)
FP16~67GB~3.5GB~70.5GBNo
INT8~34GB~3.5GB~37.5GBNo
INT4 (GPTQ)~19GB~3.5GB~22.5GBFits
INT4 (AWQ)~18GB~3.5GB~21.5GBFits
Q4_K_M (GGUF)~18.5GB~3.5GB~22GBFits

With AWQ quantisation at 4K context, total VRAM usage is around 21.5GB, leaving about 2.5GB of breathing room. You can push to 8K context (CodeLlama supports up to 16K) but VRAM will be tight at around 23.5GB. For fill-in-the-middle (FIM) code completion, 4K context is typically sufficient. See our CodeLlama VRAM requirements guide for all configurations.

Performance Benchmarks

GPUModelQuantisationTokens/secContext
RTX 3090 (24GB)CodeLlama 34BQ4_K_M~14 tok/s4096
RTX 3090 (24GB)CodeLlama 34BAWQ~16 tok/s4096
RTX 5090 (32GB)CodeLlama 34BQ4_K_M~28 tok/s8192
RTX 3090 (24GB)CodeLlama 7BFP16~45 tok/s16384

At 14-16 tok/s, CodeLlama 34B generates code at a comfortable reading pace on the RTX 3090. For code completion tasks where you typically generate 10-50 tokens, response times are under 3 seconds. Function generation and longer code blocks take 5-15 seconds depending on length. Check full benchmarks on our tokens per second benchmark page.

Setup Guide

Deploy CodeLlama 34B with Ollama for the simplest setup:

# Ollama: CodeLlama 34B in Q4_K_M
ollama run codellama:34b-instruct-q4_K_M

# For code completion (fill-in-middle) mode
ollama run codellama:34b-code-q4_K_M

For production serving with an OpenAI-compatible API:

# vLLM with AWQ quantisation
pip install vllm
vllm serve TheBloke/CodeLlama-34B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95 \
  --host 0.0.0.0 --port 8000

Set --gpu-memory-utilization 0.95 to maximise the available VRAM for this tight fit. The AWQ format is recommended for vLLM as it provides the best INT4 inference speed on Ada and Ampere architectures.

If you need CodeLlama 34B with more context or higher precision, the RTX 5090 with 32GB handles it in INT4 with 8K+ context at nearly double the throughput. For the smaller but still capable CodeLlama 13B, the RTX 3090 runs it in INT8 with excellent speed and longer context.

Consider also whether newer code models like DeepSeek Coder or Qwen Coder might serve your needs at smaller sizes. The RTX 3090 runs LLaMA 3 8B in FP16 which also handles code tasks well. For other workloads, check the Mixtral 8x7B analysis or Qwen 72B analysis. Browse all dedicated GPU servers or read the best GPU for LLM inference guide.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?