Home / Blog / Benchmarks / CodeLlama 34B Tokens/sec by GPU

Benchmarks

CodeLlama 34B Tokens/sec by GPU

Benchmark data for CodeLlama 34B inference speed across GPUs with INT4 and INT8 quantisation results and cost analysis for dedicated GPU hosting.

Benchmarks April 14, 2026 2 min read admin

Table of Contents

CodeLlama 34B Benchmark Overview
Tokens/sec Results by GPU
Quantisation Comparison
Cost Efficiency Analysis
GPU Recommendations
Conclusion

CodeLlama 34B Benchmark Overview

CodeLlama 34B is Meta’s specialised code generation model, fine-tuned from LLaMA 2 for programming tasks including code completion, infilling, and instruction following. At 34 billion parameters it requires roughly 68 GB at FP16, making quantisation mandatory for single dedicated GPU server deployments. We benchmark inference speed across six GPUs to find the optimal setup.

Tests used vLLM on GigaGPU bare-metal servers with a 512-token code input and 256-token completion output. At INT4, CodeLlama 34B requires approximately 17 GB of VRAM. See our tokens per second benchmark hub for methodology details.

Tokens/sec Results by GPU

GPU	VRAM	CodeLlama 34B INT4 (tok/s)	Notes
RTX 3050	6 GB	N/A	Insufficient VRAM
RTX 4060	8 GB	N/A	Insufficient VRAM
RTX 4060 Ti	16 GB	N/A	Tight; needs offloading
RTX 3090	24 GB	15 tok/s	INT4 fits with headroom
RTX 5080	16 GB	N/A	Needs offloading at INT4
RTX 5090	32 GB	30 tok/s	Comfortable fit

CodeLlama 34B at INT4 (approximately 17 GB) fits well on the RTX 3090 and RTX 5090. The RTX 5090 at 30 tok/s provides responsive code completions suitable for IDE integration.

Quantisation Comparison

INT8 requires roughly 34 GB, limiting it to the RTX 5090 among single consumer GPUs. For detailed quantisation trade-offs, see our quantisation speed comparison.

GPU	INT8 (tok/s)	INT4 (tok/s)
RTX 3090 (24 GB)	N/A	15 tok/s
RTX 5090 (32 GB)	22 tok/s	30 tok/s
2x RTX 3090 (48 GB)	20 tok/s	28 tok/s

For code generation tasks where precision matters, INT8 on the RTX 5090 at 22 tok/s is a strong choice. INT4 remains the speed champion, and quality degradation on code tasks is generally minimal.

Cost Efficiency Analysis

Configuration	INT4 tok/s	Approx. Monthly Cost	tok/s per Pound
RTX 3090	15	~£110	0.14
RTX 5090	30	~£250	0.12
2x RTX 3090	28	~£210	0.13

The single RTX 3090 provides the best cost efficiency. For the best GPU for CodeLlama, weigh latency requirements against your budget.

GPU Recommendations

Budget: RTX 3090 — 15 tok/s at INT4 is workable for batch code generation and CI/CD integration.
Recommended: RTX 5090 — 30 tok/s delivers responsive completions for real-time IDE plugins.
Multi-GPU: 2x RTX 3090 — 28 tok/s with INT8 option for higher code accuracy.

For smaller code models, check the Phi-3 Mini benchmark (Phi-3 includes strong coding capabilities). For other large models, see the LLaMA 3 70B benchmark. Browse all results in the Benchmarks category.

Conclusion

CodeLlama 34B is the go-to model for production code generation that demands more capability than 7-13B models can offer. With INT4 quantisation, it runs well on the RTX 3090 and excels on the RTX 5090. For teams needing fast, accurate code completions, a dedicated GPU server with the right hardware is a worthwhile investment.

Deploy CodeLlama 34B on Dedicated Servers

High-VRAM GPU servers built for code generation workloads. Full root access, NVMe, and UK hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

CodeLlama 34B Tokens/sec by GPU

CodeLlama 34B Benchmark Overview

Tokens/sec Results by GPU

Quantisation Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Deploy CodeLlama 34B on Dedicated Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

CodeLlama 34B Tokens/sec by GPU

CodeLlama 34B Benchmark Overview

Tokens/sec Results by GPU

Quantisation Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Deploy CodeLlama 34B on Dedicated Servers

Need a Dedicated GPU Server?

admin

Related Articles

Phi-3 Mini on RTX 4060 Ti: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-4060-ti-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 4060 Ti: 28 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

DeepSeek: 1 to 64 Concurrent Requests Throughput

LLaMA 3 Benchmarks: Performance on GigaGPU Servers

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?