RTX 3050 - Order Now
Home / Blog / Model Guides / ChromaDB + LLM VRAM Requirements for RAG
Model Guides

ChromaDB + LLM VRAM Requirements for RAG

VRAM breakdown for running ChromaDB-based RAG pipelines with various LLMs. Covers embedding model overhead, LLM VRAM, total pipeline requirements, and GPU recommendations.

RAG Pipeline VRAM Components

A retrieval-augmented generation (RAG) pipeline using ChromaDB has three VRAM-consuming components: the embedding model that encodes documents and queries, ChromaDB’s index (typically stored in system RAM, not VRAM), and the LLM that generates answers from retrieved context. When self-hosting on a dedicated GPU server, the embedding model and LLM both compete for GPU memory.

ChromaDB itself is a vector database that runs primarily in system RAM and on disk, consuming minimal VRAM. The VRAM pressure comes from the models that feed into and consume from ChromaDB.

Embedding Model VRAM

Embedding ModelParametersFP16 VRAMINT8 VRAM
all-MiniLM-L6-v222M~0.1 GB~0.05 GB
BGE-base-en-v1.5110M~0.3 GB~0.15 GB
BGE-large-en-v1.5335M~0.7 GB~0.4 GB
E5-large-v2335M~0.7 GB~0.4 GB
Nomic-embed-text-v1.5137M~0.3 GB~0.2 GB

Embedding models are lightweight, typically using under 1 GB of VRAM. The most popular choice, all-MiniLM-L6-v2, uses just 0.1 GB at FP16. This means the LLM dominates your VRAM budget in a RAG pipeline.

Total Pipeline VRAM by LLM

LLMLLM VRAM (4K ctx)Embedding VRAMTotal Pipeline
Phi-3 Mini (3.8B, FP16)~8 GB~0.3 GB~8.3 GB
LLaMA 3 8B (AWQ 4-bit)~7 GB~0.3 GB~7.3 GB
LLaMA 3 8B (FP16)~18 GB~0.3 GB~18.3 GB
Mistral 7B (AWQ 4-bit)~6 GB~0.3 GB~6.3 GB
Qwen 2.5 14B (AWQ 4-bit)~11 GB~0.3 GB~11.3 GB
DeepSeek 16B (INT8)~18 GB~0.3 GB~18.3 GB

A typical RAG pipeline with a quantised 7-8B LLM and a standard embedding model uses 6-8 GB total, fitting comfortably on an 8 GB GPU. FP16 deployments of larger models require 16-24 GB cards.

GPU Recommendations

GPUVRAMRAG Pipeline Capability
RTX 40608 GB7B LLM (INT4) + embeddings
RTX 4060 Ti16 GB7-8B LLM (FP16) + embeddings
RTX 309024 GB14B LLM (INT4) or 8B (FP16) + embeddings + reranker

For most RAG deployments, the RTX 4060 at 8 GB provides an excellent cost-to-performance ratio with quantised 7B models. Add a reranker model (another 0.3-0.7 GB) on 16 GB+ cards for improved retrieval quality.

Context Length Impact on VRAM

RAG pipelines inject retrieved documents into the LLM context, so effective context length is critical. Longer context means more retrieved passages but also more VRAM for the KV cache. A typical RAG query might use 2-4K tokens of retrieved context plus the user query.

Effective ContextAdditional KV Cache (LLaMA 3 8B FP16)Total Pipeline (AWQ 4-bit)
2K tokens~1 GB~7.6 GB
4K tokens~2 GB~8.6 GB
8K tokens~4 GB~10.6 GB
16K tokens~8 GB~14.6 GB

Keep retrieved context under 4K tokens on 8 GB GPUs to leave headroom for generation. On 16 GB+ cards, you can retrieve up to 8K tokens of context. See our LLaMA 3 VRAM requirements for detailed KV cache analysis.

Deployment Recommendations

For a production RAG stack, use ChromaDB for vector storage (runs in system RAM), a BGE or Nomic embedding model on GPU, and a quantised 7B LLM for generation. This entire stack runs on a single RTX 4060. For higher quality, use a 14B model on an RTX 3090 with a reranker.

Read the self-host LLM guide for full deployment instructions. Use the LLM cost calculator to estimate per-query costs. Browse all deployment guides in the model guides section.

Deploy RAG Pipelines on Dedicated GPUs

Run ChromaDB + LLM retrieval-augmented generation on dedicated GPU servers. Full root access for custom pipeline configurations.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?