Home / Blog / Tutorials / RAG Pipeline with ChromaDB and LangChain

Tutorials

RAG Pipeline with ChromaDB and LangChain

Build a retrieval-augmented generation pipeline combining ChromaDB for vector storage, LangChain for orchestration, and a self-hosted LLM for grounded answers on a dedicated GPU server.

Tutorials April 16, 2026 3 min read admin

You will build a RAG system that ingests 10,000 company documents, stores embeddings in ChromaDB, and answers employee questions with source citations — running entirely on your own GPU server. No data leaves your infrastructure. The end result: a query like “What is our parental leave policy?” returns the exact policy text with a page reference, not a hallucinated guess. Here is the complete pipeline from document ingestion to cited answers on dedicated GPU infrastructure.

Pipeline Architecture

Component	Tool	Role	Resource
Vector store	ChromaDB	Store and retrieve document embeddings	CPU + 8GB RAM
Embedding model	BGE-large-en-v1.5	Convert text to 1024-dim vectors	~2GB VRAM
LLM	LLaMA 3.1 8B (Q4)	Generate answers from retrieved context	~6GB VRAM
Orchestrator	LangChain	Chain retrieval and generation	CPU
Document loader	LangChain loaders	Parse PDF, DOCX, HTML	CPU

Total VRAM requirement: approximately 8GB. A single GPU with 24GB VRAM runs this pipeline comfortably with room for larger models later.

Environment Setup

# Install dependencies
pip install langchain langchain-community chromadb sentence-transformers vllm
pip install pypdf python-docx unstructured

# Start vLLM serving the LLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization gptq \
  --max-model-len 8192 \
  --port 8000

The vLLM server provides an OpenAI-compatible API. LangChain connects to it as if it were the OpenAI API, making the code portable.

Document Ingestion Pipeline

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# Load documents
loader = DirectoryLoader("/data/company_docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# Create embeddings and store in ChromaDB
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
vectorstore = Chroma.from_documents(
    chunks, embeddings, persist_directory="/data/chromadb"
)
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")

The ChromaDB vector store persists to disk. Subsequent runs load the existing index rather than re-embedding everything.

Retrieval and Generation Chain

from langchain_community.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="meta-llama/Llama-3.1-8B-Instruct",
    temperature=0.1
)

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""Answer the question based only on the provided context.
If the context does not contain the answer, say "I don't have information about that."
Cite the source document for each fact.

Context: {context}
Question: {question}
Answer:"""
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    chain_type_kwargs={"prompt": prompt_template},
    return_source_documents=True
)

API Endpoint

Wrap the chain in a FastAPI endpoint for team access:

from fastapi import FastAPI
app = FastAPI()

@app.post("/ask")
async def ask_question(query: dict):
    result = qa_chain({"query": query["question"]})
    sources = [{"page": doc.metadata.get("page", "N/A"),
                "source": doc.metadata.get("source", "Unknown")}
               for doc in result["source_documents"]]
    return {"answer": result["result"], "sources": sources}

Deploy behind authentication as covered in the private hosting guide. Teams can also build a frontend using chatbot frameworks for a conversational interface.

Performance Optimisation

For production workloads: increase ChromaDB’s n_results and use a reranker to improve retrieval quality; tune chunk size — 512 tokens works well for policy documents, but technical documentation may need 1024; cache frequent queries to avoid repeated GPU inference; and monitor retrieval relevance scores to detect knowledge base gaps. Explore LlamaIndex as an alternative orchestrator or Qdrant for higher-performance vector search. Scale to larger models on more powerful GPU servers as query volume grows. See RAG hosting for infrastructure recommendations and more tutorials for related pipelines.

RAG-Ready GPU Servers

Dedicated GPU servers pre-configured for RAG pipelines. Run ChromaDB, LangChain, and LLMs on isolated UK infrastructure.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RAG Pipeline with ChromaDB and LangChain

Pipeline Architecture

Environment Setup

Document Ingestion Pipeline

Retrieval and Generation Chain

API Endpoint

Performance Optimisation

RAG-Ready GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RAG Pipeline with ChromaDB and LangChain

Pipeline Architecture

Environment Setup

Document Ingestion Pipeline

Retrieval and Generation Chain

API Endpoint

Performance Optimisation

RAG-Ready GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

Migrate from Azure OpenAI to Dedicated GPU: Search Enhancement Guide

Migrate from AWS Bedrock to Dedicated GPU: Real-Time Inference Guide

Migrate from RunPod to Dedicated GPU: Model Training

Connect AWS S3 to GPU for Models

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?