You will build a RAG system that ingests 10,000 company documents, stores embeddings in ChromaDB, and answers employee questions with source citations — running entirely on your own GPU server. No data leaves your infrastructure. The end result: a query like “What is our parental leave policy?” returns the exact policy text with a page reference, not a hallucinated guess. Here is the complete pipeline from document ingestion to cited answers on dedicated GPU infrastructure.
Pipeline Architecture
| Component | Tool | Role | Resource |
|---|---|---|---|
| Vector store | ChromaDB | Store and retrieve document embeddings | CPU + 8GB RAM |
| Embedding model | BGE-large-en-v1.5 | Convert text to 1024-dim vectors | ~2GB VRAM |
| LLM | LLaMA 3.1 8B (Q4) | Generate answers from retrieved context | ~6GB VRAM |
| Orchestrator | LangChain | Chain retrieval and generation | CPU |
| Document loader | LangChain loaders | Parse PDF, DOCX, HTML | CPU |
Total VRAM requirement: approximately 8GB. A single GPU with 24GB VRAM runs this pipeline comfortably with room for larger models later.
Environment Setup
# Install dependencies
pip install langchain langchain-community chromadb sentence-transformers vllm
pip install pypdf python-docx unstructured
# Start vLLM serving the LLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization gptq \
--max-model-len 8192 \
--port 8000
The vLLM server provides an OpenAI-compatible API. LangChain connects to it as if it were the OpenAI API, making the code portable.
Document Ingestion Pipeline
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
# Load documents
loader = DirectoryLoader("/data/company_docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# Create embeddings and store in ChromaDB
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
vectorstore = Chroma.from_documents(
chunks, embeddings, persist_directory="/data/chromadb"
)
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")
The ChromaDB vector store persists to disk. Subsequent runs load the existing index rather than re-embedding everything.
Retrieval and Generation Chain
from langchain_community.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="meta-llama/Llama-3.1-8B-Instruct",
temperature=0.1
)
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template="""Answer the question based only on the provided context.
If the context does not contain the answer, say "I don't have information about that."
Cite the source document for each fact.
Context: {context}
Question: {question}
Answer:"""
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
chain_type_kwargs={"prompt": prompt_template},
return_source_documents=True
)
API Endpoint
Wrap the chain in a FastAPI endpoint for team access:
from fastapi import FastAPI
app = FastAPI()
@app.post("/ask")
async def ask_question(query: dict):
result = qa_chain({"query": query["question"]})
sources = [{"page": doc.metadata.get("page", "N/A"),
"source": doc.metadata.get("source", "Unknown")}
for doc in result["source_documents"]]
return {"answer": result["result"], "sources": sources}
Deploy behind authentication as covered in the private hosting guide. Teams can also build a frontend using chatbot frameworks for a conversational interface.
Performance Optimisation
For production workloads: increase ChromaDB’s n_results and use a reranker to improve retrieval quality; tune chunk size — 512 tokens works well for policy documents, but technical documentation may need 1024; cache frequent queries to avoid repeated GPU inference; and monitor retrieval relevance scores to detect knowledge base gaps. Explore LlamaIndex as an alternative orchestrator or Qdrant for higher-performance vector search. Scale to larger models on more powerful GPU servers as query volume grows. See RAG hosting for infrastructure recommendations and more tutorials for related pipelines.
RAG-Ready GPU Servers
Dedicated GPU servers pre-configured for RAG pipelines. Run ChromaDB, LangChain, and LLMs on isolated UK infrastructure.
Browse GPU Servers