You will build a Retrieval-Augmented Generation pipeline using LlamaIndex backed by a self-hosted LLM and embedding model on your own GPU server. By the end, you will have a working system that ingests documents, indexes them locally, and answers questions using your private model — no external API calls required.
Architecture Overview
The RAG pipeline has three components: a document ingestion layer that chunks and embeds text, a vector store that indexes embeddings for retrieval, and an LLM that generates answers from retrieved context. With self-hosted models, all three run on your hardware.
| Component | Self-Hosted Option | Server |
|---|---|---|
| LLM | LLaMA 3.1 via vLLM | Port 8000 |
| Embeddings | BGE-base via vLLM | Port 8001 |
| Vector Store | FAISS / ChromaDB | Local disk |
| Framework | LlamaIndex | Python process |
Installation and Model Configuration
Install LlamaIndex with the OpenAI-compatible integration. LlamaIndex connects to vLLM through its OpenAI adapter — no special connector needed.
pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
# Configure LLM
Settings.llm = OpenAILike(
api_base="http://localhost:8000/v1",
api_key="not-needed",
model="meta-llama/Llama-3.1-8B-Instruct",
max_tokens=512,
temperature=0.7,
is_chat_model=True
)
# Configure embeddings
Settings.embed_model = OpenAIEmbedding(
api_base="http://localhost:8001/v1",
api_key="not-needed",
model_name="BAAI/bge-base-en-v1.5"
)
For the vLLM server setup, follow the production deployment guide. Run two vLLM instances — one for the chat model, one for embeddings — or use a single instance with the chat model and a local embedding library.
Document Ingestion
LlamaIndex handles document loading, chunking, and indexing. Load documents from files, directories, or custom sources.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
# Load documents
documents = SimpleDirectoryReader("./docs").load_data()
print(f"Loaded {len(documents)} documents")
# Build index (embeds all chunks automatically)
index = VectorStoreIndex.from_documents(documents)
# Persist to disk for reuse
index.storage_context.persist(persist_dir="./storage")
# Load from disk on subsequent runs
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
The index creation step sends every document chunk through your self-hosted embedding model. For large document sets, this is where GPU acceleration pays off — embedding thousands of chunks takes seconds instead of minutes.
Query Engine
The query engine retrieves relevant chunks and passes them to the LLM with the user’s question.
# Basic query engine
query_engine = index.as_query_engine(
similarity_top_k=3,
streaming=True
)
# Non-streaming query
response = query_engine.query("What are the VRAM requirements for LLaMA 3.1?")
print(response)
# Streaming query
streaming_response = query_engine.query("Explain the deployment architecture.")
for text in streaming_response.response_gen:
print(text, end="", flush=True)
# Access source nodes
for node in response.source_nodes:
print(f"Source: {node.metadata.get('file_name', 'unknown')}")
print(f"Score: {node.score:.4f}")
For advanced retrieval strategies like hybrid search or reranking, see vector store comparison. For an alternative framework approach, check the LangChain with vLLM guide.
Advanced RAG Patterns
LlamaIndex supports sophisticated RAG techniques that improve answer quality beyond basic retrieval.
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# Create specialised indices for different document types
api_index = VectorStoreIndex.from_documents(api_docs)
tutorial_index = VectorStoreIndex.from_documents(tutorial_docs)
# Define tools
query_engine_tools = [
QueryEngineTool(
query_engine=api_index.as_query_engine(),
metadata=ToolMetadata(name="api_docs", description="API reference documentation")
),
QueryEngineTool(
query_engine=tutorial_index.as_query_engine(),
metadata=ToolMetadata(name="tutorials", description="Tutorial and how-to guides")
),
]
# Sub-question engine decomposes complex queries
engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools
)
response = engine.query("How do I set up the API and what are the best practices?")
Production Deployment
For production, wrap the query engine in a FastAPI server and add caching, authentication, and logging.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/query")
async def query(question: str):
async def stream():
response = query_engine.query(question)
for text in response.response_gen:
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(stream(), media_type="text/event-stream")
Monitor the pipeline with Prometheus and Grafana to track query latency, retrieval quality, and GPU utilisation. The LlamaIndex hosting page covers infrastructure requirements, and the self-hosting guide has base deployment patterns. Browse more examples in our tutorials section.
Build RAG Pipelines on Dedicated GPUs
Run LlamaIndex with self-hosted models on bare-metal GPU servers. Private data, local inference, zero API fees.
Browse GPU Servers