RTX 3050 - Order Now
Home / Blog / Tutorials / LlamaIndex with Self-Hosted Models: RAG Setup
Tutorials

LlamaIndex with Self-Hosted Models: RAG Setup

Complete guide to building a RAG pipeline with LlamaIndex and self-hosted models via vLLM covering document ingestion, vector indexing, query engines, and production deployment on GPU servers.

You will build a Retrieval-Augmented Generation pipeline using LlamaIndex backed by a self-hosted LLM and embedding model on your own GPU server. By the end, you will have a working system that ingests documents, indexes them locally, and answers questions using your private model — no external API calls required.

Architecture Overview

The RAG pipeline has three components: a document ingestion layer that chunks and embeds text, a vector store that indexes embeddings for retrieval, and an LLM that generates answers from retrieved context. With self-hosted models, all three run on your hardware.

ComponentSelf-Hosted OptionServer
LLMLLaMA 3.1 via vLLMPort 8000
EmbeddingsBGE-base via vLLMPort 8001
Vector StoreFAISS / ChromaDBLocal disk
FrameworkLlamaIndexPython process

Installation and Model Configuration

Install LlamaIndex with the OpenAI-compatible integration. LlamaIndex connects to vLLM through its OpenAI adapter — no special connector needed.

pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai

from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

# Configure LLM
Settings.llm = OpenAILike(
    api_base="http://localhost:8000/v1",
    api_key="not-needed",
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=512,
    temperature=0.7,
    is_chat_model=True
)

# Configure embeddings
Settings.embed_model = OpenAIEmbedding(
    api_base="http://localhost:8001/v1",
    api_key="not-needed",
    model_name="BAAI/bge-base-en-v1.5"
)

For the vLLM server setup, follow the production deployment guide. Run two vLLM instances — one for the chat model, one for embeddings — or use a single instance with the chat model and a local embedding library.

Document Ingestion

LlamaIndex handles document loading, chunking, and indexing. Load documents from files, directories, or custom sources.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load documents
documents = SimpleDirectoryReader("./docs").load_data()
print(f"Loaded {len(documents)} documents")

# Build index (embeds all chunks automatically)
index = VectorStoreIndex.from_documents(documents)

# Persist to disk for reuse
index.storage_context.persist(persist_dir="./storage")

# Load from disk on subsequent runs
from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

The index creation step sends every document chunk through your self-hosted embedding model. For large document sets, this is where GPU acceleration pays off — embedding thousands of chunks takes seconds instead of minutes.

Query Engine

The query engine retrieves relevant chunks and passes them to the LLM with the user’s question.

# Basic query engine
query_engine = index.as_query_engine(
    similarity_top_k=3,
    streaming=True
)

# Non-streaming query
response = query_engine.query("What are the VRAM requirements for LLaMA 3.1?")
print(response)

# Streaming query
streaming_response = query_engine.query("Explain the deployment architecture.")
for text in streaming_response.response_gen:
    print(text, end="", flush=True)

# Access source nodes
for node in response.source_nodes:
    print(f"Source: {node.metadata.get('file_name', 'unknown')}")
    print(f"Score: {node.score:.4f}")

For advanced retrieval strategies like hybrid search or reranking, see vector store comparison. For an alternative framework approach, check the LangChain with vLLM guide.

Advanced RAG Patterns

LlamaIndex supports sophisticated RAG techniques that improve answer quality beyond basic retrieval.

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Create specialised indices for different document types
api_index = VectorStoreIndex.from_documents(api_docs)
tutorial_index = VectorStoreIndex.from_documents(tutorial_docs)

# Define tools
query_engine_tools = [
    QueryEngineTool(
        query_engine=api_index.as_query_engine(),
        metadata=ToolMetadata(name="api_docs", description="API reference documentation")
    ),
    QueryEngineTool(
        query_engine=tutorial_index.as_query_engine(),
        metadata=ToolMetadata(name="tutorials", description="Tutorial and how-to guides")
    ),
]

# Sub-question engine decomposes complex queries
engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools
)

response = engine.query("How do I set up the API and what are the best practices?")

Production Deployment

For production, wrap the query engine in a FastAPI server and add caching, authentication, and logging.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/query")
async def query(question: str):
    async def stream():
        response = query_engine.query(question)
        for text in response.response_gen:
            yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"
    return StreamingResponse(stream(), media_type="text/event-stream")

Monitor the pipeline with Prometheus and Grafana to track query latency, retrieval quality, and GPU utilisation. The LlamaIndex hosting page covers infrastructure requirements, and the self-hosting guide has base deployment patterns. Browse more examples in our tutorials section.

Build RAG Pipelines on Dedicated GPUs

Run LlamaIndex with self-hosted models on bare-metal GPU servers. Private data, local inference, zero API fees.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?