LlamaIndex is another popular RAG framework. Tighter focus on indexing + retrieval than LangChain. Point at your self-hosted vLLM on the RTX 5060 Ti 16GB via our hosting:
Contents
Install
uv pip install llama-index llama-index-llms-openai-like llama-index-embeddings-huggingface
Ingest and Index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
Settings.llm = OpenAILike(
model="meta-llama/Llama-3.1-8B-Instruct",
api_base="http://localhost:8000/v1",
api_key="none",
is_chat_model=True,
)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
docs = SimpleDirectoryReader("./knowledge_base").load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist("./storage")
Query
from llama_index.core import load_index_from_storage, StorageContext
storage = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage)
engine = index.as_query_engine(similarity_top_k=4)
print(engine.query("Summarise our Q3 strategy."))
Advanced
- Reranking: add BGE-reranker as a post-processor – meaningful retrieval quality uplift
- Hybrid search: combine BM25 and vector – usually better than pure vector
- Sub-query decomposition: LlamaIndex can split complex questions into sub-questions
- Agent pattern: tool-using agents with
FunctionCallingAgentWorker - Streaming:
engine.query_stream(...)gives token-by-token output
LlamaIndex and LangChain cover similar ground; LlamaIndex often wins on RAG clarity, LangChain on agent tooling. Pick whichever your team finds friendlier.
LlamaIndex + Self-Hosted LLM
Fast RAG framework on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: LangChain, RAG stack, embedding server, reranker server.