You will connect LangChain to a self-hosted vLLM instance so that chains, agents, and RAG pipelines run entirely on your own GPU server. By the end of this guide, you will have working examples for chat models, retrieval chains, and tool-calling agents — all with zero external API dependencies.
Configuration
LangChain connects to vLLM through the ChatOpenAI class, leveraging vLLM’s OpenAI-compatible API. Install the required packages and point at your vLLM server.
pip install langchain langchain-openai langchain-community
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="meta-llama/Llama-3.1-8B-Instruct",
temperature=0.7,
max_tokens=512,
streaming=True
)
This llm object works with every LangChain chain, agent, and retriever. No special adapter required. For vLLM server setup, see the production deployment guide.
Basic Chains
Build prompt-response chains using LangChain Expression Language (LCEL). The chain composes a prompt template with the self-hosted model.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
("system", "You are a GPU infrastructure expert. Be concise."),
("user", "{question}")
])
chain = prompt | llm | StrOutputParser()
# Invoke
result = chain.invoke({"question": "When should I use tensor parallelism?"})
print(result)
# Stream
for chunk in chain.stream({"question": "Explain KV-cache optimisation."}):
print(chunk, end="", flush=True)
Chains compose seamlessly. Add output parsers, additional processing steps, or branch into parallel chains — the vLLM backend handles each call identically to an OpenAI endpoint.
RAG Pipeline
Retrieval-Augmented Generation combines document retrieval with LLM generation. Use a local vector store and the self-hosted model for a fully private pipeline.
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
embeddings = OpenAIEmbeddings(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="BAAI/bge-base-en-v1.5"
)
vectorstore = FAISS.from_texts(
["vLLM uses PagedAttention for efficient KV-cache management.",
"Tensor parallelism splits model layers across multiple GPUs.",
"Continuous batching improves throughput by dynamically grouping requests."],
embeddings
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
rag_prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the provided context.\n\nContext: {context}"),
("user", "{question}")
])
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| rag_prompt | llm | StrOutputParser()
)
result = rag_chain.invoke("How does vLLM manage memory?")
print(result)
For a more complete RAG setup with document loading and chunking, see the LlamaIndex RAG guide. For vector store comparisons, check ChromaDB vs FAISS vs Qdrant.
Tool-Calling Agents
LangChain agents use the model’s function-calling capability to decide which tools to invoke. With vLLM serving a tool-capable model, agents run entirely on your hardware.
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
@tool
def get_gpu_memory(gpu_id: int) -> str:
"""Get current VRAM usage for a specific GPU."""
import subprocess
result = subprocess.run(
["nvidia-smi", "--id=" + str(gpu_id), "--query-gpu=memory.used,memory.total",
"--format=csv,noheader"], capture_output=True, text=True
)
return result.stdout.strip()
agent_prompt = ChatPromptTemplate.from_messages([
("system", "You are a GPU server management assistant."),
("user", "{input}"),
("placeholder", "{agent_scratchpad}")
])
agent = create_tool_calling_agent(llm, [get_gpu_memory], agent_prompt)
executor = AgentExecutor(agent=agent, tools=[get_gpu_memory])
result = executor.invoke({"input": "How much VRAM is GPU 0 using?"})
print(result["output"])
Streaming Integration
LangChain’s streaming works end-to-end with vLLM. Pair with a FastAPI server to expose streaming chains as HTTP endpoints.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/chat")
async def chat_endpoint(question: str):
async def generate():
async for chunk in chain.astream({"question": question}):
yield f"data: {chunk}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
For Ollama as the backend instead of vLLM, see the LangChain with Ollama guide. The LangChain hosting page covers infrastructure options, and our tutorials section has additional integration patterns. For choosing between frameworks, see vLLM vs Ollama.
Run LangChain on Dedicated GPUs
Deploy LangChain with vLLM on bare-metal GPU servers. Private inference, zero API fees, full control over your AI stack.
Browse GPU Servers