RTX 3050 - Order Now
Home / Blog / Tutorials / LangChain with Self-Hosted vLLM
Tutorials

LangChain with Self-Hosted vLLM

Complete guide to integrating LangChain with a self-hosted vLLM instance covering ChatOpenAI configuration, chains, RAG pipelines, agents, and streaming on dedicated GPU servers.

You will connect LangChain to a self-hosted vLLM instance so that chains, agents, and RAG pipelines run entirely on your own GPU server. By the end of this guide, you will have working examples for chat models, retrieval chains, and tool-calling agents — all with zero external API dependencies.

Configuration

LangChain connects to vLLM through the ChatOpenAI class, leveraging vLLM’s OpenAI-compatible API. Install the required packages and point at your vLLM server.

pip install langchain langchain-openai langchain-community

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="meta-llama/Llama-3.1-8B-Instruct",
    temperature=0.7,
    max_tokens=512,
    streaming=True
)

This llm object works with every LangChain chain, agent, and retriever. No special adapter required. For vLLM server setup, see the production deployment guide.

Basic Chains

Build prompt-response chains using LangChain Expression Language (LCEL). The chain composes a prompt template with the self-hosted model.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a GPU infrastructure expert. Be concise."),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

# Invoke
result = chain.invoke({"question": "When should I use tensor parallelism?"})
print(result)

# Stream
for chunk in chain.stream({"question": "Explain KV-cache optimisation."}):
    print(chunk, end="", flush=True)

Chains compose seamlessly. Add output parsers, additional processing steps, or branch into parallel chains — the vLLM backend handles each call identically to an OpenAI endpoint.

RAG Pipeline

Retrieval-Augmented Generation combines document retrieval with LLM generation. Use a local vector store and the self-hosted model for a fully private pipeline.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough

embeddings = OpenAIEmbeddings(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="BAAI/bge-base-en-v1.5"
)

vectorstore = FAISS.from_texts(
    ["vLLM uses PagedAttention for efficient KV-cache management.",
     "Tensor parallelism splits model layers across multiple GPUs.",
     "Continuous batching improves throughput by dynamically grouping requests."],
    embeddings
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

rag_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using only the provided context.\n\nContext: {context}"),
    ("user", "{question}")
])

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt | llm | StrOutputParser()
)

result = rag_chain.invoke("How does vLLM manage memory?")
print(result)

For a more complete RAG setup with document loading and chunking, see the LlamaIndex RAG guide. For vector store comparisons, check ChromaDB vs FAISS vs Qdrant.

Tool-Calling Agents

LangChain agents use the model’s function-calling capability to decide which tools to invoke. With vLLM serving a tool-capable model, agents run entirely on your hardware.

from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor

@tool
def get_gpu_memory(gpu_id: int) -> str:
    """Get current VRAM usage for a specific GPU."""
    import subprocess
    result = subprocess.run(
        ["nvidia-smi", "--id=" + str(gpu_id), "--query-gpu=memory.used,memory.total",
         "--format=csv,noheader"], capture_output=True, text=True
    )
    return result.stdout.strip()

agent_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a GPU server management assistant."),
    ("user", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])

agent = create_tool_calling_agent(llm, [get_gpu_memory], agent_prompt)
executor = AgentExecutor(agent=agent, tools=[get_gpu_memory])

result = executor.invoke({"input": "How much VRAM is GPU 0 using?"})
print(result["output"])

Streaming Integration

LangChain’s streaming works end-to-end with vLLM. Pair with a FastAPI server to expose streaming chains as HTTP endpoints.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat")
async def chat_endpoint(question: str):
    async def generate():
        async for chunk in chain.astream({"question": question}):
            yield f"data: {chunk}\n\n"
        yield "data: [DONE]\n\n"
    return StreamingResponse(generate(), media_type="text/event-stream")

For Ollama as the backend instead of vLLM, see the LangChain with Ollama guide. The LangChain hosting page covers infrastructure options, and our tutorials section has additional integration patterns. For choosing between frameworks, see vLLM vs Ollama.

Run LangChain on Dedicated GPUs

Deploy LangChain with vLLM on bare-metal GPU servers. Private inference, zero API fees, full control over your AI stack.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?