RTX 3050 - Order Now
Home / Blog / Tutorials / LangChain with Ollama: Local LLM Integration
Tutorials

LangChain with Ollama: Local LLM Integration

Step-by-step guide to integrating LangChain with Ollama for local LLM inference covering model setup, chains, RAG pipelines, embeddings, and deployment on dedicated GPU servers.

You will connect LangChain to Ollama so that chains, RAG pipelines, and agents run against locally-served models with a single command. By the end of this guide, you will have Ollama serving a model on your GPU server and LangChain consuming it for chat, retrieval, and structured output tasks.

Ollama Setup

Ollama packages model weights, tokenisers, and serving infrastructure into a single binary. Pull a model and it is ready to serve immediately.

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.1:8b

# Verify it is running
curl http://localhost:11434/api/tags

Ollama automatically detects your GPU and loads the model into VRAM. For comparing Ollama with vLLM, see vLLM vs Ollama. For vLLM-specific LangChain integration, check the LangChain with vLLM guide.

LangChain Connection

LangChain provides a dedicated Ollama integration and also supports connecting via the OpenAI compatibility layer. Both approaches work; the dedicated class offers Ollama-specific features.

pip install langchain langchain-ollama langchain-community

# Option 1: Dedicated Ollama class
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.7,
    base_url="http://localhost:11434"
)

# Option 2: OpenAI compatibility
from langchain_openai import ChatOpenAI

llm_openai = ChatOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
    model="llama3.1:8b"
)

The dedicated class supports Ollama-specific parameters like num_ctx for context window and num_gpu for GPU layer offloading. Use it when you need fine-grained control over Ollama’s serving behaviour.

Building Chains

LangChain Expression Language chains work identically with Ollama as they do with any other LLM backend.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a DevOps engineer specialising in GPU infrastructure."),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

# Single invocation
result = chain.invoke({"question": "How do I monitor GPU temperature in a data centre?"})
print(result)

# Streaming
for chunk in chain.stream({"question": "Explain container GPU passthrough."}):
    print(chunk, end="", flush=True)

RAG with Ollama Embeddings

Ollama serves embedding models alongside chat models. Run both on the same GPU for a fully local RAG pipeline with no external dependencies.

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnablePassthrough

# Pull an embedding model
# ollama pull nomic-embed-text

embeddings = OllamaEmbeddings(model="nomic-embed-text")

docs = [
    "Ollama serves models with automatic GPU detection and VRAM management.",
    "Docker GPU passthrough requires the NVIDIA Container Toolkit.",
    "KV-cache size scales linearly with context length and batch size."
]

vectorstore = FAISS.from_texts(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

rag_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based on this context only:\n{context}"),
    ("user", "{question}")
])

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt | llm | StrOutputParser()
)

answer = rag_chain.invoke("How does Ollama handle GPU memory?")
print(answer)

For a more advanced RAG setup with document chunking and reranking, see the LlamaIndex RAG guide. For vector store options, check ChromaDB vs FAISS vs Qdrant.

Structured Output

Extract structured data from unstructured text using Ollama’s JSON mode with LangChain’s output parsers.

from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field

class ServerSpec(BaseModel):
    gpu_model: str = Field(description="GPU model name")
    vram_gb: int = Field(description="VRAM in gigabytes")
    suitable_for: str = Field(description="Recommended workload type")

parser = JsonOutputParser(pydantic_object=ServerSpec)

structured_prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract server specs from the text. {format_instructions}"),
    ("user", "{text}")
])

structured_chain = structured_prompt | llm | parser

result = structured_chain.invoke({
    "text": "The server has an RTX 5090 with 24GB VRAM, ideal for LLM inference.",
    "format_instructions": parser.get_format_instructions()
})
print(result)

Production Considerations

Ollama is excellent for development and single-server deployments. For high-concurrency production with continuous batching, consider vLLM instead. Key differences that matter at scale:

  • Ollama processes requests sequentially by default; vLLM uses continuous batching for higher throughput.
  • Ollama’s model management (pull, run, delete) is simpler for teams managing multiple models.
  • Both expose OpenAI-compatible APIs, so switching backends requires only a URL change in LangChain.

For monitoring, add Prometheus and Grafana to track GPU utilisation and response times. The self-hosting guide covers infrastructure planning, and our tutorials section has more integration patterns.

Run Ollama with LangChain on Dedicated GPUs

Deploy Ollama on bare-metal GPU servers for local LLM inference. No API fees, no data leaving your network.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?