RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama Custom Model Import via Modelfile
Tutorials

Ollama Custom Model Import via Modelfile

Import custom GGUF models into Ollama using Modelfiles. Covers weight conversion, parameter tuning, system prompts, template configuration, and creating shareable model packages on GPU servers.

When You Need a Model Ollama Doesn’t Have

The Ollama library covers popular models, but your project may need a fine-tuned variant, a community GGUF from Hugging Face, or a model you quantized yourself. Rather than switching to a different inference engine, you can import any GGUF-format model directly into Ollama using a Modelfile. The result behaves exactly like a built-in model, with full GPU acceleration on your dedicated server.

Obtain Your GGUF Model Weights

Ollama uses the GGUF format internally. If your model is in a different format, convert it first:

# Option 1: Download a pre-made GGUF from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Option 2: Convert from safetensors/PyTorch using llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt
python convert_hf_to_gguf.py /path/to/your/model --outtype q4_K_M --outfile my-model.gguf

The Q4_K_M quantization offers a strong balance between quality and VRAM usage. For quality-sensitive tasks, use Q5_K_M or Q8_0.

Write the Modelfile

A Modelfile is Ollama’s equivalent of a Dockerfile. It defines the base weights, parameters, and prompt template:

# Modelfile for a custom Mistral fine-tune
FROM ./mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Set inference parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192
PARAMETER stop "</s>"
PARAMETER stop "[INST]"

# Define the system prompt
SYSTEM """You are a technical support assistant specializing in GPU server administration. Provide concise, accurate answers with command-line examples when relevant."""

# Set the chat template
TEMPLATE """[INST] {{ if .System }}{{ .System }} {{ end }}{{ .Prompt }} [/INST]"""

The FROM directive points to your local GGUF file. All other directives are optional but significantly improve the model’s behaviour for your specific use case.

Create and Test the Model

# Build the model from the Modelfile
ollama create my-mistral-support -f Modelfile

# Verify it appears in the model list
ollama list

# Test interactively
ollama run my-mistral-support "How do I check GPU utilization on Linux?"

# Test via the API
curl http://localhost:11434/api/generate -d '{
  "model": "my-mistral-support",
  "prompt": "Explain VRAM allocation for LLMs",
  "stream": false
}'

If the model fails to load, check that the GGUF file is not corrupted and that your GPU has sufficient VRAM. Our CUDA setup guide covers driver prerequisites.

Advanced Modelfile Techniques

Build on top of existing Ollama models rather than raw GGUF files to inherit their optimised settings:

# Extend an existing Ollama model with a custom system prompt
FROM llama3.1:8b

SYSTEM """You are a code reviewer. Analyse code for bugs, security issues, and performance problems. Always format your response with sections: Issues Found, Severity, and Suggested Fix."""

PARAMETER temperature 0.3
PARAMETER num_ctx 16384

This approach is ideal for creating task-specific variants without re-downloading or converting weights.

Production Deployment

For serving custom models in production on your GPU server:

# Pre-load the model at startup
ollama pull my-mistral-support

# Add to systemd service with pre-warming
# /etc/systemd/system/ollama-warmup.service
[Unit]
Description=Pre-warm Ollama model
After=ollama.service
Requires=ollama.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/ollama run my-mistral-support ""
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Custom models work seamlessly with Ollama’s API, making them drop-in replacements for any application already using Ollama hosting. For workloads requiring OpenAI-compatible APIs, consider vLLM as described in the production setup guide. The tutorials section covers PyTorch model conversion workflows, and the LLM hosting blog has guides for other serving frameworks.

GPU Servers for Custom Models

GigaGPU dedicated servers with high-VRAM NVIDIA GPUs — import and serve your own fine-tuned models with Ollama.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?