RTX 3050 - Order Now
Home / Blog / Tutorials / OpenAI SDK with Self-Hosted Models: Python Guide
Tutorials

OpenAI SDK with Self-Hosted Models: Python Guide

Complete guide to using the official OpenAI Python SDK with self-hosted models via vLLM and Ollama covering chat completions, streaming, function calling, and async patterns on GPU servers.

You will use the official OpenAI Python SDK to interact with self-hosted models running on your own GPU server. By the end of this guide, you will have working code for chat completions, streaming, function calling, and async batch processing — all against a local model with zero API fees.

Installation and Configuration

Install the OpenAI SDK and configure it to point at your self-hosted inference engine. Both vLLM and Ollama expose OpenAI-compatible endpoints.

pip install openai

# For vLLM backend
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# For Ollama backend
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

The SDK is identical in both cases. Only the base_url and model name change. For details on setting up the inference backend, see our vLLM production guide.

Chat Completions

The chat completions endpoint is the primary interface for conversational models. The request format is identical to OpenAI’s hosted API.

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "What is KV-cache in transformer inference?"}
    ],
    max_tokens=512,
    temperature=0.7,
    top_p=0.9
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

The response object matches the OpenAI schema exactly, including usage statistics. Any code that parses OpenAI responses works without modification.

Streaming Responses

Token-by-token streaming reduces perceived latency for interactive applications. The SDK handles streaming identically for self-hosted and OpenAI-hosted models.

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain tensor parallelism step by step."}],
    max_tokens=1024,
    stream=True
)

collected_text = ""
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
        collected_text += delta.content

print(f"\n\nTotal characters: {len(collected_text)}")

For building a streaming frontend, pair this with a FastAPI server that proxies the stream to the browser via Server-Sent Events.

Function Calling

Models with tool-use training (LLaMA 3.1, Mistral, Qwen 2.5) support OpenAI-format function calling through vLLM. Define tools and the model decides when to invoke them.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_gpu_stats",
            "description": "Get current GPU utilisation and memory usage",
            "parameters": {
                "type": "object",
                "properties": {
                    "gpu_id": {"type": "integer", "description": "GPU device index"}
                },
                "required": ["gpu_id"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Check the stats for GPU 0."}],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")

For multi-tool agent workflows, combine this with LangChain or build custom chains. See the API compatibility guide for gotchas around function calling with different models.

Async and Batch Processing

The SDK provides an async client for high-concurrency applications. Use it when processing multiple prompts in parallel.

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

async def process_prompt(prompt: str) -> str:
    response = await async_client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256
    )
    return response.choices[0].message.content

async def batch_process(prompts: list[str]) -> list[str]:
    tasks = [process_prompt(p) for p in prompts]
    return await asyncio.gather(*tasks)

prompts = ["Summarise CUDA cores.", "Explain VRAM vs RAM.", "Define tensor parallelism."]
results = asyncio.run(batch_process(prompts))
for prompt, result in zip(prompts, results):
    print(f"Q: {prompt}\nA: {result}\n")

vLLM’s continuous batching engine processes these concurrent requests efficiently on the GPU. For queue-based batch processing at larger scale, see the Redis queue guide.

Error Handling and Best Practices

Handle connection failures and model errors gracefully. Self-hosted endpoints can be temporarily unavailable during model reloads or GPU memory issues.

from openai import APIConnectionError, APIStatusError
import time

def generate_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="meta-llama/Llama-3.1-8B-Instruct",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=256, timeout=30.0
            )
        except APIConnectionError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                raise
        except APIStatusError as e:
            if e.status_code == 503:
                time.sleep(5)
            else:
                raise

For production deployments, add Prometheus monitoring to track error rates. The self-hosting guide covers reliability patterns in detail. Browse more integration examples in our tutorials section.

Run Self-Hosted Models with the OpenAI SDK

Deploy models on dedicated GPU servers and use the same OpenAI SDK you already know. Zero per-token fees, full data control.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?