You will use the official OpenAI Python SDK to interact with self-hosted models running on your own GPU server. By the end of this guide, you will have working code for chat completions, streaming, function calling, and async batch processing — all against a local model with zero API fees.
Installation and Configuration
Install the OpenAI SDK and configure it to point at your self-hosted inference engine. Both vLLM and Ollama expose OpenAI-compatible endpoints.
pip install openai
# For vLLM backend
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# For Ollama backend
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
The SDK is identical in both cases. Only the base_url and model name change. For details on setting up the inference backend, see our vLLM production guide.
Chat Completions
The chat completions endpoint is the primary interface for conversational models. The request format is identical to OpenAI’s hosted API.
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is KV-cache in transformer inference?"}
],
max_tokens=512,
temperature=0.7,
top_p=0.9
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
The response object matches the OpenAI schema exactly, including usage statistics. Any code that parses OpenAI responses works without modification.
Streaming Responses
Token-by-token streaming reduces perceived latency for interactive applications. The SDK handles streaming identically for self-hosted and OpenAI-hosted models.
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain tensor parallelism step by step."}],
max_tokens=1024,
stream=True
)
collected_text = ""
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
collected_text += delta.content
print(f"\n\nTotal characters: {len(collected_text)}")
For building a streaming frontend, pair this with a FastAPI server that proxies the stream to the browser via Server-Sent Events.
Function Calling
Models with tool-use training (LLaMA 3.1, Mistral, Qwen 2.5) support OpenAI-format function calling through vLLM. Define tools and the model decides when to invoke them.
tools = [
{
"type": "function",
"function": {
"name": "get_gpu_stats",
"description": "Get current GPU utilisation and memory usage",
"parameters": {
"type": "object",
"properties": {
"gpu_id": {"type": "integer", "description": "GPU device index"}
},
"required": ["gpu_id"]
}
}
}
]
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Check the stats for GPU 0."}],
tools=tools,
tool_choice="auto"
)
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
For multi-tool agent workflows, combine this with LangChain or build custom chains. See the API compatibility guide for gotchas around function calling with different models.
Async and Batch Processing
The SDK provides an async client for high-concurrency applications. Use it when processing multiple prompts in parallel.
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
async def process_prompt(prompt: str) -> str:
response = await async_client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=256
)
return response.choices[0].message.content
async def batch_process(prompts: list[str]) -> list[str]:
tasks = [process_prompt(p) for p in prompts]
return await asyncio.gather(*tasks)
prompts = ["Summarise CUDA cores.", "Explain VRAM vs RAM.", "Define tensor parallelism."]
results = asyncio.run(batch_process(prompts))
for prompt, result in zip(prompts, results):
print(f"Q: {prompt}\nA: {result}\n")
vLLM’s continuous batching engine processes these concurrent requests efficiently on the GPU. For queue-based batch processing at larger scale, see the Redis queue guide.
Error Handling and Best Practices
Handle connection failures and model errors gracefully. Self-hosted endpoints can be temporarily unavailable during model reloads or GPU memory issues.
from openai import APIConnectionError, APIStatusError
import time
def generate_with_retry(prompt, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=256, timeout=30.0
)
except APIConnectionError:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
else:
raise
except APIStatusError as e:
if e.status_code == 503:
time.sleep(5)
else:
raise
For production deployments, add Prometheus monitoring to track error rates. The self-hosting guide covers reliability patterns in detail. Browse more integration examples in our tutorials section.
Run Self-Hosted Models with the OpenAI SDK
Deploy models on dedicated GPU servers and use the same OpenAI SDK you already know. Zero per-token fees, full data control.
Browse GPU Servers