Home / Blog / Tutorials / Self-Hosted OpenAI-Compatible API Guide

Tutorials

Self-Hosted OpenAI-Compatible API Guide

Run your own drop-in replacement for the OpenAI API on a dedicated GPU and point existing OpenAI SDK code at it without rewriting.

Tutorials April 19, 2026 2 min read admin

Almost every LLM client library speaks the OpenAI API. Self-hosting an OpenAI-compatible endpoint on dedicated GPU hosting means existing code keeps working – you just change the base URL and API key. Here is how to stand it up.

Compatible Engines

These all expose OpenAI-compatible endpoints out of the box:

vLLM: v1/chat/completions, v1/completions, v1/embeddings
Ollama: same three endpoints, easier setup
LiteLLM proxy: aggregates many backends behind one OpenAI API
TGI: OpenAI messages endpoint in recent versions
SGLang: OpenAI-compatible mode

Launching

vLLM is the most common production choice:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --api-key sk-your-secret-key \
  --served-model-name gpt-4o-mini

The --served-model-name lets you alias the model to whatever name your client code expects. If your code says model="gpt-4o-mini", serve it under that alias and no rewrite is needed.

Client Code

Python example:

from openai import OpenAI
client = OpenAI(
  base_url="https://your-server.gigagpu.com/v1",
  api_key="sk-your-secret-key"
)
resp = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role":"user","content":"Hello"}]
)

Same for Node, Go, Ruby – the OpenAI SDK accepts a base URL parameter. Your existing tools (LangChain, Instructor, LlamaIndex) work without changes.

Gotchas

Three endpoint gotchas catch most teams:

Streaming: vLLM supports SSE streaming, but some clients need slight header tweaks behind certain reverse proxies. See nginx config.
Function calling: vLLM supports tool use but format quirks vary by model. Test with your target model specifically.
Embedding endpoints: not every engine implements /v1/embeddings. If your app uses both chat and embeddings, either run two engines or use a proxy like LiteLLM.

Feature	vLLM	Ollama
Chat completions	Yes	Yes
Completions (legacy)	Yes	Yes
Embeddings	Yes	Yes
Function calling	Yes (model-dependent)	Yes (model-dependent)
Streaming	Yes	Yes

Drop-In OpenAI Replacement on Your Own Server

We deploy vLLM or Ollama pre-configured on UK dedicated hosting.

Browse GPU Servers

See vLLM behind nginx with auth and Ollama behind Cloudflare Tunnel for production deployment patterns.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Self-Hosted OpenAI-Compatible API Guide

Contents

Compatible Engines

Launching

Client Code

Gotchas

Drop-In OpenAI Replacement on Your Own Server

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Self-Hosted OpenAI-Compatible API Guide

Contents

Compatible Engines

Launching

Client Code

Gotchas

Drop-In OpenAI Replacement on Your Own Server

Need a Dedicated GPU Server?

admin

Related Articles

TensorRT-LLM on Dedicated GPU: Optimisation Guide

AutoGen vs CrewAI vs LangGraph: 2026

Connect Hugging Face Hub to GPU for Model Sync

ORPO vs DPO – Single-Stage vs Two-Stage Alignment

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?