RTX 3050 - Order Now
Home / Blog / Tutorials / Self-Hosted OpenAI-Compatible API Guide
Tutorials

Self-Hosted OpenAI-Compatible API Guide

Run your own drop-in replacement for the OpenAI API on a dedicated GPU and point existing OpenAI SDK code at it without rewriting.

Almost every LLM client library speaks the OpenAI API. Self-hosting an OpenAI-compatible endpoint on dedicated GPU hosting means existing code keeps working – you just change the base URL and API key. Here is how to stand it up.

Contents

Compatible Engines

These all expose OpenAI-compatible endpoints out of the box:

  • vLLM: v1/chat/completions, v1/completions, v1/embeddings
  • Ollama: same three endpoints, easier setup
  • LiteLLM proxy: aggregates many backends behind one OpenAI API
  • TGI: OpenAI messages endpoint in recent versions
  • SGLang: OpenAI-compatible mode

Launching

vLLM is the most common production choice:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --api-key sk-your-secret-key \
  --served-model-name gpt-4o-mini

The --served-model-name lets you alias the model to whatever name your client code expects. If your code says model="gpt-4o-mini", serve it under that alias and no rewrite is needed.

Client Code

Python example:

from openai import OpenAI
client = OpenAI(
  base_url="https://your-server.gigagpu.com/v1",
  api_key="sk-your-secret-key"
)
resp = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role":"user","content":"Hello"}]
)

Same for Node, Go, Ruby – the OpenAI SDK accepts a base URL parameter. Your existing tools (LangChain, Instructor, LlamaIndex) work without changes.

Gotchas

Three endpoint gotchas catch most teams:

  • Streaming: vLLM supports SSE streaming, but some clients need slight header tweaks behind certain reverse proxies. See nginx config.
  • Function calling: vLLM supports tool use but format quirks vary by model. Test with your target model specifically.
  • Embedding endpoints: not every engine implements /v1/embeddings. If your app uses both chat and embeddings, either run two engines or use a proxy like LiteLLM.
FeaturevLLMOllama
Chat completionsYesYes
Completions (legacy)YesYes
EmbeddingsYesYes
Function callingYes (model-dependent)Yes (model-dependent)
StreamingYesYes

Drop-In OpenAI Replacement on Your Own Server

We deploy vLLM or Ollama pre-configured on UK dedicated hosting.

Browse GPU Servers

See vLLM behind nginx with auth and Ollama behind Cloudflare Tunnel for production deployment patterns.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?