Almost every LLM client library speaks the OpenAI API. Self-hosting an OpenAI-compatible endpoint on dedicated GPU hosting means existing code keeps working – you just change the base URL and API key. Here is how to stand it up.
Contents
Compatible Engines
These all expose OpenAI-compatible endpoints out of the box:
- vLLM:
v1/chat/completions,v1/completions,v1/embeddings - Ollama: same three endpoints, easier setup
- LiteLLM proxy: aggregates many backends behind one OpenAI API
- TGI: OpenAI messages endpoint in recent versions
- SGLang: OpenAI-compatible mode
Launching
vLLM is the most common production choice:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--api-key sk-your-secret-key \
--served-model-name gpt-4o-mini
The --served-model-name lets you alias the model to whatever name your client code expects. If your code says model="gpt-4o-mini", serve it under that alias and no rewrite is needed.
Client Code
Python example:
from openai import OpenAI
client = OpenAI(
base_url="https://your-server.gigagpu.com/v1",
api_key="sk-your-secret-key"
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":"Hello"}]
)
Same for Node, Go, Ruby – the OpenAI SDK accepts a base URL parameter. Your existing tools (LangChain, Instructor, LlamaIndex) work without changes.
Gotchas
Three endpoint gotchas catch most teams:
- Streaming: vLLM supports SSE streaming, but some clients need slight header tweaks behind certain reverse proxies. See nginx config.
- Function calling: vLLM supports tool use but format quirks vary by model. Test with your target model specifically.
- Embedding endpoints: not every engine implements
/v1/embeddings. If your app uses both chat and embeddings, either run two engines or use a proxy like LiteLLM.
| Feature | vLLM | Ollama |
|---|---|---|
| Chat completions | Yes | Yes |
| Completions (legacy) | Yes | Yes |
| Embeddings | Yes | Yes |
| Function calling | Yes (model-dependent) | Yes (model-dependent) |
| Streaming | Yes | Yes |
Drop-In OpenAI Replacement on Your Own Server
We deploy vLLM or Ollama pre-configured on UK dedicated hosting.
Browse GPU ServersSee vLLM behind nginx with auth and Ollama behind Cloudflare Tunnel for production deployment patterns.