You will configure vLLM to serve an OpenAI-compatible API so that any application built against the OpenAI SDK works without code changes. By the end of this guide, your existing OpenAI integrations will point at a self-hosted model on your own dedicated GPU server with zero per-token fees.
Supported Endpoints
vLLM implements the core OpenAI API endpoints that production applications actually use. The compatibility layer handles request parsing, response formatting, and streaming identically to the OpenAI service.
| Endpoint | OpenAI | vLLM | Notes |
|---|---|---|---|
| /v1/chat/completions | Yes | Yes | Full streaming support |
| /v1/completions | Yes | Yes | Legacy completions API |
| /v1/models | Yes | Yes | Lists loaded models |
| /v1/embeddings | Yes | Yes | With embedding models |
| /v1/audio | Yes | No | Not applicable |
| /v1/images | Yes | No | Not applicable |
Starting vLLM with OpenAI Compatibility
Launch vLLM with the OpenAI-compatible server. This is the default mode — no special flags required beyond the model name.
# Start vLLM with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
# Verify the server is running
curl http://localhost:8000/v1/models
The server responds on the same path structure as the OpenAI API. Any client that can hit https://api.openai.com/v1/ can hit http://your-server:8000/v1/ instead. For a full production setup, see our vLLM production guide.
Configuring the OpenAI SDK
The official OpenAI Python SDK accepts a base_url parameter. Point it at your vLLM instance and use any string as the API key — vLLM does not enforce authentication by default.
from openai import OpenAI
client = OpenAI(
base_url="http://your-gpu-server:8000/v1",
api_key="not-needed"
)
# Chat completions -- identical to OpenAI usage
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain GPU memory hierarchy in two sentences."}
],
max_tokens=256,
temperature=0.7
)
print(response.choices[0].message.content)
# Streaming -- also identical
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about CUDA cores."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
For the Node.js equivalent, see the Node.js SDK guide. For the full Python walkthrough, check the Python SDK guide.
Migrating from OpenAI to Self-Hosted
The migration path for existing applications is deliberately simple. Change two variables and your code runs against your own hardware:
# Before (OpenAI hosted)
client = OpenAI() # Uses OPENAI_API_KEY and api.openai.com
# After (self-hosted vLLM)
client = OpenAI(
base_url="http://your-gpu-server:8000/v1",
api_key="not-needed"
)
# Or use environment variables for zero code changes
# export OPENAI_BASE_URL=http://your-gpu-server:8000/v1
# export OPENAI_API_KEY=not-needed
Applications using LangChain or LlamaIndex also support this base URL override, making the migration seamless across your entire stack.
Compatibility Gotchas
vLLM’s compatibility is excellent but not absolute. Watch for these differences when migrating:
- Model names must match exactly. OpenAI uses
gpt-4; vLLM uses the Hugging Face model ID likemeta-llama/Llama-3.1-8B-Instruct. - Function calling works but depends on the model. Models trained for tool use (LLaMA 3.1, Mistral) handle it well. Others may produce malformed JSON.
- Token limits differ. OpenAI models have fixed context windows. Self-hosted models have limits set by
--max-model-len. - Rate limiting is your responsibility. vLLM does not enforce rate limits. Add them at the API gateway layer.
When Self-Hosting Makes Sense
Self-hosting via vLLM’s OpenAI-compatible API pays off when you process enough tokens to offset server costs, need data to stay on your own infrastructure, or require custom models. The self-hosting guide covers the cost calculation in detail. For choosing between inference engines, see vLLM vs Ollama. Browse available configurations on our tutorials page.
Replace OpenAI with Self-Hosted Inference
Run vLLM on dedicated GPU servers with full OpenAI API compatibility. Zero per-token fees, complete data control, same SDK.
Browse GPU Servers