RTX 3050 - Order Now
Home / Blog / Tutorials / OpenAI API Compatibility: vLLM as Drop-In Replacement
Tutorials

OpenAI API Compatibility: vLLM as Drop-In Replacement

How to use vLLM as a drop-in replacement for the OpenAI API covering endpoint compatibility, SDK configuration, chat completions, embeddings, and migration from OpenAI to self-hosted inference.

You will configure vLLM to serve an OpenAI-compatible API so that any application built against the OpenAI SDK works without code changes. By the end of this guide, your existing OpenAI integrations will point at a self-hosted model on your own dedicated GPU server with zero per-token fees.

Supported Endpoints

vLLM implements the core OpenAI API endpoints that production applications actually use. The compatibility layer handles request parsing, response formatting, and streaming identically to the OpenAI service.

EndpointOpenAIvLLMNotes
/v1/chat/completionsYesYesFull streaming support
/v1/completionsYesYesLegacy completions API
/v1/modelsYesYesLists loaded models
/v1/embeddingsYesYesWith embedding models
/v1/audioYesNoNot applicable
/v1/imagesYesNoNot applicable

Starting vLLM with OpenAI Compatibility

Launch vLLM with the OpenAI-compatible server. This is the default mode — no special flags required beyond the model name.

# Start vLLM with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

# Verify the server is running
curl http://localhost:8000/v1/models

The server responds on the same path structure as the OpenAI API. Any client that can hit https://api.openai.com/v1/ can hit http://your-server:8000/v1/ instead. For a full production setup, see our vLLM production guide.

Configuring the OpenAI SDK

The official OpenAI Python SDK accepts a base_url parameter. Point it at your vLLM instance and use any string as the API key — vLLM does not enforce authentication by default.

from openai import OpenAI

client = OpenAI(
    base_url="http://your-gpu-server:8000/v1",
    api_key="not-needed"
)

# Chat completions -- identical to OpenAI usage
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain GPU memory hierarchy in two sentences."}
    ],
    max_tokens=256,
    temperature=0.7
)
print(response.choices[0].message.content)

# Streaming -- also identical
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about CUDA cores."}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

For the Node.js equivalent, see the Node.js SDK guide. For the full Python walkthrough, check the Python SDK guide.

Migrating from OpenAI to Self-Hosted

The migration path for existing applications is deliberately simple. Change two variables and your code runs against your own hardware:

# Before (OpenAI hosted)
client = OpenAI()  # Uses OPENAI_API_KEY and api.openai.com

# After (self-hosted vLLM)
client = OpenAI(
    base_url="http://your-gpu-server:8000/v1",
    api_key="not-needed"
)

# Or use environment variables for zero code changes
# export OPENAI_BASE_URL=http://your-gpu-server:8000/v1
# export OPENAI_API_KEY=not-needed

Applications using LangChain or LlamaIndex also support this base URL override, making the migration seamless across your entire stack.

Compatibility Gotchas

vLLM’s compatibility is excellent but not absolute. Watch for these differences when migrating:

  • Model names must match exactly. OpenAI uses gpt-4; vLLM uses the Hugging Face model ID like meta-llama/Llama-3.1-8B-Instruct.
  • Function calling works but depends on the model. Models trained for tool use (LLaMA 3.1, Mistral) handle it well. Others may produce malformed JSON.
  • Token limits differ. OpenAI models have fixed context windows. Self-hosted models have limits set by --max-model-len.
  • Rate limiting is your responsibility. vLLM does not enforce rate limits. Add them at the API gateway layer.

When Self-Hosting Makes Sense

Self-hosting via vLLM’s OpenAI-compatible API pays off when you process enough tokens to offset server costs, need data to stay on your own infrastructure, or require custom models. The self-hosting guide covers the cost calculation in detail. For choosing between inference engines, see vLLM vs Ollama. Browse available configurations on our tutorials page.

Replace OpenAI with Self-Hosted Inference

Run vLLM on dedicated GPU servers with full OpenAI API compatibility. Zero per-token fees, complete data control, same SDK.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?