RTX 3050 - Order Now
Home / Blog / Use Cases / Build Chat Completion API (OpenAI-Compatible) on GPU
Use Cases

Build Chat Completion API (OpenAI-Compatible) on GPU

Build an OpenAI-compatible chat completion API on a dedicated GPU server. Serve multi-turn conversations with streaming, function calling, and system prompts using open-source models — a drop-in replacement for the OpenAI API at zero per-token cost.

What You’ll Build

In 20 minutes, you will have an OpenAI-compatible chat completion API that handles multi-turn conversations with streaming responses, system prompts, function calling, and JSON mode. Running open-source models through vLLM on a dedicated GPU server, your API is a drop-in replacement for the OpenAI API — existing applications work by changing one line: the base URL. Zero per-token cost, no rate limits, complete data privacy.

OpenAI API costs accumulate fast at scale. A product serving 10,000 users making 20 requests daily at 1,000 tokens each runs $2,000-$6,000 per month on GPT-4-class models. Self-hosted OpenAI-compatible endpoints with 70B open-source models deliver comparable quality for a fixed monthly GPU cost, and every conversation stays on your infrastructure.

Architecture Overview

vLLM natively serves an OpenAI-compatible API out of the box. The server exposes /v1/chat/completions, /v1/completions, and /v1/models endpoints that match the OpenAI specification. Streaming uses server-sent events identical to the OpenAI format. Function calling works through the model’s tool-use training, and JSON mode constrains output to valid JSON structures.

The API layer adds authentication, rate limiting, and usage tracking on top of vLLM. A reverse proxy handles TLS termination and load balancing across multiple model instances if needed. The same server can host multiple models simultaneously, routing requests based on the model parameter in each request.

GPU Requirements

Model SizeRecommended GPUVRAMTokens/sec
8B (Llama 3 8B)RTX 509024 GB~120 tok/s
70B (Llama 3 70B)RTX 6000 Pro40 GB~40 tok/s
70B + concurrent usersRTX 6000 Pro 96 GB80 GB~60 tok/s

vLLM uses PagedAttention for efficient VRAM management, handling dozens of concurrent requests on a single GPU. The 8B models serve fast responses ideal for chatbots and simple tasks. The 70B models match GPT-4-class reasoning for complex tasks. See our self-hosted LLM guide for model selection.

Step-by-Step Build

Deploy vLLM on your GPU server — it provides the OpenAI-compatible API with no additional code needed. Add authentication and monitoring on top.

# Launch vLLM with OpenAI-compatible API
# Single command — no application code required
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b-chat-hf \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --enable-auto-tool-choice \
    --tool-call-parser llama3

# Test with curl — identical to OpenAI API format
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-70b-chat-hf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing briefly."}
    ],
    "stream": true,
    "max_tokens": 500
  }'

# Python client — works with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="any")
response = client.chat.completions.create(
    model="meta-llama/Llama-3-70b-chat-hf",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True
)
for chunk in response:
    print(chunk.choices[0].delta.content, end="")

Existing applications using the OpenAI Python SDK, LangChain, LlamaIndex, or any OpenAI-compatible client need only a base URL change. See vLLM production setup for adding Nginx reverse proxy, TLS, and API key authentication.

Multi-Model Hosting

Serve multiple models from the same GPU server for different use cases: a fast 8B model for simple chat, a 70B model for complex reasoning, and a code-specialised model for programming tasks. vLLM supports model routing based on the model parameter in each request, so clients select the right model per task.

Build an AI chatbot frontend that connects to your OpenAI-compatible endpoint. Add conversation memory, user management, and tool integrations on top of the base chat API. The OpenAI-compatible format means the entire ecosystem of chat UI libraries and frameworks works out of the box with your self-hosted backend.

Deploy Your Chat API

An OpenAI-compatible chat API on your own GPU is the foundation for every AI-powered product — chatbots, assistants, agents, and automation pipelines all start with a reliable completion endpoint. Launch on GigaGPU dedicated GPU hosting and replace your OpenAI dependency today. Browse more API use cases in our library.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?