Most OpenAI-SDK code works unchanged against any backend that speaks the same HTTP protocol – vLLM, TGI, TensorRT-LLM and others all implement the relevant routes. A RTX 5060 Ti 16GB via our dedicated GPU hosting running vLLM is a drop-in replacement for api.openai.com on Llama, Mistral or Qwen-class workloads. This guide covers the swap, the gotchas, and the cost math.
Contents
Deploy vLLM
Run on the 5060 Ti with an FP8-native model and alias it to the OpenAI model name your code expects:
python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--quantization fp8 \
--served-model-name gpt-4o-mini \
--host 0.0.0.0 --port 8000 \
--api-key sk-your-secret \
--max-model-len 8192
Expected throughput on Blackwell 16 GB: ~112 tokens/s single-stream, ~1,800 t/s aggregate across 100 concurrent users. See the Llama 3 8B benchmark.
The Client Swap
Change two environment variables. Most SDKs read these automatically:
OPENAI_API_BASE=https://llm.your-domain.com/v1
OPENAI_API_KEY=sk-your-secret
Or in Python explicitly:
from openai import OpenAI
client = OpenAI(
base_url="https://llm.your-domain.com/v1",
api_key="sk-your-secret",
)
resp = client.chat.completions.create(
model="gpt-4o-mini", # still the alias
messages=[{"role":"user","content":"hello"}],
)
print(resp.choices[0].message.content)
LangChain, LlamaIndex, Instructor, Marvin, DSPy, Haystack and Semantic Kernel all accept either the env vars or a base_url argument – no code changes beyond configuration.
Compatibility Matrix
| OpenAI feature | vLLM on 5060 Ti | Notes |
|---|---|---|
| Chat completions | Full | Streaming supported |
| Text completions (legacy) | Full | /v1/completions |
| Embeddings | Full | Load a separate model e.g. BGE-M3; served on /v1/embeddings |
| Function / tool calling | Partial | Model-dependent; Llama 3.1 and Hermes work well via vLLM parser |
| JSON / structured output | Full | vLLM guided decoding with outlines or xgrammar |
| Vision inputs | Model-dependent | Needs VLM like LLaVA; does not use a text-only model |
| Audio (Whisper) | Separate endpoint | Whisper.cpp or faster-whisper served independently |
| Assistants API (v2) | Not implemented | Use LangChain agents or OpenAI Agents SDK instead |
| Batch API | Not implemented | Roll your own queue |
| Files API | Not implemented | Upload to your own object store |
| Fine-tuning API | Not implemented | Use Unsloth/TRL locally |
| Moderation | Not implemented | Load a classifier model separately |
Gotchas
- Model name – if your code hardcodes
gpt-4o-mini, alias it with--served-model-name; otherwise the SDK rejects the response - Context length – Llama 3.1 8B supports 128k but vLLM’s KV cache on 16 GB VRAM is finite; set
--max-model-len 8192or 16384 to leave headroom - Token counting – OpenAI’s
tiktokenfor GPT-4 gives different counts than Llama’s tokenizer; don’t compare bills like-for-like - Rate limits – no hidden tier or TPM limits on your own box; add your own via nginx
limit_reqif needed - Safety filtering – OpenAI adds a moderation layer; self-hosted does not, so add Llama Guard or a classifier in front
- Embeddings model – serve BGE-M3 or E5-large on a second port, set a
--served-model-nameoftext-embedding-3-smallfor drop-in
Cost Comparison
| Scenario | OpenAI gpt-4o-mini | Self-hosted Llama 3.1 8B FP8 on 5060 Ti |
|---|---|---|
| 1 M tokens/day in + 1 M tokens/day out | ~£9/day | Flat monthly hosting |
| 10 M tokens/day | ~£90/day | Same flat fee |
| 100 M tokens/day | ~£900/day | Same flat fee (approaches card capacity) |
| Cold-start latency | 0 (shared) | 0 (always warm on dedicated card) |
| Data residency | US-default | UK dedicated |
Self-Hosted OpenAI-Compatible API
Drop-in replacement on Blackwell 16 GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: vLLM setup, FP8 deployment, Llama 3 8B benchmark, Docker CUDA setup, first-day checklist.