What You’ll Build
In 30 minutes, you will have a production code completion API that serves inline suggestions, function generation, code explanation, and refactoring assistance across 20+ programming languages. Running models like DeepSeek-Coder or CodeLlama through vLLM on a dedicated GPU server, your API delivers sub-200ms completions for a team of 50+ developers at zero per-seat cost — and your proprietary codebase never leaves your infrastructure.
GitHub Copilot costs $19-$39 per user per month, and enterprise plans send code context to external servers. For a 100-developer team, that is $23,000-$47,000 annually with no control over where your code is processed. Self-hosted code completion on open-source models delivers comparable suggestion quality with complete code privacy and no per-developer licensing.
Architecture Overview
The API serves code completions through an OpenAI-compatible endpoint, so editor extensions like Continue.dev and Cody work by changing the API base URL. vLLM handles efficient batching of concurrent requests from multiple developers, using PagedAttention to maintain per-user context without wasting VRAM. The model receives code context (surrounding lines, file imports, project structure hints) and returns completion suggestions.
Three endpoint patterns serve different IDE features: fill-in-the-middle (FIM) for inline suggestions, chat completions for code explanation and generation, and edit completions for refactoring. The API layer adds authentication per developer, usage tracking, and optional caching of common completions to reduce GPU load during peak hours.
GPU Requirements
| Team Size | Recommended GPU | VRAM | Latency (p95) |
|---|---|---|---|
| Up to 20 devs | RTX 5090 | 24 GB | ~150ms |
| 20 – 80 devs | RTX 6000 Pro | 40 GB | ~120ms |
| 80+ devs | RTX 6000 Pro 96 GB | 80 GB | ~100ms |
Code completion models like DeepSeek-Coder-33B hit the sweet spot between quality and speed. The 6.7B variant runs on smaller GPUs with excellent latency for inline suggestions. Larger models improve complex generation tasks like writing entire functions from docstrings. See our self-hosted LLM guide for code model benchmarks.
Step-by-Step Build
Deploy a code model through vLLM and configure it for fill-in-the-middle completions alongside standard chat completions.
# Launch vLLM with a code-specialised model
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-coder-33b-instruct \
--host 0.0.0.0 --port 8000 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90
# Fill-in-the-middle completion for inline suggestions
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="any")
def get_completion(prefix, suffix, language="python"):
response = client.completions.create(
model="deepseek-ai/deepseek-coder-33b-instruct",
prompt=f"<|fim_begin|>{prefix}<|fim_hole|>{suffix}<|fim_end|>",
max_tokens=256,
temperature=0.2,
stop=["\n\n", "<|fim_end|>"]
)
return response.choices[0].text
# Code generation via chat
def generate_code(instruction, context=""):
response = client.chat.completions.create(
model="deepseek-ai/deepseek-coder-33b-instruct",
messages=[
{"role": "system", "content": "You are a senior developer."},
{"role": "user", "content": f"Context:\n{context}\n\n{instruction}"}
],
max_tokens=1024, temperature=0.3
)
return response.choices[0].message.content
Connect the API to your IDE using Continue.dev (VS Code) or any OpenAI-compatible extension. Configure the extension to point at your GPU server’s URL with the appropriate model name. For repository-aware completions, build a retrieval layer that fetches relevant code snippets from your codebase. See production setup for latency optimisation.
IDE Integration and Workflow
Code completion works best when developers barely notice the infrastructure. Configure your IDE extension to trigger completions on pause (500ms debounce), accept with Tab, and dismiss with Escape. The low latency of a self-hosted model means suggestions appear as fast as or faster than cloud-based alternatives.
Track acceptance rates per developer and per language to gauge model effectiveness. If Python suggestions are accepted 40% of the time but Go suggestions only 15%, consider deploying a specialised Go model or adjusting prompt templates. Build an AI chatbot alongside the completion API for longer code discussions and architecture questions.
Deploy Your Code Completion API
A self-hosted code completion API accelerates your entire engineering team without per-seat licensing or code leaving your network. Power IDE extensions, code review tools, and documentation generators from a single GPU endpoint. Launch on GigaGPU dedicated GPU hosting and boost developer productivity. Browse more API use cases and tutorials in our library.