RTX 3050 - Order Now
Home / Blog / Tutorials / Migrate from OpenAI to Self-Hosted: Function Calling Guide
Tutorials

Migrate from OpenAI to Self-Hosted: Function Calling Guide

Replace OpenAI's function calling with self-hosted alternatives using structured output models, maintaining tool-use capabilities while eliminating API dependency.

Your Agentic Workflow Shouldn’t Depend on Someone Else’s Uptime

At 3:47 AM on a Tuesday, your on-call engineer got paged. The AI agent that routes customer refund requests — the one that uses OpenAI function calling to query your database, check inventory, and trigger refunds through your internal API — had stopped making tool calls. OpenAI’s function calling endpoint was returning malformed JSON. Not an outage, technically. Just broken enough to halt your entire automated workflow while the model hallucinated pseudo-JSON that your parser choked on. For three hours, refund requests piled up in a dead queue.

Function calling is the most integration-critical feature in any AI pipeline. When it fails, your entire agentic system fails. Self-hosting your function-calling model on a dedicated GPU eliminates this single point of failure. Here’s the migration path for teams that have built tool-use systems on OpenAI.

Understanding What You’re Replacing

OpenAI’s function calling works by fine-tuning GPT models to emit structured JSON matching a provided schema. The good news: several open-source models now match or exceed this capability, often with more reliable structured output.

CapabilityOpenAI ImplementationSelf-Hosted Equivalent
Function definitionstools parameter in APISame schema, processed by model
JSON moderesponse_format: json_objectvLLM guided decoding / Outlines
Parallel function callsGPT-4o returns multiple callsHermes 2 Pro, Llama 3.1 support this
Streaming tool callsSupported via SSESupported in vLLM
Forced function calltool_choice: requiredConstrained decoding guarantees valid output

The key models to evaluate for function calling are NousResearch/Hermes-2-Pro-Llama-3-8B for lightweight deployments, meta-llama/Llama-3.1-70B-Instruct for production-grade reliability, and Qwen/Qwen2.5-72B-Instruct for complex multi-tool orchestration. All three handle the OpenAI-style tool-use format when served through vLLM.

Step-by-Step Migration

Step 1: Inventory your tool definitions. Export every function definition from your OpenAI integration. Count the total number of tools, their parameter complexity, and whether you use parallel tool calls. This determines which model you need.

Step 2: Provision and deploy. Spin up a GigaGPU dedicated server. For function calling with a 70B model, an RTX 6000 Pro 96 GB is the standard choice. Deploy with vLLM using the --enable-auto-tool-choice flag:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192 \
  --port 8000

Step 3: Test tool-use fidelity. The most critical phase. Run your full suite of function-calling scenarios against the self-hosted model. Measure three things: schema adherence rate (should be >99%), correct function selection rate, and argument extraction accuracy.

Step 4: Add constrained decoding as a safety net. vLLM supports guided decoding through Outlines, which forces the model to emit valid JSON matching your schema. This eliminates the class of failures where the model produces almost-valid JSON:

response = client.chat.completions.create(
    model="llama-70b",
    messages=messages,
    tools=tool_definitions,
    extra_body={"guided_json": your_json_schema}
)

Step 5: Run shadow traffic. Mirror your production function-calling requests to both OpenAI and your self-hosted endpoint for 72 hours. Compare tool selection accuracy, argument correctness, and end-to-end latency.

Handling the Edge Cases

Function calling migration has specific gotchas that don’t apply to simpler chat migrations:

  • Nested tool calls: If your agent chains multiple function calls (call function A, use result to call function B), test the full chain — not just individual calls.
  • Error recovery: When a tool returns an error, does the model gracefully retry or select an alternative? Test failure paths explicitly.
  • System prompt format: Some models expect tool definitions in the system message rather than as a separate parameter. vLLM handles this translation, but verify with your specific model.
  • Token budget: Tool definitions consume context window tokens. With 10+ complex tools, you may need models with 32K+ context. Llama 3.1 supports 128K natively.

For teams using frameworks like LangChain or CrewAI, the migration is simpler — these frameworks abstract the tool-calling mechanism, so switching the underlying model is often a configuration change. Ollama-hosted models work with LangChain’s tool-calling interface directly.

Cost and Reliability Comparison

MetricOpenAI Function CallingSelf-Hosted (Llama 3.1 70B)
Cost per 1K tool-use requests~$0.50-2.00~$0 (server cost only)
JSON schema adherence~98-99%~99.5% (with constrained decoding)
Availability~99.5% (third-party dependent)~99.9% (hardware-only dependency)
Latency (first tool call)~600-900ms~200-400ms
Model version stabilityCan change without noticeYou control updates

The reliability improvement alone justifies the migration for production agentic systems. Use the GPU vs API cost comparison to model your specific volume.

Securing Your Agentic Infrastructure

Self-hosted function calling isn’t just about cost — it’s about control. Your tool definitions, conversation histories, and the data flowing through your function arguments never leave your infrastructure. For teams in regulated industries, this is often the deciding factor.

Read the companion guides for migrating chatbot APIs and embeddings pipelines to complete your OpenAI exit. The OpenAI alternative page provides a full feature comparison, while our self-host LLM guide covers infrastructure fundamentals. For cost modelling, the LLM cost calculator accounts for tool-use workloads, and the breakeven analysis shows the full economics.

Build Agentic AI on Infrastructure You Own

Function calling belongs on dedicated hardware — zero rate limits, deterministic latency, and no third-party outage risk. GigaGPU delivers the GPU power your agents need.

Browse GPU Servers

Filed under: Tutorials

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?