Your Agentic Workflow Shouldn’t Depend on Someone Else’s Uptime
At 3:47 AM on a Tuesday, your on-call engineer got paged. The AI agent that routes customer refund requests — the one that uses OpenAI function calling to query your database, check inventory, and trigger refunds through your internal API — had stopped making tool calls. OpenAI’s function calling endpoint was returning malformed JSON. Not an outage, technically. Just broken enough to halt your entire automated workflow while the model hallucinated pseudo-JSON that your parser choked on. For three hours, refund requests piled up in a dead queue.
Function calling is the most integration-critical feature in any AI pipeline. When it fails, your entire agentic system fails. Self-hosting your function-calling model on a dedicated GPU eliminates this single point of failure. Here’s the migration path for teams that have built tool-use systems on OpenAI.
Understanding What You’re Replacing
OpenAI’s function calling works by fine-tuning GPT models to emit structured JSON matching a provided schema. The good news: several open-source models now match or exceed this capability, often with more reliable structured output.
| Capability | OpenAI Implementation | Self-Hosted Equivalent |
|---|---|---|
| Function definitions | tools parameter in API | Same schema, processed by model |
| JSON mode | response_format: json_object | vLLM guided decoding / Outlines |
| Parallel function calls | GPT-4o returns multiple calls | Hermes 2 Pro, Llama 3.1 support this |
| Streaming tool calls | Supported via SSE | Supported in vLLM |
| Forced function call | tool_choice: required | Constrained decoding guarantees valid output |
The key models to evaluate for function calling are NousResearch/Hermes-2-Pro-Llama-3-8B for lightweight deployments, meta-llama/Llama-3.1-70B-Instruct for production-grade reliability, and Qwen/Qwen2.5-72B-Instruct for complex multi-tool orchestration. All three handle the OpenAI-style tool-use format when served through vLLM.
Step-by-Step Migration
Step 1: Inventory your tool definitions. Export every function definition from your OpenAI integration. Count the total number of tools, their parameter complexity, and whether you use parallel tool calls. This determines which model you need.
Step 2: Provision and deploy. Spin up a GigaGPU dedicated server. For function calling with a 70B model, an RTX 6000 Pro 96 GB is the standard choice. Deploy with vLLM using the --enable-auto-tool-choice flag:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 8192 \
--port 8000
Step 3: Test tool-use fidelity. The most critical phase. Run your full suite of function-calling scenarios against the self-hosted model. Measure three things: schema adherence rate (should be >99%), correct function selection rate, and argument extraction accuracy.
Step 4: Add constrained decoding as a safety net. vLLM supports guided decoding through Outlines, which forces the model to emit valid JSON matching your schema. This eliminates the class of failures where the model produces almost-valid JSON:
response = client.chat.completions.create(
model="llama-70b",
messages=messages,
tools=tool_definitions,
extra_body={"guided_json": your_json_schema}
)
Step 5: Run shadow traffic. Mirror your production function-calling requests to both OpenAI and your self-hosted endpoint for 72 hours. Compare tool selection accuracy, argument correctness, and end-to-end latency.
Handling the Edge Cases
Function calling migration has specific gotchas that don’t apply to simpler chat migrations:
- Nested tool calls: If your agent chains multiple function calls (call function A, use result to call function B), test the full chain — not just individual calls.
- Error recovery: When a tool returns an error, does the model gracefully retry or select an alternative? Test failure paths explicitly.
- System prompt format: Some models expect tool definitions in the system message rather than as a separate parameter. vLLM handles this translation, but verify with your specific model.
- Token budget: Tool definitions consume context window tokens. With 10+ complex tools, you may need models with 32K+ context. Llama 3.1 supports 128K natively.
For teams using frameworks like LangChain or CrewAI, the migration is simpler — these frameworks abstract the tool-calling mechanism, so switching the underlying model is often a configuration change. Ollama-hosted models work with LangChain’s tool-calling interface directly.
Cost and Reliability Comparison
| Metric | OpenAI Function Calling | Self-Hosted (Llama 3.1 70B) |
|---|---|---|
| Cost per 1K tool-use requests | ~$0.50-2.00 | ~$0 (server cost only) |
| JSON schema adherence | ~98-99% | ~99.5% (with constrained decoding) |
| Availability | ~99.5% (third-party dependent) | ~99.9% (hardware-only dependency) |
| Latency (first tool call) | ~600-900ms | ~200-400ms |
| Model version stability | Can change without notice | You control updates |
The reliability improvement alone justifies the migration for production agentic systems. Use the GPU vs API cost comparison to model your specific volume.
Securing Your Agentic Infrastructure
Self-hosted function calling isn’t just about cost — it’s about control. Your tool definitions, conversation histories, and the data flowing through your function arguments never leave your infrastructure. For teams in regulated industries, this is often the deciding factor.
Read the companion guides for migrating chatbot APIs and embeddings pipelines to complete your OpenAI exit. The OpenAI alternative page provides a full feature comparison, while our self-host LLM guide covers infrastructure fundamentals. For cost modelling, the LLM cost calculator accounts for tool-use workloads, and the breakeven analysis shows the full economics.
Build Agentic AI on Infrastructure You Own
Function calling belongs on dedicated hardware — zero rate limits, deterministic latency, and no third-party outage risk. GigaGPU delivers the GPU power your agents need.
Browse GPU ServersFiled under: Tutorials