Home / Blog / Tutorials / Function Calling with Llama 3.3 – Complete Guide

Tutorials

Function Calling with Llama 3.3 – Complete Guide

Llama 3.3 supports structured tool use. Getting reliable function calls on a self-hosted deployment takes the right inference config and prompt format.

Tutorials April 23, 2026 2 min read gigagpu

Llama 3.3 was trained with function-calling support but the implementation differs from OpenAI’s format. On our dedicated GPU hosting getting reliable tool use requires both the right vLLM flags and correct message formatting.

vLLM flags for tool parsing
Request format
Parsing the response
Reliability tips

vLLM Flags

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --chat-template /path/to/llama3_chat_template.jinja

--tool-call-parser llama3_json teaches vLLM how to extract tool calls from Llama’s native output. Without it, tool calls come through as free text.

Request

response = client.chat.completions.create(
  model="llama3.3",
  messages=[{"role": "user", "content": "What's the weather in London?"}],
  tools=[{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {"type": "string"},
          "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["location"]
      }
    }
  }],
  tool_choice="auto"
)

Parsing

Llama 3.3 emits tool calls in JSON. vLLM parses them into the standard OpenAI structure:

if response.choices[0].message.tool_calls:
    for tc in response.choices[0].message.tool_calls:
        name = tc.function.name
        args = json.loads(tc.function.arguments)
        result = execute(name, **args)

Reliability

Describe tools precisely – Llama follows descriptions closely
Keep tool schemas small – 5-10 tools per call is the sweet spot
Include an explicit example in the system prompt for rare tools
At 70B you get near-OpenAI-level reliability. At 8B, function calling is noticeably weaker – keep tools simple

Self-Hosted Function Calling

Llama 3.3 with vLLM tool parsing preconfigured on UK dedicated GPUs.

Browse GPU Servers

See tool use with Qwen Coder.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Function Calling with Llama 3.3 – Complete Guide

Contents

vLLM Flags

Request

Parsing

Reliability

Self-Hosted Function Calling

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Function Calling with Llama 3.3 – Complete Guide

Contents

vLLM Flags

Request

Parsing

Reliability

Self-Hosted Function Calling

Need a Dedicated GPU Server?

gigagpu

Related Articles

AI On-Call Runbook Template

Ollama Remote Access & Network Setup

OpenAI SDK with Self-Hosted Models: Python Guide

Self-Hosted RAG Architecture: A Reference Implementation on Dedicated GPUs

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?