RTX 3050 - Order Now
Home / Blog / Tutorials / Function Calling with Llama 3.3 – Complete Guide
Tutorials

Function Calling with Llama 3.3 – Complete Guide

Llama 3.3 supports structured tool use. Getting reliable function calls on a self-hosted deployment takes the right inference config and prompt format.

Llama 3.3 was trained with function-calling support but the implementation differs from OpenAI’s format. On our dedicated GPU hosting getting reliable tool use requires both the right vLLM flags and correct message formatting.

Contents

vLLM Flags

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --chat-template /path/to/llama3_chat_template.jinja

--tool-call-parser llama3_json teaches vLLM how to extract tool calls from Llama’s native output. Without it, tool calls come through as free text.

Request

response = client.chat.completions.create(
  model="llama3.3",
  messages=[{"role": "user", "content": "What's the weather in London?"}],
  tools=[{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {"type": "string"},
          "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["location"]
      }
    }
  }],
  tool_choice="auto"
)

Parsing

Llama 3.3 emits tool calls in JSON. vLLM parses them into the standard OpenAI structure:

if response.choices[0].message.tool_calls:
    for tc in response.choices[0].message.tool_calls:
        name = tc.function.name
        args = json.loads(tc.function.arguments)
        result = execute(name, **args)

Reliability

  • Describe tools precisely – Llama follows descriptions closely
  • Keep tool schemas small – 5-10 tools per call is the sweet spot
  • Include an explicit example in the system prompt for rare tools
  • At 70B you get near-OpenAI-level reliability. At 8B, function calling is noticeably weaker – keep tools simple

Self-Hosted Function Calling

Llama 3.3 with vLLM tool parsing preconfigured on UK dedicated GPUs.

Browse GPU Servers

See tool use with Qwen Coder.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?