RTX 3050 - Order Now
Home / Blog / Tutorials / AutoGen Self-Hosted LLM Agent
Tutorials

AutoGen Self-Hosted LLM Agent

Microsoft's AutoGen orchestrates multi-agent workflows. Pointed at a self-hosted LLM it delivers production agent pipelines without per-token fees.

AutoGen is Microsoft’s framework for building multi-agent systems – a user agent, an executor agent, a critic agent coordinating to solve tasks. By default it uses OpenAI API but it works equally well with any OpenAI-compatible endpoint. On our dedicated GPU hosting you can run full agent workflows on your own LLM with no per-token cost.

Contents

Setup

pip install autogen-agentchat autogen-ext openai

Run vLLM locally with your chosen model. See self-hosted OpenAI-compatible API.

Config

from autogen_ext.models.openai import OpenAIChatCompletionClient

model_client = OpenAIChatCompletionClient(
    model="llama-3.3-70b",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model_info={
        "vision": False,
        "function_calling": True,
        "json_output": True,
        "family": "unknown",
    },
)

The model_info dict tells AutoGen what the model supports. Function calling and JSON output need the underlying LLM to actually handle them – Llama 3.3 and Qwen 2.5 both do well.

Example

from autogen_agentchat.agents import AssistantAgent, CodeExecutorAgent
from autogen_agentchat.teams import RoundRobinGroupChat

assistant = AssistantAgent("assistant", model_client=model_client)
executor = CodeExecutorAgent("executor", code_executor=LocalCommandLineCodeExecutor())

team = RoundRobinGroupChat([assistant, executor], termination_condition=...)
await team.run(task="Analyse sales.csv and produce a summary")

Models

Agent TypeRecommended Self-Hosted Model
General assistantLlama 3.3 70B
Code executor agentQwen Coder 32B
Reasoning agentR1 Distill 32B
Low-latency routerLlama 3 8B

Self-Hosted Multi-Agent Hosting

AutoGen on UK dedicated GPUs with a strong LLM behind it.

Browse GPU Servers

See CrewAI and LangGraph.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?