RTX 3050 - Order Now
Home / Blog / Tutorials / Self-Hosted Alternative to the OpenAI Assistants API
Tutorials

Self-Hosted Alternative to the OpenAI Assistants API

OpenAI's Assistants API bundles retrieval, code execution, and function calling behind one endpoint. Rebuilding that on a dedicated GPU is straightforward.

The OpenAI Assistants API bundled retrieval (file search), code interpreter, and tool use behind one abstraction. Its deprecation (in favour of Agents SDK) left teams looking for alternatives. On our dedicated GPU hosting you can assemble an equivalent stack from open parts.

Contents

Components

  • LLM: self-hosted Llama 3.3 70B or Qwen 2.5 72B with function calling
  • File search: embedder + vector DB (Qdrant)
  • Code interpreter: sandboxed Python execution (E2B, Docker)
  • Thread storage: PostgreSQL or Redis
  • Orchestrator: LangGraph or a small FastAPI service

Assembly

A minimal Python service that ties them together:

@app.post("/threads/{thread_id}/messages")
async def send_message(thread_id: str, body: dict):
    # Load thread history
    history = await load_thread(thread_id)
    history.append({"role": "user", "content": body["content"]})

    while True:
        response = await llm.chat(history, tools=TOOLS)
        if response.tool_calls:
            for tc in response.tool_calls:
                result = await execute_tool(tc)
                history.append({"role": "tool", "tool_call_id": tc.id, "content": result})
        else:
            history.append({"role": "assistant", "content": response.content})
            break

    await save_thread(thread_id, history)
    return {"message": response.content}

Thread Persistence

Each thread has its own message history and potentially its own vector store (for file search scoped to that thread). PostgreSQL with JSONB for message history, Qdrant with per-thread collections for file search.

Tool Sandbox

For code execution, run each tool call in an isolated container. E2B provides a hosted option; for fully self-hosted use Docker with resource limits and network isolation.

Self-Hosted Assistants API Equivalent

Pre-built on UK dedicated GPUs with all components preconfigured.

Browse GPU Servers

See OpenAI-compatible API and LangGraph.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?