The OpenAI Assistants API bundled retrieval (file search), code interpreter, and tool use behind one abstraction. Its deprecation (in favour of Agents SDK) left teams looking for alternatives. On our dedicated GPU hosting you can assemble an equivalent stack from open parts.
Contents
Components
- LLM: self-hosted Llama 3.3 70B or Qwen 2.5 72B with function calling
- File search: embedder + vector DB (Qdrant)
- Code interpreter: sandboxed Python execution (E2B, Docker)
- Thread storage: PostgreSQL or Redis
- Orchestrator: LangGraph or a small FastAPI service
Assembly
A minimal Python service that ties them together:
@app.post("/threads/{thread_id}/messages")
async def send_message(thread_id: str, body: dict):
# Load thread history
history = await load_thread(thread_id)
history.append({"role": "user", "content": body["content"]})
while True:
response = await llm.chat(history, tools=TOOLS)
if response.tool_calls:
for tc in response.tool_calls:
result = await execute_tool(tc)
history.append({"role": "tool", "tool_call_id": tc.id, "content": result})
else:
history.append({"role": "assistant", "content": response.content})
break
await save_thread(thread_id, history)
return {"message": response.content}
Thread Persistence
Each thread has its own message history and potentially its own vector store (for file search scoped to that thread). PostgreSQL with JSONB for message history, Qdrant with per-thread collections for file search.
Tool Sandbox
For code execution, run each tool call in an isolated container. E2B provides a hosted option; for fully self-hosted use Docker with resource limits and network isolation.
Self-Hosted Assistants API Equivalent
Pre-built on UK dedicated GPUs with all components preconfigured.
Browse GPU ServersSee OpenAI-compatible API and LangGraph.