Adding AI to an existing Rails, Django, Node or PHP application rarely means rewriting the app – the cleanest pattern is an AI sidecar: a small, narrow service running on dedicated GPU hardware that your main app calls over HTTP. A RTX 5060 Ti 16GB from our dedicated GPU hosting makes an excellent sidecar for EC2, Heroku, Render, Vercel or Fly apps that need summarisation, embeddings, classification or moderation without the cost of a hyperscaler API.
Contents
- The sidecar pattern
- What the sidecar does
- Secure connection options
- Latency budget
- Integration code
- Scaling the sidecar
The Sidecar Pattern
Your main app handles CRUD, auth, payments, UI. The sidecar exposes a narrow internal API – /summarise, /embed, /classify, /transcribe. The main app calls it with HTTP POST; the sidecar returns JSON. Two benefits: the main app stays stateless and CPU-bound, and the GPU workload scales independently on dedicated hardware.
What the Sidecar Does
| Endpoint | Model | Typical latency | VRAM |
|---|---|---|---|
| /summarise | Mistral 7B FP8 or Llama 3.1 8B FP8 | 400-1200 ms for 200-word summary | ~10 GB |
| /embed | BGE-M3 or E5-large | 10-30 ms per doc | ~2 GB |
| /classify | DistilBERT or Llama 3 with few-shot | 30-150 ms | 1-8 GB |
| /moderate | Llama Guard 2 or BERT toxicity | 30-200 ms | 1-8 GB |
| /transcribe | faster-whisper large-v3 | ~0.1x realtime | ~5 GB |
| /ocr | TrOCR or PaddleOCR | 150-400 ms per page | ~2 GB |
Most of these co-exist comfortably within 16 GB – an LLM (10 GB) plus an embedder (2 GB) plus a small classifier leaves headroom.
Secure Connection Options
| Option | Setup | Latency overhead | Notes |
|---|---|---|---|
| HTTPS + API key | nginx + TLS cert | ~5-15 ms extra | Simplest; rate-limit by IP or key |
| WireGuard VPN | Peer between app and sidecar | ~1-3 ms | Encrypted, private, low-latency |
| Tailscale / Headscale | Managed mesh VPN | ~2-5 ms | No firewall config, auto-renewed |
| mTLS | Client certs | ~5-15 ms | Strong auth without API key secrets |
| Public nginx + Cloudflare | Proxy via CF | ~20-40 ms | DDoS protection, WAF |
For a UK-hosted sidecar serving a UK-region cloud app, WireGuard or Tailscale add almost no latency. For cross-region (US-east app -> UK sidecar), expect ~80-100 ms of network latency on top of the GPU work – consider a regional sidecar pair.
Latency Budget
| Component | Budget (UK-to-UK) | Budget (US-to-UK) |
|---|---|---|
| Main app -> sidecar TCP/TLS | 5-15 ms | 80-120 ms |
| Request parsing | 1-3 ms | 1-3 ms |
| Model prefill (200 tokens) | 30-80 ms | 30-80 ms |
| Model decode (200 out tokens @ 112 t/s) | ~1,800 ms | ~1,800 ms |
| Response JSON encode + return | 5-15 ms | 80-120 ms |
| Total for 200-token summary | ~1.9 s | ~2.1 s |
| Embedding-only call | ~40 ms | ~200 ms |
Integration Code
FastAPI sidecar (Python) + a Rails client:
# sidecar.py (on the 5060 Ti)
from fastapi import FastAPI, Header, HTTPException
from openai import OpenAI
app = FastAPI()
llm = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="x")
@app.post("/summarise")
def summarise(body: dict, x_api_key: str = Header(None)):
if x_api_key != "sk-internal": raise HTTPException(401)
r = llm.chat.completions.create(
model="llama-3.1-8b", max_tokens=200,
messages=[{"role":"user","content":f"Summarise:\n{body['text']}"}]
)
return {"summary": r.choices[0].message.content}
# Rails client
uri = URI("https://ai-sidecar.internal/summarise")
res = Net::HTTP.post(uri, { text: content }.to_json,
"Content-Type" => "application/json",
"X-API-Key" => Rails.application.credentials.sidecar_key)
JSON.parse(res.body)["summary"]
Scaling the Sidecar
- One 5060 Ti comfortably handles a few hundred requests per minute for summarisation, thousands for embeddings
- Add a second 5060 Ti on a sibling server when P95 creeps – round-robin via nginx or HAProxy
- Cache aggressively – embeddings for repeat documents, summaries keyed by content hash
- Queue long jobs (transcribe, long summaries) via Redis or RabbitMQ so synchronous requests stay snappy
AI Sidecar Hosting
Bolt AI onto your existing app. UK dedicated Blackwell 16 GB.
Order the RTX 5060 Ti 16GBSee also: vLLM setup, FP8 deployment, Llama 3 8B benchmark, Docker CUDA setup, first-day checklist.