Table of Contents
AI deployments accumulate organisational knowledge that evaporates when the original engineer leaves. The minimum viable handoff documentation prevents the "nobody knows how this works" failure mode. Build as you go.
Required docs: architecture diagram, runbook per recurring incident, config reference (envs, secrets, feature flags), eval harness operation guide, deployment + rollback procedure, decision log (why we picked these models / configs). Keep in repo as Markdown; review quarterly.
Required docs
- Architecture diagram: services, GPUs, vector store, observability, where data flows
- Service map: every component, its purpose, how to access
- Config reference: env vars, feature flags, secrets, config file locations
- Operational runbooks: per recurring incident class
- Deployment procedure: how to deploy a new model / prompt / RAG change
- Eval harness guide: how to run, where results go, how to interpret
- Decision log: why we chose Llama vs Mistral, why this prompt structure, etc.
- Cost overview: monthly costs, where they go, who pays
Runbooks
Per recurring incident, document:
- Symptoms (what alerts fire / what users report)
- Diagnosis steps (which dashboards, which logs)
- Mitigation (route traffic, restart, scale)
- Recovery verification
- When to escalate
Decisions
Decision log is the highest-leverage doc. Capture:
- What decision was made
- What alternatives were considered
- Why this option won
- What would change the decision in future
Examples: "Picked Mistral 7B over Llama 3.1 8B for English production because faster TTFT outweighed slight quality difference for our chatbot use case". "Picked single 4090 over 2× 5060 Ti because operational simplicity outweighed cost saving for our team size".
Verdict
Handoff documentation is the cheapest insurance against engineer turnover. Build as you go — retrofitting is harder than the original creation. Decision log is the highest-leverage piece; keeps future engineers from re-litigating settled questions.
Bottom line
Docs as you go; decision log especially. See deployment checklist.