Deploying AI model updates manually means downtime, human error, and inconsistent environments. A CI/CD pipeline automates the path from model checkpoint to production inference endpoint on your GPU server, running validation tests before any deployment touches live traffic.
Pipeline Architecture
An AI model deployment pipeline differs from standard application CI/CD in one critical way: the artefact is a multi-gigabyte model file, not a compiled binary. The pipeline must handle large file storage, GPU-dependent tests, and inference validation alongside the usual build and deploy stages.
| Stage | Purpose | GPU Required |
|---|---|---|
| Trigger | New model pushed to registry | No |
| Validate | Run inference tests on sample inputs | Yes |
| Build | Package model + serving code into Docker image | No |
| Stage | Deploy to staging, run integration tests | Yes |
| Deploy | Rolling update to production | Yes |
| Verify | Smoke tests against live endpoint | No |
GitHub Actions Pipeline
Define the pipeline in GitHub Actions. The self-hosted runner runs on your GPU server so validation and staging tests have access to the hardware.
# .github/workflows/deploy-model.yml
name: Deploy AI Model
on:
push:
paths:
- "models/**"
- "serving/**"
branches: [main]
jobs:
validate:
runs-on: self-hosted # GPU runner
steps:
- uses: actions/checkout@v4
- name: Validate model loads
run: |
python -c "
from vllm import LLM
llm = LLM(model='./models/current', gpu_memory_utilization=0.5)
output = llm.generate(['Hello, test prompt'], sampling_params=None)
assert len(output) > 0, 'Model failed to generate'
print('Model validation passed')
"
- name: Run inference benchmarks
run: |
python scripts/benchmark.py \
--model ./models/current \
--prompts tests/benchmark_prompts.jsonl \
--min-tokens-per-sec 50 \
--max-latency-p99 2.0
build:
needs: validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build serving image
run: |
docker build -t ai-inference:${{ github.sha }} \
-f serving/Dockerfile .
- name: Push to registry
run: |
docker tag ai-inference:${{ github.sha }} \
registry.yourdomain.com/ai-inference:${{ github.sha }}
docker push registry.yourdomain.com/ai-inference:${{ github.sha }}
deploy-staging:
needs: build
runs-on: self-hosted
steps:
- name: Deploy to staging
run: |
docker compose -f docker-compose.staging.yml up -d \
--pull always
env:
IMAGE_TAG: ${{ github.sha }}
- name: Run integration tests
run: |
python tests/integration.py \
--endpoint http://localhost:8100/v1 \
--test-suite full
deploy-production:
needs: deploy-staging
runs-on: self-hosted
environment: production
steps:
- name: Rolling update
run: |
docker compose -f docker-compose.prod.yml up -d \
--pull always --no-deps vllm-server
env:
IMAGE_TAG: ${{ github.sha }}
- name: Smoke test
run: |
curl -s http://localhost:8000/v1/models | jq .
python tests/smoke.py --endpoint http://localhost:8000/v1
Model Validation Tests
Validation ensures a new model meets quality thresholds before it reaches production. Test against a fixed set of prompts and compare outputs to expected results.
# tests/validate_model.py
import json
import httpx
ENDPOINT = "http://localhost:8000/v1/chat/completions"
TEST_CASES = json.load(open("tests/validation_cases.json"))
def validate():
client = httpx.Client(timeout=30)
failures = []
for case in TEST_CASES:
resp = client.post(ENDPOINT, json={
"model": "current",
"messages": case["messages"],
"max_tokens": case.get("max_tokens", 256),
"temperature": 0 # Deterministic for testing
})
result = resp.json()
content = result["choices"][0]["message"]["content"]
# Check required keywords appear in output
for keyword in case.get("expected_keywords", []):
if keyword.lower() not in content.lower():
failures.append({
"case": case["name"],
"missing_keyword": keyword,
"output_preview": content[:200]
})
if failures:
print(f"FAILED: {len(failures)} validation cases")
for f in failures:
print(f" - {f['case']}: missing '{f['missing_keyword']}'")
exit(1)
print(f"PASSED: {len(TEST_CASES)} validation cases")
if __name__ == "__main__":
validate()
Store validation cases alongside the model code. When your model changes — whether through fine-tuning on your PyTorch hosting environment or switching to a new base model from self-hosted options — the tests catch regressions before deployment.
Serving Dockerfile
Package the inference server and dependencies into a reproducible Docker image.
# serving/Dockerfile
FROM vllm/vllm-openai:latest
COPY serving/config.yaml /app/config.yaml
COPY serving/entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
ENV MODEL_PATH=/models/current
ENV MAX_MODEL_LEN=4096
ENV GPU_MEMORY_UTILIZATION=0.9
HEALTHCHECK --interval=30s --timeout=10s \
CMD curl -f http://localhost:8000/health || exit 1
ENTRYPOINT ["/app/entrypoint.sh"]
#!/bin/bash
# serving/entrypoint.sh
exec python -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--max-model-len "$MAX_MODEL_LEN" \
--gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
--host 0.0.0.0 \
--port 8000
Rollback Strategy
Every deployment should be reversible. Tag Docker images with the git commit SHA so rolling back is a single command pointing to the previous image.
# Rollback to previous version
export ROLLBACK_TAG="abc123def" # Previous commit SHA
docker compose -f docker-compose.prod.yml up -d \
--no-deps vllm-server
# IMAGE_TAG is read from environment by compose
# Automated rollback on smoke test failure
deploy_and_verify() {
docker compose -f docker-compose.prod.yml up -d --no-deps vllm-server
sleep 10 # Wait for model to load
if ! python tests/smoke.py --endpoint http://localhost:8000/v1; then
echo "Smoke test failed, rolling back"
export IMAGE_TAG="$PREVIOUS_TAG"
docker compose -f docker-compose.prod.yml up -d --no-deps vllm-server
exit 1
fi
}
For zero-downtime updates, combine rollback with blue-green deployment. Track which model version is active with the model versioning system.
Monitoring and Alerts
Wire the pipeline into your observability stack. Push deployment events to Prometheus and Grafana as annotations so you can correlate model changes with inference performance shifts. Log pipeline execution details to the ELK stack.
For inference servers built with FastAPI, expose a /version endpoint that returns the current model hash and deployment timestamp. Deliver deployment notifications via webhooks to Slack or your incident management tool. The tutorials section covers more production patterns for vLLM deployments.
Automate AI Deployments on Dedicated GPUs
Build CI/CD pipelines targeting bare-metal GPU servers. Automated validation, Docker builds, and rolling updates for your inference stack.
Browse GPU Servers