RTX 3050 - Order Now
Home / Blog / Tutorials / CI/CD for AI Models: Automated Pipeline
Tutorials

CI/CD for AI Models: Automated Pipeline

Step-by-step guide to building CI/CD pipelines for AI model deployment covering automated testing, model validation, Docker image builds, rolling updates, and deployment to GPU servers.

Deploying AI model updates manually means downtime, human error, and inconsistent environments. A CI/CD pipeline automates the path from model checkpoint to production inference endpoint on your GPU server, running validation tests before any deployment touches live traffic.

Pipeline Architecture

An AI model deployment pipeline differs from standard application CI/CD in one critical way: the artefact is a multi-gigabyte model file, not a compiled binary. The pipeline must handle large file storage, GPU-dependent tests, and inference validation alongside the usual build and deploy stages.

StagePurposeGPU Required
TriggerNew model pushed to registryNo
ValidateRun inference tests on sample inputsYes
BuildPackage model + serving code into Docker imageNo
StageDeploy to staging, run integration testsYes
DeployRolling update to productionYes
VerifySmoke tests against live endpointNo

GitHub Actions Pipeline

Define the pipeline in GitHub Actions. The self-hosted runner runs on your GPU server so validation and staging tests have access to the hardware.

# .github/workflows/deploy-model.yml
name: Deploy AI Model
on:
  push:
    paths:
      - "models/**"
      - "serving/**"
    branches: [main]

jobs:
  validate:
    runs-on: self-hosted  # GPU runner
    steps:
      - uses: actions/checkout@v4

      - name: Validate model loads
        run: |
          python -c "
          from vllm import LLM
          llm = LLM(model='./models/current', gpu_memory_utilization=0.5)
          output = llm.generate(['Hello, test prompt'], sampling_params=None)
          assert len(output) > 0, 'Model failed to generate'
          print('Model validation passed')
          "

      - name: Run inference benchmarks
        run: |
          python scripts/benchmark.py \
            --model ./models/current \
            --prompts tests/benchmark_prompts.jsonl \
            --min-tokens-per-sec 50 \
            --max-latency-p99 2.0

  build:
    needs: validate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build serving image
        run: |
          docker build -t ai-inference:${{ github.sha }} \
            -f serving/Dockerfile .

      - name: Push to registry
        run: |
          docker tag ai-inference:${{ github.sha }} \
            registry.yourdomain.com/ai-inference:${{ github.sha }}
          docker push registry.yourdomain.com/ai-inference:${{ github.sha }}

  deploy-staging:
    needs: build
    runs-on: self-hosted
    steps:
      - name: Deploy to staging
        run: |
          docker compose -f docker-compose.staging.yml up -d \
            --pull always
        env:
          IMAGE_TAG: ${{ github.sha }}

      - name: Run integration tests
        run: |
          python tests/integration.py \
            --endpoint http://localhost:8100/v1 \
            --test-suite full

  deploy-production:
    needs: deploy-staging
    runs-on: self-hosted
    environment: production
    steps:
      - name: Rolling update
        run: |
          docker compose -f docker-compose.prod.yml up -d \
            --pull always --no-deps vllm-server
        env:
          IMAGE_TAG: ${{ github.sha }}

      - name: Smoke test
        run: |
          curl -s http://localhost:8000/v1/models | jq .
          python tests/smoke.py --endpoint http://localhost:8000/v1

Model Validation Tests

Validation ensures a new model meets quality thresholds before it reaches production. Test against a fixed set of prompts and compare outputs to expected results.

# tests/validate_model.py
import json
import httpx

ENDPOINT = "http://localhost:8000/v1/chat/completions"
TEST_CASES = json.load(open("tests/validation_cases.json"))

def validate():
    client = httpx.Client(timeout=30)
    failures = []

    for case in TEST_CASES:
        resp = client.post(ENDPOINT, json={
            "model": "current",
            "messages": case["messages"],
            "max_tokens": case.get("max_tokens", 256),
            "temperature": 0  # Deterministic for testing
        })
        result = resp.json()
        content = result["choices"][0]["message"]["content"]

        # Check required keywords appear in output
        for keyword in case.get("expected_keywords", []):
            if keyword.lower() not in content.lower():
                failures.append({
                    "case": case["name"],
                    "missing_keyword": keyword,
                    "output_preview": content[:200]
                })

    if failures:
        print(f"FAILED: {len(failures)} validation cases")
        for f in failures:
            print(f"  - {f['case']}: missing '{f['missing_keyword']}'")
        exit(1)

    print(f"PASSED: {len(TEST_CASES)} validation cases")

if __name__ == "__main__":
    validate()

Store validation cases alongside the model code. When your model changes — whether through fine-tuning on your PyTorch hosting environment or switching to a new base model from self-hosted options — the tests catch regressions before deployment.

Serving Dockerfile

Package the inference server and dependencies into a reproducible Docker image.

# serving/Dockerfile
FROM vllm/vllm-openai:latest

COPY serving/config.yaml /app/config.yaml
COPY serving/entrypoint.sh /app/entrypoint.sh

RUN chmod +x /app/entrypoint.sh

ENV MODEL_PATH=/models/current
ENV MAX_MODEL_LEN=4096
ENV GPU_MEMORY_UTILIZATION=0.9

HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["/app/entrypoint.sh"]
#!/bin/bash
# serving/entrypoint.sh
exec python -m vllm.entrypoints.openai.api_server \
  --model "$MODEL_PATH" \
  --max-model-len "$MAX_MODEL_LEN" \
  --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
  --host 0.0.0.0 \
  --port 8000

Rollback Strategy

Every deployment should be reversible. Tag Docker images with the git commit SHA so rolling back is a single command pointing to the previous image.

# Rollback to previous version
export ROLLBACK_TAG="abc123def"  # Previous commit SHA
docker compose -f docker-compose.prod.yml up -d \
  --no-deps vllm-server
# IMAGE_TAG is read from environment by compose

# Automated rollback on smoke test failure
deploy_and_verify() {
  docker compose -f docker-compose.prod.yml up -d --no-deps vllm-server

  sleep 10  # Wait for model to load

  if ! python tests/smoke.py --endpoint http://localhost:8000/v1; then
    echo "Smoke test failed, rolling back"
    export IMAGE_TAG="$PREVIOUS_TAG"
    docker compose -f docker-compose.prod.yml up -d --no-deps vllm-server
    exit 1
  fi
}

For zero-downtime updates, combine rollback with blue-green deployment. Track which model version is active with the model versioning system.

Monitoring and Alerts

Wire the pipeline into your observability stack. Push deployment events to Prometheus and Grafana as annotations so you can correlate model changes with inference performance shifts. Log pipeline execution details to the ELK stack.

For inference servers built with FastAPI, expose a /version endpoint that returns the current model hash and deployment timestamp. Deliver deployment notifications via webhooks to Slack or your incident management tool. The tutorials section covers more production patterns for vLLM deployments.

Automate AI Deployments on Dedicated GPUs

Build CI/CD pipelines targeting bare-metal GPU servers. Automated validation, Docker builds, and rolling updates for your inference stack.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?