Files
ai-tax-agent/docs/BASE_IMAGE_ARCHITECTURE.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

9.4 KiB

Base Image Architecture

Overview

To optimize Docker image sizes and build times, we use a layered base image architecture:

python:3.12-slim (150MB)
    ├─> base-runtime (300MB) - Core deps for ALL services
    └─> base-ml (1.2GB) - ML deps (sentence-transformers, PyTorch, etc.)
            ├─> svc-ocr (1.25GB = base-ml + 50MB app)
            ├─> svc-rag-indexer (1.25GB = base-ml + 50MB app)
            └─> svc-rag-retriever (1.25GB = base-ml + 50MB app)

Benefits

1. Build ML Dependencies Once

  • Heavy ML libraries (PyTorch, transformers, sentence-transformers) are built once in base-ml
  • All ML services reuse the same base image
  • No need to rebuild 1GB+ of dependencies for each service

2. Faster Builds

  • Before: Each ML service took 10-15 minutes to build
  • After: ML services build in 1-2 minutes (only app code + small deps)

3. Faster Pushes

  • Before: Pushing 1.3GB per service = 3.9GB total for 3 ML services
  • After: Push base-ml once (1.2GB) + 3 small app layers (50MB each) = 1.35GB total
  • Savings: 65% reduction in push time

4. Layer Caching

  • Docker reuses base-ml layers across all ML services
  • Only the small application layer (~50MB) needs to be pushed/pulled
  • Faster deployments and rollbacks

5. Easy Updates

  • Update ML library versions in one place (base-ml)
  • Rebuild base-ml once, then rebuild all ML services quickly
  • Consistent ML library versions across all services

Image Sizes

Image Type Size Contents
base-runtime ~300MB FastAPI, uvicorn, database drivers, Redis, NATS, MinIO, Qdrant, etc.
base-ml ~1.2GB base-runtime + sentence-transformers, PyTorch, transformers, numpy, scikit-learn, spacy, nltk
ML Service ~1.25GB base-ml + service-specific deps (faiss, tiktoken, etc.) + app code (~50MB)
Non-ML Service ~350MB python:3.12-slim + base deps + service deps + app code

Architecture

Base Images

1. base-runtime

  • Location: infra/docker/base-runtime.Dockerfile
  • Registry: gitea.harkon.co.uk/harkon/base-runtime:v1.0.1
  • Contents: Core dependencies for ALL services
    • FastAPI, uvicorn, pydantic
    • Database drivers (asyncpg, psycopg2, neo4j, redis)
    • Object storage (minio)
    • Vector DB (qdrant-client)
    • Event bus (nats-py)
    • Secrets (hvac)
    • Monitoring (prometheus-client)
    • HTTP client (httpx)
    • Utilities (ulid-py, python-dateutil, orjson)

2. base-ml

  • Location: infra/docker/base-ml.Dockerfile
  • Registry: gitea.harkon.co.uk/harkon/base-ml:v1.0.1
  • Contents: base-runtime + ML dependencies
    • sentence-transformers (includes PyTorch)
    • transformers
    • scikit-learn
    • numpy
    • spacy
    • nltk
    • fuzzywuzzy
    • python-Levenshtein

Service Images

ML Services (use base-ml)

  1. svc-ocr - OCR and document AI

    • Additional deps: pytesseract, PyMuPDF, pdf2image, Pillow, opencv-python-headless, torchvision
    • System deps: tesseract-ocr, poppler-utils
  2. svc-rag-indexer - Document indexing and embedding

    • Additional deps: tiktoken, beautifulsoup4, faiss-cpu, python-docx, python-pptx, openpyxl, sparse-dot-topn
  3. svc-rag-retriever - Semantic search and retrieval

    • Additional deps: rank-bm25, faiss-cpu, sparse-dot-topn

Non-ML Services (use python:3.12-slim directly)

  • All other services (svc-ingestion, svc-extract, svc-kg, svc-forms, etc.)
  • Build from scratch with base requirements + service-specific deps

Build Process

Step 1: Build Base Images (One Time)

IMPORTANT: Build base-ml on the remote server to avoid pushing 1.2GB+ over the network!

# Build base-ml on remote server (fast push to Gitea on same network)
./scripts/remote-build-base-ml.sh deploy@141.136.35.199 /home/deploy/ai-tax-agent gitea.harkon.co.uk v1.0.1 harkon

# Or use defaults (deploy user, /home/deploy/ai-tax-agent)
./scripts/remote-build-base-ml.sh

This will:

  1. Sync code to remote server
  2. Build base-ml on remote (~1.2GB, 10-15 min)
  3. Push to Gitea from remote (fast, same network)

Why build base-ml remotely?

  • Faster push to Gitea (same datacenter/network)
  • Saves local network bandwidth
  • Image is cached on remote server for faster service builds
  • Only need to do this once

Time: 10-15 minutes (one time only)

# Build both base images locally
./scripts/build-base-images.sh gitea.harkon.co.uk v1.0.1 harkon

This builds:

  • gitea.harkon.co.uk/harkon/base-runtime:v1.0.1 (~300MB)
  • gitea.harkon.co.uk/harkon/base-ml:v1.0.1 (~1.2GB)

Note: Pushing 1.2GB base-ml from local machine is slow and may fail due to network issues.

Step 2: Build Service Images

# Build and push all services
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 harkon

ML services will:

  1. Pull base-ml:v1.0.1 from registry (if not cached)
  2. Install service-specific deps (~10-20 packages)
  3. Copy application code
  4. Build final image (~1.25GB)

Time per ML service: 1-2 minutes (vs 10-15 minutes before)

Step 3: Update Base Images (When Needed)

When you need to update ML library versions:

# 1. Update libs/requirements-ml.txt
vim libs/requirements-ml.txt

# 2. Rebuild base-ml with new version
./scripts/build-base-images.sh gitea.harkon.co.uk v1.0.2 harkon

# 3. Update service Dockerfiles to use new base version
# Change: ARG BASE_VERSION=v1.0.2

# 4. Rebuild ML services
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.2 harkon

Requirements Files

libs/requirements-base.txt

Core dependencies for ALL services (included in base-runtime and base-ml)

libs/requirements-ml.txt

ML dependencies (included in base-ml only)

apps/svc_*/requirements.txt

Service-specific dependencies:

  • ML services: Only additional deps NOT in base-ml (e.g., faiss-cpu, tiktoken)
  • Non-ML services: Service-specific deps (e.g., aiofiles, openai, anthropic)

Dockerfile Templates

ML Service Dockerfile Pattern

# Use pre-built ML base image
ARG REGISTRY=gitea.harkon.co.uk
ARG OWNER=harkon
ARG BASE_VERSION=v1.0.1
FROM ${REGISTRY}/${OWNER}/base-ml:${BASE_VERSION}

USER root
WORKDIR /app

# Install service-specific deps (minimal)
COPY apps/SERVICE_NAME/requirements.txt /tmp/service-requirements.txt
RUN pip install --no-cache-dir -r /tmp/service-requirements.txt

# Copy app code
COPY libs/ ./libs/
COPY apps/SERVICE_NAME/ ./apps/SERVICE_NAME/

RUN chown -R appuser:appuser /app
USER appuser

# Health check, expose, CMD...

Non-ML Service Dockerfile Pattern

# Multi-stage build from scratch
FROM python:3.12-slim AS builder

# Install build deps
RUN apt-get update && apt-get install -y build-essential curl && rm -rf /var/lib/apt/lists/*

# Create venv and install deps
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/SERVICE_NAME/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt

# Production stage
FROM python:3.12-slim
# ... copy venv, app code, etc.

Comparison: Before vs After

Before (Monolithic Approach)

Each ML service:
- Build time: 10-15 minutes
- Image size: 1.6GB
- Push time: 5-10 minutes
- Total for 3 services: 30-45 min build + 15-30 min push = 45-75 minutes

After (Base Image Approach)

Base-ml (one time):
- Build time: 10-15 minutes
- Image size: 1.2GB
- Push time: 5-10 minutes

Each ML service:
- Build time: 1-2 minutes
- Image size: 1.25GB (but only 50MB new layers)
- Push time: 30-60 seconds (only new layers)
- Total for 3 services: 3-6 min build + 2-3 min push = 5-9 minutes

Total time savings: 40-66 minutes (89% faster!)

Best Practices

  1. Version base images: Always tag with version (e.g., v1.0.1, v1.0.2)
  2. Update base images infrequently: Only when ML library versions need updating
  3. Keep service requirements minimal: Only add deps NOT in base-ml
  4. Use build args: Make registry/owner/version configurable
  5. Test base images: Ensure health checks pass before building services
  6. Document changes: Update this file when modifying base images

Troubleshooting

Issue: Service can't find ML library

Cause: Library removed from service requirements but not in base-ml Solution: Add library to libs/requirements-ml.txt and rebuild base-ml

Issue: Base image not found

Cause: Base image not pushed to registry or wrong version Solution: Run ./scripts/build-base-images.sh first

Issue: Service image too large

Cause: Duplicate dependencies in service requirements Solution: Remove deps already in base-ml from service requirements.txt

Future Improvements

  1. base-runtime for non-ML services: Use base-runtime instead of building from scratch
  2. Multi-arch builds: Support ARM64 for Apple Silicon
  3. Automated base image updates: CI/CD pipeline to rebuild base images on dependency updates
  4. Layer analysis: Tools to analyze and optimize layer sizes