Files
ai-tax-agent/docs/BASE_IMAGE_ARCHITECTURE.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

316 lines
9.4 KiB
Markdown

# Base Image Architecture
## Overview
To optimize Docker image sizes and build times, we use a **layered base image architecture**:
```
python:3.12-slim (150MB)
├─> base-runtime (300MB) - Core deps for ALL services
└─> base-ml (1.2GB) - ML deps (sentence-transformers, PyTorch, etc.)
├─> svc-ocr (1.25GB = base-ml + 50MB app)
├─> svc-rag-indexer (1.25GB = base-ml + 50MB app)
└─> svc-rag-retriever (1.25GB = base-ml + 50MB app)
```
## Benefits
### 1. **Build ML Dependencies Once**
- Heavy ML libraries (PyTorch, transformers, sentence-transformers) are built once in `base-ml`
- All ML services reuse the same base image
- No need to rebuild 1GB+ of dependencies for each service
### 2. **Faster Builds**
- **Before**: Each ML service took 10-15 minutes to build
- **After**: ML services build in 1-2 minutes (only app code + small deps)
### 3. **Faster Pushes**
- **Before**: Pushing 1.3GB per service = 3.9GB total for 3 ML services
- **After**: Push base-ml once (1.2GB) + 3 small app layers (50MB each) = 1.35GB total
- **Savings**: 65% reduction in push time
### 4. **Layer Caching**
- Docker reuses base-ml layers across all ML services
- Only the small application layer (~50MB) needs to be pushed/pulled
- Faster deployments and rollbacks
### 5. **Easy Updates**
- Update ML library versions in one place (`base-ml`)
- Rebuild base-ml once, then rebuild all ML services quickly
- Consistent ML library versions across all services
## Image Sizes
| Image Type | Size | Contents |
| ------------------ | ------- | --------------------------------------------------------------------------------------------- |
| **base-runtime** | ~300MB | FastAPI, uvicorn, database drivers, Redis, NATS, MinIO, Qdrant, etc. |
| **base-ml** | ~1.2GB | base-runtime + sentence-transformers, PyTorch, transformers, numpy, scikit-learn, spacy, nltk |
| **ML Service** | ~1.25GB | base-ml + service-specific deps (faiss, tiktoken, etc.) + app code (~50MB) |
| **Non-ML Service** | ~350MB | python:3.12-slim + base deps + service deps + app code |
## Architecture
### Base Images
#### 1. base-runtime
- **Location**: `infra/docker/base-runtime.Dockerfile`
- **Registry**: `gitea.harkon.co.uk/harkon/base-runtime:v1.0.1`
- **Contents**: Core dependencies for ALL services
- FastAPI, uvicorn, pydantic
- Database drivers (asyncpg, psycopg2, neo4j, redis)
- Object storage (minio)
- Vector DB (qdrant-client)
- Event bus (nats-py)
- Secrets (hvac)
- Monitoring (prometheus-client)
- HTTP client (httpx)
- Utilities (ulid-py, python-dateutil, orjson)
#### 2. base-ml
- **Location**: `infra/docker/base-ml.Dockerfile`
- **Registry**: `gitea.harkon.co.uk/harkon/base-ml:v1.0.1`
- **Contents**: base-runtime + ML dependencies
- sentence-transformers (includes PyTorch)
- transformers
- scikit-learn
- numpy
- spacy
- nltk
- fuzzywuzzy
- python-Levenshtein
### Service Images
#### ML Services (use base-ml)
1. **svc-ocr** - OCR and document AI
- Additional deps: pytesseract, PyMuPDF, pdf2image, Pillow, opencv-python-headless, torchvision
- System deps: tesseract-ocr, poppler-utils
2. **svc-rag-indexer** - Document indexing and embedding
- Additional deps: tiktoken, beautifulsoup4, faiss-cpu, python-docx, python-pptx, openpyxl, sparse-dot-topn
3. **svc-rag-retriever** - Semantic search and retrieval
- Additional deps: rank-bm25, faiss-cpu, sparse-dot-topn
#### Non-ML Services (use python:3.12-slim directly)
- All other services (svc-ingestion, svc-extract, svc-kg, svc-forms, etc.)
- Build from scratch with base requirements + service-specific deps
## Build Process
### Step 1: Build Base Images (One Time)
**IMPORTANT**: Build `base-ml` on the remote server to avoid pushing 1.2GB+ over the network!
#### Option A: Build base-ml on Remote Server (Recommended)
```bash
# Build base-ml on remote server (fast push to Gitea on same network)
./scripts/remote-build-base-ml.sh deploy@141.136.35.199 /home/deploy/ai-tax-agent gitea.harkon.co.uk v1.0.1 harkon
# Or use defaults (deploy user, /home/deploy/ai-tax-agent)
./scripts/remote-build-base-ml.sh
```
This will:
1. Sync code to remote server
2. Build `base-ml` on remote (~1.2GB, 10-15 min)
3. Push to Gitea from remote (fast, same network)
**Why build base-ml remotely?**
- ✅ Faster push to Gitea (same datacenter/network)
- ✅ Saves local network bandwidth
- ✅ Image is cached on remote server for faster service builds
- ✅ Only need to do this once
**Time**: 10-15 minutes (one time only)
#### Option B: Build Locally (Not Recommended for base-ml)
```bash
# Build both base images locally
./scripts/build-base-images.sh gitea.harkon.co.uk v1.0.1 harkon
```
This builds:
- `gitea.harkon.co.uk/harkon/base-runtime:v1.0.1` (~300MB)
- `gitea.harkon.co.uk/harkon/base-ml:v1.0.1` (~1.2GB)
**Note**: Pushing 1.2GB base-ml from local machine is slow and may fail due to network issues.
### Step 2: Build Service Images
```bash
# Build and push all services
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 harkon
```
ML services will:
1. Pull `base-ml:v1.0.1` from registry (if not cached)
2. Install service-specific deps (~10-20 packages)
3. Copy application code
4. Build final image (~1.25GB)
**Time per ML service**: 1-2 minutes (vs 10-15 minutes before)
### Step 3: Update Base Images (When Needed)
When you need to update ML library versions:
```bash
# 1. Update libs/requirements-ml.txt
vim libs/requirements-ml.txt
# 2. Rebuild base-ml with new version
./scripts/build-base-images.sh gitea.harkon.co.uk v1.0.2 harkon
# 3. Update service Dockerfiles to use new base version
# Change: ARG BASE_VERSION=v1.0.2
# 4. Rebuild ML services
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.2 harkon
```
## Requirements Files
### libs/requirements-base.txt
Core dependencies for ALL services (included in base-runtime and base-ml)
### libs/requirements-ml.txt
ML dependencies (included in base-ml only)
### apps/svc\_\*/requirements.txt
Service-specific dependencies:
- **ML services**: Only additional deps NOT in base-ml (e.g., faiss-cpu, tiktoken)
- **Non-ML services**: Service-specific deps (e.g., aiofiles, openai, anthropic)
## Dockerfile Templates
### ML Service Dockerfile Pattern
```dockerfile
# Use pre-built ML base image
ARG REGISTRY=gitea.harkon.co.uk
ARG OWNER=harkon
ARG BASE_VERSION=v1.0.1
FROM ${REGISTRY}/${OWNER}/base-ml:${BASE_VERSION}
USER root
WORKDIR /app
# Install service-specific deps (minimal)
COPY apps/SERVICE_NAME/requirements.txt /tmp/service-requirements.txt
RUN pip install --no-cache-dir -r /tmp/service-requirements.txt
# Copy app code
COPY libs/ ./libs/
COPY apps/SERVICE_NAME/ ./apps/SERVICE_NAME/
RUN chown -R appuser:appuser /app
USER appuser
# Health check, expose, CMD...
```
### Non-ML Service Dockerfile Pattern
```dockerfile
# Multi-stage build from scratch
FROM python:3.12-slim AS builder
# Install build deps
RUN apt-get update && apt-get install -y build-essential curl && rm -rf /var/lib/apt/lists/*
# Create venv and install deps
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/SERVICE_NAME/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt
# Production stage
FROM python:3.12-slim
# ... copy venv, app code, etc.
```
## Comparison: Before vs After
### Before (Monolithic Approach)
```
Each ML service:
- Build time: 10-15 minutes
- Image size: 1.6GB
- Push time: 5-10 minutes
- Total for 3 services: 30-45 min build + 15-30 min push = 45-75 minutes
```
### After (Base Image Approach)
```
Base-ml (one time):
- Build time: 10-15 minutes
- Image size: 1.2GB
- Push time: 5-10 minutes
Each ML service:
- Build time: 1-2 minutes
- Image size: 1.25GB (but only 50MB new layers)
- Push time: 30-60 seconds (only new layers)
- Total for 3 services: 3-6 min build + 2-3 min push = 5-9 minutes
Total time savings: 40-66 minutes (89% faster!)
```
## Best Practices
1. **Version base images**: Always tag with version (e.g., v1.0.1, v1.0.2)
2. **Update base images infrequently**: Only when ML library versions need updating
3. **Keep service requirements minimal**: Only add deps NOT in base-ml
4. **Use build args**: Make registry/owner/version configurable
5. **Test base images**: Ensure health checks pass before building services
6. **Document changes**: Update this file when modifying base images
## Troubleshooting
### Issue: Service can't find ML library
**Cause**: Library removed from service requirements but not in base-ml
**Solution**: Add library to `libs/requirements-ml.txt` and rebuild base-ml
### Issue: Base image not found
**Cause**: Base image not pushed to registry or wrong version
**Solution**: Run `./scripts/build-base-images.sh` first
### Issue: Service image too large
**Cause**: Duplicate dependencies in service requirements
**Solution**: Remove deps already in base-ml from service requirements.txt
## Future Improvements
1. **base-runtime for non-ML services**: Use base-runtime instead of building from scratch
2. **Multi-arch builds**: Support ARM64 for Apple Silicon
3. **Automated base image updates**: CI/CD pipeline to rebuild base images on dependency updates
4. **Layer analysis**: Tools to analyze and optimize layer sizes