Files
ai-tax-agent/docs/IMAGE_SIZE_OPTIMIZATION.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

7.8 KiB

Docker Image Size Optimization

Problem Identified

Initial Docker images were 1.6GB each, which is unacceptably large for microservices.

Root Causes

  1. Heavy ML dependencies in all services - sentence-transformers (~2GB with PyTorch) was included in base requirements
  2. Development dependencies in production - pytest, mypy, black, ruff, etc. were being installed in Docker images
  3. Unnecessary dependencies - Many services don't need ML but were getting all ML libraries
  4. Redundant dependencies - Multiple overlapping packages (transformers + sentence-transformers both include PyTorch)

Solution

1. Split Requirements Files

Before: Single libs/requirements.txt with everything (97 lines)

After: Modular requirements:

  • libs/requirements-base.txt - Core dependencies (~30 packages, ~200MB)
  • libs/requirements-ml.txt - ML dependencies (only for 3 services, ~2GB)
  • libs/requirements-pdf.txt - PDF processing (only for services that need it)
  • libs/requirements-rdf.txt - RDF/semantic web (only for KG service)
  • libs/requirements-dev.txt - Development only (NOT in Docker)

2. Service-Specific Optimization

Services WITHOUT ML (11 services) - ~300MB each

  • svc-ingestion
  • svc-extract
  • svc-forms
  • svc-hmrc
  • svc-rpa
  • svc-normalize-map
  • svc-reason
  • svc-firm-connectors
  • svc-coverage
  • svc-kg
  • ui-review

Dockerfile pattern:

COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt

Services WITH ML (3 services) - ~1.2GB each

  • svc-ocr (needs transformers for document AI)
  • svc-rag-indexer (needs sentence-transformers for embeddings)
  • svc-rag-retriever (needs sentence-transformers for retrieval)

Dockerfile pattern:

COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt

3. Additional Optimizations

Removed from Base Requirements

  • sentence-transformers - Only 3 services need it
  • transformers - Only 3 services need it
  • spacy - Only 2 services need it
  • nltk - Only 2 services need it
  • scikit-learn - Not needed by most services
  • numpy - Only needed by ML services
  • aiokafka - Using NATS instead
  • boto3/botocore - Not needed
  • asyncio-mqtt - Not used
  • ipaddress - Built-in to Python
  • All OpenTelemetry packages - Moved to dev
  • All testing packages - Moved to dev
  • All code quality tools - Moved to dev

Optimized in Service Requirements

  • opencv-pythonopencv-python-headless (smaller, no GUI)
  • langchaintiktoken (just the tokenizer, not the whole framework)
  • Removed presidio (PII detection) - can be added later if needed
  • Removed layoutparser - using transformers directly
  • Removed cohere - using OpenAI/Anthropic only

4. Expected Results

Service Type Before After Savings
Non-ML services (11) 1.6GB ~300MB 81% reduction
ML services (3) 1.6GB ~1.2GB 25% reduction
Total (14 services) 22.4GB 6.9GB 69% reduction

Implementation Checklist

Phase 1: Requirements Files

  • Create libs/requirements-base.txt
  • Create libs/requirements-ml.txt
  • Create libs/requirements-pdf.txt
  • Create libs/requirements-rdf.txt
  • Create libs/requirements-dev.txt
  • Update libs/requirements.txt to point to base

Phase 2: Service Requirements

  • Optimize svc_ingestion/requirements.txt
  • Optimize svc_extract/requirements.txt
  • Optimize svc_ocr/requirements.txt
  • Optimize svc_rag_retriever/requirements.txt
  • Optimize svc_rag_indexer/requirements.txt

Phase 3: Dockerfiles 🟡

  • Update svc_ingestion/Dockerfile
  • Update svc_extract/Dockerfile
  • Update svc_kg/Dockerfile
  • Update svc_rag_retriever/Dockerfile
  • Update svc_rag_indexer/Dockerfile
  • Update svc_forms/Dockerfile
  • Update svc_hmrc/Dockerfile
  • Update svc_ocr/Dockerfile
  • Update svc_rpa/Dockerfile
  • Update svc_normalize_map/Dockerfile
  • Update svc_reason/Dockerfile
  • Update svc_firm_connectors/Dockerfile
  • Update svc_coverage/Dockerfile
  • Update ui_review/Dockerfile

Phase 4: Rebuild & Test

  • Clean old images: docker system prune -a
  • Rebuild all images
  • Verify image sizes: docker images | grep gitea.harkon.co.uk
  • Test services locally
  • Push to registry

Dockerfile Template

For Non-ML Services (Most Services)

# Multi-stage build for svc_xxx
FROM python:3.12-slim AS builder

# Install build dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy requirements and install dependencies
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt

# Production stage
FROM python:3.12-slim

# Install runtime dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/* \
    && groupadd -r appuser \
    && useradd -r -g appuser appuser

# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Set working directory
WORKDIR /app

# Copy application code
COPY libs/ ./libs/
COPY apps/svc_xxx/ ./apps/svc_xxx/

# Create non-root user and set permissions
RUN chown -R appuser:appuser /app
USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/healthz || exit 1

# Expose port
EXPOSE 8000

# Run the application
CMD ["python", "-m", "uvicorn", "apps.svc_xxx.main:app", "--host", "0.0.0.0", "--port", "8000"]

For ML Services (OCR, RAG Indexer, RAG Retriever)

Same as above, but service requirements already include ML dependencies.

Verification Commands

# Check image sizes
docker images | grep gitea.harkon.co.uk | awk '{print $1":"$2, $7$8}'

# Check what's installed in an image
docker run --rm gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0 pip list

# Compare sizes
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep gitea

# Check layer sizes
docker history gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0

Next Steps

  1. Update all Dockerfiles to use requirements-base.txt
  2. Clean Docker cache: docker system prune -a --volumes
  3. Rebuild images: ./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 blue
  4. Verify sizes: Should see ~300MB for most services, ~1.2GB for ML services
  5. Update deployment: Change version to v1.0.1 in production compose files

Benefits

  1. Faster builds - Less to download and install
  2. Faster deployments - Smaller images to push/pull
  3. Lower storage costs - 69% reduction in total storage
  4. Faster startup - Less to load into memory
  5. Better security - Fewer dependencies = smaller attack surface
  6. Easier maintenance - Clear separation of concerns

Notes

  • Development dependencies are now in libs/requirements-dev.txt - install locally with pip install -r libs/requirements-dev.txt
  • ML services still need PyTorch, but we're using CPU-only versions where possible
  • Consider using python:3.12-alpine for even smaller images (but requires more build dependencies)
  • Monitor for any missing dependencies after deployment