# Docker Image Size Optimization ## Problem Identified Initial Docker images were **1.6GB** each, which is unacceptably large for microservices. ### Root Causes 1. **Heavy ML dependencies in all services** - `sentence-transformers` (~2GB with PyTorch) was included in base requirements 2. **Development dependencies in production** - pytest, mypy, black, ruff, etc. were being installed in Docker images 3. **Unnecessary dependencies** - Many services don't need ML but were getting all ML libraries 4. **Redundant dependencies** - Multiple overlapping packages (transformers + sentence-transformers both include PyTorch) ## Solution ### 1. Split Requirements Files **Before:** Single `libs/requirements.txt` with everything (97 lines) **After:** Modular requirements: - `libs/requirements-base.txt` - Core dependencies (~30 packages, **~200MB**) - `libs/requirements-ml.txt` - ML dependencies (only for 3 services, **~2GB**) - `libs/requirements-pdf.txt` - PDF processing (only for services that need it) - `libs/requirements-rdf.txt` - RDF/semantic web (only for KG service) - `libs/requirements-dev.txt` - Development only (NOT in Docker) ### 2. Service-Specific Optimization #### Services WITHOUT ML (11 services) - **~300MB each** - svc-ingestion - svc-extract - svc-forms - svc-hmrc - svc-rpa - svc-normalize-map - svc-reason - svc-firm-connectors - svc-coverage - svc-kg - ui-review **Dockerfile pattern:** ```dockerfile COPY libs/requirements-base.txt /tmp/libs-requirements.txt COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt ``` #### Services WITH ML (3 services) - **~1.2GB each** - svc-ocr (needs transformers for document AI) - svc-rag-indexer (needs sentence-transformers for embeddings) - svc-rag-retriever (needs sentence-transformers for retrieval) **Dockerfile pattern:** ```dockerfile COPY libs/requirements-base.txt /tmp/libs-requirements.txt COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt ``` ### 3. Additional Optimizations #### Removed from Base Requirements - ❌ `sentence-transformers` - Only 3 services need it - ❌ `transformers` - Only 3 services need it - ❌ `spacy` - Only 2 services need it - ❌ `nltk` - Only 2 services need it - ❌ `scikit-learn` - Not needed by most services - ❌ `numpy` - Only needed by ML services - ❌ `aiokafka` - Using NATS instead - ❌ `boto3/botocore` - Not needed - ❌ `asyncio-mqtt` - Not used - ❌ `ipaddress` - Built-in to Python - ❌ All OpenTelemetry packages - Moved to dev - ❌ All testing packages - Moved to dev - ❌ All code quality tools - Moved to dev #### Optimized in Service Requirements - ✅ `opencv-python` → `opencv-python-headless` (smaller, no GUI) - ✅ `langchain` → `tiktoken` (just the tokenizer, not the whole framework) - ✅ Removed `presidio` (PII detection) - can be added later if needed - ✅ Removed `layoutparser` - using transformers directly - ✅ Removed `cohere` - using OpenAI/Anthropic only ### 4. Expected Results | Service Type | Before | After | Savings | |--------------|--------|-------|---------| | Non-ML services (11) | 1.6GB | ~300MB | **81% reduction** | | ML services (3) | 1.6GB | ~1.2GB | **25% reduction** | | **Total (14 services)** | **22.4GB** | **6.9GB** | **69% reduction** | ## Implementation Checklist ### Phase 1: Requirements Files ✅ - [x] Create `libs/requirements-base.txt` - [x] Create `libs/requirements-ml.txt` - [x] Create `libs/requirements-pdf.txt` - [x] Create `libs/requirements-rdf.txt` - [x] Create `libs/requirements-dev.txt` - [x] Update `libs/requirements.txt` to point to base ### Phase 2: Service Requirements ✅ - [x] Optimize `svc_ingestion/requirements.txt` - [x] Optimize `svc_extract/requirements.txt` - [x] Optimize `svc_ocr/requirements.txt` - [x] Optimize `svc_rag_retriever/requirements.txt` - [x] Optimize `svc_rag_indexer/requirements.txt` ### Phase 3: Dockerfiles 🟡 - [x] Update `svc_ingestion/Dockerfile` - [ ] Update `svc_extract/Dockerfile` - [ ] Update `svc_kg/Dockerfile` - [ ] Update `svc_rag_retriever/Dockerfile` - [ ] Update `svc_rag_indexer/Dockerfile` - [ ] Update `svc_forms/Dockerfile` - [ ] Update `svc_hmrc/Dockerfile` - [ ] Update `svc_ocr/Dockerfile` - [ ] Update `svc_rpa/Dockerfile` - [ ] Update `svc_normalize_map/Dockerfile` - [ ] Update `svc_reason/Dockerfile` - [ ] Update `svc_firm_connectors/Dockerfile` - [ ] Update `svc_coverage/Dockerfile` - [ ] Update `ui_review/Dockerfile` ### Phase 4: Rebuild & Test - [ ] Clean old images: `docker system prune -a` - [ ] Rebuild all images - [ ] Verify image sizes: `docker images | grep gitea.harkon.co.uk` - [ ] Test services locally - [ ] Push to registry ## Dockerfile Template ### For Non-ML Services (Most Services) ```dockerfile # Multi-stage build for svc_xxx FROM python:3.12-slim AS builder # Install build dependencies RUN apt-get update && apt-get install -y \ build-essential \ curl \ && rm -rf /var/lib/apt/lists/* # Create virtual environment RUN python -m venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" # Copy requirements and install dependencies COPY libs/requirements-base.txt /tmp/libs-requirements.txt COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt RUN pip install --no-cache-dir --upgrade pip && \ pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt # Production stage FROM python:3.12-slim # Install runtime dependencies RUN apt-get update && apt-get install -y \ curl \ && rm -rf /var/lib/apt/lists/* \ && groupadd -r appuser \ && useradd -r -g appuser appuser # Copy virtual environment from builder COPY --from=builder /opt/venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" # Set working directory WORKDIR /app # Copy application code COPY libs/ ./libs/ COPY apps/svc_xxx/ ./apps/svc_xxx/ # Create non-root user and set permissions RUN chown -R appuser:appuser /app USER appuser # Health check HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ CMD curl -f http://localhost:8000/healthz || exit 1 # Expose port EXPOSE 8000 # Run the application CMD ["python", "-m", "uvicorn", "apps.svc_xxx.main:app", "--host", "0.0.0.0", "--port", "8000"] ``` ### For ML Services (OCR, RAG Indexer, RAG Retriever) Same as above, but service requirements already include ML dependencies. ## Verification Commands ```bash # Check image sizes docker images | grep gitea.harkon.co.uk | awk '{print $1":"$2, $7$8}' # Check what's installed in an image docker run --rm gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0 pip list # Compare sizes docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep gitea # Check layer sizes docker history gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0 ``` ## Next Steps 1. **Update all Dockerfiles** to use `requirements-base.txt` 2. **Clean Docker cache**: `docker system prune -a --volumes` 3. **Rebuild images**: `./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 blue` 4. **Verify sizes**: Should see ~300MB for most services, ~1.2GB for ML services 5. **Update deployment**: Change version to `v1.0.1` in production compose files ## Benefits 1. **Faster builds** - Less to download and install 2. **Faster deployments** - Smaller images to push/pull 3. **Lower storage costs** - 69% reduction in total storage 4. **Faster startup** - Less to load into memory 5. **Better security** - Fewer dependencies = smaller attack surface 6. **Easier maintenance** - Clear separation of concerns ## Notes - Development dependencies are now in `libs/requirements-dev.txt` - install locally with `pip install -r libs/requirements-dev.txt` - ML services still need PyTorch, but we're using CPU-only versions where possible - Consider using `python:3.12-alpine` for even smaller images (but requires more build dependencies) - Monitor for any missing dependencies after deployment