Files
ai-tax-agent/docs/IMAGE_SIZE_OPTIMIZATION.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

237 lines
7.8 KiB
Markdown

# Docker Image Size Optimization
## Problem Identified
Initial Docker images were **1.6GB** each, which is unacceptably large for microservices.
### Root Causes
1. **Heavy ML dependencies in all services** - `sentence-transformers` (~2GB with PyTorch) was included in base requirements
2. **Development dependencies in production** - pytest, mypy, black, ruff, etc. were being installed in Docker images
3. **Unnecessary dependencies** - Many services don't need ML but were getting all ML libraries
4. **Redundant dependencies** - Multiple overlapping packages (transformers + sentence-transformers both include PyTorch)
## Solution
### 1. Split Requirements Files
**Before:** Single `libs/requirements.txt` with everything (97 lines)
**After:** Modular requirements:
- `libs/requirements-base.txt` - Core dependencies (~30 packages, **~200MB**)
- `libs/requirements-ml.txt` - ML dependencies (only for 3 services, **~2GB**)
- `libs/requirements-pdf.txt` - PDF processing (only for services that need it)
- `libs/requirements-rdf.txt` - RDF/semantic web (only for KG service)
- `libs/requirements-dev.txt` - Development only (NOT in Docker)
### 2. Service-Specific Optimization
#### Services WITHOUT ML (11 services) - **~300MB each**
- svc-ingestion
- svc-extract
- svc-forms
- svc-hmrc
- svc-rpa
- svc-normalize-map
- svc-reason
- svc-firm-connectors
- svc-coverage
- svc-kg
- ui-review
**Dockerfile pattern:**
```dockerfile
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt
```
#### Services WITH ML (3 services) - **~1.2GB each**
- svc-ocr (needs transformers for document AI)
- svc-rag-indexer (needs sentence-transformers for embeddings)
- svc-rag-retriever (needs sentence-transformers for retrieval)
**Dockerfile pattern:**
```dockerfile
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt
```
### 3. Additional Optimizations
#### Removed from Base Requirements
-`sentence-transformers` - Only 3 services need it
-`transformers` - Only 3 services need it
-`spacy` - Only 2 services need it
-`nltk` - Only 2 services need it
-`scikit-learn` - Not needed by most services
-`numpy` - Only needed by ML services
-`aiokafka` - Using NATS instead
-`boto3/botocore` - Not needed
-`asyncio-mqtt` - Not used
-`ipaddress` - Built-in to Python
- ❌ All OpenTelemetry packages - Moved to dev
- ❌ All testing packages - Moved to dev
- ❌ All code quality tools - Moved to dev
#### Optimized in Service Requirements
-`opencv-python``opencv-python-headless` (smaller, no GUI)
-`langchain``tiktoken` (just the tokenizer, not the whole framework)
- ✅ Removed `presidio` (PII detection) - can be added later if needed
- ✅ Removed `layoutparser` - using transformers directly
- ✅ Removed `cohere` - using OpenAI/Anthropic only
### 4. Expected Results
| Service Type | Before | After | Savings |
|--------------|--------|-------|---------|
| Non-ML services (11) | 1.6GB | ~300MB | **81% reduction** |
| ML services (3) | 1.6GB | ~1.2GB | **25% reduction** |
| **Total (14 services)** | **22.4GB** | **6.9GB** | **69% reduction** |
## Implementation Checklist
### Phase 1: Requirements Files ✅
- [x] Create `libs/requirements-base.txt`
- [x] Create `libs/requirements-ml.txt`
- [x] Create `libs/requirements-pdf.txt`
- [x] Create `libs/requirements-rdf.txt`
- [x] Create `libs/requirements-dev.txt`
- [x] Update `libs/requirements.txt` to point to base
### Phase 2: Service Requirements ✅
- [x] Optimize `svc_ingestion/requirements.txt`
- [x] Optimize `svc_extract/requirements.txt`
- [x] Optimize `svc_ocr/requirements.txt`
- [x] Optimize `svc_rag_retriever/requirements.txt`
- [x] Optimize `svc_rag_indexer/requirements.txt`
### Phase 3: Dockerfiles 🟡
- [x] Update `svc_ingestion/Dockerfile`
- [ ] Update `svc_extract/Dockerfile`
- [ ] Update `svc_kg/Dockerfile`
- [ ] Update `svc_rag_retriever/Dockerfile`
- [ ] Update `svc_rag_indexer/Dockerfile`
- [ ] Update `svc_forms/Dockerfile`
- [ ] Update `svc_hmrc/Dockerfile`
- [ ] Update `svc_ocr/Dockerfile`
- [ ] Update `svc_rpa/Dockerfile`
- [ ] Update `svc_normalize_map/Dockerfile`
- [ ] Update `svc_reason/Dockerfile`
- [ ] Update `svc_firm_connectors/Dockerfile`
- [ ] Update `svc_coverage/Dockerfile`
- [ ] Update `ui_review/Dockerfile`
### Phase 4: Rebuild & Test
- [ ] Clean old images: `docker system prune -a`
- [ ] Rebuild all images
- [ ] Verify image sizes: `docker images | grep gitea.harkon.co.uk`
- [ ] Test services locally
- [ ] Push to registry
## Dockerfile Template
### For Non-ML Services (Most Services)
```dockerfile
# Multi-stage build for svc_xxx
FROM python:3.12-slim AS builder
# Install build dependencies
RUN apt-get update && apt-get install -y \
build-essential \
curl \
&& rm -rf /var/lib/apt/lists/*
# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Copy requirements and install dependencies
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt
# Production stage
FROM python:3.12-slim
# Install runtime dependencies
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/* \
&& groupadd -r appuser \
&& useradd -r -g appuser appuser
# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Set working directory
WORKDIR /app
# Copy application code
COPY libs/ ./libs/
COPY apps/svc_xxx/ ./apps/svc_xxx/
# Create non-root user and set permissions
RUN chown -R appuser:appuser /app
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/healthz || exit 1
# Expose port
EXPOSE 8000
# Run the application
CMD ["python", "-m", "uvicorn", "apps.svc_xxx.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
### For ML Services (OCR, RAG Indexer, RAG Retriever)
Same as above, but service requirements already include ML dependencies.
## Verification Commands
```bash
# Check image sizes
docker images | grep gitea.harkon.co.uk | awk '{print $1":"$2, $7$8}'
# Check what's installed in an image
docker run --rm gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0 pip list
# Compare sizes
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep gitea
# Check layer sizes
docker history gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0
```
## Next Steps
1. **Update all Dockerfiles** to use `requirements-base.txt`
2. **Clean Docker cache**: `docker system prune -a --volumes`
3. **Rebuild images**: `./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 blue`
4. **Verify sizes**: Should see ~300MB for most services, ~1.2GB for ML services
5. **Update deployment**: Change version to `v1.0.1` in production compose files
## Benefits
1. **Faster builds** - Less to download and install
2. **Faster deployments** - Smaller images to push/pull
3. **Lower storage costs** - 69% reduction in total storage
4. **Faster startup** - Less to load into memory
5. **Better security** - Fewer dependencies = smaller attack surface
6. **Easier maintenance** - Clear separation of concerns
## Notes
- Development dependencies are now in `libs/requirements-dev.txt` - install locally with `pip install -r libs/requirements-dev.txt`
- ML services still need PyTorch, but we're using CPU-only versions where possible
- Consider using `python:3.12-alpine` for even smaller images (but requires more build dependencies)
- Monitor for any missing dependencies after deployment