Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
237 lines
7.8 KiB
Markdown
237 lines
7.8 KiB
Markdown
# Docker Image Size Optimization
|
|
|
|
## Problem Identified
|
|
|
|
Initial Docker images were **1.6GB** each, which is unacceptably large for microservices.
|
|
|
|
### Root Causes
|
|
|
|
1. **Heavy ML dependencies in all services** - `sentence-transformers` (~2GB with PyTorch) was included in base requirements
|
|
2. **Development dependencies in production** - pytest, mypy, black, ruff, etc. were being installed in Docker images
|
|
3. **Unnecessary dependencies** - Many services don't need ML but were getting all ML libraries
|
|
4. **Redundant dependencies** - Multiple overlapping packages (transformers + sentence-transformers both include PyTorch)
|
|
|
|
## Solution
|
|
|
|
### 1. Split Requirements Files
|
|
|
|
**Before:** Single `libs/requirements.txt` with everything (97 lines)
|
|
|
|
**After:** Modular requirements:
|
|
- `libs/requirements-base.txt` - Core dependencies (~30 packages, **~200MB**)
|
|
- `libs/requirements-ml.txt` - ML dependencies (only for 3 services, **~2GB**)
|
|
- `libs/requirements-pdf.txt` - PDF processing (only for services that need it)
|
|
- `libs/requirements-rdf.txt` - RDF/semantic web (only for KG service)
|
|
- `libs/requirements-dev.txt` - Development only (NOT in Docker)
|
|
|
|
### 2. Service-Specific Optimization
|
|
|
|
#### Services WITHOUT ML (11 services) - **~300MB each**
|
|
- svc-ingestion
|
|
- svc-extract
|
|
- svc-forms
|
|
- svc-hmrc
|
|
- svc-rpa
|
|
- svc-normalize-map
|
|
- svc-reason
|
|
- svc-firm-connectors
|
|
- svc-coverage
|
|
- svc-kg
|
|
- ui-review
|
|
|
|
**Dockerfile pattern:**
|
|
```dockerfile
|
|
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
|
|
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
|
|
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt
|
|
```
|
|
|
|
#### Services WITH ML (3 services) - **~1.2GB each**
|
|
- svc-ocr (needs transformers for document AI)
|
|
- svc-rag-indexer (needs sentence-transformers for embeddings)
|
|
- svc-rag-retriever (needs sentence-transformers for retrieval)
|
|
|
|
**Dockerfile pattern:**
|
|
```dockerfile
|
|
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
|
|
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
|
|
RUN pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt
|
|
```
|
|
|
|
### 3. Additional Optimizations
|
|
|
|
#### Removed from Base Requirements
|
|
- ❌ `sentence-transformers` - Only 3 services need it
|
|
- ❌ `transformers` - Only 3 services need it
|
|
- ❌ `spacy` - Only 2 services need it
|
|
- ❌ `nltk` - Only 2 services need it
|
|
- ❌ `scikit-learn` - Not needed by most services
|
|
- ❌ `numpy` - Only needed by ML services
|
|
- ❌ `aiokafka` - Using NATS instead
|
|
- ❌ `boto3/botocore` - Not needed
|
|
- ❌ `asyncio-mqtt` - Not used
|
|
- ❌ `ipaddress` - Built-in to Python
|
|
- ❌ All OpenTelemetry packages - Moved to dev
|
|
- ❌ All testing packages - Moved to dev
|
|
- ❌ All code quality tools - Moved to dev
|
|
|
|
#### Optimized in Service Requirements
|
|
- ✅ `opencv-python` → `opencv-python-headless` (smaller, no GUI)
|
|
- ✅ `langchain` → `tiktoken` (just the tokenizer, not the whole framework)
|
|
- ✅ Removed `presidio` (PII detection) - can be added later if needed
|
|
- ✅ Removed `layoutparser` - using transformers directly
|
|
- ✅ Removed `cohere` - using OpenAI/Anthropic only
|
|
|
|
### 4. Expected Results
|
|
|
|
| Service Type | Before | After | Savings |
|
|
|--------------|--------|-------|---------|
|
|
| Non-ML services (11) | 1.6GB | ~300MB | **81% reduction** |
|
|
| ML services (3) | 1.6GB | ~1.2GB | **25% reduction** |
|
|
| **Total (14 services)** | **22.4GB** | **6.9GB** | **69% reduction** |
|
|
|
|
## Implementation Checklist
|
|
|
|
### Phase 1: Requirements Files ✅
|
|
- [x] Create `libs/requirements-base.txt`
|
|
- [x] Create `libs/requirements-ml.txt`
|
|
- [x] Create `libs/requirements-pdf.txt`
|
|
- [x] Create `libs/requirements-rdf.txt`
|
|
- [x] Create `libs/requirements-dev.txt`
|
|
- [x] Update `libs/requirements.txt` to point to base
|
|
|
|
### Phase 2: Service Requirements ✅
|
|
- [x] Optimize `svc_ingestion/requirements.txt`
|
|
- [x] Optimize `svc_extract/requirements.txt`
|
|
- [x] Optimize `svc_ocr/requirements.txt`
|
|
- [x] Optimize `svc_rag_retriever/requirements.txt`
|
|
- [x] Optimize `svc_rag_indexer/requirements.txt`
|
|
|
|
### Phase 3: Dockerfiles 🟡
|
|
- [x] Update `svc_ingestion/Dockerfile`
|
|
- [ ] Update `svc_extract/Dockerfile`
|
|
- [ ] Update `svc_kg/Dockerfile`
|
|
- [ ] Update `svc_rag_retriever/Dockerfile`
|
|
- [ ] Update `svc_rag_indexer/Dockerfile`
|
|
- [ ] Update `svc_forms/Dockerfile`
|
|
- [ ] Update `svc_hmrc/Dockerfile`
|
|
- [ ] Update `svc_ocr/Dockerfile`
|
|
- [ ] Update `svc_rpa/Dockerfile`
|
|
- [ ] Update `svc_normalize_map/Dockerfile`
|
|
- [ ] Update `svc_reason/Dockerfile`
|
|
- [ ] Update `svc_firm_connectors/Dockerfile`
|
|
- [ ] Update `svc_coverage/Dockerfile`
|
|
- [ ] Update `ui_review/Dockerfile`
|
|
|
|
### Phase 4: Rebuild & Test
|
|
- [ ] Clean old images: `docker system prune -a`
|
|
- [ ] Rebuild all images
|
|
- [ ] Verify image sizes: `docker images | grep gitea.harkon.co.uk`
|
|
- [ ] Test services locally
|
|
- [ ] Push to registry
|
|
|
|
## Dockerfile Template
|
|
|
|
### For Non-ML Services (Most Services)
|
|
|
|
```dockerfile
|
|
# Multi-stage build for svc_xxx
|
|
FROM python:3.12-slim AS builder
|
|
|
|
# Install build dependencies
|
|
RUN apt-get update && apt-get install -y \
|
|
build-essential \
|
|
curl \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Create virtual environment
|
|
RUN python -m venv /opt/venv
|
|
ENV PATH="/opt/venv/bin:$PATH"
|
|
|
|
# Copy requirements and install dependencies
|
|
COPY libs/requirements-base.txt /tmp/libs-requirements.txt
|
|
COPY apps/svc_xxx/requirements.txt /tmp/requirements.txt
|
|
RUN pip install --no-cache-dir --upgrade pip && \
|
|
pip install --no-cache-dir -r /tmp/libs-requirements.txt -r /tmp/requirements.txt
|
|
|
|
# Production stage
|
|
FROM python:3.12-slim
|
|
|
|
# Install runtime dependencies
|
|
RUN apt-get update && apt-get install -y \
|
|
curl \
|
|
&& rm -rf /var/lib/apt/lists/* \
|
|
&& groupadd -r appuser \
|
|
&& useradd -r -g appuser appuser
|
|
|
|
# Copy virtual environment from builder
|
|
COPY --from=builder /opt/venv /opt/venv
|
|
ENV PATH="/opt/venv/bin:$PATH"
|
|
|
|
# Set working directory
|
|
WORKDIR /app
|
|
|
|
# Copy application code
|
|
COPY libs/ ./libs/
|
|
COPY apps/svc_xxx/ ./apps/svc_xxx/
|
|
|
|
# Create non-root user and set permissions
|
|
RUN chown -R appuser:appuser /app
|
|
USER appuser
|
|
|
|
# Health check
|
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
|
|
CMD curl -f http://localhost:8000/healthz || exit 1
|
|
|
|
# Expose port
|
|
EXPOSE 8000
|
|
|
|
# Run the application
|
|
CMD ["python", "-m", "uvicorn", "apps.svc_xxx.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
|
```
|
|
|
|
### For ML Services (OCR, RAG Indexer, RAG Retriever)
|
|
|
|
Same as above, but service requirements already include ML dependencies.
|
|
|
|
## Verification Commands
|
|
|
|
```bash
|
|
# Check image sizes
|
|
docker images | grep gitea.harkon.co.uk | awk '{print $1":"$2, $7$8}'
|
|
|
|
# Check what's installed in an image
|
|
docker run --rm gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0 pip list
|
|
|
|
# Compare sizes
|
|
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" | grep gitea
|
|
|
|
# Check layer sizes
|
|
docker history gitea.harkon.co.uk/blue/svc-ingestion:v1.0.0
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. **Update all Dockerfiles** to use `requirements-base.txt`
|
|
2. **Clean Docker cache**: `docker system prune -a --volumes`
|
|
3. **Rebuild images**: `./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 blue`
|
|
4. **Verify sizes**: Should see ~300MB for most services, ~1.2GB for ML services
|
|
5. **Update deployment**: Change version to `v1.0.1` in production compose files
|
|
|
|
## Benefits
|
|
|
|
1. **Faster builds** - Less to download and install
|
|
2. **Faster deployments** - Smaller images to push/pull
|
|
3. **Lower storage costs** - 69% reduction in total storage
|
|
4. **Faster startup** - Less to load into memory
|
|
5. **Better security** - Fewer dependencies = smaller attack surface
|
|
6. **Easier maintenance** - Clear separation of concerns
|
|
|
|
## Notes
|
|
|
|
- Development dependencies are now in `libs/requirements-dev.txt` - install locally with `pip install -r libs/requirements-dev.txt`
|
|
- ML services still need PyTorch, but we're using CPU-only versions where possible
|
|
- Consider using `python:3.12-alpine` for even smaller images (but requires more build dependencies)
|
|
- Monitor for any missing dependencies after deployment
|
|
|