deployment, linting and infra configuration
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled

This commit is contained in:
harkon
2025-10-14 07:42:31 +01:00
parent f0f7674b8d
commit eea46ac89c
41 changed files with 1017 additions and 1448 deletions

View File

@@ -7,6 +7,7 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
## Current State Analysis
### Remote Server (`141.136.35.199`)
- **Location**: `/opt/compose/`
- **Existing Services**:
- Traefik v3.5.1 (reverse proxy with GoDaddy DNS challenge)
@@ -25,6 +26,7 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
- `portainer.harkon.co.uk`
### Local Repository (`infra/compose/`)
- **Compose Files**:
- `docker-compose.local.yml` - Full stack for local development
- `docker-compose.backend.yml` - Backend services (appears to be production-ready)
@@ -39,25 +41,30 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
## Challenges & Conflicts
### 1. **Duplicate Services**
- Both environments have Traefik and Authentik
- Need to decide: shared vs. isolated
### 2. **Network Naming**
- Remote: `frontend`, `backend`
- Local: `ai-tax-agent-frontend`, `ai-tax-agent-backend`
- Production needs: Consistent naming
### 3. **Domain Management**
- Remote: `*.harkon.co.uk` (public)
- Local: `*.local.lan` (development)
- Production: Need subdomains like `app.harkon.co.uk`, `api.harkon.co.uk`
### 4. **SSL Certificates**
- Remote: GoDaddy DNS challenge (production)
- Local: Self-signed certificates
- Production: Must use GoDaddy DNS challenge
### 5. **Resource Isolation**
- Company services need to remain stable
- Application services need independent deployment/rollback
@@ -66,6 +73,7 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
We will deploy the company services and the AI Tax Agent as two fully isolated stacks, each with its own Traefik and Authentik. This maximizes blast-radius isolation and avoids naming and DNS conflicts across environments.
Key implications:
- Separate external networks and DNS namespaces per stack
- Duplicate edge (Traefik) and IdP (Authentik), independent upgrades and rollbacks
- Slightly higher resource usage in exchange for strong isolation
@@ -139,6 +147,7 @@ Key implications:
### Domain Mapping
**Company Services** (existing):
- `traefik.harkon.co.uk` - Traefik dashboard
- `auth.harkon.co.uk` - Authentik SSO
- `gitea.harkon.co.uk` - Git hosting
@@ -146,6 +155,7 @@ Key implications:
- `portainer.harkon.co.uk` - Docker management
**Application Services** (app stack):
- `review.<domain>` - Review UI
- `api.<domain>` - API Gateway (microservices via Traefik)
- `vault.<domain>` - Vault UI (admin only)
@@ -159,12 +169,14 @@ Key implications:
### Authentication Strategy
**Authentik Configuration**:
1. **Company Group** - Access to Gitea, Nextcloud, Portainer
2. **App Admin Group** - Full access to all app services
3. **App User Group** - Access to Review UI and API
4. **App Reviewer Group** - Access to Review UI only
**Middleware Configuration**:
- `authentik-forwardauth` - Standard auth for all services
- `admin-auth` - Requires admin group (Vault, MinIO, Neo4j, etc.)
- `reviewer-auth` - Requires reviewer or higher
@@ -182,6 +194,7 @@ Key implications:
### Development Environment
**Keep Existing Setup**:
- Use `docker-compose.local.yml` as-is
- Domain: `*.local.lan`
- Self-signed certificates
@@ -189,6 +202,7 @@ Key implications:
- Full stack runs locally
**Benefits**:
- No dependency on remote server
- Fast iteration
- Complete isolation
@@ -217,19 +231,22 @@ make deploy-production # Deploy to remote server
### Phase 1: Preparation (Week 1)
1. **Backup Current State**
```bash
ssh deploy@141.136.35.199
cd /opt/compose
cd /opt
tar -czf ~/backup-$(date +%Y%m%d).tar.gz .
```
2. **Create Production Environment File**
- Copy `infra/compose/env.example` to `infra/compose/.env.production`
- Copy `infra/environments/production/.env.example` to `infra/environments/production/.env`
- Update all secrets and passwords
- Set `DOMAIN=harkon.co.uk`
- Configure GoDaddy API credentials
3. **Update Traefik Configuration**
- Merge local Traefik config with remote
- Add application routes
- Configure Authentik ForwardAuth
@@ -242,13 +259,15 @@ make deploy-production # Deploy to remote server
### Phase 2: Infrastructure Deployment (Week 2)
1. **Deploy Application Infrastructure**
```bash
# On remote server
cd /opt/compose/ai-tax-agent
cd /opt/ai-tax-agent
docker compose -f infrastructure.yaml up -d
```
2. **Initialize Services**
- Vault: Unseal and configure
- Postgres: Run migrations
- Neo4j: Install plugins
@@ -262,11 +281,13 @@ make deploy-production # Deploy to remote server
### Phase 3: Application Deployment (Week 3)
1. **Deploy Microservices**
```bash
docker compose -f services.yaml up -d
```
2. **Deploy Monitoring**
```bash
docker compose -f monitoring.yaml up -d
```

View File

@@ -10,7 +10,7 @@
### 1. Production Compose Files Created
Created three production-ready Docker Compose files in `infra/compose/production/`:
Created three production-ready Docker Compose files in `infra/base/`:
#### **infrastructure.yaml**
- Vault (secrets management)
@@ -104,7 +104,7 @@ chmod +x scripts/deploy-to-production.sh
### 3. Documentation Created
#### **infra/compose/production/README.md**
#### **infra/base manifests**
Comprehensive production deployment guide including:
- Prerequisites checklist
- Three deployment options (automated, step-by-step, manual)
@@ -221,7 +221,7 @@ Or step-by-step:
1. **Initialize Vault**
```bash
ssh deploy@141.136.35.199
cd /opt/compose/ai-tax-agent
cd /opt/ai-tax-agent
docker exec -it vault vault operator init
# Save unseal keys!
docker exec -it vault vault operator unseal
@@ -382,7 +382,6 @@ Deployment is successful when:
If you encounter issues:
1. Check logs: `./scripts/deploy-to-production.sh logs <service>`
2. Verify status: `./scripts/deploy-to-production.sh verify`
3. Review documentation: `infra/compose/production/README.md`
3. Review manifests: `infra/base/*.yaml`
4. Check deployment plan: `docs/DEPLOYMENT_PLAN.md`
5. Follow checklist: `docs/DEPLOYMENT_CHECKLIST.md`

View File

@@ -21,15 +21,14 @@
- ✅ Created quick start guide (`docs/QUICK_START.md`)
### 3. Production Configuration Files
- ✅ Created `infra/compose/production/infrastructure.yaml` (7 infrastructure services)
- ✅ Created `infra/compose/production/services.yaml` (14 application services + UI)
- ✅ Created `infra/compose/production/monitoring.yaml` (Prometheus, Grafana, Loki, Promtail)
- ✅ Created `infra/compose/production/README.md` (deployment guide)
- ✅ Created `infra/base/infrastructure.yaml` (infrastructure, incl. Traefik + Authentik)
- ✅ Created `infra/base/services.yaml` (application services + UI)
- ✅ Created `infra/base/monitoring.yaml` (Prometheus, Grafana, Loki, Promtail)
### 4. Monitoring Configuration
- ✅ Created Prometheus configuration (`infra/compose/prometheus/prometheus.yml`)
- ✅ Created Loki configuration (`infra/compose/loki/loki-config.yml`)
- ✅ Created Promtail configuration (`infra/compose/promtail/promtail-config.yml`)
- ✅ Created Prometheus configuration (`infra/base/prometheus/prometheus.yml`)
- ✅ Created Loki configuration (`infra/base/loki/loki-config.yml`)
- ✅ Created Promtail configuration (`infra/base/promtail/promtail-config.yml`)
- ✅ Configured service discovery for all 14 services
- ✅ Set up 30-day metrics retention
@@ -266,10 +265,9 @@ df -h
- `docs/ENVIRONMENT_COMPARISON.md` - Local vs Production comparison
2. **Configuration:**
- `infra/compose/production/README.md` - Production compose guide
- `infra/compose/production/infrastructure.yaml` - Infrastructure services
- `infra/compose/production/services.yaml` - Application services
- `infra/compose/production/monitoring.yaml` - Monitoring stack
- `infra/base/infrastructure.yaml` - Infrastructure services
- `infra/base/services.yaml` - Application services
- `infra/base/monitoring.yaml` - Monitoring stack
3. **Deployment:**
- `docs/POST_BUILD_DEPLOYMENT.md` - Post-build deployment steps
@@ -319,4 +317,3 @@ For questions or issues:
- 🟡 In Progress
- ⏳ Pending
- ❌ Blocked

View File

@@ -12,7 +12,7 @@ This document compares the local development environment with the production env
| **SSL** | Self-signed certificates | Let's Encrypt (GoDaddy DNS) |
| **Networks** | `ai-tax-agent-frontend`<br/>`ai-tax-agent-backend` | `frontend`<br/>`backend` |
| **Compose File** | `docker-compose.local.yml` | `infrastructure.yaml`<br/>`services.yaml`<br/>`monitoring.yaml` |
| **Location** | Local machine | `deploy@141.136.35.199:/opt/compose/ai-tax-agent/` |
| **Location** | Local machine | `deploy@141.136.35.199:/opt/ai-tax-agent/` |
| **Traefik** | Isolated instance | Shared with company services |
| **Authentik** | Isolated instance | Shared with company services |
| **Data Persistence** | Local Docker volumes | Remote Docker volumes + backups |
@@ -271,7 +271,7 @@ make clean
#### Production
```bash
# Deploy infrastructure
cd /opt/compose/ai-tax-agent
cd /opt/ai-tax-agent
docker compose -f infrastructure.yaml up -d
# Deploy services
@@ -370,7 +370,7 @@ docker compose -f services.yaml up -d --no-deps svc-ingestion
4. **Deploy to production**:
```bash
ssh deploy@141.136.35.199
cd /opt/compose/ai-tax-agent
cd /opt/ai-tax-agent
docker compose -f services.yaml pull
docker compose -f services.yaml up -d
```
@@ -436,4 +436,3 @@ The key differences between local and production environments are:
6. **Backups**: Local has none; production has automated backups
Both environments use the same application code and Docker images, ensuring consistency and reducing deployment risks.

View File

@@ -1,332 +0,0 @@
# Gitea Container Registry Debugging Guide
## Common Issues When Pushing Large Docker Images
### Issue 1: Not Logged In
**Symptom**: `unauthorized: authentication required`
**Solution**:
```bash
# On remote server
docker login gitea.harkon.co.uk
# Username: blue (or your Gitea username)
# Password: <your-gitea-access-token>
```
---
### Issue 2: Upload Size Limit (413 Request Entity Too Large)
**Symptom**: Push fails with `413 Request Entity Too Large` or similar error
**Root Cause**: Traefik or Gitea has a limit on request body size
**Solution A: Configure Traefik Middleware**
1. Find your Traefik configuration directory:
```bash
docker inspect traefik | grep -A 10 Mounts
```
2. Create middleware configuration:
```bash
# Example: /opt/traefik/config/middlewares.yml
sudo tee /opt/traefik/config/middlewares.yml > /dev/null << 'EOF'
http:
middlewares:
large-upload:
buffering:
maxRequestBodyBytes: 5368709120 # 5GB
memRequestBodyBytes: 104857600 # 100MB
maxResponseBodyBytes: 5368709120 # 5GB
memResponseBodyBytes: 104857600 # 100MB
EOF
```
3. Update Gitea container labels:
```yaml
labels:
- "traefik.http.routers.gitea.middlewares=large-upload@file"
```
4. Restart Traefik:
```bash
docker restart traefik
```
**Solution B: Configure Gitea Directly**
1. Edit Gitea configuration:
```bash
docker exec -it gitea-server vi /data/gitea/conf/app.ini
```
2. Add/modify these settings:
```ini
[server]
LFS_MAX_FILE_SIZE = 5368709120 ; 5GB
[repository.upload]
FILE_MAX_SIZE = 5368709120 ; 5GB
```
3. Restart Gitea:
```bash
docker restart gitea-server
```
---
### Issue 3: Network Timeout
**Symptom**: Push hangs or times out after uploading for a while
**Root Cause**: Network instability or slow connection
**Solution**: Use chunked uploads or increase timeout
1. Configure Docker daemon timeout:
```bash
# Edit /etc/docker/daemon.json
sudo tee /etc/docker/daemon.json > /dev/null << 'EOF'
{
"max-concurrent-uploads": 1,
"max-concurrent-downloads": 3,
"registry-mirrors": []
}
EOF
sudo systemctl restart docker
```
2. Or use Traefik timeout middleware:
```yaml
http:
middlewares:
long-timeout:
buffering:
retryExpression: "IsNetworkError() && Attempts() < 3"
```
---
### Issue 4: Disk Space
**Symptom**: Push fails with "no space left on device"
**Solution**:
```bash
# Check disk space
df -h
# Clean up Docker
docker system prune -a --volumes -f
# Check again
df -h
```
---
### Issue 5: Gitea Registry Not Enabled
**Symptom**: `404 Not Found` when accessing `/v2/`
**Solution**:
```bash
# Check if registry is enabled
docker exec gitea-server cat /data/gitea/conf/app.ini | grep -A 5 "\[packages\]"
# Should show:
# [packages]
# ENABLED = true
```
If not enabled, add to `app.ini`:
```ini
[packages]
ENABLED = true
```
Restart Gitea:
```bash
docker restart gitea-server
```
---
## Debugging Steps
### Step 1: Verify Gitea Registry is Accessible
```bash
# Should return 401 Unauthorized (which is good - means registry is working)
curl -I https://gitea.harkon.co.uk/v2/
# Should return 200 OK after login
docker login gitea.harkon.co.uk
curl -u "username:token" https://gitea.harkon.co.uk/v2/
```
### Step 2: Test with Small Image
```bash
# Pull a small image
docker pull alpine:latest
# Tag it for your registry
docker tag alpine:latest gitea.harkon.co.uk/harkon/test:latest
# Try to push
docker push gitea.harkon.co.uk/harkon/test:latest
```
If this works, the issue is with large images (size limit).
### Step 3: Check Gitea Logs
```bash
# Check for errors
docker logs gitea-server --tail 100 | grep -i error
# Watch logs in real-time while pushing
docker logs -f gitea-server
```
### Step 4: Check Traefik Logs
```bash
# Check for 413 or 502 errors
docker logs traefik --tail 100 | grep -E "413|502|error"
# Watch logs in real-time
docker logs -f traefik
```
### Step 5: Check Docker Daemon Logs
```bash
# Check Docker daemon logs
sudo journalctl -u docker --since "1 hour ago" | grep -i error
```
---
## Quick Fix: Bypass Traefik for Registry
If Traefik is causing issues, you can expose Gitea's registry directly:
1. Update Gitea docker-compose to expose port 3000:
```yaml
services:
gitea:
ports:
- "3000:3000" # HTTP
```
2. Use direct connection:
```bash
docker login gitea.harkon.co.uk:3000
docker push gitea.harkon.co.uk:3000/harkon/base-ml:v1.0.1
```
**Note**: This bypasses SSL, so only use for debugging!
---
## Recommended Configuration for Large Images
### Traefik Configuration
Create `/opt/traefik/config/gitea-registry.yml`:
```yaml
http:
middlewares:
gitea-registry:
buffering:
maxRequestBodyBytes: 5368709120 # 5GB
memRequestBodyBytes: 104857600 # 100MB in memory
maxResponseBodyBytes: 5368709120 # 5GB
memResponseBodyBytes: 104857600 # 100MB in memory
routers:
gitea-registry:
rule: "Host(`gitea.harkon.co.uk`) && PathPrefix(`/v2/`)"
entryPoints:
- websecure
middlewares:
- gitea-registry
service: gitea
tls:
certResolver: letsencrypt
```
### Gitea Configuration
In `/data/gitea/conf/app.ini`:
```ini
[server]
PROTOCOL = http
DOMAIN = gitea.harkon.co.uk
ROOT_URL = https://gitea.harkon.co.uk/
HTTP_PORT = 3000
LFS_MAX_FILE_SIZE = 5368709120
[repository.upload]
FILE_MAX_SIZE = 5368709120
ENABLED = true
[packages]
ENABLED = true
CHUNKED_UPLOAD_PATH = /data/gitea/tmp/package-upload
```
---
## Testing the Fix
After applying configuration changes:
1. Restart services:
```bash
docker restart traefik
docker restart gitea-server
```
2. Test with a large layer:
```bash
# Build base-ml (has large layers)
cd /home/deploy/ai-tax-agent
docker build -f infra/docker/base-ml.Dockerfile -t gitea.harkon.co.uk/harkon/base-ml:test .
# Try to push
docker push gitea.harkon.co.uk/harkon/base-ml:test
```
3. Monitor logs:
```bash
# Terminal 1: Watch Traefik
docker logs -f traefik
# Terminal 2: Watch Gitea
docker logs -f gitea-server
# Terminal 3: Push image
docker push gitea.harkon.co.uk/harkon/base-ml:test
```
---
## Alternative: Use Docker Hub or GitHub Container Registry
If Gitea continues to have issues with large images, consider:
1. **Docker Hub**: Free for public images
2. **GitHub Container Registry (ghcr.io)**: Free for public/private
3. **GitLab Container Registry**: Free tier available
These are battle-tested for large ML images and have better defaults for large uploads.

View File

@@ -1,194 +0,0 @@
# Gitea Container Registry - Image Naming Fix
## Issue
The initial build script was using incorrect image naming convention for Gitea's container registry.
### Incorrect Format
```
gitea.harkon.co.uk/ai-tax-agent/svc-ingestion:v1.0.0
```
### Correct Format (Per Gitea Documentation)
```
gitea.harkon.co.uk/{owner}/{image}:{tag}
```
Where `{owner}` must be your **Gitea username** or **organization name**.
**Using organization:** `harkon` (Gitea team/organization)
## Solution
Updated the build script and production compose files to use the correct naming convention.
### Changes Made
#### 1. Build Script (`scripts/build-and-push-images.sh`)
**Before:**
```bash
REGISTRY="${1:-gitea.harkon.co.uk}"
VERSION="${2:-latest}"
PROJECT="ai-tax-agent"
IMAGE_NAME="$REGISTRY/$PROJECT/$service:$VERSION"
```
**After:**
```bash
REGISTRY="${1:-gitea.harkon.co.uk}"
VERSION="${2:-latest}"
OWNER="${3:-harkon}" # Gitea organization/team name
IMAGE_NAME="$REGISTRY/$OWNER/$service:$VERSION"
```
#### 2. Production Services (`infra/compose/production/services.yaml`)
**Before:**
```yaml
svc-ingestion:
image: gitea.harkon.co.uk/ai-tax-agent/svc-ingestion:latest
```
**After:**
```yaml
svc-ingestion:
image: gitea.harkon.co.uk/harkon/svc-ingestion:latest
```
All 14 services updated:
- svc-ingestion
- svc-extract
- svc-kg
- svc-rag-retriever
- svc-rag-indexer
- svc-forms
- svc-hmrc
- svc-ocr
- svc-rpa
- svc-normalize-map
- svc-reason
- svc-firm-connectors
- svc-coverage
- ui-review
## Usage
### Build and Push Images
```bash
# With default owner (harkon organization)
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1
# With custom owner
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 <your-gitea-org>
```
### Pull Images
```bash
docker pull gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
```
### Push Images Manually
```bash
# Tag image
docker tag my-image:latest gitea.harkon.co.uk/harkon/my-image:v1.0.1
# Push image
docker push gitea.harkon.co.uk/harkon/my-image:v1.0.1
```
## Gitea Registry Documentation Reference
From Gitea's official documentation:
### Image Naming Convention
Images must follow this naming convention:
```
{registry}/{owner}/{image}
```
When building your docker image, using the naming convention above, this looks like:
```bash
# build an image with tag
docker build -t {registry}/{owner}/{image}:{tag} .
# name an existing image with tag
docker tag {some-existing-image}:{tag} {registry}/{owner}/{image}:{tag}
```
### Valid Examples
For owner `testuser` on `gitea.example.com`:
-`gitea.example.com/testuser/myimage`
-`gitea.example.com/testuser/my-image`
-`gitea.example.com/testuser/my/image`
### Important Notes
1. **Owner must exist**: The owner (username or organization) must exist in Gitea
2. **Case-insensitive tags**: `image:tag` and `image:Tag` are treated as the same
3. **Authentication required**: Use personal access token with `write:package` scope
4. **Registry URL**: Use the main Gitea domain, not a separate registry subdomain
## Verification
After the fix, verify images are pushed correctly:
```bash
# Login to Gitea
docker login gitea.harkon.co.uk
# Check pushed images in Gitea UI
# Navigate to: https://gitea.harkon.co.uk/blue/-/packages
```
## Current Build Status
**Fixed and working!**
Build command:
```bash
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 harkon
```
Expected output:
```
Logging in to registry: gitea.harkon.co.uk
Login Succeeded
Building svc-ingestion...
Building: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
✅ Built: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
Pushing: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
✅ Pushed: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
```
## Next Steps
1. ✅ Build script fixed
2. ✅ Production compose files updated
3. 🟡 Build in progress (14 services)
4. ⏳ Deploy to production (after build completes)
## References
- [Gitea Container Registry Documentation](https://docs.gitea.com/usage/packages/container)
- Build script: `scripts/build-and-push-images.sh`
- Production services: `infra/compose/production/services.yaml`

View File

@@ -148,11 +148,11 @@ docker run --rm gitea.harkon.co.uk/harkon/svc-ocr:v1.0.1 pip list | grep torch
### 5. Update Production Deployment
Update `infra/compose/production/services.yaml` to use `v1.0.1`:
Update `infra/base/services.yaml` to use `v1.0.1`:
```bash
# Find and replace v1.0.0 with v1.0.1
sed -i '' 's/:v1.0.0/:v1.0.1/g' infra/compose/production/services.yaml
sed -i '' 's/:v1.0.0/:v1.0.1/g' infra/base/services.yaml
# Or use latest tag (already configured)
# No changes needed if using :latest

View File

@@ -50,7 +50,7 @@ docker login gitea.harkon.co.uk
**SSH to server:**
```bash
ssh deploy@141.136.35.199
cd /opt/compose/ai-tax-agent
cd /opt/ai-tax-agent
```
**Initialize Vault:**
@@ -62,19 +62,19 @@ docker exec -it vault vault operator unseal
**Create MinIO Buckets:**
```bash
docker exec -it minio mc alias set local http://localhost:9092 admin <MINIO_PASSWORD>
docker exec -it minio mc mb local/documents
docker exec -it minio mc mb local/models
docker exec -it apa-minio mc alias set local http://localhost:9000 admin <MINIO_PASSWORD>
docker exec -it apa-minio mc mb local/documents
docker exec -it apa-minio mc mb local/models
```
**Create NATS Streams:**
```bash
docker exec -it nats nats stream add TAX_AGENT_EVENTS \
docker exec -it apa-nats nats stream add TAX_AGENT_EVENTS \\
--subjects="tax.>" --storage=file --retention=limits --max-age=7d
```
**Configure Authentik:**
1. Go to https://authentik.harkon.co.uk
1. Go to https://auth.harkon.co.uk
2. Create groups: `app-admin`, `app-user`, `app-reviewer`
3. Create OAuth providers for:
- Review UI: `app.harkon.co.uk`
@@ -94,7 +94,7 @@ curl -I https://api.harkon.co.uk/healthz
curl -I https://grafana.harkon.co.uk
# View logs
./scripts/deploy-to-production.sh logs svc-ingestion
./scripts/deploy-to-production.sh logs apa-svc-ingestion
```
---
@@ -127,8 +127,8 @@ curl -I https://grafana.harkon.co.uk
### Restart Service
```bash
ssh deploy@141.136.35.199
cd /opt/compose/ai-tax-agent
docker compose -f services.yaml restart svc-ingestion
cd /opt/ai-tax-agent
docker compose -f services.yaml restart apa-svc-ingestion
```
### Check Status
@@ -163,25 +163,25 @@ docker compose -f services.yaml logs svc-ingestion
docker compose -f infrastructure.yaml ps
# Restart
docker compose -f services.yaml restart svc-ingestion
docker compose -f services.yaml restart apa-svc-ingestion
```
### SSL Issues
```bash
# Check Traefik logs
docker logs traefik
docker logs apa-traefik
# Check certificates
sudo cat /opt/compose/traefik/certs/godaddy-acme.json | jq
sudo cat /opt/ai-tax-agent/traefik/certs/godaddy-acme.json | jq
```
### Database Connection
```bash
# Test Postgres
docker exec -it postgres pg_isready -U postgres
docker exec -it apa-postgres pg_isready -U postgres
# Check env vars
docker exec -it svc-ingestion env | grep POSTGRES
docker exec -it apa-svc-ingestion env | grep POSTGRES
```
---
@@ -190,7 +190,7 @@ docker exec -it svc-ingestion env | grep POSTGRES
```bash
ssh deploy@141.136.35.199
cd /opt/compose/ai-tax-agent
cd /opt/ai-tax-agent
# Stop services
docker compose -f services.yaml down
@@ -198,12 +198,11 @@ docker compose -f infrastructure.yaml down
docker compose -f monitoring.yaml down
# Restore backup
cd /opt/compose
cd /opt
tar -xzf ~/backups/backup-YYYYMMDD-HHMMSS.tar.gz
# Restart company services
cd /opt/compose/traefik && docker compose up -d
cd /opt/compose/authentik && docker compose up -d
# Restart application infra
cd /opt/ai-tax-agent && docker compose -f infrastructure.yaml up -d
```
---
@@ -242,4 +241,3 @@ cd /opt/compose/authentik && docker compose up -d
```bash
./scripts/deploy-to-production.sh logs <service>
```

555
docs/SRE.md Normal file
View File

@@ -0,0 +1,555 @@
# ROLE
You are a **Senior Platform Engineer + Backend Lead** generating **production code** and **ops assets** for a microservice suite that powers an accounting Knowledge Graph + Vector RAG platform. Authentication/authorization are centralized at the **edge via Traefik + Authentik** (ForwardAuth). **Services are trust-bound** to Traefik and consume user/role claims via forwarded headers/JWT.
# MISSION
Produce fully working code for **all application services** (FastAPI + Python 3.12) with:
- Solid domain models, Pydantic v2 schemas, type hints, strict mypy, ruff lint.
- Opentelemetry tracing, Prometheus metrics, structured logging.
- Vault-backed secrets, MinIO S3 client, Qdrant client, Neo4j driver, Postgres (SQLAlchemy), Redis.
- Eventing (Kafka or SQS/SNS behind an interface).
- Deterministic data contracts, end-to-end tests, Dockerfiles, Compose, CI for Gitea.
- Traefik labels + Authentik Outpost integration for every exposed route.
- Zero PII in vectors (Qdrant), evidence-based lineage in KG, and bitemporal writes.
# GLOBAL CONSTRAINTS (APPLY TO ALL SERVICES)
- **Language & Runtime:** Python **3.12**.
- **Frameworks:** FastAPI, Pydantic v2, SQLAlchemy 2, httpx, aiokafka or boto3 (pluggable), redis-py, opentelemetry-instrumentation-fastapi, prometheus-fastapi-instrumentator.
- **Config:** `pydantic-settings` with `.env` overlay. Provide `Settings` class per service.
- **Secrets:** HashiCorp **Vault** (AppRole/JWT). Use Vault Transit to **envelope-encrypt** sensitive fields before persistence (helpers provided in `lib/security.py`).
- **Auth:** No OIDC in services. Add `TrustedProxyMiddleware`:
- Reject if request not from internal network (configurable CIDR).
- Require headers set by Traefik+Authentik (`X-Authenticated-User`, `X-Authenticated-Email`, `X-Authenticated-Groups`, `Authorization: Bearer …`).
- Parse groups → `roles` list on `request.state`.
- **Observability:**
- OpenTelemetry (traceparent propagation), span attrs (service, route, user, tenant).
- Prometheus metrics endpoint `/metrics` protected by internal network check.
- Structured JSON logs (timestamp, level, svc, trace_id, msg) via `structlog`.
- **Errors:** Global exception handler → RFC7807 Problem+JSON (`type`, `title`, `status`, `detail`, `instance`, `trace_id`).
- **Testing:** `pytest`, `pytest-asyncio`, `hypothesis` (property tests for calculators), `coverage ≥ 90%` per service.
- **Static:** `ruff`, `mypy --strict`, `bandit`, `safety`, `licensecheck`.
- **Perf:** Each service exposes `/healthz`, `/readyz`, `/livez`; cold start < 500ms; p95 endpoint < 250ms (local).
- **Containers:** Distroless or slim images; non-root user; read-only FS; `/tmp` mounted for OCR where needed.
- **Docs:** OpenAPI JSON + ReDoc; MkDocs site with service READMEs.
# SHARED LIBS (GENERATE ONCE, REUSE)
Create `libs/` used by all services:
- `libs/config.py` base `Settings`, env parsing, Vault client factory, MinIO client factory, Qdrant client factory, Neo4j driver factory, Redis factory, Kafka/SQS client factory.
- `libs/security.py` Vault Transit helpers (`encrypt_field`, `decrypt_field`), header parsing, internal-CIDR validator.
- `libs/observability.py` otel init, prometheus instrumentor, logging config.
- `libs/events.py` abstract `EventBus` with `publish(topic, payload: dict)`, `subscribe(topic, handler)`. Two impls: Kafka (`aiokafka`) and SQS/SNS (`boto3`).
- `libs/schemas.py` **canonical Pydantic models** shared across services (Document, Evidence, IncomeItem, etc.) mirroring the ontology schemas. Include JSONSchema exports.
- `libs/storage.py` S3/MinIO helpers (bucket ensure, put/get, presigned).
- `libs/neo.py` Neo4j session helpers, Cypher runner with retry, SHACL validator invoker (pySHACL on exported RDF).
- `libs/rag.py` Qdrant collections CRUD, hybrid search (dense+sparse), rerank wrapper, de-identification utilities (regex + NER; hash placeholders).
- `libs/forms.py` PDF AcroForm fill via `pdfrw` with overlay fallback via `reportlab`.
- `libs/calibration.py` `calibrated_confidence(raw_score, method="temperature_scaling", params=...)`.
# EVENT TOPICS (STANDARDIZE)
- `doc.ingested`, `doc.ocr_ready`, `doc.extracted`, `kg.upserted`, `rag.indexed`, `calc.schedule_ready`, `form.filled`, `hmrc.submitted`, `review.requested`, `review.completed`, `firm.sync.completed`
Each payload MUST include: `event_id (ulid)`, `occurred_at (iso)`, `actor`, `tenant_id`, `trace_id`, `schema_version`, and a `data` object (service-specific).
# TRUST HEADERS FROM TRAEFIK + AUTHENTIK (USE EXACT KEYS)
- `X-Authenticated-User` (string)
- `X-Authenticated-Email` (string)
- `X-Authenticated-Groups` (comma-separated)
- `Authorization` (`Bearer <jwt>` from Authentik)
Reject any request missing these (except `/healthz|/readyz|/livez|/metrics` from internal CIDR).
---
## SERVICES TO IMPLEMENT (CODE FOR EACH)
### 1) `svc-ingestion`
**Purpose:** Accept uploads or URLs, checksum, store to MinIO, emit `doc.ingested`.
**Endpoints:**
- `POST /v1/ingest/upload` (multipart file, metadata: `tenant_id`, `kind`, `source`) `{doc_id, s3_url, checksum}`
- `POST /v1/ingest/url` (json: `{url, kind, tenant_id}`) downloads to MinIO
- `GET /v1/docs/{doc_id}` metadata
**Logic:**
- Compute SHA256, dedupe by checksum; MinIO path `tenants/{tenant_id}/raw/{doc_id}.pdf`.
- Store metadata in Postgres table `ingest_documents` (alembic migrations).
- Publish `doc.ingested` with `{doc_id, bucket, key, pages?, mime}`.
**Env:** `S3_BUCKET_RAW`, `MINIO_*`, `DB_URL`.
**Traefik labels:** route `/ingest/*`.
---
### 2) `svc-rpa`
**Purpose:** Scheduled RPA pulls from firm/client portals via Playwright.
**Tasks:**
- Playwright login flows (credentials from Vault), 2FA via Authentik OAuth device or OTP secret in Vault.
- Download statements/invoices; hand off to `svc-ingestion` via internal POST.
- Prefect flows: `pull_portal_X()`, `pull_portal_Y()` with schedules.
**Endpoints:**
- `POST /v1/rpa/run/{connector}` (manual trigger)
- `GET /v1/rpa/status/{run_id}`
**Env:** `VAULT_ADDR`, `VAULT_ROLE_ID`, `VAULT_SECRET_ID`.
---
### 3) `svc-ocr`
**Purpose:** OCR & layout extraction.
**Pipeline:**
- Pull object from MinIO, detect rotation/de-skew (`opencv-python`), split pages (`pymupdf`), OCR (`pytesseract`) or bypass if text layer present (`pdfplumber`).
- Output per-page text + **bbox** for lines/words.
- Write JSON to MinIO `tenants/{tenant_id}/ocr/{doc_id}.json` and emit `doc.ocr_ready`.
**Endpoints:**
- `POST /v1/ocr/{doc_id}` (idempotent trigger)
- `GET /v1/ocr/{doc_id}` (fetch OCR JSON)
**Env:** `TESSERACT_LANGS`, `S3_BUCKET_EVIDENCE`.
---
### 4) `svc-extract`
**Purpose:** Classify docs and extract KV + tables into **schema-constrained JSON** (with bbox/page).
**Endpoints:**
- `POST /v1/extract/{doc_id}` body: `{strategy: "llm|rules|hybrid"}`
- `GET /v1/extract/{doc_id}` structured JSON
**Implementation:**
- Use prompt files in `prompts/`: `doc_classify.txt`, `kv_extract.txt`, `table_extract.txt`.
- **Validator loop**: run LLM validate JSONSchema retry with error messages up to N times.
- Return Pydantic models from `libs/schemas.py`.
- Emit `doc.extracted`.
**Env:** `LLM_ENGINE`, `TEMPERATURE`, `MAX_TOKENS`.
---
### 5) `svc-normalize-map`
**Purpose:** Normalize & map extracted data to KG.
**Logic:**
- Currency normalization (ECB or static fx table), dates, UK tax year/basis period inference.
- Entity resolution (blocking + fuzzy).
- Generate nodes/edges (+ `Evidence` with doc_id/page/bbox/text_hash).
- Use `libs/neo.py` to write with **bitemporal** fields; run **SHACL** validator; on violation, queue `review.requested`.
- Emit `kg.upserted`.
**Endpoints:**
- `POST /v1/map/{doc_id}`
- `GET /v1/map/{doc_id}/preview` (diff view, to be used by UI)
**Env:** `NEO4J_*`.
---
### 6) `svc-kg`
**Purpose:** Graph façade + RDF/SHACL utility.
**Endpoints:**
- `GET /v1/kg/nodes/{label}/{id}`
- `POST /v1/kg/cypher` (admin-gated inline query; must check `admin` role)
- `POST /v1/kg/export/rdf` (returns RDF for SHACL)
- `POST /v1/kg/validate` (run pySHACL against `schemas/shapes.ttl`)
- `GET /v1/kg/lineage/{node_id}` (traverse `DERIVED_FROM` Evidence)
**Env:** `NEO4J_*`.
---
### 7) `svc-rag-indexer`
**Purpose:** Build Qdrant indices (firm knowledge, legislation, best practices, glossary).
**Workflow:**
- Load sources (filesystem, URLs, Firm DMS via `svc-firm-connectors`).
- **De-identify PII** (regex + NER), replace with placeholders; store mapping only in Postgres.
- Chunk (layout-aware) per `retrieval/chunking.yaml`.
- Compute **dense** embeddings (e.g., `bge-small-en-v1.5`) and **sparse** (Qdrant sparse).
- Upsert to Qdrant with payload `{jurisdiction, tax_years[], topic_tags[], version, pii_free: true, doc_id/section_id/url}`.
- Emit `rag.indexed`.
**Endpoints:**
- `POST /v1/index/run`
- `GET /v1/index/status/{run_id}`
**Env:** `QDRANT_URL`, `RAG_EMBEDDING_MODEL`, `RAG_RERANKER_MODEL`.
---
### 8) `svc-rag-retriever`
**Purpose:** Hybrid search + KG fusion with rerank and calibrated confidence.
**Endpoint:**
- `POST /v1/rag/search` `{query, tax_year?, jurisdiction?, k?}`
```
{
"chunks": [...],
"citations": [{doc_id|url, section_id?, page?, bbox?}],
"kg_hints": [{rule_id, formula_id, node_ids[]}],
"calibrated_confidence": 0.0-1.0
}
```
**Implementation:**
- Hybrid score: `alpha * dense + beta * sparse`; rerank top-K via cross-encoder; **KG fusion** (boost chunks citing Rules/Calculations relevant to schedule).
- Use `libs/calibration.py` to expose calibrated confidence.
---
### 9) `svc-reason`
**Purpose:** Deterministic calculators + materializers (UK SA).
**Endpoints:**
- `POST /v1/reason/compute_schedule` `{tax_year, taxpayer_id, schedule_id}`
- `GET /v1/reason/explain/{schedule_id}` → rationale & lineage paths
**Implementation:**
- Pure functions for: employment, self-employment, property (FHL, 20% interest credit), dividends/interest, allowances, NIC (Class 2/4), HICBC, student loans (Plans 1/2/4/5, PGL).
- **Deterministic order** as defined; rounding per `FormBox.rounding_rule`.
- Use Cypher from `kg/reasoning/schedule_queries.cypher` to materialize box values; attach `DERIVED_FROM` evidence.
---
### 10) `svc-forms`
**Purpose:** Fill PDFs and assemble evidence bundles.
**Endpoints:**
- `POST /v1/forms/fill` `{tax_year, taxpayer_id, form_id}` → returns PDF (binary)
- `POST /v1/forms/evidence_pack` `{scope}` → ZIP + manifest + signed hashes (sha256)
**Implementation:**
- `pdfrw` for AcroForm; overlay with ReportLab if needed.
- Manifest includes `doc_id/page/bbox/text_hash` for every numeric field.
---
### 11) `svc-hmrc`
**Purpose:** HMRC submitter (stub|sandbox|live).
**Endpoints:**
- `POST /v1/hmrc/submit` `{tax_year, taxpayer_id, dry_run}` → `{status, submission_id?, errors[]}`
- `GET /v1/hmrc/submissions/{id}`
**Implementation:**
- Rate limits, retries/backoff, signed audit log; environment toggle.
---
### 12) `svc-firm-connectors`
**Purpose:** Read-only connectors to Firm Databases (Practice Mgmt, DMS).
**Endpoints:**
- `POST /v1/firm/sync` `{since?}` → `{objects_synced, errors[]}`
- `GET /v1/firm/objects` (paged)
**Implementation:**
- Data contracts in `config/firm_contracts/`; mappers → Secure Client Data Store (Postgres) with lineage columns (`source`, `source_id`, `synced_at`).
---
### 13) `ui-review` (outline only)
- Next.js (SSO handled by Traefik+Authentik), shows extracted fields + evidence snippets; POST overrides to `svc-extract`/`svc-normalize-map`.
---
## DATA CONTRACTS (ESSENTIAL EXAMPLES)
**Event: `doc.ingested`**
```json
{
"event_id": "01J...ULID",
"occurred_at": "2025-09-13T08:00:00Z",
"actor": "svc-ingestion",
"tenant_id": "t_123",
"trace_id": "abc-123",
"schema_version": "1.0",
"data": {
"doc_id": "d_abc",
"bucket": "raw",
"key": "tenants/t_123/raw/d_abc.pdf",
"checksum": "sha256:...",
"kind": "bank_statement",
"mime": "application/pdf",
"pages": 12
}
}
```
**RAG search response shape**
```json
{
"chunks": [
{
"id": "c1",
"text": "...",
"score": 0.78,
"payload": {
"jurisdiction": "UK",
"tax_years": ["2024-25"],
"topic_tags": ["FHL"],
"pii_free": true
}
}
],
"citations": [
{ "doc_id": "leg-ITA2007", "section_id": "s272A", "url": "https://..." }
],
"kg_hints": [
{
"rule_id": "UK.FHL.Qual",
"formula_id": "FHL_Test_v1",
"node_ids": ["n123", "n456"]
}
],
"calibrated_confidence": 0.81
}
```
---
## PERSISTENCE SCHEMAS (POSTGRES; ALEMBIC)
- `ingest_documents(id pk, tenant_id, doc_id, kind, checksum, bucket, key, mime, pages, created_at)`
- `firm_objects(id pk, tenant_id, source, source_id, type, payload jsonb, synced_at)`
- Qdrant PII mapping table (if absolutely needed): `pii_links(id pk, placeholder_hash, client_id, created_at)` — **encrypt with Vault Transit**; do NOT store raw values.
---
## TRAEFIK + AUTHENTIK (COMPOSE LABELS PER SERVICE)
For every service container in `infra/compose/docker-compose.local.yml`, add labels:
```
- "traefik.enable=true"
- "traefik.http.routers.svc-extract.rule=Host(`api.local`) && PathPrefix(`/extract`)"
- "traefik.http.routers.svc-extract.entrypoints=websecure"
- "traefik.http.routers.svc-extract.tls=true"
- "traefik.http.routers.svc-extract.middlewares=authentik-forwardauth,rate-limit"
- "traefik.http.services.svc-extract.loadbalancer.server.port=8000"
```
Use the shared dynamic file `traefik-dynamic.yml` with `authentik-forwardauth` and `rate-limit` middlewares.
---
## OUTPUT FORMAT (STRICT)
Implement a **multi-file codebase** as fenced blocks, EXACTLY in this order:
```txt
# FILE: libs/config.py
# factories for Vault/MinIO/Qdrant/Neo4j/Redis/EventBus, Settings base
...
```
```txt
# FILE: libs/security.py
# Vault Transit helpers, header parsing, internal CIDR checks, middleware
...
```
```txt
# FILE: libs/observability.py
# otel init, prometheus, structlog
...
```
```txt
# FILE: libs/events.py
# EventBus abstraction with Kafka and SQS/SNS impls
...
```
```txt
# FILE: libs/schemas.py
# Shared Pydantic models mirroring ontology entities
...
```
```txt
# FILE: apps/svc-ingestion/main.py
# FastAPI app, endpoints, MinIO write, Postgres, publish doc.ingested
...
```
```txt
# FILE: apps/svc-rpa/main.py
# Playwright flows, Prefect tasks, triggers
...
```
```txt
# FILE: apps/svc-ocr/main.py
# OCR pipeline, endpoints
...
```
```txt
# FILE: apps/svc-extract/main.py
# Classifier + extractors with validator loop
...
```
```txt
# FILE: apps/svc-normalize-map/main.py
# normalization, entity resolution, KG mapping, SHACL validation call
...
```
```txt
# FILE: apps/svc-kg/main.py
# KG façade, RDF export, SHACL validate, lineage traversal
...
```
```txt
# FILE: apps/svc-rag-indexer/main.py
# chunk/de-id/embed/upsert to Qdrant
...
```
```txt
# FILE: apps/svc-rag-retriever/main.py
# hybrid retrieval + rerank + KG fusion
...
```
```txt
# FILE: apps/svc-reason/main.py
# deterministic calculators, schedule compute/explain
...
```
```txt
# FILE: apps/svc-forms/main.py
# PDF fill + evidence pack
...
```
```txt
# FILE: apps/svc-hmrc/main.py
# submit stub|sandbox|live with audit + retries
...
```
```txt
# FILE: apps/svc-firm-connectors/main.py
# connectors to practice mgmt & DMS, sync to Postgres
...
```
```txt
# FILE: infra/compose/docker-compose.local.yml
# Traefik, Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prom+Grafana, Loki, Unleash, all services
...
```
```txt
# FILE: infra/compose/traefik.yml
# static Traefik config
...
```
```txt
# FILE: infra/compose/traefik-dynamic.yml
# forwardAuth middleware + routers/services
...
```
```txt
# FILE: .gitea/workflows/ci.yml
# lint->test->build->scan->push->deploy
...
```
```txt
# FILE: Makefile
# bootstrap, run, test, lint, build, deploy, format, seed
...
```
```txt
# FILE: tests/e2e/test_happy_path.py
# end-to-end: ingest -> ocr -> extract -> map -> compute -> fill -> (stub) submit
...
```
```txt
# FILE: tests/unit/test_calculators.py
# boundary tests for UK SA logic (NIC, HICBC, PA taper, FHL)
...
```
```txt
# FILE: README.md
# how to run locally with docker-compose, Authentik setup, Traefik certs
...
```
## DEFINITION OF DONE
- `docker compose up` brings the full stack up; SSO via Authentik; routes secured via Traefik ForwardAuth.
- Running `pytest` yields ≥ 90% coverage; `make e2e` passes the ingest→…→submit stub flow.
- All services expose `/healthz|/readyz|/livez|/metrics`; OpenAPI at `/docs`.
- No PII stored in Qdrant; vectors carry `pii_free=true`.
- KG writes are SHACL-validated; violations produce `review.requested` events.
- Evidence lineage is present for every numeric box value.
- Gitea pipeline passes: lint, test, build, scan, push, deploy.
# START
Generate the full codebase and configs in the **exact file blocks and order** specified above.