Files
ai-tax-agent/docs/DEPLOYMENT_PLAN.md
harkon eea46ac89c
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
deployment, linting and infra configuration
2025-10-14 07:42:31 +01:00

335 lines
11 KiB
Markdown

# Isolated Stacks Deployment Plan
## Executive Summary
This plan outlines the strategy to host both the **AI Tax Agent application** and **company services** (Nextcloud, Gitea, Portainer, Authentik) on the remote server at `141.136.35.199` while maintaining an efficient local development workflow.
## Current State Analysis
### Remote Server (`141.136.35.199`)
- **Location**: `/opt/compose/`
- **Existing Services**:
- Traefik v3.5.1 (reverse proxy with GoDaddy DNS challenge)
- Authentik 2025.8.1 (SSO/Authentication)
- Gitea 1.24.5 (Git hosting)
- Nextcloud (Cloud storage)
- Portainer 2.33.1 (Docker management)
- **Networks**: `frontend` and `backend` (external)
- **Domain**: `harkon.co.uk`
- **SSL**: Let's Encrypt via GoDaddy DNS challenge
- **Exposed Subdomains**:
- `traefik.harkon.co.uk`
- `auth.harkon.co.uk`
- `gitea.harkon.co.uk`
- `cloud.harkon.co.uk`
- `portainer.harkon.co.uk`
### Local Repository (`infra/compose/`)
- **Compose Files**:
- `docker-compose.local.yml` - Full stack for local development
- `docker-compose.backend.yml` - Backend services (appears to be production-ready)
- **Application Services**:
- 13+ microservices (svc-ingestion, svc-extract, svc-forms, svc-hmrc, etc.)
- UI Review application
- Infrastructure: Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, NATS, Prometheus, Grafana, Loki
- **Networks**: `ai-tax-agent-frontend` and `ai-tax-agent-backend`
- **Domain**: `local.lan` (for development)
- **Authentication**: Authentik with ForwardAuth middleware
## Challenges & Conflicts
### 1. **Duplicate Services**
- Both environments have Traefik and Authentik
- Need to decide: shared vs. isolated
### 2. **Network Naming**
- Remote: `frontend`, `backend`
- Local: `ai-tax-agent-frontend`, `ai-tax-agent-backend`
- Production needs: Consistent naming
### 3. **Domain Management**
- Remote: `*.harkon.co.uk` (public)
- Local: `*.local.lan` (development)
- Production: Need subdomains like `app.harkon.co.uk`, `api.harkon.co.uk`
### 4. **SSL Certificates**
- Remote: GoDaddy DNS challenge (production)
- Local: Self-signed certificates
- Production: Must use GoDaddy DNS challenge
### 5. **Resource Isolation**
- Company services need to remain stable
- Application services need independent deployment/rollback
# Decision: Keep Stacks Completely Separate
We will deploy the company services and the AI Tax Agent as two fully isolated stacks, each with its own Traefik and Authentik. This maximizes blast-radius isolation and avoids naming and DNS conflicts across environments.
Key implications:
- Separate external networks and DNS namespaces per stack
- Duplicate edge (Traefik) and IdP (Authentik), independent upgrades and rollbacks
- Slightly higher resource usage in exchange for strong isolation
### Architecture Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Internet (*.harkon.co.uk) │
└────────────────────────┬────────────────────────────────────┘
┌────▼────┐
│ Traefik │ (Port 80/443)
│ v3.5.1 │
└────┬────┘
┌────────────────┼────────────────┐
│ │ │
┌────▼─────┐ ┌────▼────┐ ┌────▼─────┐
│Authentik │ │ Company │ │ App │
│ SSO │ │Services │ │ Services │
└──────────┘ └─────────┘ └──────────┘
│ │
┌────┴────┐ ┌────┴────┐
│ Gitea │ │ Vault │
│Nextcloud│ │ MinIO │
│Portainer│ │ Neo4j │
└─────────┘ │ Qdrant │
│ Postgres│
│ Redis │
│ NATS │
│ 13 SVCs │
│ UI │
└─────────┘
```
### Directory Structure (per stack)
```
/opt/compose/<stack>/
├── traefik/ # Stack-local reverse proxy
│ ├── compose.yaml
│ ├── config/
│ │ ├── traefik.yaml # Static config
│ │ ├── dynamic-company.yaml
│ │ └── dynamic-app.yaml
│ └── certs/
├── authentik/ # Stack-local SSO
│ ├── compose.yaml
│ └── ...
├── company/ # Company services namespace
│ ├── gitea/
│ │ └── compose.yaml
│ ├── nextcloud/
│ │ └── compose.yaml
│ └── portainer/
│ └── compose.yaml
└── ai-tax-agent/ # Application namespace (if this is the app stack)
├── .env # Production environment
├── infrastructure.yaml # Vault, MinIO, Neo4j, Qdrant, etc.
├── services.yaml # All microservices
└── monitoring.yaml # Prometheus, Grafana, Loki
```
### Network Strategy
- Use stack-scoped network names to avoid collisions: `apa-frontend`, `apa-backend`.
- Only attach services that must be public to `apa-frontend`.
- Keep internal communication on `apa-backend`.
### Domain Mapping
**Company Services** (existing):
- `traefik.harkon.co.uk` - Traefik dashboard
- `auth.harkon.co.uk` - Authentik SSO
- `gitea.harkon.co.uk` - Git hosting
- `cloud.harkon.co.uk` - Nextcloud
- `portainer.harkon.co.uk` - Docker management
**Application Services** (app stack):
- `review.<domain>` - Review UI
- `api.<domain>` - API Gateway (microservices via Traefik)
- `vault.<domain>` - Vault UI (admin only)
- `minio.<domain>` - MinIO Console (admin only)
- `neo4j.<domain>` - Neo4j Browser (admin only)
- `qdrant.<domain>` - Qdrant UI (admin only)
- `grafana.<domain>` - Grafana (monitoring)
- `prometheus.<domain>` - Prometheus (admin only)
- `loki.<domain>` - Loki (admin only)
### Authentication Strategy
**Authentik Configuration**:
1. **Company Group** - Access to Gitea, Nextcloud, Portainer
2. **App Admin Group** - Full access to all app services
3. **App User Group** - Access to Review UI and API
4. **App Reviewer Group** - Access to Review UI only
**Middleware Configuration**:
- `authentik-forwardauth` - Standard auth for all services
- `admin-auth` - Requires admin group (Vault, MinIO, Neo4j, etc.)
- `reviewer-auth` - Requires reviewer or higher
- `rate-limit` - Standard rate limiting
- `api-rate-limit` - Stricter API rate limiting
## Implementation Notes
- infra/base/infrastructure.yaml now includes Traefik and Authentik in the infrastructure stack with stack-scoped networks and service names.
- All infrastructure component service keys and container names use the `apa-` prefix to avoid DNS collisions on shared Docker hosts.
- Traefik static and dynamic configs live under `infra/base/traefik/config/`.
## Local Development Workflow
### Development Environment
**Keep Existing Setup**:
- Use `docker-compose.local.yml` as-is
- Domain: `*.local.lan`
- Self-signed certificates
- Isolated networks: `ai-tax-agent-frontend`, `ai-tax-agent-backend`
- Full stack runs locally
**Benefits**:
- No dependency on remote server
- Fast iteration
- Complete isolation
- Works offline
### Development Commands
```bash
# Local development
make bootstrap # Initial setup
make up # Start all services
make down # Stop all services
make logs SERVICE=svc-ingestion
# Build and test
make build # Build all images
make test # Run tests
make test-integration # Integration tests
# Deploy to production
make deploy-production # Deploy to remote server
```
## Production Deployment Strategy
### Phase 1: Preparation (Week 1)
1. **Backup Current State**
```bash
ssh deploy@141.136.35.199
cd /opt
tar -czf ~/backup-$(date +%Y%m%d).tar.gz .
```
2. **Create Production Environment File**
- Copy `infra/environments/production/.env.example` to `infra/environments/production/.env`
- Update all secrets and passwords
- Set `DOMAIN=harkon.co.uk`
- Configure GoDaddy API credentials
3. **Update Traefik Configuration**
- Merge local Traefik config with remote
- Add application routes
- Configure Authentik ForwardAuth
4. **Prepare Docker Images**
- Build all application images
- Push to container registry (Gitea registry or Docker Hub)
- Tag with version numbers
### Phase 2: Infrastructure Deployment (Week 2)
1. **Deploy Application Infrastructure**
```bash
# On remote server
cd /opt/ai-tax-agent
docker compose -f infrastructure.yaml up -d
```
2. **Initialize Services**
- Vault: Unseal and configure
- Postgres: Run migrations
- Neo4j: Install plugins
- MinIO: Create buckets
3. **Configure Authentik**
- Create application groups
- Configure OAuth providers
- Set up ForwardAuth outpost
### Phase 3: Application Deployment (Week 3)
1. **Deploy Microservices**
```bash
docker compose -f services.yaml up -d
```
2. **Deploy Monitoring**
```bash
docker compose -f monitoring.yaml up -d
```
3. **Verify Health**
- Check all service health endpoints
- Verify Traefik routing
- Test authentication flow
### Phase 4: Testing & Validation (Week 4)
1. **Smoke Tests**
2. **Integration Tests**
3. **Performance Tests**
4. **Security Audit**
## Deployment Files Structure
Create three new compose files for production:
1. **`infrastructure.yaml`** - Vault, MinIO, Neo4j, Qdrant, Postgres, Redis, NATS
2. **`services.yaml`** - All 13 microservices + UI
3. **`monitoring.yaml`** - Prometheus, Grafana, Loki
## Rollback Strategy
1. **Service-Level Rollback**: Use Docker image tags
2. **Full Rollback**: Restore from backup
3. **Gradual Rollout**: Deploy services incrementally
## Monitoring & Maintenance
- **Logs**: Centralized in Loki
- **Metrics**: Prometheus + Grafana
- **Alerts**: Configure Grafana alerts
- **Backups**: Daily automated backups of volumes
## Next Steps
1. Review and approve this plan
2. Create production environment file
3. Create production compose files
4. Set up CI/CD pipeline for automated deployment
5. Execute Phase 1 (Preparation)