recovered config

2025-10-16 08:57:14 +01:00
parent eea46ac89c
commit 8fe5e62fee
14 changed files with 775 additions and 1000 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,615 @@
+# AI Tax Agent - Production Microservices Suite
+
+A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration.
+
+## 🏗️ Architecture Overview
+
+This system implements a complete end-to-end tax processing pipeline with:
+
+- **12 Microservices** for document processing, extraction, reasoning, and submission
+- **Knowledge Graph** (Neo4j) with bitemporal modeling for audit trails
+- **Vector Database** (Qdrant) for RAG with PII protection
+- **Edge Authentication** via Traefik + Authentik SSO
+- **Event-Driven Architecture** with Kafka messaging
+- **Comprehensive Observability** with OpenTelemetry, Prometheus, and Grafana
+
+## 🚀 Quick Start
+
+### Prerequisites
+
+- Docker and Docker Compose
+- Python 3.12+
+- Node.js 18+ (for UI components)
+- 16GB+ RAM recommended
+- OpenAI API key (for LLM extraction)
+
+### 1. Clone and Setup
+
+```bash
+git clone <repository-url>
+cd ai-tax-agent-2
+
+# Bootstrap the development environment
+make bootstrap
+
+# Edit .env with your configuration
+# Minimum required: OPENAI_API_KEY
+```
+
+### 2. Start Infrastructure (Automated)
+
+```bash
+# Start all services with automated fixes
+make run
+
+# Alternative: Start without fixes (original behavior)
+make run-simple
+
+# Or deploy infrastructure only
+make deploy-infra
+```
+
+### 3. Complete Authentik Setup
+
+After deployment, complete the SSO setup:
+
+1. Visit https://auth.local.lan/if/flow/initial-setup/
+2. Create the initial admin user
+3. Configure applications for protected services
+
+```bash
+# Run setup helper (optional)
+make setup-authentik
+```
+
+### 4. Access Services
+
+- **Traefik Dashboard**: http://localhost:8080
+- **Authentik SSO**: https://auth.local.lan
+- **Grafana**: https://grafana.local.lan
+- **Review UI**: https://review.local.lan (requires Authentik setup)
+- **API Gateway**: https://api.local.lan
+
+## 🤖 Automation & Scripts
+
+The system includes comprehensive automation for deployment and troubleshooting:
+
+### Core Commands
+
+```bash
+# Complete automated deployment with fixes
+make run
+
+# Bootstrap environment
+make bootstrap
+
+# Deploy infrastructure only
+make deploy-infra
+
+# Deploy application services only
+make deploy-services
+```
+
+### Troubleshooting & Maintenance
+
+```bash
+# Run comprehensive troubleshooting
+make troubleshoot
+
+# Fix database issues
+make fix-databases
+
+# Restart Authentik components
+make restart-authentik
+
+# Restart Unleash with fixes
+make restart-unleash
+
+# Verify all endpoints
+make verify
+
+# Check service health
+make health
+
+# View service status
+make status
+```
+
+### Automated Fixes
+
+The deployment automation handles:
+
+- **Database Initialization**: Creates required databases (unleash, authentik)
+- **Password Reset**: Fixes Authentik database authentication issues
+- **Service Ordering**: Starts services in correct dependency order
+- **Health Monitoring**: Waits for services to be healthy before proceeding
+- **Network Setup**: Creates required Docker networks
+- **Certificate Generation**: Generates self-signed TLS certificates
+- **Host Configuration**: Sets up local domain resolution
+
+## 📋 Services Overview
+
+### Core Processing Pipeline
+
+1. **svc-ingestion** (Port 8001) - Document upload and storage
+2. **svc-rpa** (Port 8002) - Browser automation for portal data
+3. **svc-ocr** (Port 8003) - OCR and layout extraction
+4. **svc-extract** (Port 8004) - LLM-based field extraction
+5. **svc-normalize-map** (Port 8005) - Data normalization and KG mapping
+6. **svc-kg** (Port 8006) - Knowledge graph operations
+
+### AI & Reasoning
+
+7. **svc-rag-indexer** (Port 8007) - Vector database indexing
+8. **svc-rag-retriever** (Port 8008) - Hybrid search with KG fusion
+9. **svc-reason** (Port 8009) - Tax calculation engine
+10. **svc-coverage** (Port 8013) - Document coverage policy evaluation
+
+### Output & Integration
+
+11. **svc-forms** (Port 8010) - PDF form filling
+12. **svc-hmrc** (Port 8011) - HMRC submission service
+13. **svc-firm-connectors** (Port 8012) - Practice management integration
+
+## 🔧 Development
+
+### Project Structure
+
+```
+ai-tax-agent-2/
+├── libs/                    # Shared libraries
+│   ├── config.py           # Configuration and factories
+│   ├── security.py         # Authentication and encryption
+│   ├── observability.py    # Tracing, metrics, logging
+│   ├── events.py           # Event bus abstraction
+│   ├── schemas.py          # Pydantic models
+│   ├── storage.py          # MinIO/S3 operations
+│   ├── neo.py              # Neo4j operations
+│   ├── rag.py              # RAG and vector operations
+│   ├── forms.py            # PDF form handling
+│   ├── calibration.py      # ML confidence calibration
+│   ├── policy.py           # Coverage policy loading and compilation
+│   ├── coverage_models.py  # Coverage system data models
+│   ├── coverage_eval.py    # Coverage evaluation engine
+│   └── coverage_schema.json # JSON schema for policy validation
+├── apps/                   # Microservices
+│   ├── svc-ingestion/      # Document ingestion service
+│   ├── svc-rpa/            # RPA automation service
+│   ├── svc-ocr/            # OCR processing service
+│   ├── svc-extract/        # Field extraction service
+│   ├── svc-normalize-map/  # Normalization service
+│   ├── svc-kg/             # Knowledge graph service
+│   ├── svc-rag-indexer/    # RAG indexing service
+│   ├── svc-rag-retriever/  # RAG retrieval service
+│   ├── svc-reason/         # Tax reasoning service
+│   ├── svc-coverage/       # Document coverage policy service
+│   ├── svc-forms/          # Form filling service
+│   ├── svc-hmrc/           # HMRC integration service
+│   └── svc-firm-connectors/ # Firm integration service
+├── infra/                  # Infrastructure
+│   ├── compose/            # Docker Compose files
+│   ├── k8s/                # Kubernetes manifests
+│   └── terraform/          # Terraform configurations
+├── tests/                  # Test suites
+│   ├── e2e/                # End-to-end tests
+│   └── unit/               # Unit tests
+├── config/                 # Configuration files
+├── schemas/                # Data schemas
+├── db/                     # Database schemas
+└── docs/                   # Documentation
+```
+
+### Running Tests
+
+```bash
+# Unit tests
+make test-unit
+
+# End-to-end tests
+make test-e2e
+
+# All tests
+make test
+```
+
+### Development Workflow
+
+```bash
+# Start development environment
+make dev
+
+# Watch logs for specific service
+make logs SERVICE=svc-extract
+
+# Restart specific service
+make restart SERVICE=svc-extract
+
+# Run linting and formatting
+make lint
+make format
+
+# Generate API documentation
+make docs
+```
+
+## 🔐 Security & Authentication
+
+### Edge Authentication
+
+- **Traefik** reverse proxy with SSL termination
+- **Authentik** SSO provider with OIDC/SAML support
+- **ForwardAuth** middleware for service authentication
+- **Zero-trust** architecture - services consume user context via headers
+
+### Data Protection
+
+- **Vault Transit** encryption for sensitive fields
+- **PII Detection** and de-identification before vector indexing
+- **Tenant Isolation** with row-level security
+- **Audit Trails** with bitemporal data modeling
+
+### Network Security
+
+- **Internal Networks** for service communication
+- **TLS Everywhere** with automatic certificate management
+- **Rate Limiting** and DDoS protection
+- **Security Headers** and CORS policies
+
+## 📊 Observability
+
+### Metrics & Monitoring
+
+- **Prometheus** for metrics collection
+- **Grafana** for visualization and alerting
+- **Custom Business Metrics** for document processing, RAG, calculations
+- **SLI/SLO Monitoring** with error budgets
+
+### Tracing & Logging
+
+- **OpenTelemetry** distributed tracing
+- **Jaeger** trace visualization
+- **Structured Logging** with correlation IDs
+- **Log Aggregation** with ELK stack (optional)
+
+### Health Checks
+
+```bash
+# Check all service health
+make health
+
+# Individual service health
+curl http://localhost:8001/health
+```
+
+## 🗃️ Data Architecture
+
+### Knowledge Graph (Neo4j)
+
+- **Bitemporal Modeling** with valid_time and system_time
+- **SHACL Validation** for data integrity
+- **Tenant Isolation** with security constraints
+- **Audit Trails** for all changes
+
+### Vector Database (Qdrant)
+
+- **PII-Free Indexing** with de-identification
+- **Hybrid Search** combining dense and sparse vectors
+- **Collection Management** per tenant and data type
+- **Confidence Calibration** for search results
+
+### Event Streaming (Kafka) - (TBD)
+
+- **Event-Driven Architecture** with standardized topics
+- **Exactly-Once Processing** with idempotency
+- **Dead Letter Queues** for error handling
+- **Schema Registry** for event validation
+
+## 🧮 Tax Calculation Engine
+
+### Supported Forms
+
+- **SA100** - Main Self Assessment return
+- **SA103** - Self-employment income
+- **SA105** - Property income
+- **SA106** - Foreign income
+
+### Calculation Features
+
+- **Rules Engine** with configurable tax rules
+- **Evidence Trails** linking calculations to source documents
+- **Confidence Scoring** with calibration
+- **Multi-Year Support** with basis period reform
+
+### HMRC Integration
+
+- **MTD API** integration for submissions
+- **OAuth 2.0** authentication flow
+- **Dry Run** mode for testing
+- **Validation** against HMRC business rules
+
+## 🔌 Integrations
+
+### Practice Management Systems
+
+- **IRIS** Practice Management
+- **Sage** Practice Management
+- **Xero** accounting software
+- **QuickBooks** accounting software
+- **FreeAgent** accounting software
+- **KashFlow** accounting software
+
+### Document Sources
+
+- **Direct Upload** via web interface
+- **Email Integration** with attachment processing
+- **Portal Scraping** via RPA automation
+- **API Integration** with accounting systems
+
+## 🚀 Deployment
+
+### Local Development
+
+```bash
+make up      # Start all services
+make down    # Stop all services
+make clean   # Clean up volumes and networks
+```
+
+### Production Deployment
+
+```bash
+# Using Docker Swarm
+make deploy-swarm
+
+# Using Kubernetes
+make deploy-k8s
+
+# Using Terraform (AWS/Azure/GCP)
+cd infra/terraform
+terraform init
+terraform plan
+terraform apply
+```
+
+### Environment Configuration
+
+Key environment variables:
+
+```bash
+# Database connections
+DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db
+NEO4J_URI=bolt://neo4j:7687
+QDRANT_URL=http://qdrant:6333
+
+# External services
+OPENAI_API_KEY=sk-...
+VAULT_ADDR=http://vault:8200
+KAFKA_BOOTSTRAP_SERVERS=kafka:9092
+
+# Security
+AUTHENTIK_SECRET_KEY=your-secret-key
+VAULT_ROLE_ID=your-role-id
+VAULT_SECRET_ID=your-secret-id
+```
+
+## 📚 API Documentation
+
+### Authentication
+
+All API endpoints require authentication via Authentik ForwardAuth:
+
+```bash
+curl -H "X-Forwarded-User: user@example.com" \
+     -H "X-Forwarded-Groups: tax_agents" \
+     -H "X-Tenant-ID: tenant-123" \
+     https://api.localhost/api/ingestion/health
+```
+
+### Key Endpoints
+
+- `POST /api/ingestion/upload` - Upload documents
+- `GET /api/extract/status/{doc_id}` - Check extraction status
+- `POST /api/rag-retriever/search` - Search knowledge base
+- `POST /api/reason/compute` - Trigger tax calculations
+- `POST /api/forms/fill/{form_id}` - Fill PDF forms
+- `POST /api/hmrc/submit` - Submit to HMRC
+
+### Event Topics
+
+- `DOC_INGESTED` - Document uploaded
+- `DOC_OCR_READY` - OCR completed
+- `DOC_EXTRACTED` - Fields extracted
+- `KG_UPSERTED` - Knowledge graph updated
+- `RAG_INDEXED` - Vector indexing completed
+- `CALC_SCHEDULE_READY` - Tax calculation completed
+- `FORM_FILLED` - PDF form filled
+- `HMRC_SUBMITTED` - HMRC submission completed
+
+## 🤝 Contributing
+
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Add tests
+5. Run the test suite
+6. Submit a pull request
+
+### Code Standards
+
+- **Python**: Black formatting, isort imports, mypy type checking
+- **Documentation**: Docstrings for all public functions
+- **Testing**: Minimum 80% code coverage
+- **Security**: No secrets in code, use Vault for sensitive data
+
+## 📋 Coverage Policy System
+
+The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic.
+
+### Policy Configuration
+
+Coverage policies are defined in `config/coverage.yaml` with support for jurisdiction and tenant-specific overlays:
+
+```yaml
+# config/coverage.yaml
+version: "1.0"
+jurisdiction: "UK"
+tax_year: "2024-25"
+tax_year_boundary:
+  start: "2024-04-06"
+  end: "2025-04-05"
+
+defaults:
+  confidence_thresholds:
+    ocr: 0.82
+    extract: 0.85
+  date_tolerance_days: 30
+
+triggers:
+  SA102: # Employment schedule
+    any_of:
+      - "exists(IncomeItem[type='Employment'])"
+  SA105: # Property schedule
+    any_of:
+      - "exists(IncomeItem[type='UKPropertyRent'])"
+
+schedules:
+  SA102:
+    evidence:
+      - id: "P60"
+        role: "REQUIRED"
+        boxes: ["SA102_b1", "SA102_b2"]
+        acceptable_alternatives: ["P45", "FinalPayslipYTD"]
+      - id: "P11D"
+        role: "CONDITIONALLY_REQUIRED"
+        condition: "exists(BenefitInKind=true)"
+        boxes: ["SA102_b9"]
+```
+
+### API Usage
+
+#### Check Document Coverage
+
+```bash
+curl -X POST https://api.localhost/coverage/v1/check \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $TOKEN" \
+  -d '{
+    "taxpayer_id": "T-001",
+    "tax_year": "2024-25",
+    "jurisdiction": "UK"
+  }'
+```
+
+Response:
+
+```json
+{
+  "overall_status": "INCOMPLETE",
+  "schedules_required": ["SA102"],
+  "coverage": [
+    {
+      "schedule_id": "SA102",
+      "status": "INCOMPLETE",
+      "evidence": [
+        {
+          "id": "P60",
+          "status": "MISSING",
+          "role": "REQUIRED",
+          "found": []
+        }
+      ]
+    }
+  ],
+  "blocking_items": [
+    {
+      "schedule_id": "SA102",
+      "evidence_id": "P60",
+      "role": "REQUIRED",
+      "reason": "P60 provides year-end pay and PAYE tax figures",
+      "boxes": ["SA102_b1", "SA102_b2"],
+      "acceptable_alternatives": ["P45", "FinalPayslipYTD"]
+    }
+  ]
+}
+```
+
+#### Generate Clarifying Questions
+
+```bash
+curl -X POST https://api.localhost/coverage/v1/clarify \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $TOKEN" \
+  -d '{
+    "taxpayer_id": "T-001",
+    "tax_year": "2024-25",
+    "jurisdiction": "UK",
+    "schedule_id": "SA102",
+    "evidence_id": "P60"
+  }'
+```
+
+### Policy Hot Reload
+
+Policies can be reloaded without service restart:
+
+```bash
+curl -X POST https://api.localhost/coverage/admin/reload \
+  -H "Authorization: Bearer $ADMIN_TOKEN"
+```
+
+### Predicate Language
+
+The policy system supports a domain-specific language for conditions:
+
+- `exists(Entity[filters])` - Check if entities exist with filters
+- `property_name` - Check boolean properties
+- `taxpayer_flag:flag_name` - Check taxpayer flags
+- `filing_mode:mode` - Check filing mode
+- `computed_condition` - Check computed values
+
+### Status Classification
+
+Evidence is classified into four statuses:
+
+- **PRESENT_VERIFIED**: High confidence OCR/extract, date within tax year
+- **PRESENT_UNVERIFIED**: Medium confidence, may need manual review
+- **CONFLICTING**: Multiple documents with conflicting information
+- **MISSING**: No evidence found or confidence too low
+
+### Testing
+
+Run coverage policy tests:
+
+```bash
+# Unit tests
+pytest tests/unit/coverage/ -v
+
+# Integration tests
+pytest tests/integration/coverage/ -v
+
+# End-to-end tests
+pytest tests/e2e/test_coverage_to_compute_flow.py -v
+
+# Coverage report
+pytest tests/unit/coverage/ --cov=libs --cov-report=html
+```
+
+## 📄 License
+
+This project is licensed under the MIT License - see the LICENSE file for details.
+
+## 🆘 Support
+
+- **Documentation**: See `/docs` directory
+- **Issues**: GitHub Issues
+- **Discussions**: GitHub Discussions
+- **Security**: security@example.com
+
+## 🗺️ Roadmap
+
+- [ ] Advanced ML models for extraction
+- [ ] Multi-jurisdiction support (EU, US)
+- [ ] Real-time collaboration features
+- [ ] Mobile application
+- [ ] Advanced analytics dashboard
+- [ ] Blockchain audit trails