# AI Tax Agent - Production Microservices Suite A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration. ## 🏗️ Architecture Overview This system implements a complete end-to-end tax processing pipeline with: - **12 Microservices** for document processing, extraction, reasoning, and submission - **Knowledge Graph** (Neo4j) with bitemporal modeling for audit trails - **Vector Database** (Qdrant) for RAG with PII protection - **Edge Authentication** via Traefik + Authentik SSO - **Event-Driven Architecture** with Kafka messaging - **Comprehensive Observability** with OpenTelemetry, Prometheus, and Grafana ## 🚀 Quick Start ### Prerequisites - Docker and Docker Compose - Python 3.12+ - Node.js 18+ (for UI components) - 16GB+ RAM recommended - OpenAI API key (for LLM extraction) ### 1. Clone and Setup ```bash git clone cd ai-tax-agent-2 # Bootstrap the development environment make bootstrap # Edit .env with your configuration # Minimum required: OPENAI_API_KEY ``` ### 2. Start Infrastructure (Automated) ```bash # Start all services with automated fixes make run # Alternative: Start without fixes (original behavior) make run-simple # Or deploy infrastructure only make deploy-infra ``` ### 3. Complete Authentik Setup After deployment, complete the SSO setup: 1. Visit https://auth.local.lan/if/flow/initial-setup/ 2. Create the initial admin user 3. Configure applications for protected services ```bash # Run setup helper (optional) make setup-authentik ``` ### 4. Access Services - **Traefik Dashboard**: http://localhost:8080 - **Authentik SSO**: https://auth.local.lan - **Grafana**: https://grafana.local.lan - **Review UI**: https://review.local.lan (requires Authentik setup) - **API Gateway**: https://api.local.lan ## 🤖 Automation & Scripts The system includes comprehensive automation for deployment and troubleshooting: ### Core Commands ```bash # Complete automated deployment with fixes make run # Bootstrap environment make bootstrap # Deploy infrastructure only make deploy-infra # Deploy application services only make deploy-services ``` ### Troubleshooting & Maintenance ```bash # Run comprehensive troubleshooting make troubleshoot # Fix database issues make fix-databases # Restart Authentik components make restart-authentik # Restart Unleash with fixes make restart-unleash # Verify all endpoints make verify # Check service health make health # View service status make status ``` ### Automated Fixes The deployment automation handles: - **Database Initialization**: Creates required databases (unleash, authentik) - **Password Reset**: Fixes Authentik database authentication issues - **Service Ordering**: Starts services in correct dependency order - **Health Monitoring**: Waits for services to be healthy before proceeding - **Network Setup**: Creates required Docker networks - **Certificate Generation**: Generates self-signed TLS certificates - **Host Configuration**: Sets up local domain resolution ## 📋 Services Overview ### Core Processing Pipeline 1. **svc-ingestion** (Port 8001) - Document upload and storage 2. **svc-rpa** (Port 8002) - Browser automation for portal data 3. **svc-ocr** (Port 8003) - OCR and layout extraction 4. **svc-extract** (Port 8004) - LLM-based field extraction 5. **svc-normalize-map** (Port 8005) - Data normalization and KG mapping 6. **svc-kg** (Port 8006) - Knowledge graph operations ### AI & Reasoning 7. **svc-rag-indexer** (Port 8007) - Vector database indexing 8. **svc-rag-retriever** (Port 8008) - Hybrid search with KG fusion 9. **svc-reason** (Port 8009) - Tax calculation engine 10. **svc-coverage** (Port 8013) - Document coverage policy evaluation ### Output & Integration 11. **svc-forms** (Port 8010) - PDF form filling 12. **svc-hmrc** (Port 8011) - HMRC submission service 13. **svc-firm-connectors** (Port 8012) - Practice management integration ## 🔧 Development ### Project Structure ``` ai-tax-agent/ ├── libs/ # Shared libraries │ ├── config.py # Configuration and factories │ ├── security.py # Authentication and encryption │ ├── observability.py # Tracing, metrics, logging │ ├── events.py # Event bus abstraction │ ├── schemas.py # Pydantic models │ ├── storage.py # MinIO/S3 operations │ ├── neo.py # Neo4j operations │ ├── rag.py # RAG and vector operations │ ├── forms.py # PDF form handling │ ├── calibration.py # ML confidence calibration │ ├── policy.py # Coverage policy loading and compilation │ ├── coverage_models.py # Coverage system data models │ ├── coverage_eval.py # Coverage evaluation engine │ └── coverage_schema.json # JSON schema for policy validation ├── apps/ # Microservices │ ├── svc-ingestion/ # Document ingestion service │ ├── svc-rpa/ # RPA automation service │ ├── svc-ocr/ # OCR processing service │ ├── svc-extract/ # Field extraction service │ ├── svc-normalize-map/ # Normalization service │ ├── svc-kg/ # Knowledge graph service │ ├── svc-rag-indexer/ # RAG indexing service │ ├── svc-rag-retriever/ # RAG retrieval service │ ├── svc-reason/ # Tax reasoning service │ ├── svc-coverage/ # Document coverage policy service │ ├── svc-forms/ # Form filling service │ ├── svc-hmrc/ # HMRC integration service │ └── svc-firm-connectors/ # Firm integration service ├── infra/ # Infrastructure │ ├── compose/ # Docker Compose files │ └── k8s/ # Kubernetes manifests ├── tests/ # Test suites │ ├── e2e/ # End-to-end tests │ └── unit/ # Unit tests ├── config/ # Configuration files ├── schemas/ # Data schemas ├── db/ # Database schemas └── docs/ # Documentation ``` ### Running Tests ```bash # Unit tests make test-unit # End-to-end tests make test-e2e # All tests make test ``` ### Development Workflow ```bash # Start development environment make dev # Watch logs for specific service make logs SERVICE=svc-extract # Restart specific service make restart SERVICE=svc-extract # Run linting and formatting make lint make format # Generate API documentation make docs ``` ## 🔐 Security & Authentication ### Edge Authentication - **Traefik** reverse proxy with SSL termination - **Authentik** SSO provider with OIDC/SAML support - **ForwardAuth** middleware for service authentication - **Zero-trust** architecture - services consume user context via headers ### Data Protection - **Vault Transit** encryption for sensitive fields - **PII Detection** and de-identification before vector indexing - **Tenant Isolation** with row-level security - **Audit Trails** with bitemporal data modeling ### Network Security - **Internal Networks** for service communication - **TLS Everywhere** with automatic certificate management - **Rate Limiting** and DDoS protection - **Security Headers** and CORS policies ## 📊 Observability ### Metrics & Monitoring - **Prometheus** for metrics collection - **Grafana** for visualization and alerting - **Custom Business Metrics** for document processing, RAG, calculations - **SLI/SLO Monitoring** with error budgets ### Tracing & Logging - **OpenTelemetry** distributed tracing - **Jaeger** trace visualization - **Structured Logging** with correlation IDs - **Log Aggregation** with ELK stack (optional) ### Health Checks ```bash # Check all service health make health # Individual service health curl http://localhost:8001/health ``` ## 🗃️ Data Architecture ### Knowledge Graph (Neo4j) - **Bitemporal Modeling** with valid_time and system_time - **SHACL Validation** for data integrity - **Tenant Isolation** with security constraints - **Audit Trails** for all changes ### Vector Database (Qdrant) - **PII-Free Indexing** with de-identification - **Hybrid Search** combining dense and sparse vectors - **Collection Management** per tenant and data type - **Confidence Calibration** for search results ### Event Streaming (Kafka) - (TBD) - **Event-Driven Architecture** with standardized topics - **Exactly-Once Processing** with idempotency - **Dead Letter Queues** for error handling - **Schema Registry** for event validation ## 🧮 Tax Calculation Engine ### Supported Forms - **SA100** - Main Self Assessment return - **SA103** - Self-employment income - **SA105** - Property income - **SA106** - Foreign income ### Calculation Features - **Rules Engine** with configurable tax rules - **Evidence Trails** linking calculations to source documents - **Confidence Scoring** with calibration - **Multi-Year Support** with basis period reform ### HMRC Integration - **MTD API** integration for submissions - **OAuth 2.0** authentication flow - **Dry Run** mode for testing - **Validation** against HMRC business rules ## 🔌 Integrations ### Practice Management Systems - **IRIS** Practice Management - **Sage** Practice Management - **Xero** accounting software - **QuickBooks** accounting software - **FreeAgent** accounting software - **KashFlow** accounting software ### Document Sources - **Direct Upload** via web interface - **Email Integration** with attachment processing - **Portal Scraping** via RPA automation - **API Integration** with accounting systems ## 🚀 Deployment ### Local Development ```bash make up # Start all services make down # Stop all services make clean # Clean up volumes and networks ``` ### Production Deployment For detailed instructions, see [infra/compose/README.md](infra/compose/README.md). The system uses a unified deployment script for production environments: ```bash # Deploy to production (Infrastructure + Services + Monitoring) ./infra/scripts/deploy.sh production all ``` Ensure you have configured `infra/environments/production/.env` with the correct secrets and domain settings before deploying. ### Environment Configuration Key environment variables: ```bash # Database connections DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db NEO4J_URI=bolt://neo4j:7687 QDRANT_URL=http://qdrant:6333 # External services OPENAI_API_KEY=sk-... VAULT_ADDR=http://vault:8200 KAFKA_BOOTSTRAP_SERVERS=kafka:9092 # Security AUTHENTIK_SECRET_KEY=your-secret-key VAULT_ROLE_ID=your-role-id VAULT_SECRET_ID=your-secret-id ``` ## 📚 API Documentation ### Authentication All API endpoints require authentication via Authentik ForwardAuth: ```bash curl -H "X-Forwarded-User: user@example.com" \ -H "X-Forwarded-Groups: tax_agents" \ -H "X-Tenant-ID: tenant-123" \ https://api.localhost/api/ingestion/health ``` ### Key Endpoints - `POST /api/ingestion/upload` - Upload documents - `GET /api/extract/status/{doc_id}` - Check extraction status - `POST /api/rag-retriever/search` - Search knowledge base - `POST /api/reason/compute` - Trigger tax calculations - `POST /api/forms/fill/{form_id}` - Fill PDF forms - `POST /api/hmrc/submit` - Submit to HMRC ### Event Topics - `DOC_INGESTED` - Document uploaded - `DOC_OCR_READY` - OCR completed - `DOC_EXTRACTED` - Fields extracted - `KG_UPSERTED` - Knowledge graph updated - `RAG_INDEXED` - Vector indexing completed - `CALC_SCHEDULE_READY` - Tax calculation completed - `FORM_FILLED` - PDF form filled - `HMRC_SUBMITTED` - HMRC submission completed ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests 5. Run the test suite 6. Submit a pull request ### Code Standards - **Python**: Black formatting, isort imports, mypy type checking - **Documentation**: Docstrings for all public functions - **Testing**: Minimum 80% code coverage - **Security**: No secrets in code, use Vault for sensitive data ## 📋 Coverage Policy System The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic. ### Policy Configuration Coverage policies are defined in `config/coverage.yaml` with support for jurisdiction and tenant-specific overlays: ```yaml # config/coverage.yaml version: "1.0" jurisdiction: "UK" tax_year: "2024-25" tax_year_boundary: start: "2024-04-06" end: "2025-04-05" defaults: confidence_thresholds: ocr: 0.82 extract: 0.85 date_tolerance_days: 30 triggers: SA102: # Employment schedule any_of: - "exists(IncomeItem[type='Employment'])" SA105: # Property schedule any_of: - "exists(IncomeItem[type='UKPropertyRent'])" schedules: SA102: evidence: - id: "P60" role: "REQUIRED" boxes: ["SA102_b1", "SA102_b2"] acceptable_alternatives: ["P45", "FinalPayslipYTD"] - id: "P11D" role: "CONDITIONALLY_REQUIRED" condition: "exists(BenefitInKind=true)" boxes: ["SA102_b9"] ``` ### API Usage #### Check Document Coverage ```bash curl -X POST https://api.localhost/coverage/v1/check \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $TOKEN" \ -d '{ "taxpayer_id": "T-001", "tax_year": "2024-25", "jurisdiction": "UK" }' ``` Response: ```json { "overall_status": "INCOMPLETE", "schedules_required": ["SA102"], "coverage": [ { "schedule_id": "SA102", "status": "INCOMPLETE", "evidence": [ { "id": "P60", "status": "MISSING", "role": "REQUIRED", "found": [] } ] } ], "blocking_items": [ { "schedule_id": "SA102", "evidence_id": "P60", "role": "REQUIRED", "reason": "P60 provides year-end pay and PAYE tax figures", "boxes": ["SA102_b1", "SA102_b2"], "acceptable_alternatives": ["P45", "FinalPayslipYTD"] } ] } ``` #### Generate Clarifying Questions ```bash curl -X POST https://api.localhost/coverage/v1/clarify \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $TOKEN" \ -d '{ "taxpayer_id": "T-001", "tax_year": "2024-25", "jurisdiction": "UK", "schedule_id": "SA102", "evidence_id": "P60" }' ``` ### Policy Hot Reload Policies can be reloaded without service restart: ```bash curl -X POST https://api.localhost/coverage/admin/reload \ -H "Authorization: Bearer $ADMIN_TOKEN" ``` ### Predicate Language The policy system supports a domain-specific language for conditions: - `exists(Entity[filters])` - Check if entities exist with filters - `property_name` - Check boolean properties - `taxpayer_flag:flag_name` - Check taxpayer flags - `filing_mode:mode` - Check filing mode - `computed_condition` - Check computed values ### Status Classification Evidence is classified into four statuses: - **PRESENT_VERIFIED**: High confidence OCR/extract, date within tax year - **PRESENT_UNVERIFIED**: Medium confidence, may need manual review - **CONFLICTING**: Multiple documents with conflicting information - **MISSING**: No evidence found or confidence too low ### Testing Run coverage policy tests: ```bash # Unit tests pytest tests/unit/coverage/ -v # Integration tests pytest tests/integration/coverage/ -v # End-to-end tests pytest tests/e2e/test_coverage_to_compute_flow.py -v # Coverage report pytest tests/unit/coverage/ --cov=libs --cov-report=html ``` ## 📄 License This project is licensed under the MIT License - see the LICENSE file for details. ## 🆘 Support - **Documentation**: See `/docs` directory - **Issues**: GitHub Issues - **Discussions**: GitHub Discussions - **Security**: security@example.com ## 🗺️ Roadmap - [ ] Advanced ML models for extraction - [ ] Multi-jurisdiction support (EU, US) - [ ] Real-time collaboration features - [ ] Mobile application - [ ] Advanced analytics dashboard - [ ] Blockchain audit trails