Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
612 lines
16 KiB
Markdown
612 lines
16 KiB
Markdown
# AI Tax Agent - Production Microservices Suite
|
|
|
|
A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration.
|
|
|
|
## 🏗️ Architecture Overview
|
|
|
|
This system implements a complete end-to-end tax processing pipeline with:
|
|
|
|
- **12 Microservices** for document processing, extraction, reasoning, and submission
|
|
- **Knowledge Graph** (Neo4j) with bitemporal modeling for audit trails
|
|
- **Vector Database** (Qdrant) for RAG with PII protection
|
|
- **Edge Authentication** via Traefik + Authentik SSO
|
|
- **Event-Driven Architecture** with Kafka messaging
|
|
- **Comprehensive Observability** with OpenTelemetry, Prometheus, and Grafana
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Docker and Docker Compose
|
|
- Python 3.12+
|
|
- Node.js 18+ (for UI components)
|
|
- 16GB+ RAM recommended
|
|
- OpenAI API key (for LLM extraction)
|
|
|
|
### 1. Clone and Setup
|
|
|
|
```bash
|
|
git clone <repository-url>
|
|
cd ai-tax-agent-2
|
|
|
|
# Bootstrap the development environment
|
|
make bootstrap
|
|
|
|
# Edit .env with your configuration
|
|
# Minimum required: OPENAI_API_KEY
|
|
```
|
|
|
|
### 2. Start Infrastructure (Automated)
|
|
|
|
```bash
|
|
# Start all services with automated fixes
|
|
make run
|
|
|
|
# Alternative: Start without fixes (original behavior)
|
|
make run-simple
|
|
|
|
# Or deploy infrastructure only
|
|
make deploy-infra
|
|
```
|
|
|
|
### 3. Complete Authentik Setup
|
|
|
|
After deployment, complete the SSO setup:
|
|
|
|
1. Visit https://auth.local.lan/if/flow/initial-setup/
|
|
2. Create the initial admin user
|
|
3. Configure applications for protected services
|
|
|
|
```bash
|
|
# Run setup helper (optional)
|
|
make setup-authentik
|
|
```
|
|
|
|
### 4. Access Services
|
|
|
|
- **Traefik Dashboard**: http://localhost:8080
|
|
- **Authentik SSO**: https://auth.local.lan
|
|
- **Grafana**: https://grafana.local.lan
|
|
- **Review UI**: https://review.local.lan (requires Authentik setup)
|
|
- **API Gateway**: https://api.local.lan
|
|
|
|
## 🤖 Automation & Scripts
|
|
|
|
The system includes comprehensive automation for deployment and troubleshooting:
|
|
|
|
### Core Commands
|
|
|
|
```bash
|
|
# Complete automated deployment with fixes
|
|
make run
|
|
|
|
# Bootstrap environment
|
|
make bootstrap
|
|
|
|
# Deploy infrastructure only
|
|
make deploy-infra
|
|
|
|
# Deploy application services only
|
|
make deploy-services
|
|
```
|
|
|
|
### Troubleshooting & Maintenance
|
|
|
|
```bash
|
|
# Run comprehensive troubleshooting
|
|
make troubleshoot
|
|
|
|
# Fix database issues
|
|
make fix-databases
|
|
|
|
# Restart Authentik components
|
|
make restart-authentik
|
|
|
|
# Restart Unleash with fixes
|
|
make restart-unleash
|
|
|
|
# Verify all endpoints
|
|
make verify
|
|
|
|
# Check service health
|
|
make health
|
|
|
|
# View service status
|
|
make status
|
|
```
|
|
|
|
### Automated Fixes
|
|
|
|
The deployment automation handles:
|
|
|
|
- **Database Initialization**: Creates required databases (unleash, authentik)
|
|
- **Password Reset**: Fixes Authentik database authentication issues
|
|
- **Service Ordering**: Starts services in correct dependency order
|
|
- **Health Monitoring**: Waits for services to be healthy before proceeding
|
|
- **Network Setup**: Creates required Docker networks
|
|
- **Certificate Generation**: Generates self-signed TLS certificates
|
|
- **Host Configuration**: Sets up local domain resolution
|
|
|
|
## 📋 Services Overview
|
|
|
|
### Core Processing Pipeline
|
|
|
|
1. **svc-ingestion** (Port 8001) - Document upload and storage
|
|
2. **svc-rpa** (Port 8002) - Browser automation for portal data
|
|
3. **svc-ocr** (Port 8003) - OCR and layout extraction
|
|
4. **svc-extract** (Port 8004) - LLM-based field extraction
|
|
5. **svc-normalize-map** (Port 8005) - Data normalization and KG mapping
|
|
6. **svc-kg** (Port 8006) - Knowledge graph operations
|
|
|
|
### AI & Reasoning
|
|
|
|
7. **svc-rag-indexer** (Port 8007) - Vector database indexing
|
|
8. **svc-rag-retriever** (Port 8008) - Hybrid search with KG fusion
|
|
9. **svc-reason** (Port 8009) - Tax calculation engine
|
|
10. **svc-coverage** (Port 8013) - Document coverage policy evaluation
|
|
|
|
### Output & Integration
|
|
|
|
11. **svc-forms** (Port 8010) - PDF form filling
|
|
12. **svc-hmrc** (Port 8011) - HMRC submission service
|
|
13. **svc-firm-connectors** (Port 8012) - Practice management integration
|
|
|
|
## 🔧 Development
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
ai-tax-agent/
|
|
├── libs/ # Shared libraries
|
|
│ ├── config.py # Configuration and factories
|
|
│ ├── security.py # Authentication and encryption
|
|
│ ├── observability.py # Tracing, metrics, logging
|
|
│ ├── events.py # Event bus abstraction
|
|
│ ├── schemas.py # Pydantic models
|
|
│ ├── storage.py # MinIO/S3 operations
|
|
│ ├── neo.py # Neo4j operations
|
|
│ ├── rag.py # RAG and vector operations
|
|
│ ├── forms.py # PDF form handling
|
|
│ ├── calibration.py # ML confidence calibration
|
|
│ ├── policy.py # Coverage policy loading and compilation
|
|
│ ├── coverage_models.py # Coverage system data models
|
|
│ ├── coverage_eval.py # Coverage evaluation engine
|
|
│ └── coverage_schema.json # JSON schema for policy validation
|
|
├── apps/ # Microservices
|
|
│ ├── svc-ingestion/ # Document ingestion service
|
|
│ ├── svc-rpa/ # RPA automation service
|
|
│ ├── svc-ocr/ # OCR processing service
|
|
│ ├── svc-extract/ # Field extraction service
|
|
│ ├── svc-normalize-map/ # Normalization service
|
|
│ ├── svc-kg/ # Knowledge graph service
|
|
│ ├── svc-rag-indexer/ # RAG indexing service
|
|
│ ├── svc-rag-retriever/ # RAG retrieval service
|
|
│ ├── svc-reason/ # Tax reasoning service
|
|
│ ├── svc-coverage/ # Document coverage policy service
|
|
│ ├── svc-forms/ # Form filling service
|
|
│ ├── svc-hmrc/ # HMRC integration service
|
|
│ └── svc-firm-connectors/ # Firm integration service
|
|
├── infra/ # Infrastructure
|
|
│ ├── compose/ # Docker Compose files
|
|
│ └── k8s/ # Kubernetes manifests
|
|
├── tests/ # Test suites
|
|
│ ├── e2e/ # End-to-end tests
|
|
│ └── unit/ # Unit tests
|
|
├── config/ # Configuration files
|
|
├── schemas/ # Data schemas
|
|
├── db/ # Database schemas
|
|
└── docs/ # Documentation
|
|
```
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Unit tests
|
|
make test-unit
|
|
|
|
# End-to-end tests
|
|
make test-e2e
|
|
|
|
# All tests
|
|
make test
|
|
```
|
|
|
|
### Development Workflow
|
|
|
|
```bash
|
|
# Start development environment
|
|
make dev
|
|
|
|
# Watch logs for specific service
|
|
make logs SERVICE=svc-extract
|
|
|
|
# Restart specific service
|
|
make restart SERVICE=svc-extract
|
|
|
|
# Run linting and formatting
|
|
make lint
|
|
make format
|
|
|
|
# Generate API documentation
|
|
make docs
|
|
```
|
|
|
|
## 🔐 Security & Authentication
|
|
|
|
### Edge Authentication
|
|
|
|
- **Traefik** reverse proxy with SSL termination
|
|
- **Authentik** SSO provider with OIDC/SAML support
|
|
- **ForwardAuth** middleware for service authentication
|
|
- **Zero-trust** architecture - services consume user context via headers
|
|
|
|
### Data Protection
|
|
|
|
- **Vault Transit** encryption for sensitive fields
|
|
- **PII Detection** and de-identification before vector indexing
|
|
- **Tenant Isolation** with row-level security
|
|
- **Audit Trails** with bitemporal data modeling
|
|
|
|
### Network Security
|
|
|
|
- **Internal Networks** for service communication
|
|
- **TLS Everywhere** with automatic certificate management
|
|
- **Rate Limiting** and DDoS protection
|
|
- **Security Headers** and CORS policies
|
|
|
|
## 📊 Observability
|
|
|
|
### Metrics & Monitoring
|
|
|
|
- **Prometheus** for metrics collection
|
|
- **Grafana** for visualization and alerting
|
|
- **Custom Business Metrics** for document processing, RAG, calculations
|
|
- **SLI/SLO Monitoring** with error budgets
|
|
|
|
### Tracing & Logging
|
|
|
|
- **OpenTelemetry** distributed tracing
|
|
- **Jaeger** trace visualization
|
|
- **Structured Logging** with correlation IDs
|
|
- **Log Aggregation** with ELK stack (optional)
|
|
|
|
### Health Checks
|
|
|
|
```bash
|
|
# Check all service health
|
|
make health
|
|
|
|
# Individual service health
|
|
curl http://localhost:8001/health
|
|
```
|
|
|
|
## 🗃️ Data Architecture
|
|
|
|
### Knowledge Graph (Neo4j)
|
|
|
|
- **Bitemporal Modeling** with valid_time and system_time
|
|
- **SHACL Validation** for data integrity
|
|
- **Tenant Isolation** with security constraints
|
|
- **Audit Trails** for all changes
|
|
|
|
### Vector Database (Qdrant)
|
|
|
|
- **PII-Free Indexing** with de-identification
|
|
- **Hybrid Search** combining dense and sparse vectors
|
|
- **Collection Management** per tenant and data type
|
|
- **Confidence Calibration** for search results
|
|
|
|
### Event Streaming (Kafka) - (TBD)
|
|
|
|
- **Event-Driven Architecture** with standardized topics
|
|
- **Exactly-Once Processing** with idempotency
|
|
- **Dead Letter Queues** for error handling
|
|
- **Schema Registry** for event validation
|
|
|
|
## 🧮 Tax Calculation Engine
|
|
|
|
### Supported Forms
|
|
|
|
- **SA100** - Main Self Assessment return
|
|
- **SA103** - Self-employment income
|
|
- **SA105** - Property income
|
|
- **SA106** - Foreign income
|
|
|
|
### Calculation Features
|
|
|
|
- **Rules Engine** with configurable tax rules
|
|
- **Evidence Trails** linking calculations to source documents
|
|
- **Confidence Scoring** with calibration
|
|
- **Multi-Year Support** with basis period reform
|
|
|
|
### HMRC Integration
|
|
|
|
- **MTD API** integration for submissions
|
|
- **OAuth 2.0** authentication flow
|
|
- **Dry Run** mode for testing
|
|
- **Validation** against HMRC business rules
|
|
|
|
## 🔌 Integrations
|
|
|
|
### Practice Management Systems
|
|
|
|
- **IRIS** Practice Management
|
|
- **Sage** Practice Management
|
|
- **Xero** accounting software
|
|
- **QuickBooks** accounting software
|
|
- **FreeAgent** accounting software
|
|
- **KashFlow** accounting software
|
|
|
|
### Document Sources
|
|
|
|
- **Direct Upload** via web interface
|
|
- **Email Integration** with attachment processing
|
|
- **Portal Scraping** via RPA automation
|
|
- **API Integration** with accounting systems
|
|
|
|
## 🚀 Deployment
|
|
|
|
### Local Development
|
|
|
|
```bash
|
|
make up # Start all services
|
|
make down # Stop all services
|
|
make clean # Clean up volumes and networks
|
|
```
|
|
|
|
### Production Deployment
|
|
|
|
For detailed instructions, see [infra/compose/README.md](infra/compose/README.md).
|
|
|
|
The system uses a unified deployment script for production environments:
|
|
|
|
```bash
|
|
# Deploy to production (Infrastructure + Services + Monitoring)
|
|
./infra/scripts/deploy.sh production all
|
|
```
|
|
|
|
Ensure you have configured `infra/environments/production/.env` with the correct secrets and domain settings before deploying.
|
|
|
|
### Environment Configuration
|
|
|
|
Key environment variables:
|
|
|
|
```bash
|
|
# Database connections
|
|
DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db
|
|
NEO4J_URI=bolt://neo4j:7687
|
|
QDRANT_URL=http://qdrant:6333
|
|
|
|
# External services
|
|
OPENAI_API_KEY=sk-...
|
|
VAULT_ADDR=http://vault:8200
|
|
KAFKA_BOOTSTRAP_SERVERS=kafka:9092
|
|
|
|
# Security
|
|
AUTHENTIK_SECRET_KEY=your-secret-key
|
|
VAULT_ROLE_ID=your-role-id
|
|
VAULT_SECRET_ID=your-secret-id
|
|
```
|
|
|
|
## 📚 API Documentation
|
|
|
|
### Authentication
|
|
|
|
All API endpoints require authentication via Authentik ForwardAuth:
|
|
|
|
```bash
|
|
curl -H "X-Forwarded-User: user@example.com" \
|
|
-H "X-Forwarded-Groups: tax_agents" \
|
|
-H "X-Tenant-ID: tenant-123" \
|
|
https://api.localhost/api/ingestion/health
|
|
```
|
|
|
|
### Key Endpoints
|
|
|
|
- `POST /api/ingestion/upload` - Upload documents
|
|
- `GET /api/extract/status/{doc_id}` - Check extraction status
|
|
- `POST /api/rag-retriever/search` - Search knowledge base
|
|
- `POST /api/reason/compute` - Trigger tax calculations
|
|
- `POST /api/forms/fill/{form_id}` - Fill PDF forms
|
|
- `POST /api/hmrc/submit` - Submit to HMRC
|
|
|
|
### Event Topics
|
|
|
|
- `DOC_INGESTED` - Document uploaded
|
|
- `DOC_OCR_READY` - OCR completed
|
|
- `DOC_EXTRACTED` - Fields extracted
|
|
- `KG_UPSERTED` - Knowledge graph updated
|
|
- `RAG_INDEXED` - Vector indexing completed
|
|
- `CALC_SCHEDULE_READY` - Tax calculation completed
|
|
- `FORM_FILLED` - PDF form filled
|
|
- `HMRC_SUBMITTED` - HMRC submission completed
|
|
|
|
## 🤝 Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create a feature branch
|
|
3. Make your changes
|
|
4. Add tests
|
|
5. Run the test suite
|
|
6. Submit a pull request
|
|
|
|
### Code Standards
|
|
|
|
- **Python**: Black formatting, isort imports, mypy type checking
|
|
- **Documentation**: Docstrings for all public functions
|
|
- **Testing**: Minimum 80% code coverage
|
|
- **Security**: No secrets in code, use Vault for sensitive data
|
|
|
|
## 📋 Coverage Policy System
|
|
|
|
The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic.
|
|
|
|
### Policy Configuration
|
|
|
|
Coverage policies are defined in `config/coverage.yaml` with support for jurisdiction and tenant-specific overlays:
|
|
|
|
```yaml
|
|
# config/coverage.yaml
|
|
version: "1.0"
|
|
jurisdiction: "UK"
|
|
tax_year: "2024-25"
|
|
tax_year_boundary:
|
|
start: "2024-04-06"
|
|
end: "2025-04-05"
|
|
|
|
defaults:
|
|
confidence_thresholds:
|
|
ocr: 0.82
|
|
extract: 0.85
|
|
date_tolerance_days: 30
|
|
|
|
triggers:
|
|
SA102: # Employment schedule
|
|
any_of:
|
|
- "exists(IncomeItem[type='Employment'])"
|
|
SA105: # Property schedule
|
|
any_of:
|
|
- "exists(IncomeItem[type='UKPropertyRent'])"
|
|
|
|
schedules:
|
|
SA102:
|
|
evidence:
|
|
- id: "P60"
|
|
role: "REQUIRED"
|
|
boxes: ["SA102_b1", "SA102_b2"]
|
|
acceptable_alternatives: ["P45", "FinalPayslipYTD"]
|
|
- id: "P11D"
|
|
role: "CONDITIONALLY_REQUIRED"
|
|
condition: "exists(BenefitInKind=true)"
|
|
boxes: ["SA102_b9"]
|
|
```
|
|
|
|
### API Usage
|
|
|
|
#### Check Document Coverage
|
|
|
|
```bash
|
|
curl -X POST https://api.localhost/coverage/v1/check \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{
|
|
"taxpayer_id": "T-001",
|
|
"tax_year": "2024-25",
|
|
"jurisdiction": "UK"
|
|
}'
|
|
```
|
|
|
|
Response:
|
|
|
|
```json
|
|
{
|
|
"overall_status": "INCOMPLETE",
|
|
"schedules_required": ["SA102"],
|
|
"coverage": [
|
|
{
|
|
"schedule_id": "SA102",
|
|
"status": "INCOMPLETE",
|
|
"evidence": [
|
|
{
|
|
"id": "P60",
|
|
"status": "MISSING",
|
|
"role": "REQUIRED",
|
|
"found": []
|
|
}
|
|
]
|
|
}
|
|
],
|
|
"blocking_items": [
|
|
{
|
|
"schedule_id": "SA102",
|
|
"evidence_id": "P60",
|
|
"role": "REQUIRED",
|
|
"reason": "P60 provides year-end pay and PAYE tax figures",
|
|
"boxes": ["SA102_b1", "SA102_b2"],
|
|
"acceptable_alternatives": ["P45", "FinalPayslipYTD"]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
#### Generate Clarifying Questions
|
|
|
|
```bash
|
|
curl -X POST https://api.localhost/coverage/v1/clarify \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{
|
|
"taxpayer_id": "T-001",
|
|
"tax_year": "2024-25",
|
|
"jurisdiction": "UK",
|
|
"schedule_id": "SA102",
|
|
"evidence_id": "P60"
|
|
}'
|
|
```
|
|
|
|
### Policy Hot Reload
|
|
|
|
Policies can be reloaded without service restart:
|
|
|
|
```bash
|
|
curl -X POST https://api.localhost/coverage/admin/reload \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
```
|
|
|
|
### Predicate Language
|
|
|
|
The policy system supports a domain-specific language for conditions:
|
|
|
|
- `exists(Entity[filters])` - Check if entities exist with filters
|
|
- `property_name` - Check boolean properties
|
|
- `taxpayer_flag:flag_name` - Check taxpayer flags
|
|
- `filing_mode:mode` - Check filing mode
|
|
- `computed_condition` - Check computed values
|
|
|
|
### Status Classification
|
|
|
|
Evidence is classified into four statuses:
|
|
|
|
- **PRESENT_VERIFIED**: High confidence OCR/extract, date within tax year
|
|
- **PRESENT_UNVERIFIED**: Medium confidence, may need manual review
|
|
- **CONFLICTING**: Multiple documents with conflicting information
|
|
- **MISSING**: No evidence found or confidence too low
|
|
|
|
### Testing
|
|
|
|
Run coverage policy tests:
|
|
|
|
```bash
|
|
# Unit tests
|
|
pytest tests/unit/coverage/ -v
|
|
|
|
# Integration tests
|
|
pytest tests/integration/coverage/ -v
|
|
|
|
# End-to-end tests
|
|
pytest tests/e2e/test_coverage_to_compute_flow.py -v
|
|
|
|
# Coverage report
|
|
pytest tests/unit/coverage/ --cov=libs --cov-report=html
|
|
```
|
|
|
|
## 📄 License
|
|
|
|
This project is licensed under the MIT License - see the LICENSE file for details.
|
|
|
|
## 🆘 Support
|
|
|
|
- **Documentation**: See `/docs` directory
|
|
- **Issues**: GitHub Issues
|
|
- **Discussions**: GitHub Discussions
|
|
- **Security**: security@example.com
|
|
|
|
## 🗺️ Roadmap
|
|
|
|
- [ ] Advanced ML models for extraction
|
|
- [ ] Multi-jurisdiction support (EU, US)
|
|
- [ ] Real-time collaboration features
|
|
- [ ] Mobile application
|
|
- [ ] Advanced analytics dashboard
|
|
- [ ] Blockchain audit trails
|