Files
ai-tax-agent/README.md
harkon 7e54ee9099
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
feat: working infra with sso
2025-12-04 12:49:43 +02:00

612 lines
16 KiB
Markdown

# AI Tax Agent - Production Microservices Suite
A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration.
## 🏗️ Architecture Overview
This system implements a complete end-to-end tax processing pipeline with:
- **12 Microservices** for document processing, extraction, reasoning, and submission
- **Knowledge Graph** (Neo4j) with bitemporal modeling for audit trails
- **Vector Database** (Qdrant) for RAG with PII protection
- **Edge Authentication** via Traefik + Authentik SSO
- **Event-Driven Architecture** with Kafka messaging
- **Comprehensive Observability** with OpenTelemetry, Prometheus, and Grafana
## 🚀 Quick Start
### Prerequisites
- Docker and Docker Compose
- Python 3.12+
- Node.js 18+ (for UI components)
- 16GB+ RAM recommended
- OpenAI API key (for LLM extraction)
### 1. Clone and Setup
```bash
git clone <repository-url>
cd ai-tax-agent-2
# Bootstrap the development environment
make bootstrap
# Edit .env with your configuration
# Minimum required: OPENAI_API_KEY
```
### 2. Start Infrastructure (Automated)
```bash
# Start all services with automated fixes
make run
# Alternative: Start without fixes (original behavior)
make run-simple
# Or deploy infrastructure only
make deploy-infra
```
### 3. Complete Authentik Setup
After deployment, complete the SSO setup:
1. Visit https://auth.local.lan/if/flow/initial-setup/
2. Create the initial admin user
3. Configure applications for protected services
```bash
# Run setup helper (optional)
make setup-authentik
```
### 4. Access Services
- **Traefik Dashboard**: http://localhost:8080
- **Authentik SSO**: https://auth.local.lan
- **Grafana**: https://grafana.local.lan
- **Review UI**: https://review.local.lan (requires Authentik setup)
- **API Gateway**: https://api.local.lan
## 🤖 Automation & Scripts
The system includes comprehensive automation for deployment and troubleshooting:
### Core Commands
```bash
# Complete automated deployment with fixes
make run
# Bootstrap environment
make bootstrap
# Deploy infrastructure only
make deploy-infra
# Deploy application services only
make deploy-services
```
### Troubleshooting & Maintenance
```bash
# Run comprehensive troubleshooting
make troubleshoot
# Fix database issues
make fix-databases
# Restart Authentik components
make restart-authentik
# Restart Unleash with fixes
make restart-unleash
# Verify all endpoints
make verify
# Check service health
make health
# View service status
make status
```
### Automated Fixes
The deployment automation handles:
- **Database Initialization**: Creates required databases (unleash, authentik)
- **Password Reset**: Fixes Authentik database authentication issues
- **Service Ordering**: Starts services in correct dependency order
- **Health Monitoring**: Waits for services to be healthy before proceeding
- **Network Setup**: Creates required Docker networks
- **Certificate Generation**: Generates self-signed TLS certificates
- **Host Configuration**: Sets up local domain resolution
## 📋 Services Overview
### Core Processing Pipeline
1. **svc-ingestion** (Port 8001) - Document upload and storage
2. **svc-rpa** (Port 8002) - Browser automation for portal data
3. **svc-ocr** (Port 8003) - OCR and layout extraction
4. **svc-extract** (Port 8004) - LLM-based field extraction
5. **svc-normalize-map** (Port 8005) - Data normalization and KG mapping
6. **svc-kg** (Port 8006) - Knowledge graph operations
### AI & Reasoning
7. **svc-rag-indexer** (Port 8007) - Vector database indexing
8. **svc-rag-retriever** (Port 8008) - Hybrid search with KG fusion
9. **svc-reason** (Port 8009) - Tax calculation engine
10. **svc-coverage** (Port 8013) - Document coverage policy evaluation
### Output & Integration
11. **svc-forms** (Port 8010) - PDF form filling
12. **svc-hmrc** (Port 8011) - HMRC submission service
13. **svc-firm-connectors** (Port 8012) - Practice management integration
## 🔧 Development
### Project Structure
```
ai-tax-agent/
├── libs/ # Shared libraries
│ ├── config.py # Configuration and factories
│ ├── security.py # Authentication and encryption
│ ├── observability.py # Tracing, metrics, logging
│ ├── events.py # Event bus abstraction
│ ├── schemas.py # Pydantic models
│ ├── storage.py # MinIO/S3 operations
│ ├── neo.py # Neo4j operations
│ ├── rag.py # RAG and vector operations
│ ├── forms.py # PDF form handling
│ ├── calibration.py # ML confidence calibration
│ ├── policy.py # Coverage policy loading and compilation
│ ├── coverage_models.py # Coverage system data models
│ ├── coverage_eval.py # Coverage evaluation engine
│ └── coverage_schema.json # JSON schema for policy validation
├── apps/ # Microservices
│ ├── svc-ingestion/ # Document ingestion service
│ ├── svc-rpa/ # RPA automation service
│ ├── svc-ocr/ # OCR processing service
│ ├── svc-extract/ # Field extraction service
│ ├── svc-normalize-map/ # Normalization service
│ ├── svc-kg/ # Knowledge graph service
│ ├── svc-rag-indexer/ # RAG indexing service
│ ├── svc-rag-retriever/ # RAG retrieval service
│ ├── svc-reason/ # Tax reasoning service
│ ├── svc-coverage/ # Document coverage policy service
│ ├── svc-forms/ # Form filling service
│ ├── svc-hmrc/ # HMRC integration service
│ └── svc-firm-connectors/ # Firm integration service
├── infra/ # Infrastructure
│ ├── compose/ # Docker Compose files
│ └── k8s/ # Kubernetes manifests
├── tests/ # Test suites
│ ├── e2e/ # End-to-end tests
│ └── unit/ # Unit tests
├── config/ # Configuration files
├── schemas/ # Data schemas
├── db/ # Database schemas
└── docs/ # Documentation
```
### Running Tests
```bash
# Unit tests
make test-unit
# End-to-end tests
make test-e2e
# All tests
make test
```
### Development Workflow
```bash
# Start development environment
make dev
# Watch logs for specific service
make logs SERVICE=svc-extract
# Restart specific service
make restart SERVICE=svc-extract
# Run linting and formatting
make lint
make format
# Generate API documentation
make docs
```
## 🔐 Security & Authentication
### Edge Authentication
- **Traefik** reverse proxy with SSL termination
- **Authentik** SSO provider with OIDC/SAML support
- **ForwardAuth** middleware for service authentication
- **Zero-trust** architecture - services consume user context via headers
### Data Protection
- **Vault Transit** encryption for sensitive fields
- **PII Detection** and de-identification before vector indexing
- **Tenant Isolation** with row-level security
- **Audit Trails** with bitemporal data modeling
### Network Security
- **Internal Networks** for service communication
- **TLS Everywhere** with automatic certificate management
- **Rate Limiting** and DDoS protection
- **Security Headers** and CORS policies
## 📊 Observability
### Metrics & Monitoring
- **Prometheus** for metrics collection
- **Grafana** for visualization and alerting
- **Custom Business Metrics** for document processing, RAG, calculations
- **SLI/SLO Monitoring** with error budgets
### Tracing & Logging
- **OpenTelemetry** distributed tracing
- **Jaeger** trace visualization
- **Structured Logging** with correlation IDs
- **Log Aggregation** with ELK stack (optional)
### Health Checks
```bash
# Check all service health
make health
# Individual service health
curl http://localhost:8001/health
```
## 🗃️ Data Architecture
### Knowledge Graph (Neo4j)
- **Bitemporal Modeling** with valid_time and system_time
- **SHACL Validation** for data integrity
- **Tenant Isolation** with security constraints
- **Audit Trails** for all changes
### Vector Database (Qdrant)
- **PII-Free Indexing** with de-identification
- **Hybrid Search** combining dense and sparse vectors
- **Collection Management** per tenant and data type
- **Confidence Calibration** for search results
### Event Streaming (Kafka) - (TBD)
- **Event-Driven Architecture** with standardized topics
- **Exactly-Once Processing** with idempotency
- **Dead Letter Queues** for error handling
- **Schema Registry** for event validation
## 🧮 Tax Calculation Engine
### Supported Forms
- **SA100** - Main Self Assessment return
- **SA103** - Self-employment income
- **SA105** - Property income
- **SA106** - Foreign income
### Calculation Features
- **Rules Engine** with configurable tax rules
- **Evidence Trails** linking calculations to source documents
- **Confidence Scoring** with calibration
- **Multi-Year Support** with basis period reform
### HMRC Integration
- **MTD API** integration for submissions
- **OAuth 2.0** authentication flow
- **Dry Run** mode for testing
- **Validation** against HMRC business rules
## 🔌 Integrations
### Practice Management Systems
- **IRIS** Practice Management
- **Sage** Practice Management
- **Xero** accounting software
- **QuickBooks** accounting software
- **FreeAgent** accounting software
- **KashFlow** accounting software
### Document Sources
- **Direct Upload** via web interface
- **Email Integration** with attachment processing
- **Portal Scraping** via RPA automation
- **API Integration** with accounting systems
## 🚀 Deployment
### Local Development
```bash
make up # Start all services
make down # Stop all services
make clean # Clean up volumes and networks
```
### Production Deployment
For detailed instructions, see [infra/compose/README.md](infra/compose/README.md).
The system uses a unified deployment script for production environments:
```bash
# Deploy to production (Infrastructure + Services + Monitoring)
./infra/scripts/deploy.sh production all
```
Ensure you have configured `infra/environments/production/.env` with the correct secrets and domain settings before deploying.
### Environment Configuration
Key environment variables:
```bash
# Database connections
DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db
NEO4J_URI=bolt://neo4j:7687
QDRANT_URL=http://qdrant:6333
# External services
OPENAI_API_KEY=sk-...
VAULT_ADDR=http://vault:8200
KAFKA_BOOTSTRAP_SERVERS=kafka:9092
# Security
AUTHENTIK_SECRET_KEY=your-secret-key
VAULT_ROLE_ID=your-role-id
VAULT_SECRET_ID=your-secret-id
```
## 📚 API Documentation
### Authentication
All API endpoints require authentication via Authentik ForwardAuth:
```bash
curl -H "X-Forwarded-User: user@example.com" \
-H "X-Forwarded-Groups: tax_agents" \
-H "X-Tenant-ID: tenant-123" \
https://api.localhost/api/ingestion/health
```
### Key Endpoints
- `POST /api/ingestion/upload` - Upload documents
- `GET /api/extract/status/{doc_id}` - Check extraction status
- `POST /api/rag-retriever/search` - Search knowledge base
- `POST /api/reason/compute` - Trigger tax calculations
- `POST /api/forms/fill/{form_id}` - Fill PDF forms
- `POST /api/hmrc/submit` - Submit to HMRC
### Event Topics
- `DOC_INGESTED` - Document uploaded
- `DOC_OCR_READY` - OCR completed
- `DOC_EXTRACTED` - Fields extracted
- `KG_UPSERTED` - Knowledge graph updated
- `RAG_INDEXED` - Vector indexing completed
- `CALC_SCHEDULE_READY` - Tax calculation completed
- `FORM_FILLED` - PDF form filled
- `HMRC_SUBMITTED` - HMRC submission completed
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Run the test suite
6. Submit a pull request
### Code Standards
- **Python**: Black formatting, isort imports, mypy type checking
- **Documentation**: Docstrings for all public functions
- **Testing**: Minimum 80% code coverage
- **Security**: No secrets in code, use Vault for sensitive data
## 📋 Coverage Policy System
The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic.
### Policy Configuration
Coverage policies are defined in `config/coverage.yaml` with support for jurisdiction and tenant-specific overlays:
```yaml
# config/coverage.yaml
version: "1.0"
jurisdiction: "UK"
tax_year: "2024-25"
tax_year_boundary:
start: "2024-04-06"
end: "2025-04-05"
defaults:
confidence_thresholds:
ocr: 0.82
extract: 0.85
date_tolerance_days: 30
triggers:
SA102: # Employment schedule
any_of:
- "exists(IncomeItem[type='Employment'])"
SA105: # Property schedule
any_of:
- "exists(IncomeItem[type='UKPropertyRent'])"
schedules:
SA102:
evidence:
- id: "P60"
role: "REQUIRED"
boxes: ["SA102_b1", "SA102_b2"]
acceptable_alternatives: ["P45", "FinalPayslipYTD"]
- id: "P11D"
role: "CONDITIONALLY_REQUIRED"
condition: "exists(BenefitInKind=true)"
boxes: ["SA102_b9"]
```
### API Usage
#### Check Document Coverage
```bash
curl -X POST https://api.localhost/coverage/v1/check \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"taxpayer_id": "T-001",
"tax_year": "2024-25",
"jurisdiction": "UK"
}'
```
Response:
```json
{
"overall_status": "INCOMPLETE",
"schedules_required": ["SA102"],
"coverage": [
{
"schedule_id": "SA102",
"status": "INCOMPLETE",
"evidence": [
{
"id": "P60",
"status": "MISSING",
"role": "REQUIRED",
"found": []
}
]
}
],
"blocking_items": [
{
"schedule_id": "SA102",
"evidence_id": "P60",
"role": "REQUIRED",
"reason": "P60 provides year-end pay and PAYE tax figures",
"boxes": ["SA102_b1", "SA102_b2"],
"acceptable_alternatives": ["P45", "FinalPayslipYTD"]
}
]
}
```
#### Generate Clarifying Questions
```bash
curl -X POST https://api.localhost/coverage/v1/clarify \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"taxpayer_id": "T-001",
"tax_year": "2024-25",
"jurisdiction": "UK",
"schedule_id": "SA102",
"evidence_id": "P60"
}'
```
### Policy Hot Reload
Policies can be reloaded without service restart:
```bash
curl -X POST https://api.localhost/coverage/admin/reload \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
### Predicate Language
The policy system supports a domain-specific language for conditions:
- `exists(Entity[filters])` - Check if entities exist with filters
- `property_name` - Check boolean properties
- `taxpayer_flag:flag_name` - Check taxpayer flags
- `filing_mode:mode` - Check filing mode
- `computed_condition` - Check computed values
### Status Classification
Evidence is classified into four statuses:
- **PRESENT_VERIFIED**: High confidence OCR/extract, date within tax year
- **PRESENT_UNVERIFIED**: Medium confidence, may need manual review
- **CONFLICTING**: Multiple documents with conflicting information
- **MISSING**: No evidence found or confidence too low
### Testing
Run coverage policy tests:
```bash
# Unit tests
pytest tests/unit/coverage/ -v
# Integration tests
pytest tests/integration/coverage/ -v
# End-to-end tests
pytest tests/e2e/test_coverage_to_compute_flow.py -v
# Coverage report
pytest tests/unit/coverage/ --cov=libs --cov-report=html
```
## 📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
## 🆘 Support
- **Documentation**: See `/docs` directory
- **Issues**: GitHub Issues
- **Discussions**: GitHub Discussions
- **Security**: security@example.com
## 🗺️ Roadmap
- [ ] Advanced ML models for extraction
- [ ] Multi-jurisdiction support (EU, US)
- [ ] Real-time collaboration features
- [ ] Mobile application
- [ ] Advanced analytics dashboard
- [ ] Blockchain audit trails