AI Tax Agent - Production Microservices Suite
A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration.
🏗️ Architecture Overview
This system implements a complete end-to-end tax processing pipeline with:
- 12 Microservices for document processing, extraction, reasoning, and submission
- Knowledge Graph (Neo4j) with bitemporal modeling for audit trails
- Vector Database (Qdrant) for RAG with PII protection
- Edge Authentication via Traefik + Authentik SSO
- Event-Driven Architecture with Kafka messaging
- Comprehensive Observability with OpenTelemetry, Prometheus, and Grafana
🚀 Quick Start
Prerequisites
- Docker and Docker Compose
- Python 3.12+
- Node.js 18+ (for UI components)
- 16GB+ RAM recommended
- OpenAI API key (for LLM extraction)
1. Clone and Setup
git clone <repository-url>
cd ai-tax-agent-2
# Bootstrap the development environment
make bootstrap
# Edit .env with your configuration
# Minimum required: OPENAI_API_KEY
2. Start Infrastructure (Automated)
# Start all services with automated fixes
make run
# Alternative: Start without fixes (original behavior)
make run-simple
# Or deploy infrastructure only
make deploy-infra
3. Complete Authentik Setup
After deployment, complete the SSO setup:
- Visit https://auth.local.lan/if/flow/initial-setup/
- Create the initial admin user
- Configure applications for protected services
# Run setup helper (optional)
make setup-authentik
4. Access Services
- Traefik Dashboard: http://localhost:8080
- Authentik SSO: https://auth.local.lan
- Grafana: https://grafana.local.lan
- Review UI: https://review.local.lan (requires Authentik setup)
- API Gateway: https://api.local.lan
🤖 Automation & Scripts
The system includes comprehensive automation for deployment and troubleshooting:
Core Commands
# Complete automated deployment with fixes
make run
# Bootstrap environment
make bootstrap
# Deploy infrastructure only
make deploy-infra
# Deploy application services only
make deploy-services
Troubleshooting & Maintenance
# Run comprehensive troubleshooting
make troubleshoot
# Fix database issues
make fix-databases
# Restart Authentik components
make restart-authentik
# Restart Unleash with fixes
make restart-unleash
# Verify all endpoints
make verify
# Check service health
make health
# View service status
make status
Automated Fixes
The deployment automation handles:
- Database Initialization: Creates required databases (unleash, authentik)
- Password Reset: Fixes Authentik database authentication issues
- Service Ordering: Starts services in correct dependency order
- Health Monitoring: Waits for services to be healthy before proceeding
- Network Setup: Creates required Docker networks
- Certificate Generation: Generates self-signed TLS certificates
- Host Configuration: Sets up local domain resolution
📋 Services Overview
Core Processing Pipeline
- svc-ingestion (Port 8001) - Document upload and storage
- svc-rpa (Port 8002) - Browser automation for portal data
- svc-ocr (Port 8003) - OCR and layout extraction
- svc-extract (Port 8004) - LLM-based field extraction
- svc-normalize-map (Port 8005) - Data normalization and KG mapping
- svc-kg (Port 8006) - Knowledge graph operations
AI & Reasoning
- svc-rag-indexer (Port 8007) - Vector database indexing
- svc-rag-retriever (Port 8008) - Hybrid search with KG fusion
- svc-reason (Port 8009) - Tax calculation engine
- svc-coverage (Port 8013) - Document coverage policy evaluation
Output & Integration
- svc-forms (Port 8010) - PDF form filling
- svc-hmrc (Port 8011) - HMRC submission service
- svc-firm-connectors (Port 8012) - Practice management integration
🔧 Development
Project Structure
ai-tax-agent/
├── libs/ # Shared libraries
│ ├── config.py # Configuration and factories
│ ├── security.py # Authentication and encryption
│ ├── observability.py # Tracing, metrics, logging
│ ├── events.py # Event bus abstraction
│ ├── schemas.py # Pydantic models
│ ├── storage.py # MinIO/S3 operations
│ ├── neo.py # Neo4j operations
│ ├── rag.py # RAG and vector operations
│ ├── forms.py # PDF form handling
│ ├── calibration.py # ML confidence calibration
│ ├── policy.py # Coverage policy loading and compilation
│ ├── coverage_models.py # Coverage system data models
│ ├── coverage_eval.py # Coverage evaluation engine
│ └── coverage_schema.json # JSON schema for policy validation
├── apps/ # Microservices
│ ├── svc-ingestion/ # Document ingestion service
│ ├── svc-rpa/ # RPA automation service
│ ├── svc-ocr/ # OCR processing service
│ ├── svc-extract/ # Field extraction service
│ ├── svc-normalize-map/ # Normalization service
│ ├── svc-kg/ # Knowledge graph service
│ ├── svc-rag-indexer/ # RAG indexing service
│ ├── svc-rag-retriever/ # RAG retrieval service
│ ├── svc-reason/ # Tax reasoning service
│ ├── svc-coverage/ # Document coverage policy service
│ ├── svc-forms/ # Form filling service
│ ├── svc-hmrc/ # HMRC integration service
│ └── svc-firm-connectors/ # Firm integration service
├── infra/ # Infrastructure
│ ├── compose/ # Docker Compose files
│ └── k8s/ # Kubernetes manifests
├── tests/ # Test suites
│ ├── e2e/ # End-to-end tests
│ └── unit/ # Unit tests
├── config/ # Configuration files
├── schemas/ # Data schemas
├── db/ # Database schemas
└── docs/ # Documentation
Running Tests
# Unit tests
make test-unit
# End-to-end tests
make test-e2e
# All tests
make test
Development Workflow
# Start development environment
make dev
# Watch logs for specific service
make logs SERVICE=svc-extract
# Restart specific service
make restart SERVICE=svc-extract
# Run linting and formatting
make lint
make format
# Generate API documentation
make docs
🔐 Security & Authentication
Edge Authentication
- Traefik reverse proxy with SSL termination
- Authentik SSO provider with OIDC/SAML support
- ForwardAuth middleware for service authentication
- Zero-trust architecture - services consume user context via headers
Data Protection
- Vault Transit encryption for sensitive fields
- PII Detection and de-identification before vector indexing
- Tenant Isolation with row-level security
- Audit Trails with bitemporal data modeling
Network Security
- Internal Networks for service communication
- TLS Everywhere with automatic certificate management
- Rate Limiting and DDoS protection
- Security Headers and CORS policies
📊 Observability
Metrics & Monitoring
- Prometheus for metrics collection
- Grafana for visualization and alerting
- Custom Business Metrics for document processing, RAG, calculations
- SLI/SLO Monitoring with error budgets
Tracing & Logging
- OpenTelemetry distributed tracing
- Jaeger trace visualization
- Structured Logging with correlation IDs
- Log Aggregation with ELK stack (optional)
Health Checks
# Check all service health
make health
# Individual service health
curl http://localhost:8001/health
🗃️ Data Architecture
Knowledge Graph (Neo4j)
- Bitemporal Modeling with valid_time and system_time
- SHACL Validation for data integrity
- Tenant Isolation with security constraints
- Audit Trails for all changes
Vector Database (Qdrant)
- PII-Free Indexing with de-identification
- Hybrid Search combining dense and sparse vectors
- Collection Management per tenant and data type
- Confidence Calibration for search results
Event Streaming (Kafka) - (TBD)
- Event-Driven Architecture with standardized topics
- Exactly-Once Processing with idempotency
- Dead Letter Queues for error handling
- Schema Registry for event validation
🧮 Tax Calculation Engine
Supported Forms
- SA100 - Main Self Assessment return
- SA103 - Self-employment income
- SA105 - Property income
- SA106 - Foreign income
Calculation Features
- Rules Engine with configurable tax rules
- Evidence Trails linking calculations to source documents
- Confidence Scoring with calibration
- Multi-Year Support with basis period reform
HMRC Integration
- MTD API integration for submissions
- OAuth 2.0 authentication flow
- Dry Run mode for testing
- Validation against HMRC business rules
🔌 Integrations
Practice Management Systems
- IRIS Practice Management
- Sage Practice Management
- Xero accounting software
- QuickBooks accounting software
- FreeAgent accounting software
- KashFlow accounting software
Document Sources
- Direct Upload via web interface
- Email Integration with attachment processing
- Portal Scraping via RPA automation
- API Integration with accounting systems
🚀 Deployment
Local Development
make up # Start all services
make down # Stop all services
make clean # Clean up volumes and networks
Production Deployment
For detailed instructions, see infra/compose/README.md.
The system uses a unified deployment script for production environments:
# Deploy to production (Infrastructure + Services + Monitoring)
./infra/scripts/deploy.sh production all
Ensure you have configured infra/environments/production/.env with the correct secrets and domain settings before deploying.
Environment Configuration
Key environment variables:
# Database connections
DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db
NEO4J_URI=bolt://neo4j:7687
QDRANT_URL=http://qdrant:6333
# External services
OPENAI_API_KEY=sk-...
VAULT_ADDR=http://vault:8200
KAFKA_BOOTSTRAP_SERVERS=kafka:9092
# Security
AUTHENTIK_SECRET_KEY=your-secret-key
VAULT_ROLE_ID=your-role-id
VAULT_SECRET_ID=your-secret-id
📚 API Documentation
Authentication
All API endpoints require authentication via Authentik ForwardAuth:
curl -H "X-Forwarded-User: user@example.com" \
-H "X-Forwarded-Groups: tax_agents" \
-H "X-Tenant-ID: tenant-123" \
https://api.localhost/api/ingestion/health
Key Endpoints
POST /api/ingestion/upload- Upload documentsGET /api/extract/status/{doc_id}- Check extraction statusPOST /api/rag-retriever/search- Search knowledge basePOST /api/reason/compute- Trigger tax calculationsPOST /api/forms/fill/{form_id}- Fill PDF formsPOST /api/hmrc/submit- Submit to HMRC
Event Topics
DOC_INGESTED- Document uploadedDOC_OCR_READY- OCR completedDOC_EXTRACTED- Fields extractedKG_UPSERTED- Knowledge graph updatedRAG_INDEXED- Vector indexing completedCALC_SCHEDULE_READY- Tax calculation completedFORM_FILLED- PDF form filledHMRC_SUBMITTED- HMRC submission completed
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Run the test suite
- Submit a pull request
Code Standards
- Python: Black formatting, isort imports, mypy type checking
- Documentation: Docstrings for all public functions
- Testing: Minimum 80% code coverage
- Security: No secrets in code, use Vault for sensitive data
📋 Coverage Policy System
The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic.
Policy Configuration
Coverage policies are defined in config/coverage.yaml with support for jurisdiction and tenant-specific overlays:
# config/coverage.yaml
version: "1.0"
jurisdiction: "UK"
tax_year: "2024-25"
tax_year_boundary:
start: "2024-04-06"
end: "2025-04-05"
defaults:
confidence_thresholds:
ocr: 0.82
extract: 0.85
date_tolerance_days: 30
triggers:
SA102: # Employment schedule
any_of:
- "exists(IncomeItem[type='Employment'])"
SA105: # Property schedule
any_of:
- "exists(IncomeItem[type='UKPropertyRent'])"
schedules:
SA102:
evidence:
- id: "P60"
role: "REQUIRED"
boxes: ["SA102_b1", "SA102_b2"]
acceptable_alternatives: ["P45", "FinalPayslipYTD"]
- id: "P11D"
role: "CONDITIONALLY_REQUIRED"
condition: "exists(BenefitInKind=true)"
boxes: ["SA102_b9"]
API Usage
Check Document Coverage
curl -X POST https://api.localhost/coverage/v1/check \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"taxpayer_id": "T-001",
"tax_year": "2024-25",
"jurisdiction": "UK"
}'
Response:
{
"overall_status": "INCOMPLETE",
"schedules_required": ["SA102"],
"coverage": [
{
"schedule_id": "SA102",
"status": "INCOMPLETE",
"evidence": [
{
"id": "P60",
"status": "MISSING",
"role": "REQUIRED",
"found": []
}
]
}
],
"blocking_items": [
{
"schedule_id": "SA102",
"evidence_id": "P60",
"role": "REQUIRED",
"reason": "P60 provides year-end pay and PAYE tax figures",
"boxes": ["SA102_b1", "SA102_b2"],
"acceptable_alternatives": ["P45", "FinalPayslipYTD"]
}
]
}
Generate Clarifying Questions
curl -X POST https://api.localhost/coverage/v1/clarify \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"taxpayer_id": "T-001",
"tax_year": "2024-25",
"jurisdiction": "UK",
"schedule_id": "SA102",
"evidence_id": "P60"
}'
Policy Hot Reload
Policies can be reloaded without service restart:
curl -X POST https://api.localhost/coverage/admin/reload \
-H "Authorization: Bearer $ADMIN_TOKEN"
Predicate Language
The policy system supports a domain-specific language for conditions:
exists(Entity[filters])- Check if entities exist with filtersproperty_name- Check boolean propertiestaxpayer_flag:flag_name- Check taxpayer flagsfiling_mode:mode- Check filing modecomputed_condition- Check computed values
Status Classification
Evidence is classified into four statuses:
- PRESENT_VERIFIED: High confidence OCR/extract, date within tax year
- PRESENT_UNVERIFIED: Medium confidence, may need manual review
- CONFLICTING: Multiple documents with conflicting information
- MISSING: No evidence found or confidence too low
Testing
Run coverage policy tests:
# Unit tests
pytest tests/unit/coverage/ -v
# Integration tests
pytest tests/integration/coverage/ -v
# End-to-end tests
pytest tests/e2e/test_coverage_to_compute_flow.py -v
# Coverage report
pytest tests/unit/coverage/ --cov=libs --cov-report=html
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
- Documentation: See
/docsdirectory - Issues: GitHub Issues
- Discussions: GitHub Discussions
- Security: security@example.com
🗺️ Roadmap
- Advanced ML models for extraction
- Multi-jurisdiction support (EU, US)
- Real-time collaboration features
- Mobile application
- Advanced analytics dashboard
- Blockchain audit trails