A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration.

🏗️ Architecture Overview

This system implements a complete end-to-end tax processing pipeline with:

12 Microservices for document processing, extraction, reasoning, and submission
Knowledge Graph (Neo4j) with bitemporal modeling for audit trails
Vector Database (Qdrant) for RAG with PII protection
Edge Authentication via Traefik + Authentik SSO
Event-Driven Architecture with Kafka messaging
Comprehensive Observability with OpenTelemetry, Prometheus, and Grafana

🚀 Quick Start

Prerequisites

Docker and Docker Compose
Python 3.12+
Node.js 18+ (for UI components)
16GB+ RAM recommended
OpenAI API key (for LLM extraction)

1. Clone and Setup

git clone <repository-url>
cd ai-tax-agent-2

# Bootstrap the development environment
make bootstrap

# Edit .env with your configuration
# Minimum required: OPENAI_API_KEY

2. Start Infrastructure (Automated)

# Start all services with automated fixes
make run

# Alternative: Start without fixes (original behavior)
make run-simple

# Or deploy infrastructure only
make deploy-infra

3. Complete Authentik Setup

After deployment, complete the SSO setup:

Visit https://auth.local.lan/if/flow/initial-setup/
Create the initial admin user
Configure applications for protected services

# Run setup helper (optional)
make setup-authentik

4. Access Services

Traefik Dashboard: http://localhost:8080
Authentik SSO: https://auth.local.lan
Grafana: https://grafana.local.lan
Review UI: https://review.local.lan (requires Authentik setup)
API Gateway: https://api.local.lan

🤖 Automation & Scripts

The system includes comprehensive automation for deployment and troubleshooting:

Core Commands

# Complete automated deployment with fixes
make run

# Bootstrap environment
make bootstrap

# Deploy infrastructure only
make deploy-infra

# Deploy application services only
make deploy-services

Troubleshooting & Maintenance

# Run comprehensive troubleshooting
make troubleshoot

# Fix database issues
make fix-databases

# Restart Authentik components
make restart-authentik

# Restart Unleash with fixes
make restart-unleash

# Verify all endpoints
make verify

# Check service health
make health

# View service status
make status

Automated Fixes

The deployment automation handles:

Database Initialization: Creates required databases (unleash, authentik)
Password Reset: Fixes Authentik database authentication issues
Service Ordering: Starts services in correct dependency order
Health Monitoring: Waits for services to be healthy before proceeding
Network Setup: Creates required Docker networks
Certificate Generation: Generates self-signed TLS certificates
Host Configuration: Sets up local domain resolution

📋 Services Overview

Core Processing Pipeline

svc-ingestion (Port 8001) - Document upload and storage
svc-rpa (Port 8002) - Browser automation for portal data
svc-ocr (Port 8003) - OCR and layout extraction
svc-extract (Port 8004) - LLM-based field extraction
svc-normalize-map (Port 8005) - Data normalization and KG mapping
svc-kg (Port 8006) - Knowledge graph operations

AI & Reasoning

svc-rag-indexer (Port 8007) - Vector database indexing
svc-rag-retriever (Port 8008) - Hybrid search with KG fusion
svc-reason (Port 8009) - Tax calculation engine
svc-coverage (Port 8013) - Document coverage policy evaluation

Output & Integration

svc-forms (Port 8010) - PDF form filling
svc-hmrc (Port 8011) - HMRC submission service
svc-firm-connectors (Port 8012) - Practice management integration

🔧 Development

Project Structure

ai-tax-agent/
├── libs/                   # Shared libraries
│   ├── config.py           # Configuration and factories
│   ├── security.py         # Authentication and encryption
│   ├── observability.py    # Tracing, metrics, logging
│   ├── events.py           # Event bus abstraction
│   ├── schemas.py          # Pydantic models
│   ├── storage.py          # MinIO/S3 operations
│   ├── neo.py              # Neo4j operations
│   ├── rag.py              # RAG and vector operations
│   ├── forms.py            # PDF form handling
│   ├── calibration.py      # ML confidence calibration
│   ├── policy.py           # Coverage policy loading and compilation
│   ├── coverage_models.py  # Coverage system data models
│   ├── coverage_eval.py    # Coverage evaluation engine
│   └── coverage_schema.json # JSON schema for policy validation
├── apps/                   # Microservices
│   ├── svc-ingestion/      # Document ingestion service
│   ├── svc-rpa/            # RPA automation service
│   ├── svc-ocr/            # OCR processing service
│   ├── svc-extract/        # Field extraction service
│   ├── svc-normalize-map/  # Normalization service
│   ├── svc-kg/             # Knowledge graph service
│   ├── svc-rag-indexer/    # RAG indexing service
│   ├── svc-rag-retriever/  # RAG retrieval service
│   ├── svc-reason/         # Tax reasoning service
│   ├── svc-coverage/       # Document coverage policy service
│   ├── svc-forms/          # Form filling service
│   ├── svc-hmrc/           # HMRC integration service
│   └── svc-firm-connectors/ # Firm integration service
├── infra/                  # Infrastructure
│   ├── compose/            # Docker Compose files
│   └── k8s/                # Kubernetes manifests
├── tests/                  # Test suites
│   ├── e2e/                # End-to-end tests
│   └── unit/               # Unit tests
├── config/                 # Configuration files
├── schemas/                # Data schemas
├── db/                     # Database schemas
└── docs/                   # Documentation

Running Tests

# Unit tests
make test-unit

# End-to-end tests
make test-e2e

# All tests
make test

Development Workflow

# Start development environment
make dev

# Watch logs for specific service
make logs SERVICE=svc-extract

# Restart specific service
make restart SERVICE=svc-extract

# Run linting and formatting
make lint
make format

# Generate API documentation
make docs

🔐 Security & Authentication

Edge Authentication

Traefik reverse proxy with SSL termination
Authentik SSO provider with OIDC/SAML support
ForwardAuth middleware for service authentication
Zero-trust architecture - services consume user context via headers

Data Protection

Vault Transit encryption for sensitive fields
PII Detection and de-identification before vector indexing
Tenant Isolation with row-level security
Audit Trails with bitemporal data modeling

Network Security

Internal Networks for service communication
TLS Everywhere with automatic certificate management
Rate Limiting and DDoS protection
Security Headers and CORS policies

📊 Observability

Metrics & Monitoring

Prometheus for metrics collection
Grafana for visualization and alerting
Custom Business Metrics for document processing, RAG, calculations
SLI/SLO Monitoring with error budgets

Tracing & Logging

OpenTelemetry distributed tracing
Jaeger trace visualization
Structured Logging with correlation IDs
Log Aggregation with ELK stack (optional)

Health Checks

# Check all service health
make health

# Individual service health
curl http://localhost:8001/health

🗃️ Data Architecture

Knowledge Graph (Neo4j)

Bitemporal Modeling with valid_time and system_time
SHACL Validation for data integrity
Tenant Isolation with security constraints
Audit Trails for all changes

Vector Database (Qdrant)

PII-Free Indexing with de-identification
Hybrid Search combining dense and sparse vectors
Collection Management per tenant and data type
Confidence Calibration for search results

Event Streaming (Kafka) - (TBD)

Event-Driven Architecture with standardized topics
Exactly-Once Processing with idempotency
Dead Letter Queues for error handling
Schema Registry for event validation

🧮 Tax Calculation Engine

Supported Forms

SA100 - Main Self Assessment return
SA103 - Self-employment income
SA105 - Property income
SA106 - Foreign income

Calculation Features

Rules Engine with configurable tax rules
Evidence Trails linking calculations to source documents
Confidence Scoring with calibration
Multi-Year Support with basis period reform

HMRC Integration

MTD API integration for submissions
OAuth 2.0 authentication flow
Dry Run mode for testing
Validation against HMRC business rules

🔌 Integrations

Practice Management Systems

IRIS Practice Management
Sage Practice Management
Xero accounting software
QuickBooks accounting software
FreeAgent accounting software
KashFlow accounting software

Document Sources

Direct Upload via web interface
Email Integration with attachment processing
Portal Scraping via RPA automation
API Integration with accounting systems

🚀 Deployment

Local Development

make up      # Start all services
make down    # Stop all services
make clean   # Clean up volumes and networks

Production Deployment

For detailed instructions, see infra/compose/README.md.

The system uses a unified deployment script for production environments:

# Deploy to production (Infrastructure + Services + Monitoring)
./infra/scripts/deploy.sh production all

Ensure you have configured infra/environments/production/.env with the correct secrets and domain settings before deploying.

Environment Configuration

Key environment variables:

# Database connections
DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db
NEO4J_URI=bolt://neo4j:7687
QDRANT_URL=http://qdrant:6333

# External services
OPENAI_API_KEY=sk-...
VAULT_ADDR=http://vault:8200
KAFKA_BOOTSTRAP_SERVERS=kafka:9092

# Security
AUTHENTIK_SECRET_KEY=your-secret-key
VAULT_ROLE_ID=your-role-id
VAULT_SECRET_ID=your-secret-id

📚 API Documentation

Authentication

All API endpoints require authentication via Authentik ForwardAuth:

curl -H "X-Forwarded-User: user@example.com" \
     -H "X-Forwarded-Groups: tax_agents" \
     -H "X-Tenant-ID: tenant-123" \
     https://api.localhost/api/ingestion/health

Key Endpoints

POST /api/ingestion/upload - Upload documents
GET /api/extract/status/{doc_id} - Check extraction status
POST /api/rag-retriever/search - Search knowledge base
POST /api/reason/compute - Trigger tax calculations
POST /api/forms/fill/{form_id} - Fill PDF forms
POST /api/hmrc/submit - Submit to HMRC

Event Topics

DOC_INGESTED - Document uploaded
DOC_OCR_READY - OCR completed
DOC_EXTRACTED - Fields extracted
KG_UPSERTED - Knowledge graph updated
RAG_INDEXED - Vector indexing completed
CALC_SCHEDULE_READY - Tax calculation completed
FORM_FILLED - PDF form filled
HMRC_SUBMITTED - HMRC submission completed

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Run the test suite
Submit a pull request

Code Standards

Python: Black formatting, isort imports, mypy type checking
Documentation: Docstrings for all public functions
Testing: Minimum 80% code coverage
Security: No secrets in code, use Vault for sensitive data

📋 Coverage Policy System

The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic.

Policy Configuration

Coverage policies are defined in config/coverage.yaml with support for jurisdiction and tenant-specific overlays:

# config/coverage.yaml
version: "1.0"
jurisdiction: "UK"
tax_year: "2024-25"
tax_year_boundary:
  start: "2024-04-06"
  end: "2025-04-05"

defaults:
  confidence_thresholds:
    ocr: 0.82
    extract: 0.85
  date_tolerance_days: 30

triggers:
  SA102: # Employment schedule
    any_of:
      - "exists(IncomeItem[type='Employment'])"
  SA105: # Property schedule
    any_of:
      - "exists(IncomeItem[type='UKPropertyRent'])"

schedules:
  SA102:
    evidence:
      - id: "P60"
        role: "REQUIRED"
        boxes: ["SA102_b1", "SA102_b2"]
        acceptable_alternatives: ["P45", "FinalPayslipYTD"]
      - id: "P11D"
        role: "CONDITIONALLY_REQUIRED"
        condition: "exists(BenefitInKind=true)"
        boxes: ["SA102_b9"]

API Usage

Check Document Coverage

curl -X POST https://api.localhost/coverage/v1/check \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "taxpayer_id": "T-001",
    "tax_year": "2024-25",
    "jurisdiction": "UK"
  }'

Response:

{
  "overall_status": "INCOMPLETE",
  "schedules_required": ["SA102"],
  "coverage": [
    {
      "schedule_id": "SA102",
      "status": "INCOMPLETE",
      "evidence": [
        {
          "id": "P60",
          "status": "MISSING",
          "role": "REQUIRED",
          "found": []
        }
      ]
    }
  ],
  "blocking_items": [
    {
      "schedule_id": "SA102",
      "evidence_id": "P60",
      "role": "REQUIRED",
      "reason": "P60 provides year-end pay and PAYE tax figures",
      "boxes": ["SA102_b1", "SA102_b2"],
      "acceptable_alternatives": ["P45", "FinalPayslipYTD"]
    }
  ]
}

Generate Clarifying Questions

curl -X POST https://api.localhost/coverage/v1/clarify \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "taxpayer_id": "T-001",
    "tax_year": "2024-25",
    "jurisdiction": "UK",
    "schedule_id": "SA102",
    "evidence_id": "P60"
  }'

Policy Hot Reload

Policies can be reloaded without service restart:

curl -X POST https://api.localhost/coverage/admin/reload \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Predicate Language

The policy system supports a domain-specific language for conditions:

exists(Entity[filters]) - Check if entities exist with filters
property_name - Check boolean properties
taxpayer_flag:flag_name - Check taxpayer flags
filing_mode:mode - Check filing mode
computed_condition - Check computed values

Status Classification

Evidence is classified into four statuses:

PRESENT_VERIFIED: High confidence OCR/extract, date within tax year
PRESENT_UNVERIFIED: Medium confidence, may need manual review
CONFLICTING: Multiple documents with conflicting information
MISSING: No evidence found or confidence too low

Testing

Run coverage policy tests:

# Unit tests
pytest tests/unit/coverage/ -v

# Integration tests
pytest tests/integration/coverage/ -v

# End-to-end tests
pytest tests/e2e/test_coverage_to_compute_flow.py -v

# Coverage report
pytest tests/unit/coverage/ --cov=libs --cov-report=html

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Documentation: See /docs directory
Issues: GitHub Issues
Discussions: GitHub Discussions
Security: security@example.com

🗺️ Roadmap

Advanced ML models for extraction
Multi-jurisdiction support (EU, US)
Real-time collaboration features
Mobile application
Advanced analytics dashboard
Blockchain audit trails

Languages

Python 83.3%

Shell 9.1%

Dockerfile 2.1%

TypeScript 2.1%

Makefile 1.7%

Other 1.6%