harkon 1c160d89a4
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
feat: configured grafana
2025-12-04 14:30:59 +02:00
2025-10-11 08:41:36 +01:00
2025-12-01 13:58:38 +02:00
2025-12-04 12:49:43 +02:00
2025-10-11 08:41:36 +01:00
2025-10-16 08:57:14 +01:00
2025-11-26 13:17:17 +00:00
2025-12-04 14:30:59 +02:00
2025-10-11 08:41:36 +01:00
2025-10-16 08:57:14 +01:00
2025-10-11 08:41:36 +01:00
2025-10-11 08:41:36 +01:00
2025-10-11 08:41:36 +01:00
2025-10-16 08:57:14 +01:00
2025-12-01 13:58:38 +02:00
2025-10-11 08:41:36 +01:00
2025-10-11 08:41:36 +01:00
2025-10-11 11:42:43 +01:00
2025-11-26 13:17:17 +00:00
2025-10-16 08:57:14 +01:00
2025-12-04 12:49:43 +02:00
2025-10-11 08:41:36 +01:00
2025-11-26 13:17:17 +00:00

AI Tax Agent - Production Microservices Suite

A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration.

🏗️ Architecture Overview

This system implements a complete end-to-end tax processing pipeline with:

  • 12 Microservices for document processing, extraction, reasoning, and submission
  • Knowledge Graph (Neo4j) with bitemporal modeling for audit trails
  • Vector Database (Qdrant) for RAG with PII protection
  • Edge Authentication via Traefik + Authentik SSO
  • Event-Driven Architecture with Kafka messaging
  • Comprehensive Observability with OpenTelemetry, Prometheus, and Grafana

🚀 Quick Start

Prerequisites

  • Docker and Docker Compose
  • Python 3.12+
  • Node.js 18+ (for UI components)
  • 16GB+ RAM recommended
  • OpenAI API key (for LLM extraction)

1. Clone and Setup

git clone <repository-url>
cd ai-tax-agent-2

# Bootstrap the development environment
make bootstrap

# Edit .env with your configuration
# Minimum required: OPENAI_API_KEY

2. Start Infrastructure (Automated)

# Start all services with automated fixes
make run

# Alternative: Start without fixes (original behavior)
make run-simple

# Or deploy infrastructure only
make deploy-infra

3. Complete Authentik Setup

After deployment, complete the SSO setup:

  1. Visit https://auth.local.lan/if/flow/initial-setup/
  2. Create the initial admin user
  3. Configure applications for protected services
# Run setup helper (optional)
make setup-authentik

4. Access Services

🤖 Automation & Scripts

The system includes comprehensive automation for deployment and troubleshooting:

Core Commands

# Complete automated deployment with fixes
make run

# Bootstrap environment
make bootstrap

# Deploy infrastructure only
make deploy-infra

# Deploy application services only
make deploy-services

Troubleshooting & Maintenance

# Run comprehensive troubleshooting
make troubleshoot

# Fix database issues
make fix-databases

# Restart Authentik components
make restart-authentik

# Restart Unleash with fixes
make restart-unleash

# Verify all endpoints
make verify

# Check service health
make health

# View service status
make status

Automated Fixes

The deployment automation handles:

  • Database Initialization: Creates required databases (unleash, authentik)
  • Password Reset: Fixes Authentik database authentication issues
  • Service Ordering: Starts services in correct dependency order
  • Health Monitoring: Waits for services to be healthy before proceeding
  • Network Setup: Creates required Docker networks
  • Certificate Generation: Generates self-signed TLS certificates
  • Host Configuration: Sets up local domain resolution

📋 Services Overview

Core Processing Pipeline

  1. svc-ingestion (Port 8001) - Document upload and storage
  2. svc-rpa (Port 8002) - Browser automation for portal data
  3. svc-ocr (Port 8003) - OCR and layout extraction
  4. svc-extract (Port 8004) - LLM-based field extraction
  5. svc-normalize-map (Port 8005) - Data normalization and KG mapping
  6. svc-kg (Port 8006) - Knowledge graph operations

AI & Reasoning

  1. svc-rag-indexer (Port 8007) - Vector database indexing
  2. svc-rag-retriever (Port 8008) - Hybrid search with KG fusion
  3. svc-reason (Port 8009) - Tax calculation engine
  4. svc-coverage (Port 8013) - Document coverage policy evaluation

Output & Integration

  1. svc-forms (Port 8010) - PDF form filling
  2. svc-hmrc (Port 8011) - HMRC submission service
  3. svc-firm-connectors (Port 8012) - Practice management integration

🔧 Development

Project Structure

ai-tax-agent/
├── libs/                   # Shared libraries
│   ├── config.py           # Configuration and factories
│   ├── security.py         # Authentication and encryption
│   ├── observability.py    # Tracing, metrics, logging
│   ├── events.py           # Event bus abstraction
│   ├── schemas.py          # Pydantic models
│   ├── storage.py          # MinIO/S3 operations
│   ├── neo.py              # Neo4j operations
│   ├── rag.py              # RAG and vector operations
│   ├── forms.py            # PDF form handling
│   ├── calibration.py      # ML confidence calibration
│   ├── policy.py           # Coverage policy loading and compilation
│   ├── coverage_models.py  # Coverage system data models
│   ├── coverage_eval.py    # Coverage evaluation engine
│   └── coverage_schema.json # JSON schema for policy validation
├── apps/                   # Microservices
│   ├── svc-ingestion/      # Document ingestion service
│   ├── svc-rpa/            # RPA automation service
│   ├── svc-ocr/            # OCR processing service
│   ├── svc-extract/        # Field extraction service
│   ├── svc-normalize-map/  # Normalization service
│   ├── svc-kg/             # Knowledge graph service
│   ├── svc-rag-indexer/    # RAG indexing service
│   ├── svc-rag-retriever/  # RAG retrieval service
│   ├── svc-reason/         # Tax reasoning service
│   ├── svc-coverage/       # Document coverage policy service
│   ├── svc-forms/          # Form filling service
│   ├── svc-hmrc/           # HMRC integration service
│   └── svc-firm-connectors/ # Firm integration service
├── infra/                  # Infrastructure
│   ├── compose/            # Docker Compose files
│   └── k8s/                # Kubernetes manifests
├── tests/                  # Test suites
│   ├── e2e/                # End-to-end tests
│   └── unit/               # Unit tests
├── config/                 # Configuration files
├── schemas/                # Data schemas
├── db/                     # Database schemas
└── docs/                   # Documentation

Running Tests

# Unit tests
make test-unit

# End-to-end tests
make test-e2e

# All tests
make test

Development Workflow

# Start development environment
make dev

# Watch logs for specific service
make logs SERVICE=svc-extract

# Restart specific service
make restart SERVICE=svc-extract

# Run linting and formatting
make lint
make format

# Generate API documentation
make docs

🔐 Security & Authentication

Edge Authentication

  • Traefik reverse proxy with SSL termination
  • Authentik SSO provider with OIDC/SAML support
  • ForwardAuth middleware for service authentication
  • Zero-trust architecture - services consume user context via headers

Data Protection

  • Vault Transit encryption for sensitive fields
  • PII Detection and de-identification before vector indexing
  • Tenant Isolation with row-level security
  • Audit Trails with bitemporal data modeling

Network Security

  • Internal Networks for service communication
  • TLS Everywhere with automatic certificate management
  • Rate Limiting and DDoS protection
  • Security Headers and CORS policies

📊 Observability

Metrics & Monitoring

  • Prometheus for metrics collection
  • Grafana for visualization and alerting
  • Custom Business Metrics for document processing, RAG, calculations
  • SLI/SLO Monitoring with error budgets

Tracing & Logging

  • OpenTelemetry distributed tracing
  • Jaeger trace visualization
  • Structured Logging with correlation IDs
  • Log Aggregation with ELK stack (optional)

Health Checks

# Check all service health
make health

# Individual service health
curl http://localhost:8001/health

🗃️ Data Architecture

Knowledge Graph (Neo4j)

  • Bitemporal Modeling with valid_time and system_time
  • SHACL Validation for data integrity
  • Tenant Isolation with security constraints
  • Audit Trails for all changes

Vector Database (Qdrant)

  • PII-Free Indexing with de-identification
  • Hybrid Search combining dense and sparse vectors
  • Collection Management per tenant and data type
  • Confidence Calibration for search results

Event Streaming (Kafka) - (TBD)

  • Event-Driven Architecture with standardized topics
  • Exactly-Once Processing with idempotency
  • Dead Letter Queues for error handling
  • Schema Registry for event validation

🧮 Tax Calculation Engine

Supported Forms

  • SA100 - Main Self Assessment return
  • SA103 - Self-employment income
  • SA105 - Property income
  • SA106 - Foreign income

Calculation Features

  • Rules Engine with configurable tax rules
  • Evidence Trails linking calculations to source documents
  • Confidence Scoring with calibration
  • Multi-Year Support with basis period reform

HMRC Integration

  • MTD API integration for submissions
  • OAuth 2.0 authentication flow
  • Dry Run mode for testing
  • Validation against HMRC business rules

🔌 Integrations

Practice Management Systems

  • IRIS Practice Management
  • Sage Practice Management
  • Xero accounting software
  • QuickBooks accounting software
  • FreeAgent accounting software
  • KashFlow accounting software

Document Sources

  • Direct Upload via web interface
  • Email Integration with attachment processing
  • Portal Scraping via RPA automation
  • API Integration with accounting systems

🚀 Deployment

Local Development

make up      # Start all services
make down    # Stop all services
make clean   # Clean up volumes and networks

Production Deployment

For detailed instructions, see infra/compose/README.md.

The system uses a unified deployment script for production environments:

# Deploy to production (Infrastructure + Services + Monitoring)
./infra/scripts/deploy.sh production all

Ensure you have configured infra/environments/production/.env with the correct secrets and domain settings before deploying.

Environment Configuration

Key environment variables:

# Database connections
DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db
NEO4J_URI=bolt://neo4j:7687
QDRANT_URL=http://qdrant:6333

# External services
OPENAI_API_KEY=sk-...
VAULT_ADDR=http://vault:8200
KAFKA_BOOTSTRAP_SERVERS=kafka:9092

# Security
AUTHENTIK_SECRET_KEY=your-secret-key
VAULT_ROLE_ID=your-role-id
VAULT_SECRET_ID=your-secret-id

📚 API Documentation

Authentication

All API endpoints require authentication via Authentik ForwardAuth:

curl -H "X-Forwarded-User: user@example.com" \
     -H "X-Forwarded-Groups: tax_agents" \
     -H "X-Tenant-ID: tenant-123" \
     https://api.localhost/api/ingestion/health

Key Endpoints

  • POST /api/ingestion/upload - Upload documents
  • GET /api/extract/status/{doc_id} - Check extraction status
  • POST /api/rag-retriever/search - Search knowledge base
  • POST /api/reason/compute - Trigger tax calculations
  • POST /api/forms/fill/{form_id} - Fill PDF forms
  • POST /api/hmrc/submit - Submit to HMRC

Event Topics

  • DOC_INGESTED - Document uploaded
  • DOC_OCR_READY - OCR completed
  • DOC_EXTRACTED - Fields extracted
  • KG_UPSERTED - Knowledge graph updated
  • RAG_INDEXED - Vector indexing completed
  • CALC_SCHEDULE_READY - Tax calculation completed
  • FORM_FILLED - PDF form filled
  • HMRC_SUBMITTED - HMRC submission completed

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Run the test suite
  6. Submit a pull request

Code Standards

  • Python: Black formatting, isort imports, mypy type checking
  • Documentation: Docstrings for all public functions
  • Testing: Minimum 80% code coverage
  • Security: No secrets in code, use Vault for sensitive data

📋 Coverage Policy System

The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic.

Policy Configuration

Coverage policies are defined in config/coverage.yaml with support for jurisdiction and tenant-specific overlays:

# config/coverage.yaml
version: "1.0"
jurisdiction: "UK"
tax_year: "2024-25"
tax_year_boundary:
  start: "2024-04-06"
  end: "2025-04-05"

defaults:
  confidence_thresholds:
    ocr: 0.82
    extract: 0.85
  date_tolerance_days: 30

triggers:
  SA102: # Employment schedule
    any_of:
      - "exists(IncomeItem[type='Employment'])"
  SA105: # Property schedule
    any_of:
      - "exists(IncomeItem[type='UKPropertyRent'])"

schedules:
  SA102:
    evidence:
      - id: "P60"
        role: "REQUIRED"
        boxes: ["SA102_b1", "SA102_b2"]
        acceptable_alternatives: ["P45", "FinalPayslipYTD"]
      - id: "P11D"
        role: "CONDITIONALLY_REQUIRED"
        condition: "exists(BenefitInKind=true)"
        boxes: ["SA102_b9"]

API Usage

Check Document Coverage

curl -X POST https://api.localhost/coverage/v1/check \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "taxpayer_id": "T-001",
    "tax_year": "2024-25",
    "jurisdiction": "UK"
  }'

Response:

{
  "overall_status": "INCOMPLETE",
  "schedules_required": ["SA102"],
  "coverage": [
    {
      "schedule_id": "SA102",
      "status": "INCOMPLETE",
      "evidence": [
        {
          "id": "P60",
          "status": "MISSING",
          "role": "REQUIRED",
          "found": []
        }
      ]
    }
  ],
  "blocking_items": [
    {
      "schedule_id": "SA102",
      "evidence_id": "P60",
      "role": "REQUIRED",
      "reason": "P60 provides year-end pay and PAYE tax figures",
      "boxes": ["SA102_b1", "SA102_b2"],
      "acceptable_alternatives": ["P45", "FinalPayslipYTD"]
    }
  ]
}

Generate Clarifying Questions

curl -X POST https://api.localhost/coverage/v1/clarify \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "taxpayer_id": "T-001",
    "tax_year": "2024-25",
    "jurisdiction": "UK",
    "schedule_id": "SA102",
    "evidence_id": "P60"
  }'

Policy Hot Reload

Policies can be reloaded without service restart:

curl -X POST https://api.localhost/coverage/admin/reload \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Predicate Language

The policy system supports a domain-specific language for conditions:

  • exists(Entity[filters]) - Check if entities exist with filters
  • property_name - Check boolean properties
  • taxpayer_flag:flag_name - Check taxpayer flags
  • filing_mode:mode - Check filing mode
  • computed_condition - Check computed values

Status Classification

Evidence is classified into four statuses:

  • PRESENT_VERIFIED: High confidence OCR/extract, date within tax year
  • PRESENT_UNVERIFIED: Medium confidence, may need manual review
  • CONFLICTING: Multiple documents with conflicting information
  • MISSING: No evidence found or confidence too low

Testing

Run coverage policy tests:

# Unit tests
pytest tests/unit/coverage/ -v

# Integration tests
pytest tests/integration/coverage/ -v

# End-to-end tests
pytest tests/e2e/test_coverage_to_compute_flow.py -v

# Coverage report
pytest tests/unit/coverage/ --cov=libs --cov-report=html

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

  • Documentation: See /docs directory
  • Issues: GitHub Issues
  • Discussions: GitHub Discussions
  • Security: security@example.com

🗺️ Roadmap

  • Advanced ML models for extraction
  • Multi-jurisdiction support (EU, US)
  • Real-time collaboration features
  • Mobile application
  • Advanced analytics dashboard
  • Blockchain audit trails
Description
No description provided
Readme 66 MiB
Languages
Python 83.3%
Shell 9.1%
Dockerfile 2.1%
TypeScript 2.1%
Makefile 1.7%
Other 1.6%