ai-tax-agent/README.md

# AI Tax Agent - Production Microservices Suite

A comprehensive, production-grade AI-powered tax agent system for UK Self Assessment with microservices architecture, knowledge graphs, RAG capabilities, and HMRC integration.

## 🏗️ Architecture Overview

This system implements a complete end-to-end tax processing pipeline with:

- **12 Microservices** for document processing, extraction, reasoning, and submission
- **Knowledge Graph** (Neo4j) with bitemporal modeling for audit trails
- **Vector Database** (Qdrant) for RAG with PII protection
- **Edge Authentication** via Traefik + Authentik SSO
- **Event-Driven Architecture** with Kafka messaging
- **Comprehensive Observability** with OpenTelemetry, Prometheus, and Grafana

## 🚀 Quick Start

### Prerequisites

- Docker and Docker Compose
- Python 3.12+
- Node.js 18+ (for UI components)
- 16GB+ RAM recommended
- OpenAI API key (for LLM extraction)

### 1. Clone and Setup

```bash
git clone <repository-url>
cd ai-tax-agent-2

# Bootstrap the development environment
make bootstrap

# Edit .env with your configuration
# Minimum required: OPENAI_API_KEY
```

### 2. Start Infrastructure (Automated)

```bash
# Start all services with automated fixes
make run

# Alternative: Start without fixes (original behavior)
make run-simple

# Or deploy infrastructure only
make deploy-infra
```

### 3. Complete Authentik Setup

After deployment, complete the SSO setup:

1. Visit https://auth.local.lan/if/flow/initial-setup/
2. Create the initial admin user
3. Configure applications for protected services

```bash
# Run setup helper (optional)
make setup-authentik
```

### 4. Access Services

- **Traefik Dashboard**: http://localhost:8080
- **Authentik SSO**: https://auth.local.lan
- **Grafana**: https://grafana.local.lan
- **Review UI**: https://review.local.lan (requires Authentik setup)
- **API Gateway**: https://api.local.lan

## 🤖 Automation & Scripts

The system includes comprehensive automation for deployment and troubleshooting:

### Core Commands

```bash
# Complete automated deployment with fixes
make run

# Bootstrap environment
make bootstrap

# Deploy infrastructure only
make deploy-infra

# Deploy application services only
make deploy-services
```

### Troubleshooting & Maintenance

```bash
# Run comprehensive troubleshooting
make troubleshoot

# Fix database issues
make fix-databases

# Restart Authentik components
make restart-authentik

# Restart Unleash with fixes
make restart-unleash

# Verify all endpoints
make verify

# Check service health
make health

# View service status
make status
```

### Automated Fixes

The deployment automation handles:

- **Database Initialization**: Creates required databases (unleash, authentik)
- **Password Reset**: Fixes Authentik database authentication issues
- **Service Ordering**: Starts services in correct dependency order
- **Health Monitoring**: Waits for services to be healthy before proceeding
- **Network Setup**: Creates required Docker networks
- **Certificate Generation**: Generates self-signed TLS certificates
- **Host Configuration**: Sets up local domain resolution

## 📋 Services Overview

### Core Processing Pipeline

1. **svc-ingestion** (Port 8001) - Document upload and storage
2. **svc-rpa** (Port 8002) - Browser automation for portal data
3. **svc-ocr** (Port 8003) - OCR and layout extraction
4. **svc-extract** (Port 8004) - LLM-based field extraction
5. **svc-normalize-map** (Port 8005) - Data normalization and KG mapping
6. **svc-kg** (Port 8006) - Knowledge graph operations

### AI & Reasoning

7. **svc-rag-indexer** (Port 8007) - Vector database indexing
8. **svc-rag-retriever** (Port 8008) - Hybrid search with KG fusion
9. **svc-reason** (Port 8009) - Tax calculation engine
10. **svc-coverage** (Port 8013) - Document coverage policy evaluation

### Output & Integration

11. **svc-forms** (Port 8010) - PDF form filling
12. **svc-hmrc** (Port 8011) - HMRC submission service
13. **svc-firm-connectors** (Port 8012) - Practice management integration

## 🔧 Development

### Project Structure

```
ai-tax-agent/
├── libs/                   # Shared libraries
│   ├── config.py           # Configuration and factories
│   ├── security.py         # Authentication and encryption
│   ├── observability.py    # Tracing, metrics, logging
│   ├── events.py           # Event bus abstraction
│   ├── schemas.py          # Pydantic models
│   ├── storage.py          # MinIO/S3 operations
│   ├── neo.py              # Neo4j operations
│   ├── rag.py              # RAG and vector operations
│   ├── forms.py            # PDF form handling
│   ├── calibration.py      # ML confidence calibration
│   ├── policy.py           # Coverage policy loading and compilation
│   ├── coverage_models.py  # Coverage system data models
│   ├── coverage_eval.py    # Coverage evaluation engine
│   └── coverage_schema.json # JSON schema for policy validation
├── apps/                   # Microservices
│   ├── svc-ingestion/      # Document ingestion service
│   ├── svc-rpa/            # RPA automation service
│   ├── svc-ocr/            # OCR processing service
│   ├── svc-extract/        # Field extraction service
│   ├── svc-normalize-map/  # Normalization service
│   ├── svc-kg/             # Knowledge graph service
│   ├── svc-rag-indexer/    # RAG indexing service
│   ├── svc-rag-retriever/  # RAG retrieval service
│   ├── svc-reason/         # Tax reasoning service
│   ├── svc-coverage/       # Document coverage policy service
│   ├── svc-forms/          # Form filling service
│   ├── svc-hmrc/           # HMRC integration service
│   └── svc-firm-connectors/ # Firm integration service
├── infra/                  # Infrastructure
│   ├── compose/            # Docker Compose files
│   └── k8s/                # Kubernetes manifests
├── tests/                  # Test suites
│   ├── e2e/                # End-to-end tests
│   └── unit/               # Unit tests
├── config/                 # Configuration files
├── schemas/                # Data schemas
├── db/                     # Database schemas
└── docs/                   # Documentation
```

### Running Tests

```bash
# Unit tests
make test-unit

# End-to-end tests
make test-e2e

# All tests
make test
```

### Development Workflow

```bash
# Start development environment
make dev

# Watch logs for specific service
make logs SERVICE=svc-extract

# Restart specific service
make restart SERVICE=svc-extract

# Run linting and formatting
make lint
make format

# Generate API documentation
make docs
```

## 🔐 Security & Authentication

### Edge Authentication

- **Traefik** reverse proxy with SSL termination
- **Authentik** SSO provider with OIDC/SAML support
- **ForwardAuth** middleware for service authentication
- **Zero-trust** architecture - services consume user context via headers

### Data Protection

- **Vault Transit** encryption for sensitive fields
- **PII Detection** and de-identification before vector indexing
- **Tenant Isolation** with row-level security
- **Audit Trails** with bitemporal data modeling

### Network Security

- **Internal Networks** for service communication
- **TLS Everywhere** with automatic certificate management
- **Rate Limiting** and DDoS protection
- **Security Headers** and CORS policies

## 📊 Observability

### Metrics & Monitoring

- **Prometheus** for metrics collection
- **Grafana** for visualization and alerting
- **Custom Business Metrics** for document processing, RAG, calculations
- **SLI/SLO Monitoring** with error budgets

### Tracing & Logging

- **OpenTelemetry** distributed tracing
- **Jaeger** trace visualization
- **Structured Logging** with correlation IDs
- **Log Aggregation** with ELK stack (optional)

### Health Checks

```bash
# Check all service health
make health

# Individual service health
curl http://localhost:8001/health
```

## 🗃️ Data Architecture

### Knowledge Graph (Neo4j)

- **Bitemporal Modeling** with valid_time and system_time
- **SHACL Validation** for data integrity
- **Tenant Isolation** with security constraints
- **Audit Trails** for all changes

### Vector Database (Qdrant)

- **PII-Free Indexing** with de-identification
- **Hybrid Search** combining dense and sparse vectors
- **Collection Management** per tenant and data type
- **Confidence Calibration** for search results

### Event Streaming (Kafka) - (TBD)

- **Event-Driven Architecture** with standardized topics
- **Exactly-Once Processing** with idempotency
- **Dead Letter Queues** for error handling
- **Schema Registry** for event validation

## 🧮 Tax Calculation Engine

### Supported Forms

- **SA100** - Main Self Assessment return
- **SA103** - Self-employment income
- **SA105** - Property income
- **SA106** - Foreign income

### Calculation Features

- **Rules Engine** with configurable tax rules
- **Evidence Trails** linking calculations to source documents
- **Confidence Scoring** with calibration
- **Multi-Year Support** with basis period reform

### HMRC Integration

- **MTD API** integration for submissions
- **OAuth 2.0** authentication flow
- **Dry Run** mode for testing
- **Validation** against HMRC business rules

## 🔌 Integrations

### Practice Management Systems

- **IRIS** Practice Management
- **Sage** Practice Management
- **Xero** accounting software
- **QuickBooks** accounting software
- **FreeAgent** accounting software
- **KashFlow** accounting software

### Document Sources

- **Direct Upload** via web interface
- **Email Integration** with attachment processing
- **Portal Scraping** via RPA automation
- **API Integration** with accounting systems

## 🚀 Deployment

### Local Development

```bash
make up      # Start all services
make down    # Stop all services
make clean   # Clean up volumes and networks
```

### Production Deployment

For detailed instructions, see [infra/compose/README.md](infra/compose/README.md).

The system uses a unified deployment script for production environments:

```bash
# Deploy to production (Infrastructure + Services + Monitoring)
./infra/scripts/deploy.sh production all
```

Ensure you have configured `infra/environments/production/.env` with the correct secrets and domain settings before deploying.

### Environment Configuration

Key environment variables:

```bash
# Database connections
DATABASE_URL=postgresql+asyncpg://user:pass@host:5432/db
NEO4J_URI=bolt://neo4j:7687
QDRANT_URL=http://qdrant:6333

# External services
OPENAI_API_KEY=sk-...
VAULT_ADDR=http://vault:8200
KAFKA_BOOTSTRAP_SERVERS=kafka:9092

# Security
AUTHENTIK_SECRET_KEY=your-secret-key
VAULT_ROLE_ID=your-role-id
VAULT_SECRET_ID=your-secret-id
```

## 📚 API Documentation

### Authentication

All API endpoints require authentication via Authentik ForwardAuth:

```bash
curl -H "X-Forwarded-User: user@example.com" \
     -H "X-Forwarded-Groups: tax_agents" \
     -H "X-Tenant-ID: tenant-123" \
     https://api.localhost/api/ingestion/health
```

### Key Endpoints

- `POST /api/ingestion/upload` - Upload documents
- `GET /api/extract/status/{doc_id}` - Check extraction status
- `POST /api/rag-retriever/search` - Search knowledge base
- `POST /api/reason/compute` - Trigger tax calculations
- `POST /api/forms/fill/{form_id}` - Fill PDF forms
- `POST /api/hmrc/submit` - Submit to HMRC

### Event Topics

- `DOC_INGESTED` - Document uploaded
- `DOC_OCR_READY` - OCR completed
- `DOC_EXTRACTED` - Fields extracted
- `KG_UPSERTED` - Knowledge graph updated
- `RAG_INDEXED` - Vector indexing completed
- `CALC_SCHEDULE_READY` - Tax calculation completed
- `FORM_FILLED` - PDF form filled
- `HMRC_SUBMITTED` - HMRC submission completed

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Run the test suite
6. Submit a pull request

### Code Standards

- **Python**: Black formatting, isort imports, mypy type checking
- **Documentation**: Docstrings for all public functions
- **Testing**: Minimum 80% code coverage
- **Security**: No secrets in code, use Vault for sensitive data

## 📋 Coverage Policy System

The coverage policy system ensures that all required tax documents are present and verified before computation. It uses a declarative YAML-based policy language with conditional logic.

### Policy Configuration

Coverage policies are defined in `config/coverage.yaml` with support for jurisdiction and tenant-specific overlays:

```yaml
# config/coverage.yaml
version: "1.0"
jurisdiction: "UK"
tax_year: "2024-25"
tax_year_boundary:
  start: "2024-04-06"
  end: "2025-04-05"

defaults:
  confidence_thresholds:
    ocr: 0.82
    extract: 0.85
  date_tolerance_days: 30

triggers:
  SA102: # Employment schedule
    any_of:
      - "exists(IncomeItem[type='Employment'])"
  SA105: # Property schedule
    any_of:
      - "exists(IncomeItem[type='UKPropertyRent'])"

schedules:
  SA102:
    evidence:
      - id: "P60"
        role: "REQUIRED"
        boxes: ["SA102_b1", "SA102_b2"]
        acceptable_alternatives: ["P45", "FinalPayslipYTD"]
      - id: "P11D"
        role: "CONDITIONALLY_REQUIRED"
        condition: "exists(BenefitInKind=true)"
        boxes: ["SA102_b9"]
```

### API Usage

#### Check Document Coverage

```bash
curl -X POST https://api.localhost/coverage/v1/check \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "taxpayer_id": "T-001",
    "tax_year": "2024-25",
    "jurisdiction": "UK"
  }'
```

Response:

```json
{
  "overall_status": "INCOMPLETE",
  "schedules_required": ["SA102"],
  "coverage": [
    {
      "schedule_id": "SA102",
      "status": "INCOMPLETE",
      "evidence": [
        {
          "id": "P60",
          "status": "MISSING",
          "role": "REQUIRED",
          "found": []
        }
      ]
    }
  ],
  "blocking_items": [
    {
      "schedule_id": "SA102",
      "evidence_id": "P60",
      "role": "REQUIRED",
      "reason": "P60 provides year-end pay and PAYE tax figures",
      "boxes": ["SA102_b1", "SA102_b2"],
      "acceptable_alternatives": ["P45", "FinalPayslipYTD"]
    }
  ]
}
```

#### Generate Clarifying Questions

```bash
curl -X POST https://api.localhost/coverage/v1/clarify \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "taxpayer_id": "T-001",
    "tax_year": "2024-25",
    "jurisdiction": "UK",
    "schedule_id": "SA102",
    "evidence_id": "P60"
  }'
```

### Policy Hot Reload

Policies can be reloaded without service restart:

```bash
curl -X POST https://api.localhost/coverage/admin/reload \
  -H "Authorization: Bearer $ADMIN_TOKEN"
```

### Predicate Language

The policy system supports a domain-specific language for conditions:

- `exists(Entity[filters])` - Check if entities exist with filters
- `property_name` - Check boolean properties
- `taxpayer_flag:flag_name` - Check taxpayer flags
- `filing_mode:mode` - Check filing mode
- `computed_condition` - Check computed values

### Status Classification

Evidence is classified into four statuses:

- **PRESENT_VERIFIED**: High confidence OCR/extract, date within tax year
- **PRESENT_UNVERIFIED**: Medium confidence, may need manual review
- **CONFLICTING**: Multiple documents with conflicting information
- **MISSING**: No evidence found or confidence too low

### Testing

Run coverage policy tests:

```bash
# Unit tests
pytest tests/unit/coverage/ -v

# Integration tests
pytest tests/integration/coverage/ -v

# End-to-end tests
pytest tests/e2e/test_coverage_to_compute_flow.py -v

# Coverage report
pytest tests/unit/coverage/ --cov=libs --cov-report=html
```

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🆘 Support

- **Documentation**: See `/docs` directory
- **Issues**: GitHub Issues
- **Discussions**: GitHub Discussions
- **Security**: security@example.com

## 🗺️ Roadmap

- [ ] Advanced ML models for extraction
- [ ] Multi-jurisdiction support (EU, US)
- [ ] Real-time collaboration features
- [ ] Mobile application
- [ ] Advanced analytics dashboard
- [ ] Blockchain audit trails