ai-tax-agent/docs/TESTPLAN.md

<!-- FILE: TESTPLAN.md -->

## Datasets, Metrics, Acceptance Criteria

### Test Datasets

#### Synthetic Data

- **Employment scenarios**: 50 synthetic P60s, payslips, and bank statements
- **Self-employment**: 30 invoice/receipt sets with varying complexity
- **Property**: 25 rental scenarios including FHL and joint ownership
- **Mixed portfolios**: 20 complete taxpayer profiles with multiple income sources
- **Edge cases**: 15 scenarios with basis period reform, loss carry-forwards, HICBC

#### Anonymized Real-like Data

- **Bank statements**: 100 anonymized statements with realistic transaction patterns
- **Invoices**: 200 business invoices with varying layouts and quality
- **Property documents**: 50 rental agreements and property statements
- **HMRC forms**: 30 completed SA100 series with known correct values

#### Golden Reference Sets

- **Schedule calculations**: Hand-verified calculations for each schedule type
- **Reconciliation tests**: Known bank-to-invoice matching scenarios
- **RAG evaluation**: Curated question-answer pairs with ground truth citations

### Extraction Metrics

#### Field-Level Precision/Recall

- **Target precision ≥ 0.97** for structured fields (amounts, dates, references)
- **Target recall ≥ 0.95** for mandatory fields per document type
- **OCR confidence threshold**: Reject below 0.50, human review 0.50-0.85

| Field Type        | Precision Target | Recall Target | Notes                     |
| ----------------- | ---------------- | ------------- | ------------------------- |
| Currency amounts  | ≥ 0.98           | ≥ 0.96        | Critical for calculations |
| Dates             | ≥ 0.95           | ≥ 0.94        | Tax year assignment       |
| Party names       | ≥ 0.90           | ≥ 0.88        | Entity resolution         |
| Reference numbers | ≥ 0.92           | ≥ 0.90        | UTR, NI, VAT validation   |
| Addresses         | ≥ 0.85           | ≥ 0.80        | Postcode validation       |

#### Document Classification

- **Overall accuracy ≥ 0.95** for document type classification
- **Confidence calibration**: Platt scaling on validation set
- **Confusion matrix analysis** for misclassification patterns

### Schedule-Level Accuracy

#### Absolute Error Targets

- **SA102 Employment**: Mean absolute error ≤ £10 per box
- **SA103 Self-Employment**: Mean absolute error ≤ £50 per box
- **SA105 Property**: Mean absolute error ≤ £25 per box
- **SA110 Tax Calculation**: Mean absolute error ≤ £5 for tax due

#### Reconciliation Pass-Rate

- **Target ≥ 98%** for bank statement to invoice/expense matching
- **Tolerance**: ±£0.01 for amounts, ±2 days for dates
- **Delta analysis**: Track systematic biases in reconciliation

### RAG Retrieval Evaluation

#### Retrieval Metrics

- **Top-k recall@5 ≥ 0.85**: Relevant chunks in top 5 results
- **nDCG@10 ≥ 0.80**: Normalized discounted cumulative gain
- **MRR ≥ 0.75**: Mean reciprocal rank of first relevant result

#### Faithfulness & Groundedness

- **Faithfulness ≥ 0.90**: Generated answers supported by retrieved chunks
- **Groundedness ≥ 0.85**: Claims traceable to source documents
- **Citation accuracy ≥ 0.95**: Correct document/page/section references

#### RAG-Specific Tests

- **Jurisdiction filtering**: Ensure UK-specific results for UK queries
- **Tax year relevance**: Retrieve rules applicable to specified tax year
- **PII leak prevention**: No personal data in vector embeddings
- **Right-to-erasure**: Complete removal via payload filters

### Explanation Coverage

#### Lineage Traceability

- **Target ≥ 99%** of numeric facts traceable to source evidence
- **Evidence chain completeness**: Document → Evidence → IncomeItem/ExpenseItem → Schedule → FormBox
- **Provenance accuracy**: Correct page/bbox/text_hash references

#### Calculation Explanations

- **Rule application transparency**: Each calculation step with rule reference
- **Confidence propagation**: Uncertainty quantification through calculation chain
- **Alternative scenarios**: "What-if" analysis for different input values

### Security & Compliance Tests

#### Authentication & Authorization

- **Traefik+Authentik integration**: Route-level access control
- **Header spoofing prevention**: Reject requests with auth headers from untrusted sources
- **JWT validation**: Proper signature verification and claim extraction
- **Session management**: Timeout, refresh, and logout functionality

#### Data Protection

- **PII masking**: Verify no raw PII in logs, vectors, or exports
- **Encryption at rest**: All sensitive data encrypted with KMS keys
- **Encryption in transit**: TLS 1.3 for all inter-service communication
- **Access logging**: Complete audit trail of data access

#### GDPR Compliance

- **Right-to-erasure**: Complete data removal across all systems
- **Data minimization**: Only necessary data collected and retained
- **Consent tracking**: Valid legal basis for all processing activities
- **Retention policies**: Automatic deletion per defined schedules

### Red-Team Test Cases

#### Adversarial Inputs

- **OCR noise injection**: Deliberately degraded document quality
- **Conflicting documents**: Multiple sources with contradictory information
- **Malformed data**: Invalid formats, extreme values, edge cases
- **Injection attacks**: Attempt to inject malicious content via documents

#### System Resilience

- **Rate limiting**: Verify API rate limits prevent abuse
- **Resource exhaustion**: Large document processing under load
- **Cascade failures**: Service dependency failure scenarios
- **Data corruption**: Recovery from corrupted KG/vector data

#### Privacy Attacks

- **Membership inference**: Attempt to determine if data was used in training
- **Model inversion**: Try to extract training data from model outputs
- **PII reconstruction**: Attempt to rebuild personal data from anonymized vectors
- **Cross-tenant leakage**: Verify data isolation between clients

### Performance Benchmarks

#### Throughput Targets

- **Local deployment**: 2 documents/second sustained processing
- **Scale-out**: 5 documents/second with burst to 20 documents/second
- **RAG queries**: <500ms p95 response time for hybrid retrieval
- **KG queries**: <200ms p95 for schedule calculations

#### Latency SLOs

- **Ingest → Extract**: p95 ≤ 3 minutes for typical documents
- **Extract → KG**: p95 ≤ 30 seconds for mapping and validation
- **Schedule computation**: p95 ≤ 5 seconds for complete form
- **Evidence generation**: p95 ≤ 10 seconds for full audit pack

### Acceptance Criteria

#### Functional Requirements

- [ ] All SA100 series schedules computed with target accuracy
- [ ] Complete audit trail from source documents to final values
- [ ] RAG system provides relevant, cited answers to tax questions
- [ ] HMRC submission integration (stub/sandbox modes)
- [ ] Multi-tenant data isolation and access control

#### Non-Functional Requirements

- [ ] System handles 1000+ documents per taxpayer
- [ ] 99.9% uptime during tax season (Jan-Apr)
- [ ] Zero data breaches or PII leaks
- [ ] Complete disaster recovery within 4 hours
- [ ] GDPR compliance audit passes

#### Integration Requirements

- [ ] Firm database connectors sync without data loss
- [ ] Traefik+Authentik SSO works across all services
- [ ] Vector and graph databases maintain consistency
- [ ] CI/CD pipeline deploys without manual intervention
- [ ] Monitoring alerts on SLO violations

### Test Execution Strategy

#### Unit Tests

- **Coverage target**: ≥ 90% line coverage for business logic
- **Property-based testing**: Fuzz testing for calculation functions
- **Mock external dependencies**: HMRC API, firm databases, LLM services

#### Integration Tests

- **End-to-end workflows**: Document upload → extraction → calculation → submission
- **Cross-service communication**: Event-driven architecture validation
- **Database consistency**: KG and vector DB synchronization

#### Performance Tests

- **Load testing**: Gradual ramp-up to target throughput
- **Stress testing**: Beyond normal capacity to find breaking points
- **Endurance testing**: Sustained load over extended periods

#### Security Tests

- **Penetration testing**: External security assessment
- **Vulnerability scanning**: Automated SAST/DAST in CI/CD
- **Compliance auditing**: GDPR, SOC2, ISO27001 readiness

### Continuous Monitoring

#### Quality Metrics Dashboard

- **Real-time extraction accuracy**: Field-level precision tracking
- **Schedule calculation drift**: Comparison with known good values
- **RAG performance**: Retrieval quality and answer faithfulness
- **User feedback integration**: Human reviewer corrections

#### Alerting Thresholds

- **Extraction precision drop**: Alert if below 0.95 for any field type
- **Reconciliation failures**: Alert if pass-rate below 0.96
- **RAG recall degradation**: Alert if top-k recall below 0.80
- **Calculation errors**: Alert on any schedule with >£100 variance

#### Model Retraining Triggers

- **Performance degradation**: Automatic retraining when metrics decline
- **Data drift detection**: Distribution changes in input documents
- **Feedback accumulation**: Retrain when sufficient corrections collected
- **Regulatory updates**: Model updates for tax law changes