## Datasets, Metrics, Acceptance Criteria ### Test Datasets #### Synthetic Data - **Employment scenarios**: 50 synthetic P60s, payslips, and bank statements - **Self-employment**: 30 invoice/receipt sets with varying complexity - **Property**: 25 rental scenarios including FHL and joint ownership - **Mixed portfolios**: 20 complete taxpayer profiles with multiple income sources - **Edge cases**: 15 scenarios with basis period reform, loss carry-forwards, HICBC #### Anonymized Real-like Data - **Bank statements**: 100 anonymized statements with realistic transaction patterns - **Invoices**: 200 business invoices with varying layouts and quality - **Property documents**: 50 rental agreements and property statements - **HMRC forms**: 30 completed SA100 series with known correct values #### Golden Reference Sets - **Schedule calculations**: Hand-verified calculations for each schedule type - **Reconciliation tests**: Known bank-to-invoice matching scenarios - **RAG evaluation**: Curated question-answer pairs with ground truth citations ### Extraction Metrics #### Field-Level Precision/Recall - **Target precision ≥ 0.97** for structured fields (amounts, dates, references) - **Target recall ≥ 0.95** for mandatory fields per document type - **OCR confidence threshold**: Reject below 0.50, human review 0.50-0.85 | Field Type | Precision Target | Recall Target | Notes | | ----------------- | ---------------- | ------------- | ------------------------- | | Currency amounts | ≥ 0.98 | ≥ 0.96 | Critical for calculations | | Dates | ≥ 0.95 | ≥ 0.94 | Tax year assignment | | Party names | ≥ 0.90 | ≥ 0.88 | Entity resolution | | Reference numbers | ≥ 0.92 | ≥ 0.90 | UTR, NI, VAT validation | | Addresses | ≥ 0.85 | ≥ 0.80 | Postcode validation | #### Document Classification - **Overall accuracy ≥ 0.95** for document type classification - **Confidence calibration**: Platt scaling on validation set - **Confusion matrix analysis** for misclassification patterns ### Schedule-Level Accuracy #### Absolute Error Targets - **SA102 Employment**: Mean absolute error ≤ £10 per box - **SA103 Self-Employment**: Mean absolute error ≤ £50 per box - **SA105 Property**: Mean absolute error ≤ £25 per box - **SA110 Tax Calculation**: Mean absolute error ≤ £5 for tax due #### Reconciliation Pass-Rate - **Target ≥ 98%** for bank statement to invoice/expense matching - **Tolerance**: ±£0.01 for amounts, ±2 days for dates - **Delta analysis**: Track systematic biases in reconciliation ### RAG Retrieval Evaluation #### Retrieval Metrics - **Top-k recall@5 ≥ 0.85**: Relevant chunks in top 5 results - **nDCG@10 ≥ 0.80**: Normalized discounted cumulative gain - **MRR ≥ 0.75**: Mean reciprocal rank of first relevant result #### Faithfulness & Groundedness - **Faithfulness ≥ 0.90**: Generated answers supported by retrieved chunks - **Groundedness ≥ 0.85**: Claims traceable to source documents - **Citation accuracy ≥ 0.95**: Correct document/page/section references #### RAG-Specific Tests - **Jurisdiction filtering**: Ensure UK-specific results for UK queries - **Tax year relevance**: Retrieve rules applicable to specified tax year - **PII leak prevention**: No personal data in vector embeddings - **Right-to-erasure**: Complete removal via payload filters ### Explanation Coverage #### Lineage Traceability - **Target ≥ 99%** of numeric facts traceable to source evidence - **Evidence chain completeness**: Document → Evidence → IncomeItem/ExpenseItem → Schedule → FormBox - **Provenance accuracy**: Correct page/bbox/text_hash references #### Calculation Explanations - **Rule application transparency**: Each calculation step with rule reference - **Confidence propagation**: Uncertainty quantification through calculation chain - **Alternative scenarios**: "What-if" analysis for different input values ### Security & Compliance Tests #### Authentication & Authorization - **Traefik+Authentik integration**: Route-level access control - **Header spoofing prevention**: Reject requests with auth headers from untrusted sources - **JWT validation**: Proper signature verification and claim extraction - **Session management**: Timeout, refresh, and logout functionality #### Data Protection - **PII masking**: Verify no raw PII in logs, vectors, or exports - **Encryption at rest**: All sensitive data encrypted with KMS keys - **Encryption in transit**: TLS 1.3 for all inter-service communication - **Access logging**: Complete audit trail of data access #### GDPR Compliance - **Right-to-erasure**: Complete data removal across all systems - **Data minimization**: Only necessary data collected and retained - **Consent tracking**: Valid legal basis for all processing activities - **Retention policies**: Automatic deletion per defined schedules ### Red-Team Test Cases #### Adversarial Inputs - **OCR noise injection**: Deliberately degraded document quality - **Conflicting documents**: Multiple sources with contradictory information - **Malformed data**: Invalid formats, extreme values, edge cases - **Injection attacks**: Attempt to inject malicious content via documents #### System Resilience - **Rate limiting**: Verify API rate limits prevent abuse - **Resource exhaustion**: Large document processing under load - **Cascade failures**: Service dependency failure scenarios - **Data corruption**: Recovery from corrupted KG/vector data #### Privacy Attacks - **Membership inference**: Attempt to determine if data was used in training - **Model inversion**: Try to extract training data from model outputs - **PII reconstruction**: Attempt to rebuild personal data from anonymized vectors - **Cross-tenant leakage**: Verify data isolation between clients ### Performance Benchmarks #### Throughput Targets - **Local deployment**: 2 documents/second sustained processing - **Scale-out**: 5 documents/second with burst to 20 documents/second - **RAG queries**: <500ms p95 response time for hybrid retrieval - **KG queries**: <200ms p95 for schedule calculations #### Latency SLOs - **Ingest → Extract**: p95 ≤ 3 minutes for typical documents - **Extract → KG**: p95 ≤ 30 seconds for mapping and validation - **Schedule computation**: p95 ≤ 5 seconds for complete form - **Evidence generation**: p95 ≤ 10 seconds for full audit pack ### Acceptance Criteria #### Functional Requirements - [ ] All SA100 series schedules computed with target accuracy - [ ] Complete audit trail from source documents to final values - [ ] RAG system provides relevant, cited answers to tax questions - [ ] HMRC submission integration (stub/sandbox modes) - [ ] Multi-tenant data isolation and access control #### Non-Functional Requirements - [ ] System handles 1000+ documents per taxpayer - [ ] 99.9% uptime during tax season (Jan-Apr) - [ ] Zero data breaches or PII leaks - [ ] Complete disaster recovery within 4 hours - [ ] GDPR compliance audit passes #### Integration Requirements - [ ] Firm database connectors sync without data loss - [ ] Traefik+Authentik SSO works across all services - [ ] Vector and graph databases maintain consistency - [ ] CI/CD pipeline deploys without manual intervention - [ ] Monitoring alerts on SLO violations ### Test Execution Strategy #### Unit Tests - **Coverage target**: ≥ 90% line coverage for business logic - **Property-based testing**: Fuzz testing for calculation functions - **Mock external dependencies**: HMRC API, firm databases, LLM services #### Integration Tests - **End-to-end workflows**: Document upload → extraction → calculation → submission - **Cross-service communication**: Event-driven architecture validation - **Database consistency**: KG and vector DB synchronization #### Performance Tests - **Load testing**: Gradual ramp-up to target throughput - **Stress testing**: Beyond normal capacity to find breaking points - **Endurance testing**: Sustained load over extended periods #### Security Tests - **Penetration testing**: External security assessment - **Vulnerability scanning**: Automated SAST/DAST in CI/CD - **Compliance auditing**: GDPR, SOC2, ISO27001 readiness ### Continuous Monitoring #### Quality Metrics Dashboard - **Real-time extraction accuracy**: Field-level precision tracking - **Schedule calculation drift**: Comparison with known good values - **RAG performance**: Retrieval quality and answer faithfulness - **User feedback integration**: Human reviewer corrections #### Alerting Thresholds - **Extraction precision drop**: Alert if below 0.95 for any field type - **Reconciliation failures**: Alert if pass-rate below 0.96 - **RAG recall degradation**: Alert if top-k recall below 0.80 - **Calculation errors**: Alert on any schedule with >£100 variance #### Model Retraining Triggers - **Performance degradation**: Automatic retraining when metrics decline - **Data drift detection**: Distribution changes in input documents - **Feedback accumulation**: Retrain when sufficient corrections collected - **Regulatory updates**: Model updates for tax law changes