Files
ai-tax-agent/docs/TESTPLAN.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

236 lines
9.1 KiB
Markdown

<!-- FILE: TESTPLAN.md -->
## Datasets, Metrics, Acceptance Criteria
### Test Datasets
#### Synthetic Data
- **Employment scenarios**: 50 synthetic P60s, payslips, and bank statements
- **Self-employment**: 30 invoice/receipt sets with varying complexity
- **Property**: 25 rental scenarios including FHL and joint ownership
- **Mixed portfolios**: 20 complete taxpayer profiles with multiple income sources
- **Edge cases**: 15 scenarios with basis period reform, loss carry-forwards, HICBC
#### Anonymized Real-like Data
- **Bank statements**: 100 anonymized statements with realistic transaction patterns
- **Invoices**: 200 business invoices with varying layouts and quality
- **Property documents**: 50 rental agreements and property statements
- **HMRC forms**: 30 completed SA100 series with known correct values
#### Golden Reference Sets
- **Schedule calculations**: Hand-verified calculations for each schedule type
- **Reconciliation tests**: Known bank-to-invoice matching scenarios
- **RAG evaluation**: Curated question-answer pairs with ground truth citations
### Extraction Metrics
#### Field-Level Precision/Recall
- **Target precision ≥ 0.97** for structured fields (amounts, dates, references)
- **Target recall ≥ 0.95** for mandatory fields per document type
- **OCR confidence threshold**: Reject below 0.50, human review 0.50-0.85
| Field Type | Precision Target | Recall Target | Notes |
| ----------------- | ---------------- | ------------- | ------------------------- |
| Currency amounts | ≥ 0.98 | ≥ 0.96 | Critical for calculations |
| Dates | ≥ 0.95 | ≥ 0.94 | Tax year assignment |
| Party names | ≥ 0.90 | ≥ 0.88 | Entity resolution |
| Reference numbers | ≥ 0.92 | ≥ 0.90 | UTR, NI, VAT validation |
| Addresses | ≥ 0.85 | ≥ 0.80 | Postcode validation |
#### Document Classification
- **Overall accuracy ≥ 0.95** for document type classification
- **Confidence calibration**: Platt scaling on validation set
- **Confusion matrix analysis** for misclassification patterns
### Schedule-Level Accuracy
#### Absolute Error Targets
- **SA102 Employment**: Mean absolute error ≤ £10 per box
- **SA103 Self-Employment**: Mean absolute error ≤ £50 per box
- **SA105 Property**: Mean absolute error ≤ £25 per box
- **SA110 Tax Calculation**: Mean absolute error ≤ £5 for tax due
#### Reconciliation Pass-Rate
- **Target ≥ 98%** for bank statement to invoice/expense matching
- **Tolerance**: ±£0.01 for amounts, ±2 days for dates
- **Delta analysis**: Track systematic biases in reconciliation
### RAG Retrieval Evaluation
#### Retrieval Metrics
- **Top-k recall@5 ≥ 0.85**: Relevant chunks in top 5 results
- **nDCG@10 ≥ 0.80**: Normalized discounted cumulative gain
- **MRR ≥ 0.75**: Mean reciprocal rank of first relevant result
#### Faithfulness & Groundedness
- **Faithfulness ≥ 0.90**: Generated answers supported by retrieved chunks
- **Groundedness ≥ 0.85**: Claims traceable to source documents
- **Citation accuracy ≥ 0.95**: Correct document/page/section references
#### RAG-Specific Tests
- **Jurisdiction filtering**: Ensure UK-specific results for UK queries
- **Tax year relevance**: Retrieve rules applicable to specified tax year
- **PII leak prevention**: No personal data in vector embeddings
- **Right-to-erasure**: Complete removal via payload filters
### Explanation Coverage
#### Lineage Traceability
- **Target ≥ 99%** of numeric facts traceable to source evidence
- **Evidence chain completeness**: Document → Evidence → IncomeItem/ExpenseItem → Schedule → FormBox
- **Provenance accuracy**: Correct page/bbox/text_hash references
#### Calculation Explanations
- **Rule application transparency**: Each calculation step with rule reference
- **Confidence propagation**: Uncertainty quantification through calculation chain
- **Alternative scenarios**: "What-if" analysis for different input values
### Security & Compliance Tests
#### Authentication & Authorization
- **Traefik+Authentik integration**: Route-level access control
- **Header spoofing prevention**: Reject requests with auth headers from untrusted sources
- **JWT validation**: Proper signature verification and claim extraction
- **Session management**: Timeout, refresh, and logout functionality
#### Data Protection
- **PII masking**: Verify no raw PII in logs, vectors, or exports
- **Encryption at rest**: All sensitive data encrypted with KMS keys
- **Encryption in transit**: TLS 1.3 for all inter-service communication
- **Access logging**: Complete audit trail of data access
#### GDPR Compliance
- **Right-to-erasure**: Complete data removal across all systems
- **Data minimization**: Only necessary data collected and retained
- **Consent tracking**: Valid legal basis for all processing activities
- **Retention policies**: Automatic deletion per defined schedules
### Red-Team Test Cases
#### Adversarial Inputs
- **OCR noise injection**: Deliberately degraded document quality
- **Conflicting documents**: Multiple sources with contradictory information
- **Malformed data**: Invalid formats, extreme values, edge cases
- **Injection attacks**: Attempt to inject malicious content via documents
#### System Resilience
- **Rate limiting**: Verify API rate limits prevent abuse
- **Resource exhaustion**: Large document processing under load
- **Cascade failures**: Service dependency failure scenarios
- **Data corruption**: Recovery from corrupted KG/vector data
#### Privacy Attacks
- **Membership inference**: Attempt to determine if data was used in training
- **Model inversion**: Try to extract training data from model outputs
- **PII reconstruction**: Attempt to rebuild personal data from anonymized vectors
- **Cross-tenant leakage**: Verify data isolation between clients
### Performance Benchmarks
#### Throughput Targets
- **Local deployment**: 2 documents/second sustained processing
- **Scale-out**: 5 documents/second with burst to 20 documents/second
- **RAG queries**: <500ms p95 response time for hybrid retrieval
- **KG queries**: <200ms p95 for schedule calculations
#### Latency SLOs
- **Ingest Extract**: p95 3 minutes for typical documents
- **Extract KG**: p95 30 seconds for mapping and validation
- **Schedule computation**: p95 5 seconds for complete form
- **Evidence generation**: p95 10 seconds for full audit pack
### Acceptance Criteria
#### Functional Requirements
- [ ] All SA100 series schedules computed with target accuracy
- [ ] Complete audit trail from source documents to final values
- [ ] RAG system provides relevant, cited answers to tax questions
- [ ] HMRC submission integration (stub/sandbox modes)
- [ ] Multi-tenant data isolation and access control
#### Non-Functional Requirements
- [ ] System handles 1000+ documents per taxpayer
- [ ] 99.9% uptime during tax season (Jan-Apr)
- [ ] Zero data breaches or PII leaks
- [ ] Complete disaster recovery within 4 hours
- [ ] GDPR compliance audit passes
#### Integration Requirements
- [ ] Firm database connectors sync without data loss
- [ ] Traefik+Authentik SSO works across all services
- [ ] Vector and graph databases maintain consistency
- [ ] CI/CD pipeline deploys without manual intervention
- [ ] Monitoring alerts on SLO violations
### Test Execution Strategy
#### Unit Tests
- **Coverage target**: 90% line coverage for business logic
- **Property-based testing**: Fuzz testing for calculation functions
- **Mock external dependencies**: HMRC API, firm databases, LLM services
#### Integration Tests
- **End-to-end workflows**: Document upload extraction calculation submission
- **Cross-service communication**: Event-driven architecture validation
- **Database consistency**: KG and vector DB synchronization
#### Performance Tests
- **Load testing**: Gradual ramp-up to target throughput
- **Stress testing**: Beyond normal capacity to find breaking points
- **Endurance testing**: Sustained load over extended periods
#### Security Tests
- **Penetration testing**: External security assessment
- **Vulnerability scanning**: Automated SAST/DAST in CI/CD
- **Compliance auditing**: GDPR, SOC2, ISO27001 readiness
### Continuous Monitoring
#### Quality Metrics Dashboard
- **Real-time extraction accuracy**: Field-level precision tracking
- **Schedule calculation drift**: Comparison with known good values
- **RAG performance**: Retrieval quality and answer faithfulness
- **User feedback integration**: Human reviewer corrections
#### Alerting Thresholds
- **Extraction precision drop**: Alert if below 0.95 for any field type
- **Reconciliation failures**: Alert if pass-rate below 0.96
- **RAG recall degradation**: Alert if top-k recall below 0.80
- **Calculation errors**: Alert on any schedule with >£100 variance
#### Model Retraining Triggers
- **Performance degradation**: Automatic retraining when metrics decline
- **Data drift detection**: Distribution changes in input documents
- **Feedback accumulation**: Retrain when sufficient corrections collected
- **Regulatory updates**: Model updates for tax law changes