Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
236 lines
9.1 KiB
Markdown
236 lines
9.1 KiB
Markdown
<!-- FILE: TESTPLAN.md -->
|
|
|
|
## Datasets, Metrics, Acceptance Criteria
|
|
|
|
### Test Datasets
|
|
|
|
#### Synthetic Data
|
|
|
|
- **Employment scenarios**: 50 synthetic P60s, payslips, and bank statements
|
|
- **Self-employment**: 30 invoice/receipt sets with varying complexity
|
|
- **Property**: 25 rental scenarios including FHL and joint ownership
|
|
- **Mixed portfolios**: 20 complete taxpayer profiles with multiple income sources
|
|
- **Edge cases**: 15 scenarios with basis period reform, loss carry-forwards, HICBC
|
|
|
|
#### Anonymized Real-like Data
|
|
|
|
- **Bank statements**: 100 anonymized statements with realistic transaction patterns
|
|
- **Invoices**: 200 business invoices with varying layouts and quality
|
|
- **Property documents**: 50 rental agreements and property statements
|
|
- **HMRC forms**: 30 completed SA100 series with known correct values
|
|
|
|
#### Golden Reference Sets
|
|
|
|
- **Schedule calculations**: Hand-verified calculations for each schedule type
|
|
- **Reconciliation tests**: Known bank-to-invoice matching scenarios
|
|
- **RAG evaluation**: Curated question-answer pairs with ground truth citations
|
|
|
|
### Extraction Metrics
|
|
|
|
#### Field-Level Precision/Recall
|
|
|
|
- **Target precision ≥ 0.97** for structured fields (amounts, dates, references)
|
|
- **Target recall ≥ 0.95** for mandatory fields per document type
|
|
- **OCR confidence threshold**: Reject below 0.50, human review 0.50-0.85
|
|
|
|
| Field Type | Precision Target | Recall Target | Notes |
|
|
| ----------------- | ---------------- | ------------- | ------------------------- |
|
|
| Currency amounts | ≥ 0.98 | ≥ 0.96 | Critical for calculations |
|
|
| Dates | ≥ 0.95 | ≥ 0.94 | Tax year assignment |
|
|
| Party names | ≥ 0.90 | ≥ 0.88 | Entity resolution |
|
|
| Reference numbers | ≥ 0.92 | ≥ 0.90 | UTR, NI, VAT validation |
|
|
| Addresses | ≥ 0.85 | ≥ 0.80 | Postcode validation |
|
|
|
|
#### Document Classification
|
|
|
|
- **Overall accuracy ≥ 0.95** for document type classification
|
|
- **Confidence calibration**: Platt scaling on validation set
|
|
- **Confusion matrix analysis** for misclassification patterns
|
|
|
|
### Schedule-Level Accuracy
|
|
|
|
#### Absolute Error Targets
|
|
|
|
- **SA102 Employment**: Mean absolute error ≤ £10 per box
|
|
- **SA103 Self-Employment**: Mean absolute error ≤ £50 per box
|
|
- **SA105 Property**: Mean absolute error ≤ £25 per box
|
|
- **SA110 Tax Calculation**: Mean absolute error ≤ £5 for tax due
|
|
|
|
#### Reconciliation Pass-Rate
|
|
|
|
- **Target ≥ 98%** for bank statement to invoice/expense matching
|
|
- **Tolerance**: ±£0.01 for amounts, ±2 days for dates
|
|
- **Delta analysis**: Track systematic biases in reconciliation
|
|
|
|
### RAG Retrieval Evaluation
|
|
|
|
#### Retrieval Metrics
|
|
|
|
- **Top-k recall@5 ≥ 0.85**: Relevant chunks in top 5 results
|
|
- **nDCG@10 ≥ 0.80**: Normalized discounted cumulative gain
|
|
- **MRR ≥ 0.75**: Mean reciprocal rank of first relevant result
|
|
|
|
#### Faithfulness & Groundedness
|
|
|
|
- **Faithfulness ≥ 0.90**: Generated answers supported by retrieved chunks
|
|
- **Groundedness ≥ 0.85**: Claims traceable to source documents
|
|
- **Citation accuracy ≥ 0.95**: Correct document/page/section references
|
|
|
|
#### RAG-Specific Tests
|
|
|
|
- **Jurisdiction filtering**: Ensure UK-specific results for UK queries
|
|
- **Tax year relevance**: Retrieve rules applicable to specified tax year
|
|
- **PII leak prevention**: No personal data in vector embeddings
|
|
- **Right-to-erasure**: Complete removal via payload filters
|
|
|
|
### Explanation Coverage
|
|
|
|
#### Lineage Traceability
|
|
|
|
- **Target ≥ 99%** of numeric facts traceable to source evidence
|
|
- **Evidence chain completeness**: Document → Evidence → IncomeItem/ExpenseItem → Schedule → FormBox
|
|
- **Provenance accuracy**: Correct page/bbox/text_hash references
|
|
|
|
#### Calculation Explanations
|
|
|
|
- **Rule application transparency**: Each calculation step with rule reference
|
|
- **Confidence propagation**: Uncertainty quantification through calculation chain
|
|
- **Alternative scenarios**: "What-if" analysis for different input values
|
|
|
|
### Security & Compliance Tests
|
|
|
|
#### Authentication & Authorization
|
|
|
|
- **Traefik+Authentik integration**: Route-level access control
|
|
- **Header spoofing prevention**: Reject requests with auth headers from untrusted sources
|
|
- **JWT validation**: Proper signature verification and claim extraction
|
|
- **Session management**: Timeout, refresh, and logout functionality
|
|
|
|
#### Data Protection
|
|
|
|
- **PII masking**: Verify no raw PII in logs, vectors, or exports
|
|
- **Encryption at rest**: All sensitive data encrypted with KMS keys
|
|
- **Encryption in transit**: TLS 1.3 for all inter-service communication
|
|
- **Access logging**: Complete audit trail of data access
|
|
|
|
#### GDPR Compliance
|
|
|
|
- **Right-to-erasure**: Complete data removal across all systems
|
|
- **Data minimization**: Only necessary data collected and retained
|
|
- **Consent tracking**: Valid legal basis for all processing activities
|
|
- **Retention policies**: Automatic deletion per defined schedules
|
|
|
|
### Red-Team Test Cases
|
|
|
|
#### Adversarial Inputs
|
|
|
|
- **OCR noise injection**: Deliberately degraded document quality
|
|
- **Conflicting documents**: Multiple sources with contradictory information
|
|
- **Malformed data**: Invalid formats, extreme values, edge cases
|
|
- **Injection attacks**: Attempt to inject malicious content via documents
|
|
|
|
#### System Resilience
|
|
|
|
- **Rate limiting**: Verify API rate limits prevent abuse
|
|
- **Resource exhaustion**: Large document processing under load
|
|
- **Cascade failures**: Service dependency failure scenarios
|
|
- **Data corruption**: Recovery from corrupted KG/vector data
|
|
|
|
#### Privacy Attacks
|
|
|
|
- **Membership inference**: Attempt to determine if data was used in training
|
|
- **Model inversion**: Try to extract training data from model outputs
|
|
- **PII reconstruction**: Attempt to rebuild personal data from anonymized vectors
|
|
- **Cross-tenant leakage**: Verify data isolation between clients
|
|
|
|
### Performance Benchmarks
|
|
|
|
#### Throughput Targets
|
|
|
|
- **Local deployment**: 2 documents/second sustained processing
|
|
- **Scale-out**: 5 documents/second with burst to 20 documents/second
|
|
- **RAG queries**: <500ms p95 response time for hybrid retrieval
|
|
- **KG queries**: <200ms p95 for schedule calculations
|
|
|
|
#### Latency SLOs
|
|
|
|
- **Ingest → Extract**: p95 ≤ 3 minutes for typical documents
|
|
- **Extract → KG**: p95 ≤ 30 seconds for mapping and validation
|
|
- **Schedule computation**: p95 ≤ 5 seconds for complete form
|
|
- **Evidence generation**: p95 ≤ 10 seconds for full audit pack
|
|
|
|
### Acceptance Criteria
|
|
|
|
#### Functional Requirements
|
|
|
|
- [ ] All SA100 series schedules computed with target accuracy
|
|
- [ ] Complete audit trail from source documents to final values
|
|
- [ ] RAG system provides relevant, cited answers to tax questions
|
|
- [ ] HMRC submission integration (stub/sandbox modes)
|
|
- [ ] Multi-tenant data isolation and access control
|
|
|
|
#### Non-Functional Requirements
|
|
|
|
- [ ] System handles 1000+ documents per taxpayer
|
|
- [ ] 99.9% uptime during tax season (Jan-Apr)
|
|
- [ ] Zero data breaches or PII leaks
|
|
- [ ] Complete disaster recovery within 4 hours
|
|
- [ ] GDPR compliance audit passes
|
|
|
|
#### Integration Requirements
|
|
|
|
- [ ] Firm database connectors sync without data loss
|
|
- [ ] Traefik+Authentik SSO works across all services
|
|
- [ ] Vector and graph databases maintain consistency
|
|
- [ ] CI/CD pipeline deploys without manual intervention
|
|
- [ ] Monitoring alerts on SLO violations
|
|
|
|
### Test Execution Strategy
|
|
|
|
#### Unit Tests
|
|
|
|
- **Coverage target**: ≥ 90% line coverage for business logic
|
|
- **Property-based testing**: Fuzz testing for calculation functions
|
|
- **Mock external dependencies**: HMRC API, firm databases, LLM services
|
|
|
|
#### Integration Tests
|
|
|
|
- **End-to-end workflows**: Document upload → extraction → calculation → submission
|
|
- **Cross-service communication**: Event-driven architecture validation
|
|
- **Database consistency**: KG and vector DB synchronization
|
|
|
|
#### Performance Tests
|
|
|
|
- **Load testing**: Gradual ramp-up to target throughput
|
|
- **Stress testing**: Beyond normal capacity to find breaking points
|
|
- **Endurance testing**: Sustained load over extended periods
|
|
|
|
#### Security Tests
|
|
|
|
- **Penetration testing**: External security assessment
|
|
- **Vulnerability scanning**: Automated SAST/DAST in CI/CD
|
|
- **Compliance auditing**: GDPR, SOC2, ISO27001 readiness
|
|
|
|
### Continuous Monitoring
|
|
|
|
#### Quality Metrics Dashboard
|
|
|
|
- **Real-time extraction accuracy**: Field-level precision tracking
|
|
- **Schedule calculation drift**: Comparison with known good values
|
|
- **RAG performance**: Retrieval quality and answer faithfulness
|
|
- **User feedback integration**: Human reviewer corrections
|
|
|
|
#### Alerting Thresholds
|
|
|
|
- **Extraction precision drop**: Alert if below 0.95 for any field type
|
|
- **Reconciliation failures**: Alert if pass-rate below 0.96
|
|
- **RAG recall degradation**: Alert if top-k recall below 0.80
|
|
- **Calculation errors**: Alert on any schedule with >£100 variance
|
|
|
|
#### Model Retraining Triggers
|
|
|
|
- **Performance degradation**: Automatic retraining when metrics decline
|
|
- **Data drift detection**: Distribution changes in input documents
|
|
- **Feedback accumulation**: Retrain when sufficient corrections collected
|
|
- **Regulatory updates**: Model updates for tax law changes
|