Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
9.1 KiB
9.1 KiB
Datasets, Metrics, Acceptance Criteria
Test Datasets
Synthetic Data
- Employment scenarios: 50 synthetic P60s, payslips, and bank statements
- Self-employment: 30 invoice/receipt sets with varying complexity
- Property: 25 rental scenarios including FHL and joint ownership
- Mixed portfolios: 20 complete taxpayer profiles with multiple income sources
- Edge cases: 15 scenarios with basis period reform, loss carry-forwards, HICBC
Anonymized Real-like Data
- Bank statements: 100 anonymized statements with realistic transaction patterns
- Invoices: 200 business invoices with varying layouts and quality
- Property documents: 50 rental agreements and property statements
- HMRC forms: 30 completed SA100 series with known correct values
Golden Reference Sets
- Schedule calculations: Hand-verified calculations for each schedule type
- Reconciliation tests: Known bank-to-invoice matching scenarios
- RAG evaluation: Curated question-answer pairs with ground truth citations
Extraction Metrics
Field-Level Precision/Recall
- Target precision ≥ 0.97 for structured fields (amounts, dates, references)
- Target recall ≥ 0.95 for mandatory fields per document type
- OCR confidence threshold: Reject below 0.50, human review 0.50-0.85
| Field Type | Precision Target | Recall Target | Notes |
|---|---|---|---|
| Currency amounts | ≥ 0.98 | ≥ 0.96 | Critical for calculations |
| Dates | ≥ 0.95 | ≥ 0.94 | Tax year assignment |
| Party names | ≥ 0.90 | ≥ 0.88 | Entity resolution |
| Reference numbers | ≥ 0.92 | ≥ 0.90 | UTR, NI, VAT validation |
| Addresses | ≥ 0.85 | ≥ 0.80 | Postcode validation |
Document Classification
- Overall accuracy ≥ 0.95 for document type classification
- Confidence calibration: Platt scaling on validation set
- Confusion matrix analysis for misclassification patterns
Schedule-Level Accuracy
Absolute Error Targets
- SA102 Employment: Mean absolute error ≤ £10 per box
- SA103 Self-Employment: Mean absolute error ≤ £50 per box
- SA105 Property: Mean absolute error ≤ £25 per box
- SA110 Tax Calculation: Mean absolute error ≤ £5 for tax due
Reconciliation Pass-Rate
- Target ≥ 98% for bank statement to invoice/expense matching
- Tolerance: ±£0.01 for amounts, ±2 days for dates
- Delta analysis: Track systematic biases in reconciliation
RAG Retrieval Evaluation
Retrieval Metrics
- Top-k recall@5 ≥ 0.85: Relevant chunks in top 5 results
- nDCG@10 ≥ 0.80: Normalized discounted cumulative gain
- MRR ≥ 0.75: Mean reciprocal rank of first relevant result
Faithfulness & Groundedness
- Faithfulness ≥ 0.90: Generated answers supported by retrieved chunks
- Groundedness ≥ 0.85: Claims traceable to source documents
- Citation accuracy ≥ 0.95: Correct document/page/section references
RAG-Specific Tests
- Jurisdiction filtering: Ensure UK-specific results for UK queries
- Tax year relevance: Retrieve rules applicable to specified tax year
- PII leak prevention: No personal data in vector embeddings
- Right-to-erasure: Complete removal via payload filters
Explanation Coverage
Lineage Traceability
- Target ≥ 99% of numeric facts traceable to source evidence
- Evidence chain completeness: Document → Evidence → IncomeItem/ExpenseItem → Schedule → FormBox
- Provenance accuracy: Correct page/bbox/text_hash references
Calculation Explanations
- Rule application transparency: Each calculation step with rule reference
- Confidence propagation: Uncertainty quantification through calculation chain
- Alternative scenarios: "What-if" analysis for different input values
Security & Compliance Tests
Authentication & Authorization
- Traefik+Authentik integration: Route-level access control
- Header spoofing prevention: Reject requests with auth headers from untrusted sources
- JWT validation: Proper signature verification and claim extraction
- Session management: Timeout, refresh, and logout functionality
Data Protection
- PII masking: Verify no raw PII in logs, vectors, or exports
- Encryption at rest: All sensitive data encrypted with KMS keys
- Encryption in transit: TLS 1.3 for all inter-service communication
- Access logging: Complete audit trail of data access
GDPR Compliance
- Right-to-erasure: Complete data removal across all systems
- Data minimization: Only necessary data collected and retained
- Consent tracking: Valid legal basis for all processing activities
- Retention policies: Automatic deletion per defined schedules
Red-Team Test Cases
Adversarial Inputs
- OCR noise injection: Deliberately degraded document quality
- Conflicting documents: Multiple sources with contradictory information
- Malformed data: Invalid formats, extreme values, edge cases
- Injection attacks: Attempt to inject malicious content via documents
System Resilience
- Rate limiting: Verify API rate limits prevent abuse
- Resource exhaustion: Large document processing under load
- Cascade failures: Service dependency failure scenarios
- Data corruption: Recovery from corrupted KG/vector data
Privacy Attacks
- Membership inference: Attempt to determine if data was used in training
- Model inversion: Try to extract training data from model outputs
- PII reconstruction: Attempt to rebuild personal data from anonymized vectors
- Cross-tenant leakage: Verify data isolation between clients
Performance Benchmarks
Throughput Targets
- Local deployment: 2 documents/second sustained processing
- Scale-out: 5 documents/second with burst to 20 documents/second
- RAG queries: <500ms p95 response time for hybrid retrieval
- KG queries: <200ms p95 for schedule calculations
Latency SLOs
- Ingest → Extract: p95 ≤ 3 minutes for typical documents
- Extract → KG: p95 ≤ 30 seconds for mapping and validation
- Schedule computation: p95 ≤ 5 seconds for complete form
- Evidence generation: p95 ≤ 10 seconds for full audit pack
Acceptance Criteria
Functional Requirements
- All SA100 series schedules computed with target accuracy
- Complete audit trail from source documents to final values
- RAG system provides relevant, cited answers to tax questions
- HMRC submission integration (stub/sandbox modes)
- Multi-tenant data isolation and access control
Non-Functional Requirements
- System handles 1000+ documents per taxpayer
- 99.9% uptime during tax season (Jan-Apr)
- Zero data breaches or PII leaks
- Complete disaster recovery within 4 hours
- GDPR compliance audit passes
Integration Requirements
- Firm database connectors sync without data loss
- Traefik+Authentik SSO works across all services
- Vector and graph databases maintain consistency
- CI/CD pipeline deploys without manual intervention
- Monitoring alerts on SLO violations
Test Execution Strategy
Unit Tests
- Coverage target: ≥ 90% line coverage for business logic
- Property-based testing: Fuzz testing for calculation functions
- Mock external dependencies: HMRC API, firm databases, LLM services
Integration Tests
- End-to-end workflows: Document upload → extraction → calculation → submission
- Cross-service communication: Event-driven architecture validation
- Database consistency: KG and vector DB synchronization
Performance Tests
- Load testing: Gradual ramp-up to target throughput
- Stress testing: Beyond normal capacity to find breaking points
- Endurance testing: Sustained load over extended periods
Security Tests
- Penetration testing: External security assessment
- Vulnerability scanning: Automated SAST/DAST in CI/CD
- Compliance auditing: GDPR, SOC2, ISO27001 readiness
Continuous Monitoring
Quality Metrics Dashboard
- Real-time extraction accuracy: Field-level precision tracking
- Schedule calculation drift: Comparison with known good values
- RAG performance: Retrieval quality and answer faithfulness
- User feedback integration: Human reviewer corrections
Alerting Thresholds
- Extraction precision drop: Alert if below 0.95 for any field type
- Reconciliation failures: Alert if pass-rate below 0.96
- RAG recall degradation: Alert if top-k recall below 0.80
- Calculation errors: Alert on any schedule with >£100 variance
Model Retraining Triggers
- Performance degradation: Automatic retraining when metrics decline
- Data drift detection: Distribution changes in input documents
- Feedback accumulation: Retrain when sufficient corrections collected
- Regulatory updates: Model updates for tax law changes