Files
ai-tax-agent/docs/TESTPLAN.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

9.1 KiB

Datasets, Metrics, Acceptance Criteria

Test Datasets

Synthetic Data

  • Employment scenarios: 50 synthetic P60s, payslips, and bank statements
  • Self-employment: 30 invoice/receipt sets with varying complexity
  • Property: 25 rental scenarios including FHL and joint ownership
  • Mixed portfolios: 20 complete taxpayer profiles with multiple income sources
  • Edge cases: 15 scenarios with basis period reform, loss carry-forwards, HICBC

Anonymized Real-like Data

  • Bank statements: 100 anonymized statements with realistic transaction patterns
  • Invoices: 200 business invoices with varying layouts and quality
  • Property documents: 50 rental agreements and property statements
  • HMRC forms: 30 completed SA100 series with known correct values

Golden Reference Sets

  • Schedule calculations: Hand-verified calculations for each schedule type
  • Reconciliation tests: Known bank-to-invoice matching scenarios
  • RAG evaluation: Curated question-answer pairs with ground truth citations

Extraction Metrics

Field-Level Precision/Recall

  • Target precision ≥ 0.97 for structured fields (amounts, dates, references)
  • Target recall ≥ 0.95 for mandatory fields per document type
  • OCR confidence threshold: Reject below 0.50, human review 0.50-0.85
Field Type Precision Target Recall Target Notes
Currency amounts ≥ 0.98 ≥ 0.96 Critical for calculations
Dates ≥ 0.95 ≥ 0.94 Tax year assignment
Party names ≥ 0.90 ≥ 0.88 Entity resolution
Reference numbers ≥ 0.92 ≥ 0.90 UTR, NI, VAT validation
Addresses ≥ 0.85 ≥ 0.80 Postcode validation

Document Classification

  • Overall accuracy ≥ 0.95 for document type classification
  • Confidence calibration: Platt scaling on validation set
  • Confusion matrix analysis for misclassification patterns

Schedule-Level Accuracy

Absolute Error Targets

  • SA102 Employment: Mean absolute error ≤ £10 per box
  • SA103 Self-Employment: Mean absolute error ≤ £50 per box
  • SA105 Property: Mean absolute error ≤ £25 per box
  • SA110 Tax Calculation: Mean absolute error ≤ £5 for tax due

Reconciliation Pass-Rate

  • Target ≥ 98% for bank statement to invoice/expense matching
  • Tolerance: ±£0.01 for amounts, ±2 days for dates
  • Delta analysis: Track systematic biases in reconciliation

RAG Retrieval Evaluation

Retrieval Metrics

  • Top-k recall@5 ≥ 0.85: Relevant chunks in top 5 results
  • nDCG@10 ≥ 0.80: Normalized discounted cumulative gain
  • MRR ≥ 0.75: Mean reciprocal rank of first relevant result

Faithfulness & Groundedness

  • Faithfulness ≥ 0.90: Generated answers supported by retrieved chunks
  • Groundedness ≥ 0.85: Claims traceable to source documents
  • Citation accuracy ≥ 0.95: Correct document/page/section references

RAG-Specific Tests

  • Jurisdiction filtering: Ensure UK-specific results for UK queries
  • Tax year relevance: Retrieve rules applicable to specified tax year
  • PII leak prevention: No personal data in vector embeddings
  • Right-to-erasure: Complete removal via payload filters

Explanation Coverage

Lineage Traceability

  • Target ≥ 99% of numeric facts traceable to source evidence
  • Evidence chain completeness: Document → Evidence → IncomeItem/ExpenseItem → Schedule → FormBox
  • Provenance accuracy: Correct page/bbox/text_hash references

Calculation Explanations

  • Rule application transparency: Each calculation step with rule reference
  • Confidence propagation: Uncertainty quantification through calculation chain
  • Alternative scenarios: "What-if" analysis for different input values

Security & Compliance Tests

Authentication & Authorization

  • Traefik+Authentik integration: Route-level access control
  • Header spoofing prevention: Reject requests with auth headers from untrusted sources
  • JWT validation: Proper signature verification and claim extraction
  • Session management: Timeout, refresh, and logout functionality

Data Protection

  • PII masking: Verify no raw PII in logs, vectors, or exports
  • Encryption at rest: All sensitive data encrypted with KMS keys
  • Encryption in transit: TLS 1.3 for all inter-service communication
  • Access logging: Complete audit trail of data access

GDPR Compliance

  • Right-to-erasure: Complete data removal across all systems
  • Data minimization: Only necessary data collected and retained
  • Consent tracking: Valid legal basis for all processing activities
  • Retention policies: Automatic deletion per defined schedules

Red-Team Test Cases

Adversarial Inputs

  • OCR noise injection: Deliberately degraded document quality
  • Conflicting documents: Multiple sources with contradictory information
  • Malformed data: Invalid formats, extreme values, edge cases
  • Injection attacks: Attempt to inject malicious content via documents

System Resilience

  • Rate limiting: Verify API rate limits prevent abuse
  • Resource exhaustion: Large document processing under load
  • Cascade failures: Service dependency failure scenarios
  • Data corruption: Recovery from corrupted KG/vector data

Privacy Attacks

  • Membership inference: Attempt to determine if data was used in training
  • Model inversion: Try to extract training data from model outputs
  • PII reconstruction: Attempt to rebuild personal data from anonymized vectors
  • Cross-tenant leakage: Verify data isolation between clients

Performance Benchmarks

Throughput Targets

  • Local deployment: 2 documents/second sustained processing
  • Scale-out: 5 documents/second with burst to 20 documents/second
  • RAG queries: <500ms p95 response time for hybrid retrieval
  • KG queries: <200ms p95 for schedule calculations

Latency SLOs

  • Ingest → Extract: p95 ≤ 3 minutes for typical documents
  • Extract → KG: p95 ≤ 30 seconds for mapping and validation
  • Schedule computation: p95 ≤ 5 seconds for complete form
  • Evidence generation: p95 ≤ 10 seconds for full audit pack

Acceptance Criteria

Functional Requirements

  • All SA100 series schedules computed with target accuracy
  • Complete audit trail from source documents to final values
  • RAG system provides relevant, cited answers to tax questions
  • HMRC submission integration (stub/sandbox modes)
  • Multi-tenant data isolation and access control

Non-Functional Requirements

  • System handles 1000+ documents per taxpayer
  • 99.9% uptime during tax season (Jan-Apr)
  • Zero data breaches or PII leaks
  • Complete disaster recovery within 4 hours
  • GDPR compliance audit passes

Integration Requirements

  • Firm database connectors sync without data loss
  • Traefik+Authentik SSO works across all services
  • Vector and graph databases maintain consistency
  • CI/CD pipeline deploys without manual intervention
  • Monitoring alerts on SLO violations

Test Execution Strategy

Unit Tests

  • Coverage target: ≥ 90% line coverage for business logic
  • Property-based testing: Fuzz testing for calculation functions
  • Mock external dependencies: HMRC API, firm databases, LLM services

Integration Tests

  • End-to-end workflows: Document upload → extraction → calculation → submission
  • Cross-service communication: Event-driven architecture validation
  • Database consistency: KG and vector DB synchronization

Performance Tests

  • Load testing: Gradual ramp-up to target throughput
  • Stress testing: Beyond normal capacity to find breaking points
  • Endurance testing: Sustained load over extended periods

Security Tests

  • Penetration testing: External security assessment
  • Vulnerability scanning: Automated SAST/DAST in CI/CD
  • Compliance auditing: GDPR, SOC2, ISO27001 readiness

Continuous Monitoring

Quality Metrics Dashboard

  • Real-time extraction accuracy: Field-level precision tracking
  • Schedule calculation drift: Comparison with known good values
  • RAG performance: Retrieval quality and answer faithfulness
  • User feedback integration: Human reviewer corrections

Alerting Thresholds

  • Extraction precision drop: Alert if below 0.95 for any field type
  • Reconciliation failures: Alert if pass-rate below 0.96
  • RAG recall degradation: Alert if top-k recall below 0.80
  • Calculation errors: Alert on any schedule with >£100 variance

Model Retraining Triggers

  • Performance degradation: Automatic retraining when metrics decline
  • Data drift detection: Distribution changes in input documents
  • Feedback accumulation: Retrain when sufficient corrections collected
  • Regulatory updates: Model updates for tax law changes