# ROLE You are a **Senior Backend Engineer** working inside an existing monorepo that already contains the services and libraries described previously (Traefik+Authentik SSO at the edge; Python 3.12; FastAPI microservices; Vault, MinIO, Neo4j, Postgres, Redis, Qdrant; Prefect; Docker-Compose; Gitea CI). # OBJECTIVE Integrate the new **coverage policy** (`config/coverage.yaml`) so agents can: 1. call `CheckDocumentCoverage({tax_year, taxpayer_id})` and get a **precise, machine-readable coverage matrix** (required/conditional/optional evidence per schedule, with status and citations), and 2. call `AskClarifyingQuestion(gap, context)` to receive a **ready-to-send user question** with **why** and **citations**. You will implement **policy loading with overlays + hot reload**, **runtime evaluation against the KG**, **citations via KG or RAG**, **validation**, **tests**, **CI**, and **deploy assets**. --- # SCOPE (DO EXACTLY THIS) ## A) New service: `svc-coverage` Create a dedicated microservice to encapsulate policy loading and coverage evaluation (keeps `svc-reason` calculators clean). **Endpoints (FastAPI):** 1. `POST /v1/coverage/check` - Body: `{"tax_year": "YYYY-YY", "taxpayer_id": "T-xxx"}` - Returns: full coverage report (shape below). 2. `POST /v1/coverage/clarify` - Body: `{"gap": {...}, "context": {"tax_year": "...", "taxpayer_id": "...", "jurisdiction": "UK"}}` - Returns: `{question_text, why_it_is_needed, citations[], options_to_provide[], blocking, boxes_affected[]}`. 3. `POST /admin/coverage/reload` - Reloads policy from files/overrides/feature flags. **Require admin group** via forwarded header. 4. `GET /v1/coverage/policy` - Returns **current compiled policy** (no secrets, no PII), with version & sources. 5. `GET /v1/coverage/validate` - Runs cross-checks (see Validation section). Returns `{ok: bool, errors[]}`. **Security:** - All routes behind Traefik+Authentik. - `/admin/*` additionally checks `X-Authenticated-Groups` contains `admin`. - Use the existing `TrustedProxyMiddleware`. **Observability:** - OTel tracing, Prometheus metrics at `/metrics` (internal CIDR only), structured logs. --- ## B) Libraries & shared code (create/update) 1. **`libs/policy.py`** (new) - Functions: - `load_policy(baseline_path, jurisdiction, tax_year, tenant_id|None) -> CoveragePolicy` - `merge_overlays(base, *overlays) -> CoveragePolicy` - `apply_feature_flags(policy) -> CoveragePolicy` (optional Unleash) - `compile_predicates(policy) -> CompiledCoveragePolicy` (turn `condition:` DSL into callables; see DSL below) - `watch_and_reload()` (optional watchdog; otherwise `/admin/coverage/reload`) - Validate against JSON Schema (below). Raise `PolicyError` on failure. 2. **`libs/coverage_models.py`** (new) - Pydantic v2 models mirroring `config/coverage.yaml`: `CoveragePolicy, SchedulePolicy, EvidenceItem, Validity, StatusClassifier, QuestionTemplates, ConflictRules, GuidanceRef, Trigger, CoverageReport, CoverageItem, Citation, ClarifyResponse`. - Enums: `Role = REQUIRED|CONDITIONALLY_REQUIRED|OPTIONAL`, `Status = present_verified|present_unverified|missing|conflicting`. 3. **`libs/coverage_eval.py`** (new) - Core runtime: - `infer_required_schedules(taxpayer_id, tax_year, policy, kg) -> list[str]` - `find_evidence_docs(taxpayer_id, tax_year, evidence_ids, thresholds, kg) -> list[FoundEvidence]` - `classify_status(found, thresholds, tax_year_bounds, conflicts_rules) -> Status` - `build_reason_and_citations(schedule_id, evidence_item, status, taxpayer_id, tax_year, kg, rag) -> (str, list[Citation])` - `check_document_coverage(...) -> CoverageReport` (implements the A→D steps we defined) - Uses: - `libs/neo.py` for Cypher helpers (see queries below) - `libs/rag.py` for fallback citations (filters `{jurisdiction:'UK', tax_year}` and `pii_free:true`) 4. **`libs/coverage_schema.json`** (new) - JSON Schema for validating `coverage.yaml`. Include: - enum checks (`role`, `status keys`) - `boxes[]` is non-empty strings - every `evidence.id` present in `document_kinds` or `acceptable_alternatives` points to a declared kind - `triggers` exist for each schedule referenced under `schedules` 5. **`libs/neo.py`** (update) - Add helpers: - `kg_boxes_exist(box_ids: list[str]) -> dict[str,bool]` - `kg_find_evidence(taxpayer_id, tax_year, kinds: list[str], min_ocr: float, date_window) -> list[FoundEvidence]` - `kg_rule_citations(schedule_id, boxes: list[str]) -> list[Citation]` 6. **`libs/rag.py`** (update) - Add `rag_search_for_citations(query, filters) -> list[Citation]` (ensure `pii_free:true` and include `doc_id/url, locator`). --- ## C) Coverage DSL for conditions (compile in `compile_predicates`) Supported condition atoms (map to KG checks): - `exists(Entity[filters])` e.g., `exists(ExpenseItem[category='FinanceCosts'])` - `property_joint_ownership` (bool from KG `PropertyAsset` links) - `candidate_FHL` (bool property on `PropertyAsset` or derived) - `claims_FTCR`, `claims_remittance_basis` (flags on `TaxpayerProfile`) - `turnover_lt_vat_threshold` / `turnover_ge_vat_threshold` (computed from `IncomeItem` aggregates) - `received_estate_income`, `BenefitInKind=true`, etc. Implementation: parse simple strings with a tiny hand-rolled parser or declarative mapping table; **do not eval** raw strings. Return callables `fn(taxpayer_id, tax_year) -> bool`. --- ## D) Database migrations (Postgres; Alembic) Create two tables (new `apps/svc-coverage/alembic`): 1. `coverage_versions` - `id` (serial pk), `version` (text), `jurisdiction` (text), `tax_year` (text), `tenant_id` (text null), `source_files` (jsonb), `compiled_at` (timestamptz), `hash` (text) 2. `coverage_audit` - `id` (serial pk), `taxpayer_id` (text), `tax_year` (text), `policy_version` (text), `overall_status` (text), `blocking_items` (jsonb), `created_at` (timestamptz), `trace_id` (text) Write to `coverage_versions` on reload; write to `coverage_audit` on each `/v1/coverage/check`. --- ## E) API Contracts (exact shapes) ### 1) `/v1/coverage/check` (request) ```json { "tax_year": "2024-25", "taxpayer_id": "T-001" } ``` ### 1) `/v1/coverage/check` (response) ```json { "tax_year": "2024-25", "taxpayer_id": "T-001", "schedules_required": ["SA102", "SA105", "SA110"], "overall_status": "blocking", // ok | partial | blocking "coverage": [ { "schedule_id": "SA102", "status": "partial", "evidence": [ { "id": "P60", "role": "REQUIRED", "status": "present_unverified", "boxes": ["SA102_b1", "SA102_b2"], "found": [ { "doc_id": "DOC-123", "kind": "P60", "confidence": 0.81, "pages": [2] } ], "acceptable_alternatives": ["FinalPayslipYTD", "P45"], "reason": "P60 present but OCR confidence 0.81 < 0.82 threshold.", "citations": [ { "rule_id": "UK.SA102.P60.Required", "doc_id": "SA102-Notes-2025", "locator": "p.3 §1.1" } ] } ] } ], "blocking_items": [ { "schedule_id": "SA105", "evidence_id": "LettingAgentStatements" } ] } ``` ### 2) `/v1/coverage/clarify` (request) ```json { "gap": { "schedule_id": "SA105", "evidence_id": "LettingAgentStatements", "role": "REQUIRED", "reason": "No rent/fees statements for 2024–25.", "boxes": ["SA105_b5", "SA105_b20", "SA105_b29"], "citations": [ { "rule_id": "UK.SA105.RentEvidence", "doc_id": "SA105-Notes-2025", "locator": "p.4 §2.1" } ], "acceptable_alternatives": ["TenancyLedger", "BankStatements"] }, "context": { "tax_year": "2024-25", "taxpayer_id": "T-001", "jurisdiction": "UK" } } ``` ### 2) `/v1/coverage/clarify` (response) ```json { "question_text": "To complete the UK Property pages (SA105) for 2024–25, we need your letting agent statements showing total rents received, fees and charges. These support boxes SA105:5, SA105:20 and SA105:29. If you don’t have agent statements, you can provide a tenancy income ledger instead.", "why_it_is_needed": "HMRC guidance requires evidence of gross rents and allowable expenses for SA105 (see notes p.4 §2.1).", "citations": [ { "rule_id": "UK.SA105.RentEvidence", "doc_id": "SA105-Notes-2025", "locator": "p.4 §2.1" } ], "options_to_provide": [ { "label": "Upload agent statements (PDF/CSV)", "accepted_formats": ["pdf", "csv"], "upload_endpoint": "/v1/ingest/upload?tag=LettingAgentStatements" }, { "label": "Upload tenancy income ledger (XLSX/CSV)", "accepted_formats": ["xlsx", "csv"], "upload_endpoint": "/v1/ingest/upload?tag=TenancyLedger" } ], "blocking": true, "boxes_affected": ["SA105_b5", "SA105_b20", "SA105_b29"] } ``` --- ## F) KG & RAG integration (implement exactly) ### Neo4j Cypher helpers (in `libs/neo.py`) - **Presence of evidence** ```cypher MATCH (p:TaxpayerProfile {taxpayer_id:$tid})-[:OF_TAX_YEAR]->(y:TaxYear {label:$tax_year}) MATCH (ev:Evidence)-[:DERIVED_FROM]->(d:Document) WHERE (ev)-[:SUPPORTS]->(p) OR (d)-[:BELONGS_TO]->(p) AND d.kind IN $kinds AND date(d.date) >= date(y.start_date) AND date(d.date) <= date(y.end_date) RETURN d.doc_id AS doc_id, d.kind AS kind, ev.page AS page, ev.bbox AS bbox, ev.ocr_confidence AS conf; ``` - **Rule citations for schedule/boxes** ```cypher MATCH (fb:FormBox)-[:GOVERNED_BY]->(r:Rule)-[:CITES]->(doc:Document) WHERE fb.box_id IN $box_ids RETURN r.rule_id AS rule_id, doc.doc_id AS doc_id, doc.locator AS locator LIMIT 10; ``` - **Check boxes exist** ```cypher UNWIND $box_ids AS bid OPTIONAL MATCH (fb:FormBox {box_id: bid}) RETURN bid, fb IS NOT NULL AS exists; ``` ### RAG fallback (in `libs/rag.py`) - `rag_search_for_citations(query, filters={'jurisdiction':'UK','tax_year':'2024-25','pii_free':true}) -> list[Citation]` - Use Qdrant hybrid search + rerank; return **doc_id/url** and a best-effort **locator** (heading/page). --- ## G) Validation & policy correctness Implement `/v1/coverage/validate` to run checks: 1. **YAML schema** (`libs/coverage_schema.json`) passes. 2. Every `boxes[]` exists in KG (`FormBox`). 3. Every `evidence.id` and each `acceptable_alternatives[]` is in `document_kinds`. 4. Every schedule referenced under `schedules` has a `triggers` entry. 5. Simulate a set of synthetic profiles (unit fixtures) to ensure conditional paths are exercised (e.g., with/without BIK, FHL candidate, remittance). Return `{ok: true}` or `{ok:false, errors:[...]}`. --- ## H) Config loading, overlays & hot reload Load order: 1. `config/coverage.yaml` (baseline) 2. `config/coverage.{jurisdiction}.{tax_year}.yaml` (if present) 3. `config/overrides/{tenant_id}.yaml` (if present) 4. Apply feature flags (if Unleash present) 5. Compile predicates; compute hash of concatenated files. Expose `/admin/coverage/reload` to recompile; write an entry in `coverage_versions`. --- ## I) Compose & Traefik **Add container** `svc-coverage` to `infra/compose/docker-compose.local.yml`: - Port `8000`, labels: ``` - "traefik.enable=true" - "traefik.http.routers.svc-coverage.rule=Host(`api.local`) && PathPrefix(`/coverage`)" - "traefik.http.routers.svc-coverage.entrypoints=websecure" - "traefik.http.routers.svc-coverage.tls=true" - "traefik.http.routers.svc-coverage.middlewares=authentik-forwardauth,rate-limit" - "traefik.http.services.svc-coverage.loadbalancer.server.port=8000" ``` - Mount `./config:/app/config:ro` so policy can be hot-reloaded. --- ## J) CI (Gitea) additions - Add a job **`policy-validate`** that runs: - `yamllint config/coverage.yaml` - Policy JSON Schema validation - Box existence check (calls a local Neo4j with seeded `FormBox` registry or mocks via snapshot) - Make pipeline **fail** if any validation fails. - Ensure unit/integration tests for `svc-coverage` push coverage ≥ 90%. --- ## K) Tests (create all) 1. **Unit** (`tests/unit/coverage/`): - `test_policy_load_and_merge.py` - `test_predicate_compilation.py` (conditions DSL) - `test_status_classifier.py` (present_verified/unverified/missing/conflicting) - `test_question_templates.py` (string assembly, alternatives) 2. **Integration** (`tests/integration/coverage/`): - Spin up Neo4j with fixtures (seed form boxes + minimal rules/docs). - `test_check_document_coverage_happy_path.py` - `test_check_document_coverage_blocking_gaps.py` - `test_clarify_generates_citations_kg_then_rag.py` (mock RAG) 3. **E2E** (`tests/e2e/test_coverage_to_compute_flow.py`): - Ingest → OCR → Extract (mock) → Map → `/coverage/check` (expect blocking) → `/coverage/clarify` → upload alt doc → `/coverage/check` now ok → compute schedule. --- ## L) Error handling & codes - Use RFC7807 Problem+JSON; standardize types: - `/errors/policy-invalid`, `/errors/policy-reload-failed`, `/errors/kg-query-failed`, `/errors/rag-citation-failed` - Include `trace_id` in all errors; log with `warn/error` and span attributes `{taxpayer_id, tax_year, schedule}`. --- ## M) Acceptance criteria (DoD) - `docker compose up` brings up `svc-coverage`. - `POST /v1/coverage/check` returns correct **overall_status** and **blocking_items** for synthetic fixtures. - `/v1/coverage/clarify` returns a **polite, specific question** with **boxes listed**, **upload endpoints**, and **citations**. - `/admin/coverage/reload` picks up edited YAML without restart and logs a new `coverage_versions` row. - `/v1/coverage/validate` returns `{ok:true}` on the provided policy; CI fails if not. - No PII enters RAG queries (enforce `pii_free:true` filter). - Coverage ≥ 90% on `svc-coverage`; policy validation job green. --- # OUTPUT (FILES TO CREATE/UPDATE) Generate the following files with production-quality code and docs: ``` libs/policy.py libs/coverage_models.py libs/coverage_schema.json libs/coverage_eval.py libs/neo.py # update with helpers shown libs/rag.py # update with citation search apps/svc-coverage/main.py apps/svc-coverage/alembic/versions/*.py infra/compose/docker-compose.local.yml # add service & volume .gitea/workflows/ci.yml # add policy-validate job tests/unit/coverage/*.py tests/integration/coverage/*.py tests/e2e/test_coverage_to_compute_flow.py README.md # add section: Coverage Policy & Hot Reload ``` Use the **policy file** at `config/coverage.yaml` we already drafted. Do not change its content; only **read and validate** it. # START Proceed to implement and output the listed files in the order above.