Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
431 lines
14 KiB
Markdown
431 lines
14 KiB
Markdown
# ROLE
|
||
|
||
You are a **Senior Backend Engineer** working inside an existing monorepo that already contains the services and libraries described previously (Traefik+Authentik SSO at the edge; Python 3.12; FastAPI microservices; Vault, MinIO, Neo4j, Postgres, Redis, Qdrant; Prefect; Docker-Compose; Gitea CI).
|
||
|
||
# OBJECTIVE
|
||
|
||
Integrate the new **coverage policy** (`config/coverage.yaml`) so agents can:
|
||
|
||
1. call `CheckDocumentCoverage({tax_year, taxpayer_id})` and get a **precise, machine-readable coverage matrix** (required/conditional/optional evidence per schedule, with status and citations), and
|
||
2. call `AskClarifyingQuestion(gap, context)` to receive a **ready-to-send user question** with **why** and **citations**.
|
||
|
||
You will implement **policy loading with overlays + hot reload**, **runtime evaluation against the KG**, **citations via KG or RAG**, **validation**, **tests**, **CI**, and **deploy assets**.
|
||
|
||
---
|
||
|
||
# SCOPE (DO EXACTLY THIS)
|
||
|
||
## A) New service: `svc-coverage`
|
||
|
||
Create a dedicated microservice to encapsulate policy loading and coverage evaluation (keeps `svc-reason` calculators clean).
|
||
|
||
**Endpoints (FastAPI):**
|
||
|
||
1. `POST /v1/coverage/check`
|
||
|
||
- Body: `{"tax_year": "YYYY-YY", "taxpayer_id": "T-xxx"}`
|
||
- Returns: full coverage report (shape below).
|
||
|
||
2. `POST /v1/coverage/clarify`
|
||
|
||
- Body: `{"gap": {...}, "context": {"tax_year": "...", "taxpayer_id": "...", "jurisdiction": "UK"}}`
|
||
- Returns: `{question_text, why_it_is_needed, citations[], options_to_provide[], blocking, boxes_affected[]}`.
|
||
|
||
3. `POST /admin/coverage/reload`
|
||
|
||
- Reloads policy from files/overrides/feature flags. **Require admin group** via forwarded header.
|
||
|
||
4. `GET /v1/coverage/policy`
|
||
|
||
- Returns **current compiled policy** (no secrets, no PII), with version & sources.
|
||
|
||
5. `GET /v1/coverage/validate`
|
||
|
||
- Runs cross-checks (see Validation section). Returns `{ok: bool, errors[]}`.
|
||
|
||
**Security:**
|
||
|
||
- All routes behind Traefik+Authentik.
|
||
- `/admin/*` additionally checks `X-Authenticated-Groups` contains `admin`.
|
||
- Use the existing `TrustedProxyMiddleware`.
|
||
|
||
**Observability:**
|
||
|
||
- OTel tracing, Prometheus metrics at `/metrics` (internal CIDR only), structured logs.
|
||
|
||
---
|
||
|
||
## B) Libraries & shared code (create/update)
|
||
|
||
1. **`libs/policy.py`** (new)
|
||
|
||
- Functions:
|
||
|
||
- `load_policy(baseline_path, jurisdiction, tax_year, tenant_id|None) -> CoveragePolicy`
|
||
- `merge_overlays(base, *overlays) -> CoveragePolicy`
|
||
- `apply_feature_flags(policy) -> CoveragePolicy` (optional Unleash)
|
||
- `compile_predicates(policy) -> CompiledCoveragePolicy`
|
||
(turn `condition:` DSL into callables; see DSL below)
|
||
- `watch_and_reload()` (optional watchdog; otherwise `/admin/coverage/reload`)
|
||
|
||
- Validate against JSON Schema (below). Raise `PolicyError` on failure.
|
||
|
||
2. **`libs/coverage_models.py`** (new)
|
||
|
||
- Pydantic v2 models mirroring `config/coverage.yaml`:
|
||
`CoveragePolicy, SchedulePolicy, EvidenceItem, Validity, StatusClassifier, QuestionTemplates, ConflictRules, GuidanceRef, Trigger, CoverageReport, CoverageItem, Citation, ClarifyResponse`.
|
||
- Enums: `Role = REQUIRED|CONDITIONALLY_REQUIRED|OPTIONAL`, `Status = present_verified|present_unverified|missing|conflicting`.
|
||
|
||
3. **`libs/coverage_eval.py`** (new)
|
||
|
||
- Core runtime:
|
||
|
||
- `infer_required_schedules(taxpayer_id, tax_year, policy, kg) -> list[str]`
|
||
- `find_evidence_docs(taxpayer_id, tax_year, evidence_ids, thresholds, kg) -> list[FoundEvidence]`
|
||
- `classify_status(found, thresholds, tax_year_bounds, conflicts_rules) -> Status`
|
||
- `build_reason_and_citations(schedule_id, evidence_item, status, taxpayer_id, tax_year, kg, rag) -> (str, list[Citation])`
|
||
- `check_document_coverage(...) -> CoverageReport` (implements the A→D steps we defined)
|
||
|
||
- Uses:
|
||
|
||
- `libs/neo.py` for Cypher helpers (see queries below)
|
||
- `libs/rag.py` for fallback citations (filters `{jurisdiction:'UK', tax_year}` and `pii_free:true`)
|
||
|
||
4. **`libs/coverage_schema.json`** (new)
|
||
|
||
- JSON Schema for validating `coverage.yaml`. Include:
|
||
|
||
- enum checks (`role`, `status keys`)
|
||
- `boxes[]` is non-empty strings
|
||
- every `evidence.id` present in `document_kinds` or `acceptable_alternatives` points to a declared kind
|
||
- `triggers` exist for each schedule referenced under `schedules`
|
||
|
||
5. **`libs/neo.py`** (update)
|
||
|
||
- Add helpers:
|
||
|
||
- `kg_boxes_exist(box_ids: list[str]) -> dict[str,bool]`
|
||
- `kg_find_evidence(taxpayer_id, tax_year, kinds: list[str], min_ocr: float, date_window) -> list[FoundEvidence]`
|
||
- `kg_rule_citations(schedule_id, boxes: list[str]) -> list[Citation]`
|
||
|
||
6. **`libs/rag.py`** (update)
|
||
|
||
- Add `rag_search_for_citations(query, filters) -> list[Citation]` (ensure `pii_free:true` and include `doc_id/url, locator`).
|
||
|
||
---
|
||
|
||
## C) Coverage DSL for conditions (compile in `compile_predicates`)
|
||
|
||
Supported condition atoms (map to KG checks):
|
||
|
||
- `exists(Entity[filters])` e.g., `exists(ExpenseItem[category='FinanceCosts'])`
|
||
- `property_joint_ownership` (bool from KG `PropertyAsset` links)
|
||
- `candidate_FHL` (bool property on `PropertyAsset` or derived)
|
||
- `claims_FTCR`, `claims_remittance_basis` (flags on `TaxpayerProfile`)
|
||
- `turnover_lt_vat_threshold` / `turnover_ge_vat_threshold` (computed from `IncomeItem` aggregates)
|
||
- `received_estate_income`, `BenefitInKind=true`, etc.
|
||
|
||
Implementation: parse simple strings with a tiny hand-rolled parser or declarative mapping table; **do not eval** raw strings. Return callables `fn(taxpayer_id, tax_year) -> bool`.
|
||
|
||
---
|
||
|
||
## D) Database migrations (Postgres; Alembic)
|
||
|
||
Create two tables (new `apps/svc-coverage/alembic`):
|
||
|
||
1. `coverage_versions`
|
||
|
||
- `id` (serial pk), `version` (text), `jurisdiction` (text), `tax_year` (text), `tenant_id` (text null),
|
||
`source_files` (jsonb), `compiled_at` (timestamptz), `hash` (text)
|
||
|
||
2. `coverage_audit`
|
||
|
||
- `id` (serial pk), `taxpayer_id` (text), `tax_year` (text), `policy_version` (text),
|
||
`overall_status` (text), `blocking_items` (jsonb), `created_at` (timestamptz), `trace_id` (text)
|
||
|
||
Write to `coverage_versions` on reload; write to `coverage_audit` on each `/v1/coverage/check`.
|
||
|
||
---
|
||
|
||
## E) API Contracts (exact shapes)
|
||
|
||
### 1) `/v1/coverage/check` (request)
|
||
|
||
```json
|
||
{ "tax_year": "2024-25", "taxpayer_id": "T-001" }
|
||
```
|
||
|
||
### 1) `/v1/coverage/check` (response)
|
||
|
||
```json
|
||
{
|
||
"tax_year": "2024-25",
|
||
"taxpayer_id": "T-001",
|
||
"schedules_required": ["SA102", "SA105", "SA110"],
|
||
"overall_status": "blocking", // ok | partial | blocking
|
||
"coverage": [
|
||
{
|
||
"schedule_id": "SA102",
|
||
"status": "partial",
|
||
"evidence": [
|
||
{
|
||
"id": "P60",
|
||
"role": "REQUIRED",
|
||
"status": "present_unverified",
|
||
"boxes": ["SA102_b1", "SA102_b2"],
|
||
"found": [
|
||
{
|
||
"doc_id": "DOC-123",
|
||
"kind": "P60",
|
||
"confidence": 0.81,
|
||
"pages": [2]
|
||
}
|
||
],
|
||
"acceptable_alternatives": ["FinalPayslipYTD", "P45"],
|
||
"reason": "P60 present but OCR confidence 0.81 < 0.82 threshold.",
|
||
"citations": [
|
||
{
|
||
"rule_id": "UK.SA102.P60.Required",
|
||
"doc_id": "SA102-Notes-2025",
|
||
"locator": "p.3 §1.1"
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
],
|
||
"blocking_items": [
|
||
{ "schedule_id": "SA105", "evidence_id": "LettingAgentStatements" }
|
||
]
|
||
}
|
||
```
|
||
|
||
### 2) `/v1/coverage/clarify` (request)
|
||
|
||
```json
|
||
{
|
||
"gap": {
|
||
"schedule_id": "SA105",
|
||
"evidence_id": "LettingAgentStatements",
|
||
"role": "REQUIRED",
|
||
"reason": "No rent/fees statements for 2024–25.",
|
||
"boxes": ["SA105_b5", "SA105_b20", "SA105_b29"],
|
||
"citations": [
|
||
{
|
||
"rule_id": "UK.SA105.RentEvidence",
|
||
"doc_id": "SA105-Notes-2025",
|
||
"locator": "p.4 §2.1"
|
||
}
|
||
],
|
||
"acceptable_alternatives": ["TenancyLedger", "BankStatements"]
|
||
},
|
||
"context": {
|
||
"tax_year": "2024-25",
|
||
"taxpayer_id": "T-001",
|
||
"jurisdiction": "UK"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 2) `/v1/coverage/clarify` (response)
|
||
|
||
```json
|
||
{
|
||
"question_text": "To complete the UK Property pages (SA105) for 2024–25, we need your letting agent statements showing total rents received, fees and charges. These support boxes SA105:5, SA105:20 and SA105:29. If you don’t have agent statements, you can provide a tenancy income ledger instead.",
|
||
"why_it_is_needed": "HMRC guidance requires evidence of gross rents and allowable expenses for SA105 (see notes p.4 §2.1).",
|
||
"citations": [
|
||
{
|
||
"rule_id": "UK.SA105.RentEvidence",
|
||
"doc_id": "SA105-Notes-2025",
|
||
"locator": "p.4 §2.1"
|
||
}
|
||
],
|
||
"options_to_provide": [
|
||
{
|
||
"label": "Upload agent statements (PDF/CSV)",
|
||
"accepted_formats": ["pdf", "csv"],
|
||
"upload_endpoint": "/v1/ingest/upload?tag=LettingAgentStatements"
|
||
},
|
||
{
|
||
"label": "Upload tenancy income ledger (XLSX/CSV)",
|
||
"accepted_formats": ["xlsx", "csv"],
|
||
"upload_endpoint": "/v1/ingest/upload?tag=TenancyLedger"
|
||
}
|
||
],
|
||
"blocking": true,
|
||
"boxes_affected": ["SA105_b5", "SA105_b20", "SA105_b29"]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## F) KG & RAG integration (implement exactly)
|
||
|
||
### Neo4j Cypher helpers (in `libs/neo.py`)
|
||
|
||
- **Presence of evidence**
|
||
|
||
```cypher
|
||
MATCH (p:TaxpayerProfile {taxpayer_id:$tid})-[:OF_TAX_YEAR]->(y:TaxYear {label:$tax_year})
|
||
MATCH (ev:Evidence)-[:DERIVED_FROM]->(d:Document)
|
||
WHERE (ev)-[:SUPPORTS]->(p) OR (d)-[:BELONGS_TO]->(p)
|
||
AND d.kind IN $kinds
|
||
AND date(d.date) >= date(y.start_date) AND date(d.date) <= date(y.end_date)
|
||
RETURN d.doc_id AS doc_id, d.kind AS kind, ev.page AS page, ev.bbox AS bbox, ev.ocr_confidence AS conf;
|
||
```
|
||
|
||
- **Rule citations for schedule/boxes**
|
||
|
||
```cypher
|
||
MATCH (fb:FormBox)-[:GOVERNED_BY]->(r:Rule)-[:CITES]->(doc:Document)
|
||
WHERE fb.box_id IN $box_ids
|
||
RETURN r.rule_id AS rule_id, doc.doc_id AS doc_id, doc.locator AS locator LIMIT 10;
|
||
```
|
||
|
||
- **Check boxes exist**
|
||
|
||
```cypher
|
||
UNWIND $box_ids AS bid
|
||
OPTIONAL MATCH (fb:FormBox {box_id: bid})
|
||
RETURN bid, fb IS NOT NULL AS exists;
|
||
```
|
||
|
||
### RAG fallback (in `libs/rag.py`)
|
||
|
||
- `rag_search_for_citations(query, filters={'jurisdiction':'UK','tax_year':'2024-25','pii_free':true}) -> list[Citation]`
|
||
|
||
- Use Qdrant hybrid search + rerank; return **doc_id/url** and a best-effort **locator** (heading/page).
|
||
|
||
---
|
||
|
||
## G) Validation & policy correctness
|
||
|
||
Implement `/v1/coverage/validate` to run checks:
|
||
|
||
1. **YAML schema** (`libs/coverage_schema.json`) passes.
|
||
2. Every `boxes[]` exists in KG (`FormBox`).
|
||
3. Every `evidence.id` and each `acceptable_alternatives[]` is in `document_kinds`.
|
||
4. Every schedule referenced under `schedules` has a `triggers` entry.
|
||
5. Simulate a set of synthetic profiles (unit fixtures) to ensure conditional paths are exercised (e.g., with/without BIK, FHL candidate, remittance).
|
||
|
||
Return `{ok: true}` or `{ok:false, errors:[...]}`.
|
||
|
||
---
|
||
|
||
## H) Config loading, overlays & hot reload
|
||
|
||
Load order:
|
||
|
||
1. `config/coverage.yaml` (baseline)
|
||
2. `config/coverage.{jurisdiction}.{tax_year}.yaml` (if present)
|
||
3. `config/overrides/{tenant_id}.yaml` (if present)
|
||
4. Apply feature flags (if Unleash present)
|
||
5. Compile predicates; compute hash of concatenated files.
|
||
|
||
Expose `/admin/coverage/reload` to recompile; write an entry in `coverage_versions`.
|
||
|
||
---
|
||
|
||
## I) Compose & Traefik
|
||
|
||
**Add container** `svc-coverage` to `infra/compose/docker-compose.local.yml`:
|
||
|
||
- Port `8000`, labels:
|
||
|
||
```
|
||
- "traefik.enable=true"
|
||
- "traefik.http.routers.svc-coverage.rule=Host(`api.local`) && PathPrefix(`/coverage`)"
|
||
- "traefik.http.routers.svc-coverage.entrypoints=websecure"
|
||
- "traefik.http.routers.svc-coverage.tls=true"
|
||
- "traefik.http.routers.svc-coverage.middlewares=authentik-forwardauth,rate-limit"
|
||
- "traefik.http.services.svc-coverage.loadbalancer.server.port=8000"
|
||
```
|
||
|
||
- Mount `./config:/app/config:ro` so policy can be hot-reloaded.
|
||
|
||
---
|
||
|
||
## J) CI (Gitea) additions
|
||
|
||
- Add a job **`policy-validate`** that runs:
|
||
|
||
- `yamllint config/coverage.yaml`
|
||
- Policy JSON Schema validation
|
||
- Box existence check (calls a local Neo4j with seeded `FormBox` registry or mocks via snapshot)
|
||
|
||
- Make pipeline **fail** if any validation fails.
|
||
- Ensure unit/integration tests for `svc-coverage` push coverage ≥ 90%.
|
||
|
||
---
|
||
|
||
## K) Tests (create all)
|
||
|
||
1. **Unit** (`tests/unit/coverage/`):
|
||
|
||
- `test_policy_load_and_merge.py`
|
||
- `test_predicate_compilation.py` (conditions DSL)
|
||
- `test_status_classifier.py` (present_verified/unverified/missing/conflicting)
|
||
- `test_question_templates.py` (string assembly, alternatives)
|
||
|
||
2. **Integration** (`tests/integration/coverage/`):
|
||
|
||
- Spin up Neo4j with fixtures (seed form boxes + minimal rules/docs).
|
||
- `test_check_document_coverage_happy_path.py`
|
||
- `test_check_document_coverage_blocking_gaps.py`
|
||
- `test_clarify_generates_citations_kg_then_rag.py` (mock RAG)
|
||
|
||
3. **E2E** (`tests/e2e/test_coverage_to_compute_flow.py`):
|
||
|
||
- Ingest → OCR → Extract (mock) → Map → `/coverage/check` (expect blocking) → `/coverage/clarify` → upload alt doc → `/coverage/check` now ok → compute schedule.
|
||
|
||
---
|
||
|
||
## L) Error handling & codes
|
||
|
||
- Use RFC7807 Problem+JSON; standardize types:
|
||
|
||
- `/errors/policy-invalid`, `/errors/policy-reload-failed`, `/errors/kg-query-failed`, `/errors/rag-citation-failed`
|
||
|
||
- Include `trace_id` in all errors; log with `warn/error` and span attributes `{taxpayer_id, tax_year, schedule}`.
|
||
|
||
---
|
||
|
||
## M) Acceptance criteria (DoD)
|
||
|
||
- `docker compose up` brings up `svc-coverage`.
|
||
- `POST /v1/coverage/check` returns correct **overall_status** and **blocking_items** for synthetic fixtures.
|
||
- `/v1/coverage/clarify` returns a **polite, specific question** with **boxes listed**, **upload endpoints**, and **citations**.
|
||
- `/admin/coverage/reload` picks up edited YAML without restart and logs a new `coverage_versions` row.
|
||
- `/v1/coverage/validate` returns `{ok:true}` on the provided policy; CI fails if not.
|
||
- No PII enters RAG queries (enforce `pii_free:true` filter).
|
||
- Coverage ≥ 90% on `svc-coverage`; policy validation job green.
|
||
|
||
---
|
||
|
||
# OUTPUT (FILES TO CREATE/UPDATE)
|
||
|
||
Generate the following files with production-quality code and docs:
|
||
|
||
```
|
||
libs/policy.py
|
||
libs/coverage_models.py
|
||
libs/coverage_schema.json
|
||
libs/coverage_eval.py
|
||
libs/neo.py # update with helpers shown
|
||
libs/rag.py # update with citation search
|
||
apps/svc-coverage/main.py
|
||
apps/svc-coverage/alembic/versions/*.py
|
||
infra/compose/docker-compose.local.yml # add service & volume
|
||
.gitea/workflows/ci.yml # add policy-validate job
|
||
tests/unit/coverage/*.py
|
||
tests/integration/coverage/*.py
|
||
tests/e2e/test_coverage_to_compute_flow.py
|
||
README.md # add section: Coverage Policy & Hot Reload
|
||
```
|
||
|
||
Use the **policy file** at `config/coverage.yaml` we already drafted. Do not change its content; only **read and validate** it.
|
||
|
||
# START
|
||
|
||
Proceed to implement and output the listed files in the order above.
|