Files
ai-tax-agent/docs/BACKEND.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

431 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ROLE
You are a **Senior Backend Engineer** working inside an existing monorepo that already contains the services and libraries described previously (Traefik+Authentik SSO at the edge; Python 3.12; FastAPI microservices; Vault, MinIO, Neo4j, Postgres, Redis, Qdrant; Prefect; Docker-Compose; Gitea CI).
# OBJECTIVE
Integrate the new **coverage policy** (`config/coverage.yaml`) so agents can:
1. call `CheckDocumentCoverage({tax_year, taxpayer_id})` and get a **precise, machine-readable coverage matrix** (required/conditional/optional evidence per schedule, with status and citations), and
2. call `AskClarifyingQuestion(gap, context)` to receive a **ready-to-send user question** with **why** and **citations**.
You will implement **policy loading with overlays + hot reload**, **runtime evaluation against the KG**, **citations via KG or RAG**, **validation**, **tests**, **CI**, and **deploy assets**.
---
# SCOPE (DO EXACTLY THIS)
## A) New service: `svc-coverage`
Create a dedicated microservice to encapsulate policy loading and coverage evaluation (keeps `svc-reason` calculators clean).
**Endpoints (FastAPI):**
1. `POST /v1/coverage/check`
- Body: `{"tax_year": "YYYY-YY", "taxpayer_id": "T-xxx"}`
- Returns: full coverage report (shape below).
2. `POST /v1/coverage/clarify`
- Body: `{"gap": {...}, "context": {"tax_year": "...", "taxpayer_id": "...", "jurisdiction": "UK"}}`
- Returns: `{question_text, why_it_is_needed, citations[], options_to_provide[], blocking, boxes_affected[]}`.
3. `POST /admin/coverage/reload`
- Reloads policy from files/overrides/feature flags. **Require admin group** via forwarded header.
4. `GET /v1/coverage/policy`
- Returns **current compiled policy** (no secrets, no PII), with version & sources.
5. `GET /v1/coverage/validate`
- Runs cross-checks (see Validation section). Returns `{ok: bool, errors[]}`.
**Security:**
- All routes behind Traefik+Authentik.
- `/admin/*` additionally checks `X-Authenticated-Groups` contains `admin`.
- Use the existing `TrustedProxyMiddleware`.
**Observability:**
- OTel tracing, Prometheus metrics at `/metrics` (internal CIDR only), structured logs.
---
## B) Libraries & shared code (create/update)
1. **`libs/policy.py`** (new)
- Functions:
- `load_policy(baseline_path, jurisdiction, tax_year, tenant_id|None) -> CoveragePolicy`
- `merge_overlays(base, *overlays) -> CoveragePolicy`
- `apply_feature_flags(policy) -> CoveragePolicy` (optional Unleash)
- `compile_predicates(policy) -> CompiledCoveragePolicy`
(turn `condition:` DSL into callables; see DSL below)
- `watch_and_reload()` (optional watchdog; otherwise `/admin/coverage/reload`)
- Validate against JSON Schema (below). Raise `PolicyError` on failure.
2. **`libs/coverage_models.py`** (new)
- Pydantic v2 models mirroring `config/coverage.yaml`:
`CoveragePolicy, SchedulePolicy, EvidenceItem, Validity, StatusClassifier, QuestionTemplates, ConflictRules, GuidanceRef, Trigger, CoverageReport, CoverageItem, Citation, ClarifyResponse`.
- Enums: `Role = REQUIRED|CONDITIONALLY_REQUIRED|OPTIONAL`, `Status = present_verified|present_unverified|missing|conflicting`.
3. **`libs/coverage_eval.py`** (new)
- Core runtime:
- `infer_required_schedules(taxpayer_id, tax_year, policy, kg) -> list[str]`
- `find_evidence_docs(taxpayer_id, tax_year, evidence_ids, thresholds, kg) -> list[FoundEvidence]`
- `classify_status(found, thresholds, tax_year_bounds, conflicts_rules) -> Status`
- `build_reason_and_citations(schedule_id, evidence_item, status, taxpayer_id, tax_year, kg, rag) -> (str, list[Citation])`
- `check_document_coverage(...) -> CoverageReport` (implements the A→D steps we defined)
- Uses:
- `libs/neo.py` for Cypher helpers (see queries below)
- `libs/rag.py` for fallback citations (filters `{jurisdiction:'UK', tax_year}` and `pii_free:true`)
4. **`libs/coverage_schema.json`** (new)
- JSON Schema for validating `coverage.yaml`. Include:
- enum checks (`role`, `status keys`)
- `boxes[]` is non-empty strings
- every `evidence.id` present in `document_kinds` or `acceptable_alternatives` points to a declared kind
- `triggers` exist for each schedule referenced under `schedules`
5. **`libs/neo.py`** (update)
- Add helpers:
- `kg_boxes_exist(box_ids: list[str]) -> dict[str,bool]`
- `kg_find_evidence(taxpayer_id, tax_year, kinds: list[str], min_ocr: float, date_window) -> list[FoundEvidence]`
- `kg_rule_citations(schedule_id, boxes: list[str]) -> list[Citation]`
6. **`libs/rag.py`** (update)
- Add `rag_search_for_citations(query, filters) -> list[Citation]` (ensure `pii_free:true` and include `doc_id/url, locator`).
---
## C) Coverage DSL for conditions (compile in `compile_predicates`)
Supported condition atoms (map to KG checks):
- `exists(Entity[filters])` e.g., `exists(ExpenseItem[category='FinanceCosts'])`
- `property_joint_ownership` (bool from KG `PropertyAsset` links)
- `candidate_FHL` (bool property on `PropertyAsset` or derived)
- `claims_FTCR`, `claims_remittance_basis` (flags on `TaxpayerProfile`)
- `turnover_lt_vat_threshold` / `turnover_ge_vat_threshold` (computed from `IncomeItem` aggregates)
- `received_estate_income`, `BenefitInKind=true`, etc.
Implementation: parse simple strings with a tiny hand-rolled parser or declarative mapping table; **do not eval** raw strings. Return callables `fn(taxpayer_id, tax_year) -> bool`.
---
## D) Database migrations (Postgres; Alembic)
Create two tables (new `apps/svc-coverage/alembic`):
1. `coverage_versions`
- `id` (serial pk), `version` (text), `jurisdiction` (text), `tax_year` (text), `tenant_id` (text null),
`source_files` (jsonb), `compiled_at` (timestamptz), `hash` (text)
2. `coverage_audit`
- `id` (serial pk), `taxpayer_id` (text), `tax_year` (text), `policy_version` (text),
`overall_status` (text), `blocking_items` (jsonb), `created_at` (timestamptz), `trace_id` (text)
Write to `coverage_versions` on reload; write to `coverage_audit` on each `/v1/coverage/check`.
---
## E) API Contracts (exact shapes)
### 1) `/v1/coverage/check` (request)
```json
{ "tax_year": "2024-25", "taxpayer_id": "T-001" }
```
### 1) `/v1/coverage/check` (response)
```json
{
"tax_year": "2024-25",
"taxpayer_id": "T-001",
"schedules_required": ["SA102", "SA105", "SA110"],
"overall_status": "blocking", // ok | partial | blocking
"coverage": [
{
"schedule_id": "SA102",
"status": "partial",
"evidence": [
{
"id": "P60",
"role": "REQUIRED",
"status": "present_unverified",
"boxes": ["SA102_b1", "SA102_b2"],
"found": [
{
"doc_id": "DOC-123",
"kind": "P60",
"confidence": 0.81,
"pages": [2]
}
],
"acceptable_alternatives": ["FinalPayslipYTD", "P45"],
"reason": "P60 present but OCR confidence 0.81 < 0.82 threshold.",
"citations": [
{
"rule_id": "UK.SA102.P60.Required",
"doc_id": "SA102-Notes-2025",
"locator": "p.3 §1.1"
}
]
}
]
}
],
"blocking_items": [
{ "schedule_id": "SA105", "evidence_id": "LettingAgentStatements" }
]
}
```
### 2) `/v1/coverage/clarify` (request)
```json
{
"gap": {
"schedule_id": "SA105",
"evidence_id": "LettingAgentStatements",
"role": "REQUIRED",
"reason": "No rent/fees statements for 202425.",
"boxes": ["SA105_b5", "SA105_b20", "SA105_b29"],
"citations": [
{
"rule_id": "UK.SA105.RentEvidence",
"doc_id": "SA105-Notes-2025",
"locator": "p.4 §2.1"
}
],
"acceptable_alternatives": ["TenancyLedger", "BankStatements"]
},
"context": {
"tax_year": "2024-25",
"taxpayer_id": "T-001",
"jurisdiction": "UK"
}
}
```
### 2) `/v1/coverage/clarify` (response)
```json
{
"question_text": "To complete the UK Property pages (SA105) for 202425, we need your letting agent statements showing total rents received, fees and charges. These support boxes SA105:5, SA105:20 and SA105:29. If you dont have agent statements, you can provide a tenancy income ledger instead.",
"why_it_is_needed": "HMRC guidance requires evidence of gross rents and allowable expenses for SA105 (see notes p.4 §2.1).",
"citations": [
{
"rule_id": "UK.SA105.RentEvidence",
"doc_id": "SA105-Notes-2025",
"locator": "p.4 §2.1"
}
],
"options_to_provide": [
{
"label": "Upload agent statements (PDF/CSV)",
"accepted_formats": ["pdf", "csv"],
"upload_endpoint": "/v1/ingest/upload?tag=LettingAgentStatements"
},
{
"label": "Upload tenancy income ledger (XLSX/CSV)",
"accepted_formats": ["xlsx", "csv"],
"upload_endpoint": "/v1/ingest/upload?tag=TenancyLedger"
}
],
"blocking": true,
"boxes_affected": ["SA105_b5", "SA105_b20", "SA105_b29"]
}
```
---
## F) KG & RAG integration (implement exactly)
### Neo4j Cypher helpers (in `libs/neo.py`)
- **Presence of evidence**
```cypher
MATCH (p:TaxpayerProfile {taxpayer_id:$tid})-[:OF_TAX_YEAR]->(y:TaxYear {label:$tax_year})
MATCH (ev:Evidence)-[:DERIVED_FROM]->(d:Document)
WHERE (ev)-[:SUPPORTS]->(p) OR (d)-[:BELONGS_TO]->(p)
AND d.kind IN $kinds
AND date(d.date) >= date(y.start_date) AND date(d.date) <= date(y.end_date)
RETURN d.doc_id AS doc_id, d.kind AS kind, ev.page AS page, ev.bbox AS bbox, ev.ocr_confidence AS conf;
```
- **Rule citations for schedule/boxes**
```cypher
MATCH (fb:FormBox)-[:GOVERNED_BY]->(r:Rule)-[:CITES]->(doc:Document)
WHERE fb.box_id IN $box_ids
RETURN r.rule_id AS rule_id, doc.doc_id AS doc_id, doc.locator AS locator LIMIT 10;
```
- **Check boxes exist**
```cypher
UNWIND $box_ids AS bid
OPTIONAL MATCH (fb:FormBox {box_id: bid})
RETURN bid, fb IS NOT NULL AS exists;
```
### RAG fallback (in `libs/rag.py`)
- `rag_search_for_citations(query, filters={'jurisdiction':'UK','tax_year':'2024-25','pii_free':true}) -> list[Citation]`
- Use Qdrant hybrid search + rerank; return **doc_id/url** and a best-effort **locator** (heading/page).
---
## G) Validation & policy correctness
Implement `/v1/coverage/validate` to run checks:
1. **YAML schema** (`libs/coverage_schema.json`) passes.
2. Every `boxes[]` exists in KG (`FormBox`).
3. Every `evidence.id` and each `acceptable_alternatives[]` is in `document_kinds`.
4. Every schedule referenced under `schedules` has a `triggers` entry.
5. Simulate a set of synthetic profiles (unit fixtures) to ensure conditional paths are exercised (e.g., with/without BIK, FHL candidate, remittance).
Return `{ok: true}` or `{ok:false, errors:[...]}`.
---
## H) Config loading, overlays & hot reload
Load order:
1. `config/coverage.yaml` (baseline)
2. `config/coverage.{jurisdiction}.{tax_year}.yaml` (if present)
3. `config/overrides/{tenant_id}.yaml` (if present)
4. Apply feature flags (if Unleash present)
5. Compile predicates; compute hash of concatenated files.
Expose `/admin/coverage/reload` to recompile; write an entry in `coverage_versions`.
---
## I) Compose & Traefik
**Add container** `svc-coverage` to `infra/compose/docker-compose.local.yml`:
- Port `8000`, labels:
```
- "traefik.enable=true"
- "traefik.http.routers.svc-coverage.rule=Host(`api.local`) && PathPrefix(`/coverage`)"
- "traefik.http.routers.svc-coverage.entrypoints=websecure"
- "traefik.http.routers.svc-coverage.tls=true"
- "traefik.http.routers.svc-coverage.middlewares=authentik-forwardauth,rate-limit"
- "traefik.http.services.svc-coverage.loadbalancer.server.port=8000"
```
- Mount `./config:/app/config:ro` so policy can be hot-reloaded.
---
## J) CI (Gitea) additions
- Add a job **`policy-validate`** that runs:
- `yamllint config/coverage.yaml`
- Policy JSON Schema validation
- Box existence check (calls a local Neo4j with seeded `FormBox` registry or mocks via snapshot)
- Make pipeline **fail** if any validation fails.
- Ensure unit/integration tests for `svc-coverage` push coverage ≥ 90%.
---
## K) Tests (create all)
1. **Unit** (`tests/unit/coverage/`):
- `test_policy_load_and_merge.py`
- `test_predicate_compilation.py` (conditions DSL)
- `test_status_classifier.py` (present_verified/unverified/missing/conflicting)
- `test_question_templates.py` (string assembly, alternatives)
2. **Integration** (`tests/integration/coverage/`):
- Spin up Neo4j with fixtures (seed form boxes + minimal rules/docs).
- `test_check_document_coverage_happy_path.py`
- `test_check_document_coverage_blocking_gaps.py`
- `test_clarify_generates_citations_kg_then_rag.py` (mock RAG)
3. **E2E** (`tests/e2e/test_coverage_to_compute_flow.py`):
- Ingest → OCR → Extract (mock) → Map → `/coverage/check` (expect blocking) → `/coverage/clarify` → upload alt doc → `/coverage/check` now ok → compute schedule.
---
## L) Error handling & codes
- Use RFC7807 Problem+JSON; standardize types:
- `/errors/policy-invalid`, `/errors/policy-reload-failed`, `/errors/kg-query-failed`, `/errors/rag-citation-failed`
- Include `trace_id` in all errors; log with `warn/error` and span attributes `{taxpayer_id, tax_year, schedule}`.
---
## M) Acceptance criteria (DoD)
- `docker compose up` brings up `svc-coverage`.
- `POST /v1/coverage/check` returns correct **overall_status** and **blocking_items** for synthetic fixtures.
- `/v1/coverage/clarify` returns a **polite, specific question** with **boxes listed**, **upload endpoints**, and **citations**.
- `/admin/coverage/reload` picks up edited YAML without restart and logs a new `coverage_versions` row.
- `/v1/coverage/validate` returns `{ok:true}` on the provided policy; CI fails if not.
- No PII enters RAG queries (enforce `pii_free:true` filter).
- Coverage ≥ 90% on `svc-coverage`; policy validation job green.
---
# OUTPUT (FILES TO CREATE/UPDATE)
Generate the following files with production-quality code and docs:
```
libs/policy.py
libs/coverage_models.py
libs/coverage_schema.json
libs/coverage_eval.py
libs/neo.py # update with helpers shown
libs/rag.py # update with citation search
apps/svc-coverage/main.py
apps/svc-coverage/alembic/versions/*.py
infra/compose/docker-compose.local.yml # add service & volume
.gitea/workflows/ci.yml # add policy-validate job
tests/unit/coverage/*.py
tests/integration/coverage/*.py
tests/e2e/test_coverage_to_compute_flow.py
README.md # add section: Coverage Policy & Hot Reload
```
Use the **policy file** at `config/coverage.yaml` we already drafted. Do not change its content; only **read and validate** it.
# START
Proceed to implement and output the listed files in the order above.