Initial commit

2025-10-11 08:41:36 +01:00
commit b324ff09ef
276 changed files with 55220 additions and 0 deletions
--- a/retrieval/chunking.yaml
+++ b/retrieval/chunking.yaml
@@ -0,0 +1,475 @@
+# ROLE
+
+You are a **Solution Architect + Ontologist + Data Engineer + Platform/SRE** delivering a **production-grade accounting knowledge system** that ingests documents, fuses a **Knowledge Graph (KG)** with a **Vector DB (Qdrant)** for RAG, integrates with **Firm Databases**, and powers **AI agents** to complete workflows like **UK Self Assessment** — with **auditable provenance**.
+**Authentication & authorization are centralized at the edge:** **Traefik** gateway + **Authentik** SSO (OIDC/ForwardAuth). **Backend services trust Traefik** on an internal network and consume user/role claims from forwarded headers/JWT.
+
+# OBJECTIVE
+
+Deliver a complete, implementable solution—ontology, extraction pipeline, RAG+KG retrieval, deterministic calculators, APIs, validations, **architecture & stack**, infra-as-code, CI/CD, observability, security/governance, test plan, and a worked example—so agents can:
+
+1. read documents (and scrape portals via RPA),
+2. populate/maintain a compliant accounting/tax KG,
+3. retrieve firm knowledge via RAG (vector + keyword + graph),
+4. compute/validate schedules and fill forms,
+5. submit (stub/sandbox/live),
+6. justify every output with **traceable provenance** (doc/page/bbox) and citations.
+
+# SCOPE & VARIABLES
+
+- **Jurisdiction:** {{jurisdiction}} (default: UK)
+- **Tax regime / forms:** {{forms}} (default: SA100 + SA102, SA103, SA105, SA110; optional SA108)
+- **Accounting basis:** {{standards}} (default: UK GAAP; support IFRS/XBRL mapping)
+- **Document types:** bank statements, invoices, receipts, P\&L, balance sheet, payslips, dividend vouchers, property statements, prior returns, letters, certificates.
+- **Primary stores:** KG = Neo4j; RAG = Qdrant; Objects = MinIO; Secrets = Vault; IdP/SSO = Authentik; **API Gateway = Traefik**.
+- **PII constraints:** GDPR/UK-GDPR; **no raw PII in vector DB** (de-identify before indexing); role-based access; encryption; retention; right-to-erasure.
+
+---
+
+# ARCHITECTURE & STACK (LOCAL-FIRST; SCALE-OUT READY)
+
+## Edge & Identity (centralized)
+
+- **Traefik** (reverse proxy & ingress) terminates TLS, does **AuthN/AuthZ via Authentik**:
+
+  - Use **Authentik Outpost (ForwardAuth)** middleware in Traefik.
+  - Traefik injects verified headers/JWT to upstream services: `X-Authenticated-User`, `X-Authenticated-Email`, `X-Authenticated-Groups`, `Authorization: Bearer <jwt>`.
+  - **Per-route RBAC** via Traefik middlewares (group/claim checks); services only enforce **fine-grained, app-level authorization** using forwarded claims (no OIDC in each service).
+  - All services are **private** (only reachable behind Traefik on an internal Docker/K8s network). Direct access is denied.
+
+## Services (independent deployables; Python 3.12 unless stated)
+
+1. **svc-ingestion** — uploads/URLs; checksum; MinIO write; emits `doc.ingested`.
+2. **svc-rpa** — Playwright RPA for firm/client portals; Prefect-scheduled; emits `doc.ingested`.
+3. **svc-ocr** — Tesseract (local) or Textract (scale); de-skew/rotation/layout; emits `doc.ocr_ready`.
+4. **svc-extract** — LLM + rules + table detectors → **schema-constrained JSON** (kv + tables + bbox/page); emits `doc.extracted`.
+5. **svc-normalize-map** — normalize currency/dates; entity resolution; assign tax year; map to KG nodes/edges with **Evidence** anchors; emits `kg.upserted`.
+6. **svc-kg** — Neo4j DDL + **SHACL** validation; **bitemporal** writes `{valid_from, valid_to, asserted_at}`; RDF export.
+7. **svc-rag-indexer** — chunk/de-identify/embed; upsert **Qdrant** collections (firm knowledge, legislation, best practices, glossary).
+8. **svc-rag-retriever** — **hybrid retrieval** (dense + sparse) + rerank + **KG-fusion**; returns chunks + citations + KG join hints.
+9. **svc-reason** — deterministic calculators (employment, self-employment, property, dividends/interest, allowances, NIC, HICBC, student loans); Cypher materializers; explanations.
+10. **svc-forms** — fill PDFs; ZIP evidence bundle (signed manifest).
+11. **svc-hmrc** — submit stub|sandbox|live; rate-limit & retries; submission audit.
+12. **svc-firm-connectors** — read-only connectors to Firm Databases; sync to **Secure Client Data Store** with lineage.
+13. **ui-review** — Next.js reviewer portal (SSO via Traefik+Authentik); reviewers accept/override extractions.
+
+## Orchestration & Messaging
+
+- **Prefect 2.x** for local orchestration; **Temporal** for production scale (sagas, retries, idempotency).
+- Events: Kafka (or SQS/SNS) — `doc.ingested`, `doc.ocr_ready`, `doc.extracted`, `kg.upserted`, `rag.indexed`, `calc.schedule_ready`, `form.filled`, `hmrc.submitted`, `review.requested`, `review.completed`, `firm.sync.completed`.
+
+## Concrete Stack (pin/assume unless replaced)
+
+- **Languages:** Python **3.12**, TypeScript 5/Node 20
+- **Frameworks:** FastAPI, Pydantic v2, SQLAlchemy 2 (ledger), Prefect 2.x (local), Temporal (scale)
+- **Gateway:** **Traefik** 3.x with **Authentik Outpost** (ForwardAuth)
+- **Identity/SSO:** **Authentik** (OIDC/OAuth2)
+- **Secrets:** **Vault** (AppRole/JWT; Transit for envelope encryption)
+- **Object Storage:** **MinIO** (S3 API)
+- **Vector DB:** **Qdrant** 1.x (dense + sparse hybrid)
+- **Embeddings/Rerankers (local-first):**
+  Dense: `bge-m3` or `bge-small-en-v1.5`; Sparse: BM25/SPLADE (Qdrant sparse); Reranker: `cross-encoder/ms-marco-MiniLM-L-6-v2`
+- **Datastores:**
+
+  - **Secure Client Data Store:** PostgreSQL 15 (encrypted; RLS; pgcrypto)
+  - **KG:** Neo4j 5.x
+  - **Cache/locks:** Redis
+
+- **Infra:** **Docker-Compose** for local; **Kubernetes** for scale (Helm, ArgoCD optional later)
+- **CI/CD:** **Gitea** + Gitea Actions (or Drone) → container registry → deploy
+
+## Data Layer (three pillars + fusion)
+
+1. **Firm Databases** → **Firm Connectors** (read-only) → **Secure Client Data Store (Postgres)** with lineage.
+2. **Vector DB / Knowledge Base (Qdrant)** — internal knowledge, legislation, best practices, glossary; **no PII** (placeholders + hashes).
+3. **Knowledge Graph (Neo4j)** — accounting/tax ontology with evidence anchors and rules/calculations.
+
+**Fusion strategy:** Query → RAG retrieve (Qdrant) + KG traverse → **fusion** scoring (α·dense + β·sparse + γ·KG-link-boost) → results with citations (URL/doc_id+page/anchor) and graph paths.
+
+## Non-functional Targets
+
+- SLOs: ingest→extract p95 ≤ 3m; reconciliation ≥ 98%; lineage coverage ≥ 99%; schedule error ≤ 1/1k
+- Throughput: local 2 docs/s; scale 5 docs/s sustained; burst 20 docs/s
+- Idempotency: `sha256(doc_checksum + extractor_version)`
+- Retention: raw images 7y; derived text 2y; vectors (non-PII) 7y; PII-min logs 90d
+- Erasure: per `client_id` across MinIO, KG, Qdrant (payload filter), Postgres rows
+
+---
+
+# REPOSITORY LAYOUT (monorepo, local-first)
+
+```
+repo/
+  apps/
+    svc-ingestion/      svc-rpa/           svc-ocr/           svc-extract/
+    svc-normalize-map/  svc-kg/            svc-rag-indexer/   svc-rag-retriever/
+    svc-reason/         svc-forms/         svc-hmrc/          svc-firm-connectors/
+    ui-review/
+  kg/
+    ONTOLOGY.md
+    schemas/{nodes_and_edges.schema.json, context.jsonld, shapes.ttl}
+    db/{neo4j_schema.cypher, seed.cypher}
+    reasoning/schedule_queries.cypher
+  retrieval/
+    chunking.yaml  qdrant_collections.json  indexer.py  retriever.py  fusion.py
+  config/{heuristics.yaml, mapping.json}
+  prompts/{doc_classify.txt, kv_extract.txt, table_extract.txt, entity_link.txt, rag_answer.txt}
+  pipeline/etl.py
+  infra/
+    compose/{docker-compose.local.yml, traefik.yml, traefik-dynamic.yml, env.example}
+    k8s/ (optional later: Helm charts)
+  security/{dpia.md, ropa.md, retention_policy.md, threat_model.md}
+  ops/
+    runbooks/{ingest.md, calculators.md, hmrc.md, vector-indexing.md, dr-restore.md}
+    dashboards/grafana.json
+    alerts/prometheus-rules.yaml
+  tests/{unit, integration, e2e, data/{synthetic, golden}}
+  Makefile
+  .gitea/workflows/ci.yml
+  mkdocs.yml
+```
+
+---
+
+# DELIVERABLES (RETURN ALL AS MARKED CODE BLOCKS)
+
+1. **Ontology** (Concept model; JSON-Schema; JSON-LD; Neo4j DDL)
+2. **Heuristics & Rules (YAML)**
+3. **Extraction pipeline & prompts**
+4. **RAG & Retrieval Layer** (chunking, Qdrant collections, indexer, retriever, fusion)
+5. **Reasoning layer** (deterministic calculators + Cypher + tests)
+6. **Agent interface (Tooling API)**
+7. **Quality & Safety** (datasets, metrics, tests, red-team)
+8. **Graph Constraints** (SHACL, IDs, bitemporal)
+9. **Security & Compliance** (DPIA, ROPA, encryption, auditability)
+10. **Worked Example** (end-to-end UK SA sample)
+11. **Observability & SRE** (SLIs/SLOs, tracing, idempotency, DR, cost controls)
+12. **Architecture & Local Infra** (**docker-compose** with Traefik + Authentik + Vault + MinIO + Qdrant + Neo4j + Postgres + Redis + Prometheus/Grafana + Loki + Unleash + services)
+13. **Repo Scaffolding & Makefile** (dev tasks, lint, test, build, run)
+14. **Firm Database Connectors** (data contracts, sync jobs, lineage)
+15. **Traefik & Authentik configs** (static+dynamic, ForwardAuth, route labels)
+
+---
+
+# ONTOLOGY REQUIREMENTS (as before + RAG links)
+
+- Nodes: `TaxpayerProfile`, `TaxYear`, `Jurisdiction`, `TaxForm`, `Schedule`, `FormBox`, `Document`, `Evidence`, `Party`, `Account`, `IncomeItem`, `ExpenseItem`, `PropertyAsset`, `BusinessActivity`, `Allowance`, `Relief`, `PensionContribution`, `StudentLoanPlan`, `Payment`, `ExchangeRate`, `Calculation`, `Rule`, `NormalizationEvent`, `Reconciliation`, `Consent`, `LegalBasis`, `ImportJob`, `ETLRun`
+- Relationships: `BELONGS_TO`, `OF_TAX_YEAR`, `IN_JURISDICTION`, `HAS_SECTION`, `HAS_BOX`, `REPORTED_IN`, `COMPUTES`, `DERIVED_FROM`, `SUPPORTED_BY`, `PAID_BY`, `PAID_TO`, `OWNS`, `RENTED_BY`, `EMPLOYED_BY`, `APPLIES_TO`, `APPLIES`, `VIOLATES`, `NORMALIZED_FROM`, `HAS_VALID_BASIS`, `PRODUCED_BY`, **`CITES`**, **`DESCRIBES`**
+- **Bitemporal** and **provenance** mandatory.
+
+---
+
+# UK-SPECIFIC REQUIREMENTS
+
+- Year boundary 6 Apr–5 Apr; basis period reform toggle
+- Employment aggregation, BIK, PAYE offsets
+- Self-employment: allowable/disallowable, capital allowances (AIA/WDA/SBA), loss rules, **NIC Class 2 & 4**
+- Property: FHL tests, **mortgage interest 20% credit**, Rent-a-Room, joint splits
+- Savings/dividends: allowances & rate bands; ordering
+- Personal allowance tapering; Gift Aid & pension gross-up; **HICBC**; **Student Loan** plans 1/2/4/5 & PGL
+- Rounding per `FormBox.rounding_rule`
+
+---
+
+# YAML HEURISTICS (KEEP SEPARATE FILE)
+
+- document_kinds, field_normalization, line_item_mapping
+- period_inference (UK boundary + reform), dedupe_rules
+- **validation_rules:** `utr_checksum`, `ni_number_regex`, `iban_check`, `vat_gb_mod97`, `rounding_policy: "HMRC"`, `numeric_tolerance: 0.01`
+- **entity_resolution:** blocking keys, fuzzy thresholds, canonical source priority
+- **privacy_redaction:** `mask_except_last4` for NI/UTR/IBAN/sort_code/phone/email
+- **jurisdiction_overrides:** by {{jurisdiction}} and {{tax\_year}}
+
+---
+
+# EXTRACTION PIPELINE (SPECIFY CODE & PROMPTS)
+
+- ingest → classify → OCR/layout → extract (schema-constrained JSON with bbox/page) → validate → normalize → map_to_graph → post-checks
+- Prompts: `doc_classify`, `kv_extract`, `table_extract` (multi-page), `entity_link`
+- Contract: **JSON schema enforcement** with retry/validator loop; temperature guidance
+- Reliability: de-skew/rotation/language/handwriting policy
+- Mapping config: JSON mapping to nodes/edges + provenance (doc_id/page/bbox/text_hash)
+
+---
+
+# RAG & RETRIEVAL LAYER (Qdrant + KG Fusion)
+
+- Collections: `firm_knowledge`, `legislation`, `best_practices`, `glossary` (payloads include jurisdiction, tax_years, topic_tags, version, `pii_free:true`)
+- Chunking: layout-aware; tables serialized; \~1.5k token chunks, 10–15% overlap
+- Indexer: de-identify PII; placeholders only; embeddings (dense) + sparse; upsert with payload
+- Retriever: hybrid scoring (α·dense + β·sparse), filters (jurisdiction/tax_year), rerank; return **citations** + **KG hints**
+- Fusion: boost results linked to applicable `Rule`/`Calculation`/`Evidence` for current schedule
+- Right-to-erasure: purge vectors via payload filter (`client_id?` only for client-authored knowledge)
+
+---
+
+# REASONING & CALCULATION (DETERMINISTIC)
+
+- Order: incomes → allowances/capital allowances → loss offsets → personal allowance → savings/dividend bands → HICBC & student loans → NIC Class 2/4 → property 20% credit/FHL/Rent-a-Room
+- Cypher materializers per schedule/box; explanations via `DERIVED_FROM` and RAG `CITES`
+- Unit tests per rule; golden files; property-based tests
+
+---
+
+# AGENT TOOLING API (JSON SCHEMAS)
+
+1. `ComputeSchedule({tax_year, taxpayer_id, schedule_id}) -> {boxes[], totals[], explanations[]}`
+2. `PopulateFormBoxes({tax_year, taxpayer_id, form_id}) -> {fields[], pdf_fields[], confidence, calibrated_confidence}`
+3. `AskClarifyingQuestion({gap, candidate_values, evidence}) -> {question_text, missing_docs}`
+4. `GenerateEvidencePack({scope}) -> {bundle_manifest, signed_hashes}`
+5. `ExplainLineage({node_id|field}) -> {chain:[evidence], graph_paths}`
+6. `CheckDocumentCoverage({tax_year, taxpayer_id}) -> {required_docs[], missing[], blockers[]}`
+7. `SubmitToHMRC({tax_year, taxpayer_id, dry_run}) -> {status, submission_id?, errors[]}`
+8. `ReconcileBank({account_id, period}) -> {unmatched_invoices[], unmatched_bank_lines[], deltas}`
+9. `RAGSearch({query, tax_year?, jurisdiction?, k?}) -> {chunks[], citations[], kg_hints[], calibrated_confidence}`
+10. `SyncFirmDatabases({since}) -> {objects_synced, errors[]}`
+
+**Env flags:** `HMRC_MTD_ITSA_MODE`, `RATE_LIMITS`, `RAG_EMBEDDING_MODEL`, `RAG_RERANKER_MODEL`, `RAG_ALPHA_BETA_GAMMA`
+
+---
+
+# SECURITY & COMPLIANCE
+
+- **Traefik + Authentik SSO at edge** (ForwardAuth); per-route RBAC; inject verified claims headers/JWT
+- **Vault** for secrets (AppRole/JWT, Transit for envelope encryption)
+- **PII minimization:** no PII in Qdrant; placeholders; PII mapping only in Secure Client Data Store
+- **Auditability:** tamper-evident logs (hash chain), signer identity, time sync
+- **DPIA, ROPA, retention policy, right-to-erasure** workflows
+
+---
+
+# CI/CD (Gitea)
+
+- Gitea Actions: `lint` (ruff/mypy/eslint), `test` (pytest+coverage, e2e), `build` (Docker), `scan` (Trivy/SAST), `push` (registry), `deploy` (compose up or K8s apply)
+- SemVer tags; SBOM (Syft); OpenAPI + MkDocs publish; pre-commit hooks
+
+---
+
+# OBSERVABILITY & SRE
+
+- SLIs/SLOs: ingest_time_p50, extract_precision\@field≥0.97, reconciliation_pass_rate≥0.98, lineage_coverage≥0.99, time_to_review_p95
+- Dashboards: ingestion throughput, OCR error rates, extraction precision, mapping latency, calculator failures, HMRC submits, **RAG recall/precision & faithfulness**
+- Alerts: OCR 5xx spike, extraction precision dip, reconciliation failures, HMRC rate-limit breaches, RAG drift
+- Backups/DR: Neo4j dump (daily), Postgres PITR, Qdrant snapshot, MinIO versioning; quarterly restore test
+- Cost controls: embedding cache, incremental indexing, compaction/TTL for stale vectors, cold archive for images
+
+---
+
+# OUTPUT FORMAT (STRICT)
+
+Return results in the following order, each in its own fenced code block **with the exact language tag**:
+
+```md
+<!-- FILE: ONTOLOGY.md -->
+
+# Concept Model
+
+...
+```
+
+```json
+// FILE: schemas/nodes_and_edges.schema.json
+{ ... }
+```
+
+```json
+// FILE: schemas/context.jsonld
+{ ... }
+```
+
+```turtle
+# FILE: schemas/shapes.ttl
+# SHACL shapes for node/edge integrity
+...
+```
+
+```cypher
+// FILE: db/neo4j_schema.cypher
+CREATE CONSTRAINT ...
+```
+
+```yaml
+# FILE: config/heuristics.yaml
+document_kinds: ...
+```
+
+```json
+# FILE: config/mapping.json
+{ "mappings": [ ... ] }
+```
+
+```yaml
+# FILE: retrieval/chunking.yaml
+# Layout-aware chunking, tables, overlap, token targets
+```
+
+```json
+# FILE: retrieval/qdrant_collections.json
+{
+  "collections": [
+    { "name": "firm_knowledge", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
+    { "name": "legislation", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
+    { "name": "best_practices", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
+    { "name": "glossary", "dense": {"size": 768}, "sparse": true, "payload_schema": { ... } }
+  ]
+}
+```
+
+```python
+# FILE: retrieval/indexer.py
+# De-identify -> embed dense/sparse -> upsert to Qdrant with payload
+...
+```
+
+```python
+# FILE: retrieval/retriever.py
+# Hybrid retrieval (alpha,beta), rerank, filters, return citations + KG hints
+...
+```
+
+```python
+# FILE: retrieval/fusion.py
+# Join RAG chunks to KG rules/calculations/evidence; boost linked results
+...
+```
+
+```txt
+# FILE: prompts/rag_answer.txt
+[Instruction: cite every claim; forbid PII; return calibrated_confidence; JSON contract]
+```
+
+```python
+# FILE: pipeline/etl.py
+def ingest(...): ...
+```
+
+```txt
+# FILE: prompts/kv_extract.txt
+[Prompt with JSON contract + examples]
+```
+
+```cypher
+// FILE: reasoning/schedule_queries.cypher
+// SA105: compute property income totals
+MATCH ...
+```
+
+```json
+// FILE: tools/agent_tools.json
+{ ... }
+```
+
+```yaml
+# FILE: infra/compose/docker-compose.local.yml
+# Traefik (with Authentik ForwardAuth), Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prometheus/Grafana, Loki, Unleash, all services
+```
+
+```yaml
+# FILE: infra/compose/traefik.yml
+# Static config: entryPoints, providers, certificates, access logs
+entryPoints:
+  web:
+    address: ":80"
+  websecure:
+    address: ":443"
+providers:
+  docker: {}
+  file:
+    filename: /etc/traefik/traefik-dynamic.yml
+api:
+  dashboard: true
+log:
+  level: INFO
+accessLog: {}
+```
+
+```yaml
+# FILE: infra/compose/traefik-dynamic.yml
+# Dynamic config: Authentik ForwardAuth middleware + routers per service
+http:
+  middlewares:
+    authentik-forwardauth:
+      forwardAuth:
+        address: "http://authentik-outpost:9000/outpost.goauthentik.io/auth/traefik"
+        trustForwardHeader: true
+        authResponseHeaders:
+          - X-Authenticated-User
+          - X-Authenticated-Email
+          - X-Authenticated-Groups
+          - Authorization
+    rate-limit:
+      rateLimit:
+        average: 50
+        burst: 100
+
+  routers:
+    svc-extract:
+      rule: "Host(`api.local`) && PathPrefix(`/extract`)"
+      entryPoints: ["websecure"]
+      service: svc-extract
+      middlewares: ["authentik-forwardauth", "rate-limit"]
+      tls: {}
+  services:
+    svc-extract:
+      loadBalancer:
+        servers:
+          - url: "http://svc-extract:8000"
+```
+
+```yaml
+# FILE: infra/compose/env.example
+DOMAIN=local
+EMAIL=admin@local
+MINIO_ROOT_USER=minio
+MINIO_ROOT_PASSWORD=miniopass
+POSTGRES_PASSWORD=postgres
+NEO4J_PASSWORD=neo4jpass
+QDRANT__SERVICE__GRPC_PORT=6334
+VAULT_DEV_ROOT_TOKEN_ID=root
+AUTHENTIK_SECRET_KEY=changeme
+RAG_EMBEDDING_MODEL=bge-small-en-v1.5
+RAG_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
+```
+
+```yaml
+# FILE: .gitea/workflows/ci.yml
+# Lint → Test → Build → Scan → Push → Deploy (compose up)
+```
+
+```makefile
+# FILE: Makefile
+# bootstrap, run, test, lint, build, deploy, format, seed
+...
+```
+
+```md
+<!-- FILE: TESTPLAN.md -->
+
+## Datasets, Metrics, Acceptance Criteria
+
+- Extraction precision/recall per field
+- Schedule-level absolute error
+- Reconciliation pass-rate
+- Explanation coverage
+- RAG retrieval: top-k recall, nDCG, faithfulness, groundedness
+- Security: Traefik+Authentik route auth tests, header spoofing prevention (internal network, trusted proxy)
+- Red-team cases (OCR noise, conflicting docs, PII leak prevention)
+  ...
+```
+
+---
+
+# STYLE & GUARANTEES
+
+- Be **concise but complete**; prefer schemas/code over prose.
+- **No chain-of-thought.** Provide final artifacts and brief rationales.
+- Every numeric output must include **lineage to Evidence → Document (page/bbox/text_hash)** and **citations** for narrative answers.
+- Parameterize by {{jurisdiction}} and {{tax\_year}}.
+- Include **calibrated_confidence** and name calibration method.
+- Enforce **SHACL** on KG writes; reject/queue fixes on violation.
+- **No PII** in Qdrant. Use de-ID placeholders; keep mappings only in Secure Client Data Store.
+- Deterministic IDs; reproducible builds; version-pinned dependencies.
+- **Trust boundary:** only Traefik exposes ports; all services on a private network; services accept only requests with Traefik’s network identity; **never trust client-supplied auth headers**.
+
+# START
+
+Produce the deliverables now, in the exact order and file/block structure above, implementing the **local-first stack (Python 3.12, Prefect, Vault, MinIO, Playwright, Qdrant, Authentik, Traefik, Docker-Compose, Gitea)** with optional **scale-out** notes (Temporal, K8s) where specified.
--- a/retrieval/indexer.py
+++ b/retrieval/indexer.py
@@ -0,0 +1,507 @@
+# FILE: retrieval/indexer.py
+# De-identify -> embed dense/sparse -> upsert to Qdrant with payload
+
+import json
+import logging
+import re
+from dataclasses import dataclass
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import spacy
+import torch
+import yaml
+from qdrant_client import QdrantClient
+from qdrant_client.models import Distance, PointStruct, SparseVector, VectorParams
+from sentence_transformers import SentenceTransformer
+
+from .chunker import DocumentChunker
+from .pii_detector import PIIDetector, PIIRedactor
+
+
+@dataclass
+class IndexingResult:
+    collection_name: str
+    points_indexed: int
+    points_updated: int
+    points_failed: int
+    processing_time: float
+    errors: list[str]
+
+
+class RAGIndexer:
+    def __init__(self, config_path: str, qdrant_url: str = "http://localhost:6333"):
+        with open(config_path) as f:
+            self.config = yaml.safe_load(f)
+
+        self.qdrant_client = QdrantClient(url=qdrant_url)
+        self.chunker = DocumentChunker(config_path)
+        self.pii_detector = PIIDetector()
+        self.pii_redactor = PIIRedactor()
+
+        # Initialize embedding models
+        self.dense_model = SentenceTransformer(
+            self.config.get("embedding_model", "bge-small-en-v1.5")
+        )
+
+        # Initialize sparse model (BM25/SPLADE)
+        self.sparse_model = self._init_sparse_model()
+
+        # Initialize NLP pipeline
+        self.nlp = spacy.load("en_core_web_sm")
+
+        self.logger = logging.getLogger(__name__)
+
+    def _init_sparse_model(self):
+        """Initialize sparse embedding model (BM25 or SPLADE)"""
+        sparse_config = self.config.get("sparse_model", {})
+        model_type = sparse_config.get("type", "bm25")
+
+        if model_type == "bm25":
+            from rank_bm25 import BM25Okapi
+
+            return BM25Okapi
+        elif model_type == "splade":
+            from transformers import AutoModelForMaskedLM, AutoTokenizer
+
+            tokenizer = AutoTokenizer.from_pretrained(
+                "naver/splade-cocondenser-ensembledistil"
+            )
+            model = AutoModelForMaskedLM.from_pretrained(
+                "naver/splade-cocondenser-ensembledistil"
+            )
+            return {"tokenizer": tokenizer, "model": model}
+        else:
+            raise ValueError(f"Unsupported sparse model type: {model_type}")
+
+    async def index_document(
+        self, document_path: str, collection_name: str, metadata: dict[str, Any]
+    ) -> IndexingResult:
+        """Index a single document into the specified collection"""
+        start_time = datetime.now()
+        errors = []
+        points_indexed = 0
+        points_updated = 0
+        points_failed = 0
+
+        try:
+            # Step 1: Chunk the document
+            chunks = await self.chunker.chunk_document(document_path, metadata)
+
+            # Step 2: Process each chunk
+            points = []
+            for chunk in chunks:
+                try:
+                    point = await self._process_chunk(chunk, collection_name, metadata)
+                    if point:
+                        points.append(point)
+                except Exception as e:
+                    self.logger.error(
+                        f"Failed to process chunk {chunk.get('id', 'unknown')}: {str(e)}"
+                    )
+                    errors.append(f"Chunk processing error: {str(e)}")
+                    points_failed += 1
+
+            # Step 3: Upsert to Qdrant
+            if points:
+                try:
+                    operation_info = self.qdrant_client.upsert(
+                        collection_name=collection_name, points=points, wait=True
+                    )
+                    points_indexed = len(points)
+                    self.logger.info(
+                        f"Indexed {points_indexed} points to {collection_name}"
+                    )
+                except Exception as e:
+                    self.logger.error(f"Failed to upsert to Qdrant: {str(e)}")
+                    errors.append(f"Qdrant upsert error: {str(e)}")
+                    points_failed += len(points)
+                    points_indexed = 0
+
+        except Exception as e:
+            self.logger.error(f"Document indexing failed: {str(e)}")
+            errors.append(f"Document indexing error: {str(e)}")
+
+        processing_time = (datetime.now() - start_time).total_seconds()
+
+        return IndexingResult(
+            collection_name=collection_name,
+            points_indexed=points_indexed,
+            points_updated=points_updated,
+            points_failed=points_failed,
+            processing_time=processing_time,
+            errors=errors,
+        )
+
+    async def _process_chunk(
+        self, chunk: dict[str, Any], collection_name: str, base_metadata: dict[str, Any]
+    ) -> PointStruct | None:
+        """Process a single chunk: de-identify, embed, create point"""
+
+        # Step 1: De-identify PII
+        content = chunk["content"]
+        pii_detected = self.pii_detector.detect(content)
+
+        if pii_detected:
+            # Redact PII and create mapping
+            redacted_content, pii_mapping = self.pii_redactor.redact(
+                content, pii_detected
+            )
+
+            # Store PII mapping securely (not in vector DB)
+            await self._store_pii_mapping(chunk["id"], pii_mapping)
+
+            # Log PII detection for audit
+            self.logger.warning(
+                f"PII detected in chunk {chunk['id']}: {[p['type'] for p in pii_detected]}"
+            )
+        else:
+            redacted_content = content
+
+        # Verify no PII remains
+        if not self._verify_pii_free(redacted_content):
+            self.logger.error(f"PII verification failed for chunk {chunk['id']}")
+            return None
+
+        # Step 2: Generate embeddings
+        try:
+            dense_vector = await self._generate_dense_embedding(redacted_content)
+            sparse_vector = await self._generate_sparse_embedding(redacted_content)
+        except Exception as e:
+            self.logger.error(
+                f"Embedding generation failed for chunk {chunk['id']}: {str(e)}"
+            )
+            return None
+
+        # Step 3: Prepare metadata
+        payload = self._prepare_payload(chunk, base_metadata, redacted_content)
+        payload["pii_free"] = True  # Verified above
+
+        # Step 4: Create point
+        point = PointStruct(
+            id=chunk["id"],
+            vector={"dense": dense_vector, "sparse": sparse_vector},
+            payload=payload,
+        )
+
+        return point
+
+    async def _generate_dense_embedding(self, text: str) -> list[float]:
+        """Generate dense vector embedding"""
+        try:
+            # Use sentence transformer for dense embeddings
+            embedding = self.dense_model.encode(text, normalize_embeddings=True)
+            return embedding.tolist()
+        except Exception as e:
+            self.logger.error(f"Dense embedding generation failed: {str(e)}")
+            raise
+
+    async def _generate_sparse_embedding(self, text: str) -> SparseVector:
+        """Generate sparse vector embedding (BM25 or SPLADE)"""
+        vector = SparseVector(indices=[], values=[])
+
+        try:
+            sparse_config = self.config.get("sparse_model", {})
+            model_type = sparse_config.get("type", "bm25")
+
+            if model_type == "bm25":
+                # Simple BM25-style sparse representation
+                doc = self.nlp(text)
+                tokens = [
+                    token.lemma_.lower()
+                    for token in doc
+                    if not token.is_stop and not token.is_punct
+                ]
+
+                # Create term frequency vector
+                term_freq = {}
+                for token in tokens:
+                    term_freq[token] = term_freq.get(token, 0) + 1
+
+                # Convert to sparse vector format
+                vocab_size = sparse_config.get("vocab_size", 30000)
+                indices = []
+                values = []
+
+                for term, freq in term_freq.items():
+                    # Simple hash-based vocabulary mapping
+                    term_id = hash(term) % vocab_size
+                    indices.append(term_id)
+                    values.append(float(freq))
+
+                vector = SparseVector(indices=indices, values=values)
+
+            elif model_type == "splade":
+                # SPLADE sparse embeddings
+                tokenizer = self.sparse_model["tokenizer"]
+                model = self.sparse_model["model"]
+
+                inputs = tokenizer(
+                    text, return_tensors="pt", truncation=True, max_length=512
+                )
+                outputs = model(**inputs)
+
+                # Extract sparse representation
+                logits = outputs.logits.squeeze()
+                sparse_rep = torch.relu(logits).detach().numpy()
+
+                # Convert to sparse format
+                indices = np.nonzero(sparse_rep)[0].tolist()
+                values = sparse_rep[indices].tolist()
+
+                vector = SparseVector(indices=indices, values=values)
+
+            return vector
+
+        except Exception as e:
+            self.logger.error(f"Sparse embedding generation failed: {str(e)}")
+            # Return empty sparse vector as fallback
+            return vector
+
+    def _prepare_payload(
+        self, chunk: dict[str, Any], base_metadata: dict[str, Any], content: str
+    ) -> dict[str, Any]:
+        """Prepare payload metadata for the chunk"""
+
+        # Start with base metadata
+        payload = base_metadata.copy()
+
+        # Add chunk-specific metadata
+        payload.update(
+            {
+                "document_id": chunk.get("document_id"),
+                "content": content,  # De-identified content
+                "chunk_index": chunk.get("chunk_index", 0),
+                "total_chunks": chunk.get("total_chunks", 1),
+                "page_numbers": chunk.get("page_numbers", []),
+                "section_hierarchy": chunk.get("section_hierarchy", []),
+                "has_calculations": self._detect_calculations(content),
+                "has_forms": self._detect_form_references(content),
+                "confidence_score": chunk.get("confidence_score", 1.0),
+                "created_at": datetime.now().isoformat(),
+                "version": self.config.get("version", "1.0"),
+            }
+        )
+
+        # Extract and add topic tags
+        topic_tags = self._extract_topic_tags(content)
+        if topic_tags:
+            payload["topic_tags"] = topic_tags
+
+        # Add content analysis
+        payload.update(self._analyze_content(content))
+
+        return payload
+
+    def _detect_calculations(self, text: str) -> bool:
+        """Detect if text contains calculations or formulas"""
+        calculation_patterns = [
+            r"\d+\s*[+\-*/]\s*\d+",
+            r"£\d+(?:,\d{3})*(?:\.\d{2})?",
+            r"\d+(?:\.\d+)?%",
+            r"total|sum|calculate|compute",
+            r"rate|threshold|allowance|relief",
+        ]
+
+        for pattern in calculation_patterns:
+            if re.search(pattern, text, re.IGNORECASE):
+                return True
+        return False
+
+    def _detect_form_references(self, text: str) -> bool:
+        """Detect references to tax forms"""
+        form_patterns = [
+            r"SA\d{3}",
+            r"P\d{2}",
+            r"CT\d{3}",
+            r"VAT\d{3}",
+            r"form\s+\w+",
+            r"schedule\s+\w+",
+        ]
+
+        for pattern in form_patterns:
+            if re.search(pattern, text, re.IGNORECASE):
+                return True
+        return False
+
+    def _extract_topic_tags(self, text: str) -> list[str]:
+        """Extract topic tags from content"""
+        topic_keywords = {
+            "employment": [
+                "PAYE",
+                "payslip",
+                "P60",
+                "employment",
+                "salary",
+                "wages",
+                "employer",
+            ],
+            "self_employment": [
+                "self-employed",
+                "business",
+                "turnover",
+                "expenses",
+                "profit",
+                "loss",
+            ],
+            "property": ["rental", "property", "landlord", "FHL", "mortgage", "rent"],
+            "dividends": ["dividend", "shares", "distribution", "corporation tax"],
+            "capital_gains": ["capital gains", "disposal", "acquisition", "CGT"],
+            "pensions": ["pension", "retirement", "SIPP", "occupational"],
+            "savings": ["interest", "savings", "ISA", "bonds"],
+            "inheritance": ["inheritance", "IHT", "estate", "probate"],
+            "vat": ["VAT", "value added tax", "registration", "return"],
+        }
+
+        tags = []
+        text_lower = text.lower()
+
+        for topic, keywords in topic_keywords.items():
+            for keyword in keywords:
+                if keyword.lower() in text_lower:
+                    tags.append(topic)
+                    break
+
+        return list(set(tags))  # Remove duplicates
+
+    def _analyze_content(self, text: str) -> dict[str, Any]:
+        """Analyze content for additional metadata"""
+        doc = self.nlp(text)
+
+        return {
+            "word_count": len([token for token in doc if not token.is_space]),
+            "sentence_count": len(list(doc.sents)),
+            "entity_count": len(doc.ents),
+            "complexity_score": self._calculate_complexity(doc),
+            "language": doc.lang_ if hasattr(doc, "lang_") else "en",
+        }
+
+    def _calculate_complexity(self, doc: dict) -> float:
+        """Calculate text complexity score"""
+        if not doc:
+            return 0.0
+
+        # Simple complexity based on sentence length and vocabulary
+        avg_sentence_length = sum(len(sent) for sent in doc.sents) / len(
+            list(doc.sents)
+        )
+        unique_words = len(set(token.lemma_.lower() for token in doc if token.is_alpha))
+        total_words = len([token for token in doc if token.is_alpha])
+
+        vocabulary_diversity = unique_words / total_words if total_words > 0 else 0
+
+        # Normalize to 0-1 scale
+        complexity = min(1.0, (avg_sentence_length / 20.0 + vocabulary_diversity) / 2.0)
+        return complexity
+
+    def _verify_pii_free(self, text: str) -> bool:
+        """Verify that text contains no PII"""
+        # Quick verification using patterns
+        pii_patterns = [
+            r"\b[A-Z]{2}\d{6}[A-D]\b",  # NI number
+            r"\b\d{10}\b",  # UTR
+            r"\b[A-Z]{2}\d{2}[A-Z]{4}\d{14}\b",  # IBAN
+            r"\b\d{2}-\d{2}-\d{2}\b",  # Sort code
+            r"\b[A-Z]{1,2}\d[A-Z\d]?\s*\d[A-Z]{2}\b",  # Postcode
+            r"\b[\w\.-]+@[\w\.-]+\.\w+\b",  # Email
+            r"\b(?:\+44|0)\d{10,11}\b",  # Phone
+        ]
+
+        for pattern in pii_patterns:
+            if re.search(pattern, text):
+                return False
+
+        return True
+
+    async def _store_pii_mapping(
+        self, chunk_id: str, pii_mapping: dict[str, Any]
+    ) -> None:
+        """Store PII mapping in secure client data store (not in vector DB)"""
+        # This would integrate with the secure PostgreSQL client data store
+        # For now, just log the mapping securely
+        self.logger.info(
+            f"PII mapping stored for chunk {chunk_id}: {len(pii_mapping)} items"
+        )
+
+    async def create_collections(self) -> None:
+        """Create all Qdrant collections based on configuration"""
+        collections_config_path = Path(__file__).parent / "qdrant_collections.json"
+
+        with open(collections_config_path) as f:
+            collections_config = json.load(f)
+
+        for collection_config in collections_config["collections"]:
+            collection_name = collection_config["name"]
+
+            try:
+                # Check if collection exists
+                try:
+                    self.qdrant_client.get_collection(collection_name)
+                    self.logger.info(f"Collection {collection_name} already exists")
+                    continue
+                except:
+                    pass  # Collection doesn't exist, create it
+
+                # Create collection
+                vectors_config = {}
+
+                # Dense vector configuration
+                if "dense" in collection_config:
+                    vectors_config["dense"] = VectorParams(
+                        size=collection_config["dense"]["size"],
+                        distance=Distance.COSINE,
+                    )
+
+                # Sparse vector configuration
+                if collection_config.get("sparse", False):
+                    vectors_config["sparse"] = VectorParams(
+                        size=30000,  # Vocabulary size for sparse vectors
+                        distance=Distance.DOT,
+                        on_disk=True,
+                    )
+
+                self.qdrant_client.create_collection(
+                    collection_name=collection_name,
+                    vectors_config=vectors_config,
+                    **collection_config.get("indexing_config", {}),
+                )
+
+                self.logger.info(f"Created collection: {collection_name}")
+
+            except Exception as e:
+                self.logger.error(
+                    f"Failed to create collection {collection_name}: {str(e)}"
+                )
+                raise
+
+    async def batch_index(
+        self, documents: list[dict[str, Any]], collection_name: str
+    ) -> list[IndexingResult]:
+        """Index multiple documents in batch"""
+        results = []
+
+        for doc_info in documents:
+            result = await self.index_document(
+                doc_info["path"], collection_name, doc_info["metadata"]
+            )
+            results.append(result)
+
+        return results
+
+    def get_collection_stats(self, collection_name: str) -> dict[str, Any]:
+        """Get statistics for a collection"""
+        try:
+            collection_info = self.qdrant_client.get_collection(collection_name)
+            return {
+                "name": collection_name,
+                "vectors_count": collection_info.vectors_count,
+                "indexed_vectors_count": collection_info.indexed_vectors_count,
+                "points_count": collection_info.points_count,
+                "segments_count": collection_info.segments_count,
+                "status": collection_info.status,
+            }
+        except Exception as e:
+            self.logger.error(f"Failed to get stats for {collection_name}: {str(e)}")
+            return {"error": str(e)}