# ROLE You are a **Solution Architect + Ontologist + Data Engineer + Platform/SRE** delivering a **production-grade accounting knowledge system** that ingests documents, fuses a **Knowledge Graph (KG)** with a **Vector DB (Qdrant)** for RAG, integrates with **Firm Databases**, and powers **AI agents** to complete workflows like **UK Self Assessment** — with **auditable provenance**. **Authentication & authorization are centralized at the edge:** **Traefik** gateway + **Authentik** SSO (OIDC/ForwardAuth). **Backend services trust Traefik** on an internal network and consume user/role claims from forwarded headers/JWT. # OBJECTIVE Deliver a complete, implementable solution—ontology, extraction pipeline, RAG+KG retrieval, deterministic calculators, APIs, validations, **architecture & stack**, infra-as-code, CI/CD, observability, security/governance, test plan, and a worked example—so agents can: 1. read documents (and scrape portals via RPA), 2. populate/maintain a compliant accounting/tax KG, 3. retrieve firm knowledge via RAG (vector + keyword + graph), 4. compute/validate schedules and fill forms, 5. submit (stub/sandbox/live), 6. justify every output with **traceable provenance** (doc/page/bbox) and citations. # SCOPE & VARIABLES - **Jurisdiction:** {{jurisdiction}} (default: UK) - **Tax regime / forms:** {{forms}} (default: SA100 + SA102, SA103, SA105, SA110; optional SA108) - **Accounting basis:** {{standards}} (default: UK GAAP; support IFRS/XBRL mapping) - **Document types:** bank statements, invoices, receipts, P\&L, balance sheet, payslips, dividend vouchers, property statements, prior returns, letters, certificates. - **Primary stores:** KG = Neo4j; RAG = Qdrant; Objects = MinIO; Secrets = Vault; IdP/SSO = Authentik; **API Gateway = Traefik**. - **PII constraints:** GDPR/UK-GDPR; **no raw PII in vector DB** (de-identify before indexing); role-based access; encryption; retention; right-to-erasure. --- # ARCHITECTURE & STACK (LOCAL-FIRST; SCALE-OUT READY) ## Edge & Identity (centralized) - **Traefik** (reverse proxy & ingress) terminates TLS, does **AuthN/AuthZ via Authentik**: - Use **Authentik Outpost (ForwardAuth)** middleware in Traefik. - Traefik injects verified headers/JWT to upstream services: `X-Authenticated-User`, `X-Authenticated-Email`, `X-Authenticated-Groups`, `Authorization: Bearer `. - **Per-route RBAC** via Traefik middlewares (group/claim checks); services only enforce **fine-grained, app-level authorization** using forwarded claims (no OIDC in each service). - All services are **private** (only reachable behind Traefik on an internal Docker/K8s network). Direct access is denied. ## Services (independent deployables; Python 3.12 unless stated) 1. **svc-ingestion** — uploads/URLs; checksum; MinIO write; emits `doc.ingested`. 2. **svc-rpa** — Playwright RPA for firm/client portals; Prefect-scheduled; emits `doc.ingested`. 3. **svc-ocr** — Tesseract (local) or Textract (scale); de-skew/rotation/layout; emits `doc.ocr_ready`. 4. **svc-extract** — LLM + rules + table detectors → **schema-constrained JSON** (kv + tables + bbox/page); emits `doc.extracted`. 5. **svc-normalize-map** — normalize currency/dates; entity resolution; assign tax year; map to KG nodes/edges with **Evidence** anchors; emits `kg.upserted`. 6. **svc-kg** — Neo4j DDL + **SHACL** validation; **bitemporal** writes `{valid_from, valid_to, asserted_at}`; RDF export. 7. **svc-rag-indexer** — chunk/de-identify/embed; upsert **Qdrant** collections (firm knowledge, legislation, best practices, glossary). 8. **svc-rag-retriever** — **hybrid retrieval** (dense + sparse) + rerank + **KG-fusion**; returns chunks + citations + KG join hints. 9. **svc-reason** — deterministic calculators (employment, self-employment, property, dividends/interest, allowances, NIC, HICBC, student loans); Cypher materializers; explanations. 10. **svc-forms** — fill PDFs; ZIP evidence bundle (signed manifest). 11. **svc-hmrc** — submit stub|sandbox|live; rate-limit & retries; submission audit. 12. **svc-firm-connectors** — read-only connectors to Firm Databases; sync to **Secure Client Data Store** with lineage. 13. **ui-review** — Next.js reviewer portal (SSO via Traefik+Authentik); reviewers accept/override extractions. ## Orchestration & Messaging - **Prefect 2.x** for local orchestration; **Temporal** for production scale (sagas, retries, idempotency). - Events: Kafka (or SQS/SNS) — `doc.ingested`, `doc.ocr_ready`, `doc.extracted`, `kg.upserted`, `rag.indexed`, `calc.schedule_ready`, `form.filled`, `hmrc.submitted`, `review.requested`, `review.completed`, `firm.sync.completed`. ## Concrete Stack (pin/assume unless replaced) - **Languages:** Python **3.12**, TypeScript 5/Node 20 - **Frameworks:** FastAPI, Pydantic v2, SQLAlchemy 2 (ledger), Prefect 2.x (local), Temporal (scale) - **Gateway:** **Traefik** 3.x with **Authentik Outpost** (ForwardAuth) - **Identity/SSO:** **Authentik** (OIDC/OAuth2) - **Secrets:** **Vault** (AppRole/JWT; Transit for envelope encryption) - **Object Storage:** **MinIO** (S3 API) - **Vector DB:** **Qdrant** 1.x (dense + sparse hybrid) - **Embeddings/Rerankers (local-first):** Dense: `bge-m3` or `bge-small-en-v1.5`; Sparse: BM25/SPLADE (Qdrant sparse); Reranker: `cross-encoder/ms-marco-MiniLM-L-6-v2` - **Datastores:** - **Secure Client Data Store:** PostgreSQL 15 (encrypted; RLS; pgcrypto) - **KG:** Neo4j 5.x - **Cache/locks:** Redis - **Infra:** **Docker-Compose** for local; **Kubernetes** for scale (Helm, ArgoCD optional later) - **CI/CD:** **Gitea** + Gitea Actions (or Drone) → container registry → deploy ## Data Layer (three pillars + fusion) 1. **Firm Databases** → **Firm Connectors** (read-only) → **Secure Client Data Store (Postgres)** with lineage. 2. **Vector DB / Knowledge Base (Qdrant)** — internal knowledge, legislation, best practices, glossary; **no PII** (placeholders + hashes). 3. **Knowledge Graph (Neo4j)** — accounting/tax ontology with evidence anchors and rules/calculations. **Fusion strategy:** Query → RAG retrieve (Qdrant) + KG traverse → **fusion** scoring (α·dense + β·sparse + γ·KG-link-boost) → results with citations (URL/doc_id+page/anchor) and graph paths. ## Non-functional Targets - SLOs: ingest→extract p95 ≤ 3m; reconciliation ≥ 98%; lineage coverage ≥ 99%; schedule error ≤ 1/1k - Throughput: local 2 docs/s; scale 5 docs/s sustained; burst 20 docs/s - Idempotency: `sha256(doc_checksum + extractor_version)` - Retention: raw images 7y; derived text 2y; vectors (non-PII) 7y; PII-min logs 90d - Erasure: per `client_id` across MinIO, KG, Qdrant (payload filter), Postgres rows --- # REPOSITORY LAYOUT (monorepo, local-first) ``` repo/ apps/ svc-ingestion/ svc-rpa/ svc-ocr/ svc-extract/ svc-normalize-map/ svc-kg/ svc-rag-indexer/ svc-rag-retriever/ svc-reason/ svc-forms/ svc-hmrc/ svc-firm-connectors/ ui-review/ kg/ ONTOLOGY.md schemas/{nodes_and_edges.schema.json, context.jsonld, shapes.ttl} db/{neo4j_schema.cypher, seed.cypher} reasoning/schedule_queries.cypher retrieval/ chunking.yaml qdrant_collections.json indexer.py retriever.py fusion.py config/{heuristics.yaml, mapping.json} prompts/{doc_classify.txt, kv_extract.txt, table_extract.txt, entity_link.txt, rag_answer.txt} pipeline/etl.py infra/ compose/{docker-compose.local.yml, traefik.yml, traefik-dynamic.yml, env.example} k8s/ (optional later: Helm charts) security/{dpia.md, ropa.md, retention_policy.md, threat_model.md} ops/ runbooks/{ingest.md, calculators.md, hmrc.md, vector-indexing.md, dr-restore.md} dashboards/grafana.json alerts/prometheus-rules.yaml tests/{unit, integration, e2e, data/{synthetic, golden}} Makefile .gitea/workflows/ci.yml mkdocs.yml ``` --- # DELIVERABLES (RETURN ALL AS MARKED CODE BLOCKS) 1. **Ontology** (Concept model; JSON-Schema; JSON-LD; Neo4j DDL) 2. **Heuristics & Rules (YAML)** 3. **Extraction pipeline & prompts** 4. **RAG & Retrieval Layer** (chunking, Qdrant collections, indexer, retriever, fusion) 5. **Reasoning layer** (deterministic calculators + Cypher + tests) 6. **Agent interface (Tooling API)** 7. **Quality & Safety** (datasets, metrics, tests, red-team) 8. **Graph Constraints** (SHACL, IDs, bitemporal) 9. **Security & Compliance** (DPIA, ROPA, encryption, auditability) 10. **Worked Example** (end-to-end UK SA sample) 11. **Observability & SRE** (SLIs/SLOs, tracing, idempotency, DR, cost controls) 12. **Architecture & Local Infra** (**docker-compose** with Traefik + Authentik + Vault + MinIO + Qdrant + Neo4j + Postgres + Redis + Prometheus/Grafana + Loki + Unleash + services) 13. **Repo Scaffolding & Makefile** (dev tasks, lint, test, build, run) 14. **Firm Database Connectors** (data contracts, sync jobs, lineage) 15. **Traefik & Authentik configs** (static+dynamic, ForwardAuth, route labels) --- # ONTOLOGY REQUIREMENTS (as before + RAG links) - Nodes: `TaxpayerProfile`, `TaxYear`, `Jurisdiction`, `TaxForm`, `Schedule`, `FormBox`, `Document`, `Evidence`, `Party`, `Account`, `IncomeItem`, `ExpenseItem`, `PropertyAsset`, `BusinessActivity`, `Allowance`, `Relief`, `PensionContribution`, `StudentLoanPlan`, `Payment`, `ExchangeRate`, `Calculation`, `Rule`, `NormalizationEvent`, `Reconciliation`, `Consent`, `LegalBasis`, `ImportJob`, `ETLRun` - Relationships: `BELONGS_TO`, `OF_TAX_YEAR`, `IN_JURISDICTION`, `HAS_SECTION`, `HAS_BOX`, `REPORTED_IN`, `COMPUTES`, `DERIVED_FROM`, `SUPPORTED_BY`, `PAID_BY`, `PAID_TO`, `OWNS`, `RENTED_BY`, `EMPLOYED_BY`, `APPLIES_TO`, `APPLIES`, `VIOLATES`, `NORMALIZED_FROM`, `HAS_VALID_BASIS`, `PRODUCED_BY`, **`CITES`**, **`DESCRIBES`** - **Bitemporal** and **provenance** mandatory. --- # UK-SPECIFIC REQUIREMENTS - Year boundary 6 Apr–5 Apr; basis period reform toggle - Employment aggregation, BIK, PAYE offsets - Self-employment: allowable/disallowable, capital allowances (AIA/WDA/SBA), loss rules, **NIC Class 2 & 4** - Property: FHL tests, **mortgage interest 20% credit**, Rent-a-Room, joint splits - Savings/dividends: allowances & rate bands; ordering - Personal allowance tapering; Gift Aid & pension gross-up; **HICBC**; **Student Loan** plans 1/2/4/5 & PGL - Rounding per `FormBox.rounding_rule` --- # YAML HEURISTICS (KEEP SEPARATE FILE) - document_kinds, field_normalization, line_item_mapping - period_inference (UK boundary + reform), dedupe_rules - **validation_rules:** `utr_checksum`, `ni_number_regex`, `iban_check`, `vat_gb_mod97`, `rounding_policy: "HMRC"`, `numeric_tolerance: 0.01` - **entity_resolution:** blocking keys, fuzzy thresholds, canonical source priority - **privacy_redaction:** `mask_except_last4` for NI/UTR/IBAN/sort_code/phone/email - **jurisdiction_overrides:** by {{jurisdiction}} and {{tax\_year}} --- # EXTRACTION PIPELINE (SPECIFY CODE & PROMPTS) - ingest → classify → OCR/layout → extract (schema-constrained JSON with bbox/page) → validate → normalize → map_to_graph → post-checks - Prompts: `doc_classify`, `kv_extract`, `table_extract` (multi-page), `entity_link` - Contract: **JSON schema enforcement** with retry/validator loop; temperature guidance - Reliability: de-skew/rotation/language/handwriting policy - Mapping config: JSON mapping to nodes/edges + provenance (doc_id/page/bbox/text_hash) --- # RAG & RETRIEVAL LAYER (Qdrant + KG Fusion) - Collections: `firm_knowledge`, `legislation`, `best_practices`, `glossary` (payloads include jurisdiction, tax_years, topic_tags, version, `pii_free:true`) - Chunking: layout-aware; tables serialized; \~1.5k token chunks, 10–15% overlap - Indexer: de-identify PII; placeholders only; embeddings (dense) + sparse; upsert with payload - Retriever: hybrid scoring (α·dense + β·sparse), filters (jurisdiction/tax_year), rerank; return **citations** + **KG hints** - Fusion: boost results linked to applicable `Rule`/`Calculation`/`Evidence` for current schedule - Right-to-erasure: purge vectors via payload filter (`client_id?` only for client-authored knowledge) --- # REASONING & CALCULATION (DETERMINISTIC) - Order: incomes → allowances/capital allowances → loss offsets → personal allowance → savings/dividend bands → HICBC & student loans → NIC Class 2/4 → property 20% credit/FHL/Rent-a-Room - Cypher materializers per schedule/box; explanations via `DERIVED_FROM` and RAG `CITES` - Unit tests per rule; golden files; property-based tests --- # AGENT TOOLING API (JSON SCHEMAS) 1. `ComputeSchedule({tax_year, taxpayer_id, schedule_id}) -> {boxes[], totals[], explanations[]}` 2. `PopulateFormBoxes({tax_year, taxpayer_id, form_id}) -> {fields[], pdf_fields[], confidence, calibrated_confidence}` 3. `AskClarifyingQuestion({gap, candidate_values, evidence}) -> {question_text, missing_docs}` 4. `GenerateEvidencePack({scope}) -> {bundle_manifest, signed_hashes}` 5. `ExplainLineage({node_id|field}) -> {chain:[evidence], graph_paths}` 6. `CheckDocumentCoverage({tax_year, taxpayer_id}) -> {required_docs[], missing[], blockers[]}` 7. `SubmitToHMRC({tax_year, taxpayer_id, dry_run}) -> {status, submission_id?, errors[]}` 8. `ReconcileBank({account_id, period}) -> {unmatched_invoices[], unmatched_bank_lines[], deltas}` 9. `RAGSearch({query, tax_year?, jurisdiction?, k?}) -> {chunks[], citations[], kg_hints[], calibrated_confidence}` 10. `SyncFirmDatabases({since}) -> {objects_synced, errors[]}` **Env flags:** `HMRC_MTD_ITSA_MODE`, `RATE_LIMITS`, `RAG_EMBEDDING_MODEL`, `RAG_RERANKER_MODEL`, `RAG_ALPHA_BETA_GAMMA` --- # SECURITY & COMPLIANCE - **Traefik + Authentik SSO at edge** (ForwardAuth); per-route RBAC; inject verified claims headers/JWT - **Vault** for secrets (AppRole/JWT, Transit for envelope encryption) - **PII minimization:** no PII in Qdrant; placeholders; PII mapping only in Secure Client Data Store - **Auditability:** tamper-evident logs (hash chain), signer identity, time sync - **DPIA, ROPA, retention policy, right-to-erasure** workflows --- # CI/CD (Gitea) - Gitea Actions: `lint` (ruff/mypy/eslint), `test` (pytest+coverage, e2e), `build` (Docker), `scan` (Trivy/SAST), `push` (registry), `deploy` (compose up or K8s apply) - SemVer tags; SBOM (Syft); OpenAPI + MkDocs publish; pre-commit hooks --- # OBSERVABILITY & SRE - SLIs/SLOs: ingest_time_p50, extract_precision\@field≥0.97, reconciliation_pass_rate≥0.98, lineage_coverage≥0.99, time_to_review_p95 - Dashboards: ingestion throughput, OCR error rates, extraction precision, mapping latency, calculator failures, HMRC submits, **RAG recall/precision & faithfulness** - Alerts: OCR 5xx spike, extraction precision dip, reconciliation failures, HMRC rate-limit breaches, RAG drift - Backups/DR: Neo4j dump (daily), Postgres PITR, Qdrant snapshot, MinIO versioning; quarterly restore test - Cost controls: embedding cache, incremental indexing, compaction/TTL for stale vectors, cold archive for images --- # OUTPUT FORMAT (STRICT) Return results in the following order, each in its own fenced code block **with the exact language tag**: ```md # Concept Model ... ``` ```json // FILE: schemas/nodes_and_edges.schema.json { ... } ``` ```json // FILE: schemas/context.jsonld { ... } ``` ```turtle # FILE: schemas/shapes.ttl # SHACL shapes for node/edge integrity ... ``` ```cypher // FILE: db/neo4j_schema.cypher CREATE CONSTRAINT ... ``` ```yaml # FILE: config/heuristics.yaml document_kinds: ... ``` ```json # FILE: config/mapping.json { "mappings": [ ... ] } ``` ```yaml # FILE: retrieval/chunking.yaml # Layout-aware chunking, tables, overlap, token targets ``` ```json # FILE: retrieval/qdrant_collections.json { "collections": [ { "name": "firm_knowledge", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } }, { "name": "legislation", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } }, { "name": "best_practices", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } }, { "name": "glossary", "dense": {"size": 768}, "sparse": true, "payload_schema": { ... } } ] } ``` ```python # FILE: retrieval/indexer.py # De-identify -> embed dense/sparse -> upsert to Qdrant with payload ... ``` ```python # FILE: retrieval/retriever.py # Hybrid retrieval (alpha,beta), rerank, filters, return citations + KG hints ... ``` ```python # FILE: retrieval/fusion.py # Join RAG chunks to KG rules/calculations/evidence; boost linked results ... ``` ```txt # FILE: prompts/rag_answer.txt [Instruction: cite every claim; forbid PII; return calibrated_confidence; JSON contract] ``` ```python # FILE: pipeline/etl.py def ingest(...): ... ``` ```txt # FILE: prompts/kv_extract.txt [Prompt with JSON contract + examples] ``` ```cypher // FILE: reasoning/schedule_queries.cypher // SA105: compute property income totals MATCH ... ``` ```json // FILE: tools/agent_tools.json { ... } ``` ```yaml # FILE: infra/compose/docker-compose.local.yml # Traefik (with Authentik ForwardAuth), Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prometheus/Grafana, Loki, Unleash, all services ``` ```yaml # FILE: infra/compose/traefik.yml # Static config: entryPoints, providers, certificates, access logs entryPoints: web: address: ":80" websecure: address: ":443" providers: docker: {} file: filename: /etc/traefik/traefik-dynamic.yml api: dashboard: true log: level: INFO accessLog: {} ``` ```yaml # FILE: infra/compose/traefik-dynamic.yml # Dynamic config: Authentik ForwardAuth middleware + routers per service http: middlewares: authentik-forwardauth: forwardAuth: address: "http://authentik-outpost:9000/outpost.goauthentik.io/auth/traefik" trustForwardHeader: true authResponseHeaders: - X-Authenticated-User - X-Authenticated-Email - X-Authenticated-Groups - Authorization rate-limit: rateLimit: average: 50 burst: 100 routers: svc-extract: rule: "Host(`api.local`) && PathPrefix(`/extract`)" entryPoints: ["websecure"] service: svc-extract middlewares: ["authentik-forwardauth", "rate-limit"] tls: {} services: svc-extract: loadBalancer: servers: - url: "http://svc-extract:8000" ``` ```yaml # FILE: infra/compose/env.example DOMAIN=local EMAIL=admin@local MINIO_ROOT_USER=minio MINIO_ROOT_PASSWORD=miniopass POSTGRES_PASSWORD=postgres NEO4J_PASSWORD=neo4jpass QDRANT__SERVICE__GRPC_PORT=6334 VAULT_DEV_ROOT_TOKEN_ID=root AUTHENTIK_SECRET_KEY=changeme RAG_EMBEDDING_MODEL=bge-small-en-v1.5 RAG_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2 ``` ```yaml # FILE: .gitea/workflows/ci.yml # Lint → Test → Build → Scan → Push → Deploy (compose up) ``` ```makefile # FILE: Makefile # bootstrap, run, test, lint, build, deploy, format, seed ... ``` ```md ## Datasets, Metrics, Acceptance Criteria - Extraction precision/recall per field - Schedule-level absolute error - Reconciliation pass-rate - Explanation coverage - RAG retrieval: top-k recall, nDCG, faithfulness, groundedness - Security: Traefik+Authentik route auth tests, header spoofing prevention (internal network, trusted proxy) - Red-team cases (OCR noise, conflicting docs, PII leak prevention) ... ``` --- # STYLE & GUARANTEES - Be **concise but complete**; prefer schemas/code over prose. - **No chain-of-thought.** Provide final artifacts and brief rationales. - Every numeric output must include **lineage to Evidence → Document (page/bbox/text_hash)** and **citations** for narrative answers. - Parameterize by {{jurisdiction}} and {{tax\_year}}. - Include **calibrated_confidence** and name calibration method. - Enforce **SHACL** on KG writes; reject/queue fixes on violation. - **No PII** in Qdrant. Use de-ID placeholders; keep mappings only in Secure Client Data Store. - Deterministic IDs; reproducible builds; version-pinned dependencies. - **Trust boundary:** only Traefik exposes ports; all services on a private network; services accept only requests with Traefik’s network identity; **never trust client-supplied auth headers**. # START Produce the deliverables now, in the exact order and file/block structure above, implementing the **local-first stack (Python 3.12, Prefect, Vault, MinIO, Playwright, Qdrant, Authentik, Traefik, Docker-Compose, Gitea)** with optional **scale-out** notes (Temporal, K8s) where specified.