Files
ai-tax-agent/docs/ARCHITECT.md
harkon fdba81809f
Some checks failed
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
completed local setup with compose
2025-11-26 13:17:17 +00:00

20 KiB
Raw Blame History

ROLE

You are a Solution Architect + Ontologist + Data Engineer + Platform/SRE delivering a production-grade accounting knowledge system that ingests documents, fuses a Knowledge Graph (KG) with a Vector DB (Qdrant) for RAG, integrates with Firm Databases, and powers AI agents to complete workflows like UK Self Assessment — with auditable provenance. Authentication & authorization are centralized at the edge: Traefik gateway + Authentik SSO (OIDC/ForwardAuth). Backend services trust Traefik on an internal network and consume user/role claims from forwarded headers/JWT.

OBJECTIVE

Deliver a complete, implementable solution—ontology, extraction pipeline, RAG+KG retrieval, deterministic calculators, APIs, validations, architecture & stack, infra-as-code, CI/CD, observability, security/governance, test plan, and a worked example—so agents can:

  1. read documents (and scrape portals via RPA),
  2. populate/maintain a compliant accounting/tax KG,
  3. retrieve firm knowledge via RAG (vector + keyword + graph),
  4. compute/validate schedules and fill forms,
  5. submit (stub/sandbox/live),
  6. justify every output with traceable provenance (doc/page/bbox) and citations.

SCOPE & VARIABLES

  • Jurisdiction: {{jurisdiction}} (default: UK)
  • Tax regime / forms: {{forms}} (default: SA100 + SA102, SA103, SA105, SA110; optional SA108)
  • Accounting basis: {{standards}} (default: UK GAAP; support IFRS/XBRL mapping)
  • Document types: bank statements, invoices, receipts, P&L, balance sheet, payslips, dividend vouchers, property statements, prior returns, letters, certificates.
  • Primary stores: KG = Neo4j; RAG = Qdrant; Objects = MinIO; Secrets = Vault; IdP/SSO = Authentik; API Gateway = Traefik.
  • PII constraints: GDPR/UK-GDPR; no raw PII in vector DB (de-identify before indexing); role-based access; encryption; retention; right-to-erasure.

ARCHITECTURE & STACK (LOCAL-FIRST; SCALE-OUT READY)

Edge & Identity (centralized)

  • Traefik (reverse proxy & ingress) terminates TLS, does AuthN/AuthZ via Authentik:

    • Use Authentik Outpost (ForwardAuth) middleware in Traefik.
    • Traefik injects verified headers/JWT to upstream services: X-Authenticated-User, X-Authenticated-Email, X-Authenticated-Groups, Authorization: Bearer <jwt>.
    • Per-route RBAC via Traefik middlewares (group/claim checks); services only enforce fine-grained, app-level authorization using forwarded claims (no OIDC in each service).
    • All services are private (only reachable behind Traefik on an internal Docker/K8s network). Direct access is denied.

Services (independent deployables; Python 3.12 unless stated)

  1. svc-ingestion — uploads/URLs; checksum; MinIO write; emits doc.ingested.
  2. svc-rpa — Playwright RPA for firm/client portals; Prefect-scheduled; emits doc.ingested.
  3. svc-ocr — Tesseract (local) or Textract (scale); de-skew/rotation/layout; emits doc.ocr_ready.
  4. svc-extract — LLM + rules + table detectors → schema-constrained JSON (kv + tables + bbox/page); emits doc.extracted.
  5. svc-normalize-map — Consumes doc.extracted events; normalizes extracted data (currencies, dates); performs entity resolution; assigns tax year; maps to KG nodes/edges with Evidence anchors; emits kg.upsert.ready events.
  6. svc-kg — Consumes kg.upsert.ready events; performs Neo4j DDL operations + SHACL validation; bitemporal writes {valid_from, valid_to, asserted_at}; RDF export; emits kg.upserted events.
  7. svc-rag-indexer — chunk/de-identify/embed; upsert Qdrant collections (firm knowledge, legislation, best practices, glossary).
  8. svc-rag-retrieverhybrid retrieval (dense + sparse) + rerank + KG-fusion; returns chunks + citations + KG join hints.
  9. svc-reason — deterministic calculators (employment, self-employment, property, dividends/interest, allowances, NIC, HICBC, student loans); Cypher materializers; explanations.
  10. svc-forms — fill PDFs; ZIP evidence bundle (signed manifest).
  11. svc-hmrc — submit stub|sandbox|live; rate-limit & retries; submission audit.
  12. svc-firm-connectors — read-only connectors to Firm Databases; sync to Secure Client Data Store with lineage.
  13. ui-review — Next.js reviewer portal (SSO via Traefik+Authentik); reviewers accept/override extractions.
  14. svc-coverage — Evaluates document coverage against policies, identifies gaps, and generates clarifying questions.

Orchestration & Messaging

  • Prefect 2.x for local orchestration; Temporal for production scale (sagas, retries, idempotency).
  • Events: Kafka (or SQS/SNS) — doc.ingested, doc.ocr_ready, doc.extracted, kg.upsert.ready, kg.upserted, rag.indexed, calc.schedule_ready, form.filled, hmrc.submitted, review.requested, review.completed, firm.sync.completed.

Concrete Stack (pin/assume unless replaced)

  • Languages: Python 3.12, TypeScript 5/Node 20

  • Frameworks: FastAPI, Pydantic v2, SQLAlchemy 2 (ledger), Prefect 2.x (local), Temporal (scale)

  • Gateway: Traefik 3.x with Authentik Outpost (ForwardAuth)

  • Identity/SSO: Authentik (OIDC/OAuth2)

  • Secrets: Vault (AppRole/JWT; Transit for envelope encryption)

  • Object Storage: MinIO (S3 API)

  • Vector DB: Qdrant 1.x (dense + sparse hybrid)

  • Embeddings/Rerankers (local-first): Dense: bge-m3 or bge-small-en-v1.5; Sparse: BM25/SPLADE (Qdrant sparse); Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2

  • Datastores:

    • Secure Client Data Store: PostgreSQL 15 (encrypted; RLS; pgcrypto)
    • KG: Neo4j 5.x
    • Cache/locks: Redis
  • Infra: Docker-Compose for local; Kubernetes for scale (Helm, ArgoCD optional later)

  • CI/CD: Gitea + Gitea Actions (or Drone) → container registry → deploy

Data Layer (three pillars + fusion)

  1. Firm DatabasesFirm Connectors (read-only) → Secure Client Data Store (Postgres) with lineage.
  2. Vector DB / Knowledge Base (Qdrant) — internal knowledge, legislation, best practices, glossary; no PII (placeholders + hashes).
  3. Knowledge Graph (Neo4j) — accounting/tax ontology with evidence anchors and rules/calculations.

Fusion strategy: Query → RAG retrieve (Qdrant) + KG traverse → fusion scoring (α·dense + β·sparse + γ·KG-link-boost) → results with citations (URL/doc_id+page/anchor) and graph paths.

Non-functional Targets

  • SLOs: ingest→extract p95 ≤ 3m; reconciliation ≥ 98%; lineage coverage ≥ 99%; schedule error ≤ 1/1k
  • Throughput: local 2 docs/s; scale 5 docs/s sustained; burst 20 docs/s
  • Idempotency: sha256(doc_checksum + extractor_version)
  • Retention: raw images 7y; derived text 2y; vectors (non-PII) 7y; PII-min logs 90d
  • Erasure: per client_id across MinIO, KG, Qdrant (payload filter), Postgres rows

REPOSITORY LAYOUT (monorepo, local-first)

repo/
  apps/
    svc-ingestion/      svc-rpa/           svc-ocr/           svc-extract/
    svc-normalize-map/  svc-kg/            svc-rag-indexer/   svc-rag-retriever/
    svc-reason/         svc-forms/         svc-hmrc/          svc-firm-connectors/
    svc-coverage/       ui-review/
  kg/
    ONTOLOGY.md
    schemas/{nodes_and_edges.schema.json, context.jsonld, shapes.ttl}
    db/{neo4j_schema.cypher, seed.cypher}
    reasoning/schedule_queries.cypher
  retrieval/
    chunking.yaml  qdrant_collections.json  indexer.py  retriever.py  fusion.py
  config/{heuristics.yaml, mapping.json}
  prompts/{doc_classify.txt, kv_extract.txt, table_extract.txt, entity_link.txt, rag_answer.txt}
  pipeline/etl.py
  infra/
    compose/{docker-compose.local.yml, traefik.yml, traefik-dynamic.yml, env.example}
    k8s/ (optional later: Helm charts)
  security/{dpia.md, ropa.md, retention_policy.md, threat_model.md}
  ops/
    runbooks/{ingest.md, calculators.md, hmrc.md, vector-indexing.md, dr-restore.md}
    dashboards/grafana.json
    alerts/prometheus-rules.yaml
  tests/{unit, integration, e2e, data/{synthetic, golden}}
  Makefile
  .gitea/workflows/ci.yml
  mkdocs.yml

DELIVERABLES (RETURN ALL AS MARKED CODE BLOCKS)

  1. Ontology (Concept model; JSON-Schema; JSON-LD; Neo4j DDL)
  2. Heuristics & Rules (YAML)
  3. Extraction pipeline & prompts
  4. RAG & Retrieval Layer (chunking, Qdrant collections, indexer, retriever, fusion)
  5. Reasoning layer (deterministic calculators + Cypher + tests)
  6. Agent interface (Tooling API)
  7. Quality & Safety (datasets, metrics, tests, red-team)
  8. Graph Constraints (SHACL, IDs, bitemporal)
  9. Security & Compliance (DPIA, ROPA, encryption, auditability)
  10. Worked Example (end-to-end UK SA sample)
  11. Observability & SRE (SLIs/SLOs, tracing, idempotency, DR, cost controls)
  12. Architecture & Local Infra (docker-compose with Traefik + Authentik + Vault + MinIO + Qdrant + Neo4j + Postgres + Redis + Prometheus/Grafana + Loki + Unleash + services)
  13. Repo Scaffolding & Makefile (dev tasks, lint, test, build, run)
  14. Firm Database Connectors (data contracts, sync jobs, lineage)
  15. Traefik & Authentik configs (static+dynamic, ForwardAuth, route labels)

ONTOLOGY REQUIREMENTS (as before + RAG links)

  • Nodes: TaxpayerProfile, TaxYear, Jurisdiction, TaxForm, Schedule, FormBox, Document, Evidence, Party, Account, IncomeItem, ExpenseItem, PropertyAsset, BusinessActivity, Allowance, Relief, PensionContribution, StudentLoanPlan, Payment, ExchangeRate, Calculation, Rule, NormalizationEvent, Reconciliation, Consent, LegalBasis, ImportJob, ETLRun
  • Relationships: BELONGS_TO, OF_TAX_YEAR, IN_JURISDICTION, HAS_SECTION, HAS_BOX, REPORTED_IN, COMPUTES, DERIVED_FROM, SUPPORTED_BY, PAID_BY, PAID_TO, OWNS, RENTED_BY, EMPLOYED_BY, APPLIES_TO, APPLIES, VIOLATES, NORMALIZED_FROM, HAS_VALID_BASIS, PRODUCED_BY, CITES, DESCRIBES
  • Bitemporal and provenance mandatory.

UK-SPECIFIC REQUIREMENTS

  • Year boundary 6 Apr5 Apr; basis period reform toggle
  • Employment aggregation, BIK, PAYE offsets
  • Self-employment: allowable/disallowable, capital allowances (AIA/WDA/SBA), loss rules, NIC Class 2 & 4
  • Property: FHL tests, mortgage interest 20% credit, Rent-a-Room, joint splits
  • Savings/dividends: allowances & rate bands; ordering
  • Personal allowance tapering; Gift Aid & pension gross-up; HICBC; Student Loan plans 1/2/4/5 & PGL
  • Rounding per FormBox.rounding_rule

YAML HEURISTICS (KEEP SEPARATE FILE)

  • document_kinds, field_normalization, line_item_mapping
  • period_inference (UK boundary + reform), dedupe_rules
  • validation_rules: utr_checksum, ni_number_regex, iban_check, vat_gb_mod97, rounding_policy: "HMRC", numeric_tolerance: 0.01
  • entity_resolution: blocking keys, fuzzy thresholds, canonical source priority
  • privacy_redaction: mask_except_last4 for NI/UTR/IBAN/sort_code/phone/email
  • jurisdiction_overrides: by {{jurisdiction}} and {{tax_year}}

EXTRACTION PIPELINE (SPECIFY CODE & PROMPTS)

  • ingest → classify → OCR/layout → extract (schema-constrained JSON with bbox/page) → validate → normalize → map_to_graph → post-checks
  • Prompts: doc_classify, kv_extract, table_extract (multi-page), entity_link
  • Contract: JSON schema enforcement with retry/validator loop; temperature guidance
  • Reliability: de-skew/rotation/language/handwriting policy
  • Mapping config: JSON mapping to nodes/edges + provenance (doc_id/page/bbox/text_hash)

RAG & RETRIEVAL LAYER (Qdrant + KG Fusion)

  • Collections: firm_knowledge, legislation, best_practices, glossary (payloads include jurisdiction, tax_years, topic_tags, version, pii_free:true)
  • Chunking: layout-aware; tables serialized; ~1.5k token chunks, 1015% overlap
  • Indexer: de-identify PII; placeholders only; embeddings (dense) + sparse; upsert with payload
  • Retriever: hybrid scoring (α·dense + β·sparse), filters (jurisdiction/tax_year), rerank; return citations + KG hints
  • Fusion: boost results linked to applicable Rule/Calculation/Evidence for current schedule
  • Right-to-erasure: purge vectors via payload filter (client_id? only for client-authored knowledge)

REASONING & CALCULATION (DETERMINISTIC)

  • Order: incomes → allowances/capital allowances → loss offsets → personal allowance → savings/dividend bands → HICBC & student loans → NIC Class 2/4 → property 20% credit/FHL/Rent-a-Room
  • Cypher materializers per schedule/box; explanations via DERIVED_FROM and RAG CITES
  • Unit tests per rule; golden files; property-based tests

AGENT TOOLING API (JSON SCHEMAS)

  1. ComputeSchedule({tax_year, taxpayer_id, schedule_id}) -> {boxes[], totals[], explanations[]}
  2. PopulateFormBoxes({tax_year, taxpayer_id, form_id}) -> {fields[], pdf_fields[], confidence, calibrated_confidence}
  3. AskClarifyingQuestion({gap, candidate_values, evidence}) -> {question_text, missing_docs}
  4. GenerateEvidencePack({scope}) -> {bundle_manifest, signed_hashes}
  5. ExplainLineage({node_id|field}) -> {chain:[evidence], graph_paths}
  6. CheckDocumentCoverage({tax_year, taxpayer_id}) -> {required_docs[], missing[], blockers[]}
  7. SubmitToHMRC({tax_year, taxpayer_id, dry_run}) -> {status, submission_id?, errors[]}
  8. ReconcileBank({account_id, period}) -> {unmatched_invoices[], unmatched_bank_lines[], deltas}
  9. RAGSearch({query, tax_year?, jurisdiction?, k?}) -> {chunks[], citations[], kg_hints[], calibrated_confidence}
  10. SyncFirmDatabases({since}) -> {objects_synced, errors[]}

Env flags: HMRC_MTD_ITSA_MODE, RATE_LIMITS, RAG_EMBEDDING_MODEL, RAG_RERANKER_MODEL, RAG_ALPHA_BETA_GAMMA


SECURITY & COMPLIANCE

  • Traefik + Authentik SSO at edge (ForwardAuth); per-route RBAC; inject verified claims headers/JWT
  • Vault for secrets (AppRole/JWT, Transit for envelope encryption)
  • PII minimization: no PII in Qdrant; placeholders; PII mapping only in Secure Client Data Store
  • Auditability: tamper-evident logs (hash chain), signer identity, time sync
  • DPIA, ROPA, retention policy, right-to-erasure workflows

CI/CD (Gitea)

  • Gitea Actions: lint (ruff/mypy/eslint), test (pytest+coverage, e2e), build (Docker), scan (Trivy/SAST), push (registry), deploy (compose up or K8s apply)
  • SemVer tags; SBOM (Syft); OpenAPI + MkDocs publish; pre-commit hooks

OBSERVABILITY & SRE

  • SLIs/SLOs: ingest_time_p50, extract_precision@field≥0.97, reconciliation_pass_rate≥0.98, lineage_coverage≥0.99, time_to_review_p95
  • Dashboards: ingestion throughput, OCR error rates, extraction precision, mapping latency, calculator failures, HMRC submits, RAG recall/precision & faithfulness
  • Alerts: OCR 5xx spike, extraction precision dip, reconciliation failures, HMRC rate-limit breaches, RAG drift
  • Backups/DR: Neo4j dump (daily), Postgres PITR, Qdrant snapshot, MinIO versioning; quarterly restore test
  • Cost controls: embedding cache, incremental indexing, compaction/TTL for stale vectors, cold archive for images

OUTPUT FORMAT (STRICT)

Return results in the following order, each in its own fenced code block with the exact language tag:

<!-- FILE: ONTOLOGY.md -->

# Concept Model

...
// FILE: schemas/nodes_and_edges.schema.json
{ ... }
// FILE: schemas/context.jsonld
{ ... }
# FILE: schemas/shapes.ttl
# SHACL shapes for node/edge integrity
...
// FILE: db/neo4j_schema.cypher
CREATE CONSTRAINT ...
# FILE: config/heuristics.yaml
document_kinds: ...
# FILE: config/mapping.json
{ "mappings": [ ... ] }
# FILE: retrieval/chunking.yaml
# Layout-aware chunking, tables, overlap, token targets
# FILE: retrieval/qdrant_collections.json
{
  "collections": [
    { "name": "firm_knowledge", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
    { "name": "legislation", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
    { "name": "best_practices", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
    { "name": "glossary", "dense": {"size": 768}, "sparse": true, "payload_schema": { ... } }
  ]
}
# FILE: retrieval/indexer.py
# De-identify -> embed dense/sparse -> upsert to Qdrant with payload
...
# FILE: retrieval/retriever.py
# Hybrid retrieval (alpha,beta), rerank, filters, return citations + KG hints
...
# FILE: retrieval/fusion.py
# Join RAG chunks to KG rules/calculations/evidence; boost linked results
...
# FILE: prompts/rag_answer.txt
[Instruction: cite every claim; forbid PII; return calibrated_confidence; JSON contract]
# FILE: pipeline/etl.py
def ingest(...): ...
# FILE: prompts/kv_extract.txt
[Prompt with JSON contract + examples]
// FILE: reasoning/schedule_queries.cypher
// SA105: compute property income totals
MATCH ...
// FILE: tools/agent_tools.json
{ ... }
# FILE: infra/compose/docker-compose.local.yml
# Traefik (with Authentik ForwardAuth), Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prometheus/Grafana, Loki, Unleash, all services
# FILE: infra/compose/traefik.yml
# Static config: entryPoints, providers, certificates, access logs
entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"
providers:
  docker: {}
  file:
    filename: /etc/traefik/traefik-dynamic.yml
api:
  dashboard: true
log:
  level: INFO
accessLog: {}
# FILE: infra/compose/traefik-dynamic.yml
# Dynamic config: Authentik ForwardAuth middleware + routers per service
http:
  middlewares:
    authentik-forwardauth:
      forwardAuth:
        address: "http://authentik-outpost:9000/outpost.goauthentik.io/auth/traefik"
        trustForwardHeader: true
        authResponseHeaders:
          - X-Authenticated-User
          - X-Authenticated-Email
          - X-Authenticated-Groups
          - Authorization
    rate-limit:
      rateLimit:
        average: 50
        burst: 100

  routers:
    svc-extract:
      rule: "Host(`api.local`) && PathPrefix(`/extract`)"
      entryPoints: ["websecure"]
      service: svc-extract
      middlewares: ["authentik-forwardauth", "rate-limit"]
      tls: {}
  services:
    svc-extract:
      loadBalancer:
        servers:
          - url: "http://svc-extract:8000"
# FILE: infra/compose/env.example
DOMAIN=local
EMAIL=admin@local
MINIO_ROOT_USER=minio
MINIO_ROOT_PASSWORD=miniopass
POSTGRES_PASSWORD=postgres
NEO4J_PASSWORD=neo4jpass
QDRANT__SERVICE__GRPC_PORT=6334
VAULT_DEV_ROOT_TOKEN_ID=root
AUTHENTIK_SECRET_KEY=changeme
RAG_EMBEDDING_MODEL=bge-small-en-v1.5
RAG_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
# FILE: .gitea/workflows/ci.yml
# Lint → Test → Build → Scan → Push → Deploy (compose up)
# FILE: Makefile
# bootstrap, run, test, lint, build, deploy, format, seed
...
<!-- FILE: TESTPLAN.md -->

## Datasets, Metrics, Acceptance Criteria

- Extraction precision/recall per field
- Schedule-level absolute error
- Reconciliation pass-rate
- Explanation coverage
- RAG retrieval: top-k recall, nDCG, faithfulness, groundedness
- Security: Traefik+Authentik route auth tests, header spoofing prevention (internal network, trusted proxy)
- Red-team cases (OCR noise, conflicting docs, PII leak prevention)
  ...

STYLE & GUARANTEES

  • Be concise but complete; prefer schemas/code over prose.
  • No chain-of-thought. Provide final artifacts and brief rationales.
  • Every numeric output must include lineage to Evidence → Document (page/bbox/text_hash) and citations for narrative answers.
  • Parameterize by {{jurisdiction}} and {{tax_year}}.
  • Include calibrated_confidence and name calibration method.
  • Enforce SHACL on KG writes; reject/queue fixes on violation.
  • No PII in Qdrant. Use de-ID placeholders; keep mappings only in Secure Client Data Store.
  • Deterministic IDs; reproducible builds; version-pinned dependencies.
  • Trust boundary: only Traefik exposes ports; all services on a private network; services accept only requests with Traefiks network identity; never trust client-supplied auth headers.

START

Produce the deliverables now, in the exact order and file/block structure above, implementing the local-first stack (Python 3.12, Prefect, Vault, MinIO, Playwright, Qdrant, Authentik, Traefik, Docker-Compose, Gitea) with optional scale-out notes (Temporal, K8s) where specified.