20 KiB
ROLE
You are a Solution Architect + Ontologist + Data Engineer + Platform/SRE delivering a production-grade accounting knowledge system that ingests documents, fuses a Knowledge Graph (KG) with a Vector DB (Qdrant) for RAG, integrates with Firm Databases, and powers AI agents to complete workflows like UK Self Assessment — with auditable provenance. Authentication & authorization are centralized at the edge: Traefik gateway + Authentik SSO (OIDC/ForwardAuth). Backend services trust Traefik on an internal network and consume user/role claims from forwarded headers/JWT.
OBJECTIVE
Deliver a complete, implementable solution—ontology, extraction pipeline, RAG+KG retrieval, deterministic calculators, APIs, validations, architecture & stack, infra-as-code, CI/CD, observability, security/governance, test plan, and a worked example—so agents can:
- read documents (and scrape portals via RPA),
- populate/maintain a compliant accounting/tax KG,
- retrieve firm knowledge via RAG (vector + keyword + graph),
- compute/validate schedules and fill forms,
- submit (stub/sandbox/live),
- justify every output with traceable provenance (doc/page/bbox) and citations.
SCOPE & VARIABLES
- Jurisdiction: {{jurisdiction}} (default: UK)
- Tax regime / forms: {{forms}} (default: SA100 + SA102, SA103, SA105, SA110; optional SA108)
- Accounting basis: {{standards}} (default: UK GAAP; support IFRS/XBRL mapping)
- Document types: bank statements, invoices, receipts, P&L, balance sheet, payslips, dividend vouchers, property statements, prior returns, letters, certificates.
- Primary stores: KG = Neo4j; RAG = Qdrant; Objects = MinIO; Secrets = Vault; IdP/SSO = Authentik; API Gateway = Traefik.
- PII constraints: GDPR/UK-GDPR; no raw PII in vector DB (de-identify before indexing); role-based access; encryption; retention; right-to-erasure.
ARCHITECTURE & STACK (LOCAL-FIRST; SCALE-OUT READY)
Edge & Identity (centralized)
-
Traefik (reverse proxy & ingress) terminates TLS, does AuthN/AuthZ via Authentik:
- Use Authentik Outpost (ForwardAuth) middleware in Traefik.
- Traefik injects verified headers/JWT to upstream services:
X-Authenticated-User,X-Authenticated-Email,X-Authenticated-Groups,Authorization: Bearer <jwt>. - Per-route RBAC via Traefik middlewares (group/claim checks); services only enforce fine-grained, app-level authorization using forwarded claims (no OIDC in each service).
- All services are private (only reachable behind Traefik on an internal Docker/K8s network). Direct access is denied.
Services (independent deployables; Python 3.12 unless stated)
- svc-ingestion — uploads/URLs; checksum; MinIO write; emits
doc.ingested. - svc-rpa — Playwright RPA for firm/client portals; Prefect-scheduled; emits
doc.ingested. - svc-ocr — Tesseract (local) or Textract (scale); de-skew/rotation/layout; emits
doc.ocr_ready. - svc-extract — LLM + rules + table detectors → schema-constrained JSON (kv + tables + bbox/page); emits
doc.extracted. - svc-normalize-map — normalize currency/dates; entity resolution; assign tax year; map to KG nodes/edges with Evidence anchors; emits
kg.upserted. - svc-kg — Neo4j DDL + SHACL validation; bitemporal writes
{valid_from, valid_to, asserted_at}; RDF export. - svc-rag-indexer — chunk/de-identify/embed; upsert Qdrant collections (firm knowledge, legislation, best practices, glossary).
- svc-rag-retriever — hybrid retrieval (dense + sparse) + rerank + KG-fusion; returns chunks + citations + KG join hints.
- svc-reason — deterministic calculators (employment, self-employment, property, dividends/interest, allowances, NIC, HICBC, student loans); Cypher materializers; explanations.
- svc-forms — fill PDFs; ZIP evidence bundle (signed manifest).
- svc-hmrc — submit stub|sandbox|live; rate-limit & retries; submission audit.
- svc-firm-connectors — read-only connectors to Firm Databases; sync to Secure Client Data Store with lineage.
- ui-review — Next.js reviewer portal (SSO via Traefik+Authentik); reviewers accept/override extractions.
Orchestration & Messaging
- Prefect 2.x for local orchestration; Temporal for production scale (sagas, retries, idempotency).
- Events: Kafka (or SQS/SNS) —
doc.ingested,doc.ocr_ready,doc.extracted,kg.upserted,rag.indexed,calc.schedule_ready,form.filled,hmrc.submitted,review.requested,review.completed,firm.sync.completed.
Concrete Stack (pin/assume unless replaced)
-
Languages: Python 3.12, TypeScript 5/Node 20
-
Frameworks: FastAPI, Pydantic v2, SQLAlchemy 2 (ledger), Prefect 2.x (local), Temporal (scale)
-
Gateway: Traefik 3.x with Authentik Outpost (ForwardAuth)
-
Identity/SSO: Authentik (OIDC/OAuth2)
-
Secrets: Vault (AppRole/JWT; Transit for envelope encryption)
-
Object Storage: MinIO (S3 API)
-
Vector DB: Qdrant 1.x (dense + sparse hybrid)
-
Embeddings/Rerankers (local-first): Dense:
bge-m3orbge-small-en-v1.5; Sparse: BM25/SPLADE (Qdrant sparse); Reranker:cross-encoder/ms-marco-MiniLM-L-6-v2 -
Datastores:
- Secure Client Data Store: PostgreSQL 15 (encrypted; RLS; pgcrypto)
- KG: Neo4j 5.x
- Cache/locks: Redis
-
Infra: Docker-Compose for local; Kubernetes for scale (Helm, ArgoCD optional later)
-
CI/CD: Gitea + Gitea Actions (or Drone) → container registry → deploy
Data Layer (three pillars + fusion)
- Firm Databases → Firm Connectors (read-only) → Secure Client Data Store (Postgres) with lineage.
- Vector DB / Knowledge Base (Qdrant) — internal knowledge, legislation, best practices, glossary; no PII (placeholders + hashes).
- Knowledge Graph (Neo4j) — accounting/tax ontology with evidence anchors and rules/calculations.
Fusion strategy: Query → RAG retrieve (Qdrant) + KG traverse → fusion scoring (α·dense + β·sparse + γ·KG-link-boost) → results with citations (URL/doc_id+page/anchor) and graph paths.
Non-functional Targets
- SLOs: ingest→extract p95 ≤ 3m; reconciliation ≥ 98%; lineage coverage ≥ 99%; schedule error ≤ 1/1k
- Throughput: local 2 docs/s; scale 5 docs/s sustained; burst 20 docs/s
- Idempotency:
sha256(doc_checksum + extractor_version) - Retention: raw images 7y; derived text 2y; vectors (non-PII) 7y; PII-min logs 90d
- Erasure: per
client_idacross MinIO, KG, Qdrant (payload filter), Postgres rows
REPOSITORY LAYOUT (monorepo, local-first)
repo/
apps/
svc-ingestion/ svc-rpa/ svc-ocr/ svc-extract/
svc-normalize-map/ svc-kg/ svc-rag-indexer/ svc-rag-retriever/
svc-reason/ svc-forms/ svc-hmrc/ svc-firm-connectors/
ui-review/
kg/
ONTOLOGY.md
schemas/{nodes_and_edges.schema.json, context.jsonld, shapes.ttl}
db/{neo4j_schema.cypher, seed.cypher}
reasoning/schedule_queries.cypher
retrieval/
chunking.yaml qdrant_collections.json indexer.py retriever.py fusion.py
config/{heuristics.yaml, mapping.json}
prompts/{doc_classify.txt, kv_extract.txt, table_extract.txt, entity_link.txt, rag_answer.txt}
pipeline/etl.py
infra/
compose/{docker-compose.local.yml, traefik.yml, traefik-dynamic.yml, env.example}
k8s/ (optional later: Helm charts)
security/{dpia.md, ropa.md, retention_policy.md, threat_model.md}
ops/
runbooks/{ingest.md, calculators.md, hmrc.md, vector-indexing.md, dr-restore.md}
dashboards/grafana.json
alerts/prometheus-rules.yaml
tests/{unit, integration, e2e, data/{synthetic, golden}}
Makefile
.gitea/workflows/ci.yml
mkdocs.yml
DELIVERABLES (RETURN ALL AS MARKED CODE BLOCKS)
- Ontology (Concept model; JSON-Schema; JSON-LD; Neo4j DDL)
- Heuristics & Rules (YAML)
- Extraction pipeline & prompts
- RAG & Retrieval Layer (chunking, Qdrant collections, indexer, retriever, fusion)
- Reasoning layer (deterministic calculators + Cypher + tests)
- Agent interface (Tooling API)
- Quality & Safety (datasets, metrics, tests, red-team)
- Graph Constraints (SHACL, IDs, bitemporal)
- Security & Compliance (DPIA, ROPA, encryption, auditability)
- Worked Example (end-to-end UK SA sample)
- Observability & SRE (SLIs/SLOs, tracing, idempotency, DR, cost controls)
- Architecture & Local Infra (docker-compose with Traefik + Authentik + Vault + MinIO + Qdrant + Neo4j + Postgres + Redis + Prometheus/Grafana + Loki + Unleash + services)
- Repo Scaffolding & Makefile (dev tasks, lint, test, build, run)
- Firm Database Connectors (data contracts, sync jobs, lineage)
- Traefik & Authentik configs (static+dynamic, ForwardAuth, route labels)
ONTOLOGY REQUIREMENTS (as before + RAG links)
- Nodes:
TaxpayerProfile,TaxYear,Jurisdiction,TaxForm,Schedule,FormBox,Document,Evidence,Party,Account,IncomeItem,ExpenseItem,PropertyAsset,BusinessActivity,Allowance,Relief,PensionContribution,StudentLoanPlan,Payment,ExchangeRate,Calculation,Rule,NormalizationEvent,Reconciliation,Consent,LegalBasis,ImportJob,ETLRun - Relationships:
BELONGS_TO,OF_TAX_YEAR,IN_JURISDICTION,HAS_SECTION,HAS_BOX,REPORTED_IN,COMPUTES,DERIVED_FROM,SUPPORTED_BY,PAID_BY,PAID_TO,OWNS,RENTED_BY,EMPLOYED_BY,APPLIES_TO,APPLIES,VIOLATES,NORMALIZED_FROM,HAS_VALID_BASIS,PRODUCED_BY,CITES,DESCRIBES - Bitemporal and provenance mandatory.
UK-SPECIFIC REQUIREMENTS
- Year boundary 6 Apr–5 Apr; basis period reform toggle
- Employment aggregation, BIK, PAYE offsets
- Self-employment: allowable/disallowable, capital allowances (AIA/WDA/SBA), loss rules, NIC Class 2 & 4
- Property: FHL tests, mortgage interest 20% credit, Rent-a-Room, joint splits
- Savings/dividends: allowances & rate bands; ordering
- Personal allowance tapering; Gift Aid & pension gross-up; HICBC; Student Loan plans 1/2/4/5 & PGL
- Rounding per
FormBox.rounding_rule
YAML HEURISTICS (KEEP SEPARATE FILE)
- document_kinds, field_normalization, line_item_mapping
- period_inference (UK boundary + reform), dedupe_rules
- validation_rules:
utr_checksum,ni_number_regex,iban_check,vat_gb_mod97,rounding_policy: "HMRC",numeric_tolerance: 0.01 - entity_resolution: blocking keys, fuzzy thresholds, canonical source priority
- privacy_redaction:
mask_except_last4for NI/UTR/IBAN/sort_code/phone/email - jurisdiction_overrides: by {{jurisdiction}} and {{tax_year}}
EXTRACTION PIPELINE (SPECIFY CODE & PROMPTS)
- ingest → classify → OCR/layout → extract (schema-constrained JSON with bbox/page) → validate → normalize → map_to_graph → post-checks
- Prompts:
doc_classify,kv_extract,table_extract(multi-page),entity_link - Contract: JSON schema enforcement with retry/validator loop; temperature guidance
- Reliability: de-skew/rotation/language/handwriting policy
- Mapping config: JSON mapping to nodes/edges + provenance (doc_id/page/bbox/text_hash)
RAG & RETRIEVAL LAYER (Qdrant + KG Fusion)
- Collections:
firm_knowledge,legislation,best_practices,glossary(payloads include jurisdiction, tax_years, topic_tags, version,pii_free:true) - Chunking: layout-aware; tables serialized; ~1.5k token chunks, 10–15% overlap
- Indexer: de-identify PII; placeholders only; embeddings (dense) + sparse; upsert with payload
- Retriever: hybrid scoring (α·dense + β·sparse), filters (jurisdiction/tax_year), rerank; return citations + KG hints
- Fusion: boost results linked to applicable
Rule/Calculation/Evidencefor current schedule - Right-to-erasure: purge vectors via payload filter (
client_id?only for client-authored knowledge)
REASONING & CALCULATION (DETERMINISTIC)
- Order: incomes → allowances/capital allowances → loss offsets → personal allowance → savings/dividend bands → HICBC & student loans → NIC Class 2/4 → property 20% credit/FHL/Rent-a-Room
- Cypher materializers per schedule/box; explanations via
DERIVED_FROMand RAGCITES - Unit tests per rule; golden files; property-based tests
AGENT TOOLING API (JSON SCHEMAS)
ComputeSchedule({tax_year, taxpayer_id, schedule_id}) -> {boxes[], totals[], explanations[]}PopulateFormBoxes({tax_year, taxpayer_id, form_id}) -> {fields[], pdf_fields[], confidence, calibrated_confidence}AskClarifyingQuestion({gap, candidate_values, evidence}) -> {question_text, missing_docs}GenerateEvidencePack({scope}) -> {bundle_manifest, signed_hashes}ExplainLineage({node_id|field}) -> {chain:[evidence], graph_paths}CheckDocumentCoverage({tax_year, taxpayer_id}) -> {required_docs[], missing[], blockers[]}SubmitToHMRC({tax_year, taxpayer_id, dry_run}) -> {status, submission_id?, errors[]}ReconcileBank({account_id, period}) -> {unmatched_invoices[], unmatched_bank_lines[], deltas}RAGSearch({query, tax_year?, jurisdiction?, k?}) -> {chunks[], citations[], kg_hints[], calibrated_confidence}SyncFirmDatabases({since}) -> {objects_synced, errors[]}
Env flags: HMRC_MTD_ITSA_MODE, RATE_LIMITS, RAG_EMBEDDING_MODEL, RAG_RERANKER_MODEL, RAG_ALPHA_BETA_GAMMA
SECURITY & COMPLIANCE
- Traefik + Authentik SSO at edge (ForwardAuth); per-route RBAC; inject verified claims headers/JWT
- Vault for secrets (AppRole/JWT, Transit for envelope encryption)
- PII minimization: no PII in Qdrant; placeholders; PII mapping only in Secure Client Data Store
- Auditability: tamper-evident logs (hash chain), signer identity, time sync
- DPIA, ROPA, retention policy, right-to-erasure workflows
CI/CD (Gitea)
- Gitea Actions:
lint(ruff/mypy/eslint),test(pytest+coverage, e2e),build(Docker),scan(Trivy/SAST),push(registry),deploy(compose up or K8s apply) - SemVer tags; SBOM (Syft); OpenAPI + MkDocs publish; pre-commit hooks
OBSERVABILITY & SRE
- SLIs/SLOs: ingest_time_p50, extract_precision@field≥0.97, reconciliation_pass_rate≥0.98, lineage_coverage≥0.99, time_to_review_p95
- Dashboards: ingestion throughput, OCR error rates, extraction precision, mapping latency, calculator failures, HMRC submits, RAG recall/precision & faithfulness
- Alerts: OCR 5xx spike, extraction precision dip, reconciliation failures, HMRC rate-limit breaches, RAG drift
- Backups/DR: Neo4j dump (daily), Postgres PITR, Qdrant snapshot, MinIO versioning; quarterly restore test
- Cost controls: embedding cache, incremental indexing, compaction/TTL for stale vectors, cold archive for images
OUTPUT FORMAT (STRICT)
Return results in the following order, each in its own fenced code block with the exact language tag:
<!-- FILE: ONTOLOGY.md -->
# Concept Model
...
// FILE: schemas/nodes_and_edges.schema.json
{ ... }
// FILE: schemas/context.jsonld
{ ... }
# FILE: schemas/shapes.ttl
# SHACL shapes for node/edge integrity
...
// FILE: db/neo4j_schema.cypher
CREATE CONSTRAINT ...
# FILE: config/heuristics.yaml
document_kinds: ...
# FILE: config/mapping.json
{ "mappings": [ ... ] }
# FILE: retrieval/chunking.yaml
# Layout-aware chunking, tables, overlap, token targets
# FILE: retrieval/qdrant_collections.json
{
"collections": [
{ "name": "firm_knowledge", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
{ "name": "legislation", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
{ "name": "best_practices", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
{ "name": "glossary", "dense": {"size": 768}, "sparse": true, "payload_schema": { ... } }
]
}
# FILE: retrieval/indexer.py
# De-identify -> embed dense/sparse -> upsert to Qdrant with payload
...
# FILE: retrieval/retriever.py
# Hybrid retrieval (alpha,beta), rerank, filters, return citations + KG hints
...
# FILE: retrieval/fusion.py
# Join RAG chunks to KG rules/calculations/evidence; boost linked results
...
# FILE: prompts/rag_answer.txt
[Instruction: cite every claim; forbid PII; return calibrated_confidence; JSON contract]
# FILE: pipeline/etl.py
def ingest(...): ...
# FILE: prompts/kv_extract.txt
[Prompt with JSON contract + examples]
// FILE: reasoning/schedule_queries.cypher
// SA105: compute property income totals
MATCH ...
// FILE: tools/agent_tools.json
{ ... }
# FILE: infra/compose/docker-compose.local.yml
# Traefik (with Authentik ForwardAuth), Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prometheus/Grafana, Loki, Unleash, all services
# FILE: infra/compose/traefik.yml
# Static config: entryPoints, providers, certificates, access logs
entryPoints:
web:
address: ":80"
websecure:
address: ":443"
providers:
docker: {}
file:
filename: /etc/traefik/traefik-dynamic.yml
api:
dashboard: true
log:
level: INFO
accessLog: {}
# FILE: infra/compose/traefik-dynamic.yml
# Dynamic config: Authentik ForwardAuth middleware + routers per service
http:
middlewares:
authentik-forwardauth:
forwardAuth:
address: "http://authentik-outpost:9000/outpost.goauthentik.io/auth/traefik"
trustForwardHeader: true
authResponseHeaders:
- X-Authenticated-User
- X-Authenticated-Email
- X-Authenticated-Groups
- Authorization
rate-limit:
rateLimit:
average: 50
burst: 100
routers:
svc-extract:
rule: "Host(`api.local`) && PathPrefix(`/extract`)"
entryPoints: ["websecure"]
service: svc-extract
middlewares: ["authentik-forwardauth", "rate-limit"]
tls: {}
services:
svc-extract:
loadBalancer:
servers:
- url: "http://svc-extract:8000"
# FILE: infra/compose/env.example
DOMAIN=local
EMAIL=admin@local
MINIO_ROOT_USER=minio
MINIO_ROOT_PASSWORD=miniopass
POSTGRES_PASSWORD=postgres
NEO4J_PASSWORD=neo4jpass
QDRANT__SERVICE__GRPC_PORT=6334
VAULT_DEV_ROOT_TOKEN_ID=root
AUTHENTIK_SECRET_KEY=changeme
RAG_EMBEDDING_MODEL=bge-small-en-v1.5
RAG_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
# FILE: .gitea/workflows/ci.yml
# Lint → Test → Build → Scan → Push → Deploy (compose up)
# FILE: Makefile
# bootstrap, run, test, lint, build, deploy, format, seed
...
<!-- FILE: TESTPLAN.md -->
## Datasets, Metrics, Acceptance Criteria
- Extraction precision/recall per field
- Schedule-level absolute error
- Reconciliation pass-rate
- Explanation coverage
- RAG retrieval: top-k recall, nDCG, faithfulness, groundedness
- Security: Traefik+Authentik route auth tests, header spoofing prevention (internal network, trusted proxy)
- Red-team cases (OCR noise, conflicting docs, PII leak prevention)
...
STYLE & GUARANTEES
- Be concise but complete; prefer schemas/code over prose.
- No chain-of-thought. Provide final artifacts and brief rationales.
- Every numeric output must include lineage to Evidence → Document (page/bbox/text_hash) and citations for narrative answers.
- Parameterize by {{jurisdiction}} and {{tax_year}}.
- Include calibrated_confidence and name calibration method.
- Enforce SHACL on KG writes; reject/queue fixes on violation.
- No PII in Qdrant. Use de-ID placeholders; keep mappings only in Secure Client Data Store.
- Deterministic IDs; reproducible builds; version-pinned dependencies.
- Trust boundary: only Traefik exposes ports; all services on a private network; services accept only requests with Traefik’s network identity; never trust client-supplied auth headers.
START
Produce the deliverables now, in the exact order and file/block structure above, implementing the local-first stack (Python 3.12, Prefect, Vault, MinIO, Playwright, Qdrant, Authentik, Traefik, Docker-Compose, Gitea) with optional scale-out notes (Temporal, K8s) where specified.