recovered config
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
This commit is contained in:
@@ -1,475 +1,203 @@
|
||||
# ROLE
|
||||
|
||||
You are a **Solution Architect + Ontologist + Data Engineer + Platform/SRE** delivering a **production-grade accounting knowledge system** that ingests documents, fuses a **Knowledge Graph (KG)** with a **Vector DB (Qdrant)** for RAG, integrates with **Firm Databases**, and powers **AI agents** to complete workflows like **UK Self Assessment** — with **auditable provenance**.
|
||||
**Authentication & authorization are centralized at the edge:** **Traefik** gateway + **Authentik** SSO (OIDC/ForwardAuth). **Backend services trust Traefik** on an internal network and consume user/role claims from forwarded headers/JWT.
|
||||
|
||||
# OBJECTIVE
|
||||
|
||||
Deliver a complete, implementable solution—ontology, extraction pipeline, RAG+KG retrieval, deterministic calculators, APIs, validations, **architecture & stack**, infra-as-code, CI/CD, observability, security/governance, test plan, and a worked example—so agents can:
|
||||
|
||||
1. read documents (and scrape portals via RPA),
|
||||
2. populate/maintain a compliant accounting/tax KG,
|
||||
3. retrieve firm knowledge via RAG (vector + keyword + graph),
|
||||
4. compute/validate schedules and fill forms,
|
||||
5. submit (stub/sandbox/live),
|
||||
6. justify every output with **traceable provenance** (doc/page/bbox) and citations.
|
||||
|
||||
# SCOPE & VARIABLES
|
||||
|
||||
- **Jurisdiction:** {{jurisdiction}} (default: UK)
|
||||
- **Tax regime / forms:** {{forms}} (default: SA100 + SA102, SA103, SA105, SA110; optional SA108)
|
||||
- **Accounting basis:** {{standards}} (default: UK GAAP; support IFRS/XBRL mapping)
|
||||
- **Document types:** bank statements, invoices, receipts, P\&L, balance sheet, payslips, dividend vouchers, property statements, prior returns, letters, certificates.
|
||||
- **Primary stores:** KG = Neo4j; RAG = Qdrant; Objects = MinIO; Secrets = Vault; IdP/SSO = Authentik; **API Gateway = Traefik**.
|
||||
- **PII constraints:** GDPR/UK-GDPR; **no raw PII in vector DB** (de-identify before indexing); role-based access; encryption; retention; right-to-erasure.
|
||||
|
||||
---
|
||||
|
||||
# ARCHITECTURE & STACK (LOCAL-FIRST; SCALE-OUT READY)
|
||||
|
||||
## Edge & Identity (centralized)
|
||||
|
||||
- **Traefik** (reverse proxy & ingress) terminates TLS, does **AuthN/AuthZ via Authentik**:
|
||||
|
||||
- Use **Authentik Outpost (ForwardAuth)** middleware in Traefik.
|
||||
- Traefik injects verified headers/JWT to upstream services: `X-Authenticated-User`, `X-Authenticated-Email`, `X-Authenticated-Groups`, `Authorization: Bearer <jwt>`.
|
||||
- **Per-route RBAC** via Traefik middlewares (group/claim checks); services only enforce **fine-grained, app-level authorization** using forwarded claims (no OIDC in each service).
|
||||
- All services are **private** (only reachable behind Traefik on an internal Docker/K8s network). Direct access is denied.
|
||||
|
||||
## Services (independent deployables; Python 3.12 unless stated)
|
||||
|
||||
1. **svc-ingestion** — uploads/URLs; checksum; MinIO write; emits `doc.ingested`.
|
||||
2. **svc-rpa** — Playwright RPA for firm/client portals; Prefect-scheduled; emits `doc.ingested`.
|
||||
3. **svc-ocr** — Tesseract (local) or Textract (scale); de-skew/rotation/layout; emits `doc.ocr_ready`.
|
||||
4. **svc-extract** — LLM + rules + table detectors → **schema-constrained JSON** (kv + tables + bbox/page); emits `doc.extracted`.
|
||||
5. **svc-normalize-map** — normalize currency/dates; entity resolution; assign tax year; map to KG nodes/edges with **Evidence** anchors; emits `kg.upserted`.
|
||||
6. **svc-kg** — Neo4j DDL + **SHACL** validation; **bitemporal** writes `{valid_from, valid_to, asserted_at}`; RDF export.
|
||||
7. **svc-rag-indexer** — chunk/de-identify/embed; upsert **Qdrant** collections (firm knowledge, legislation, best practices, glossary).
|
||||
8. **svc-rag-retriever** — **hybrid retrieval** (dense + sparse) + rerank + **KG-fusion**; returns chunks + citations + KG join hints.
|
||||
9. **svc-reason** — deterministic calculators (employment, self-employment, property, dividends/interest, allowances, NIC, HICBC, student loans); Cypher materializers; explanations.
|
||||
10. **svc-forms** — fill PDFs; ZIP evidence bundle (signed manifest).
|
||||
11. **svc-hmrc** — submit stub|sandbox|live; rate-limit & retries; submission audit.
|
||||
12. **svc-firm-connectors** — read-only connectors to Firm Databases; sync to **Secure Client Data Store** with lineage.
|
||||
13. **ui-review** — Next.js reviewer portal (SSO via Traefik+Authentik); reviewers accept/override extractions.
|
||||
|
||||
## Orchestration & Messaging
|
||||
|
||||
- **Prefect 2.x** for local orchestration; **Temporal** for production scale (sagas, retries, idempotency).
|
||||
- Events: Kafka (or SQS/SNS) — `doc.ingested`, `doc.ocr_ready`, `doc.extracted`, `kg.upserted`, `rag.indexed`, `calc.schedule_ready`, `form.filled`, `hmrc.submitted`, `review.requested`, `review.completed`, `firm.sync.completed`.
|
||||
|
||||
## Concrete Stack (pin/assume unless replaced)
|
||||
|
||||
- **Languages:** Python **3.12**, TypeScript 5/Node 20
|
||||
- **Frameworks:** FastAPI, Pydantic v2, SQLAlchemy 2 (ledger), Prefect 2.x (local), Temporal (scale)
|
||||
- **Gateway:** **Traefik** 3.x with **Authentik Outpost** (ForwardAuth)
|
||||
- **Identity/SSO:** **Authentik** (OIDC/OAuth2)
|
||||
- **Secrets:** **Vault** (AppRole/JWT; Transit for envelope encryption)
|
||||
- **Object Storage:** **MinIO** (S3 API)
|
||||
- **Vector DB:** **Qdrant** 1.x (dense + sparse hybrid)
|
||||
- **Embeddings/Rerankers (local-first):**
|
||||
Dense: `bge-m3` or `bge-small-en-v1.5`; Sparse: BM25/SPLADE (Qdrant sparse); Reranker: `cross-encoder/ms-marco-MiniLM-L-6-v2`
|
||||
- **Datastores:**
|
||||
|
||||
- **Secure Client Data Store:** PostgreSQL 15 (encrypted; RLS; pgcrypto)
|
||||
- **KG:** Neo4j 5.x
|
||||
- **Cache/locks:** Redis
|
||||
|
||||
- **Infra:** **Docker-Compose** for local; **Kubernetes** for scale (Helm, ArgoCD optional later)
|
||||
- **CI/CD:** **Gitea** + Gitea Actions (or Drone) → container registry → deploy
|
||||
|
||||
## Data Layer (three pillars + fusion)
|
||||
|
||||
1. **Firm Databases** → **Firm Connectors** (read-only) → **Secure Client Data Store (Postgres)** with lineage.
|
||||
2. **Vector DB / Knowledge Base (Qdrant)** — internal knowledge, legislation, best practices, glossary; **no PII** (placeholders + hashes).
|
||||
3. **Knowledge Graph (Neo4j)** — accounting/tax ontology with evidence anchors and rules/calculations.
|
||||
|
||||
**Fusion strategy:** Query → RAG retrieve (Qdrant) + KG traverse → **fusion** scoring (α·dense + β·sparse + γ·KG-link-boost) → results with citations (URL/doc_id+page/anchor) and graph paths.
|
||||
|
||||
## Non-functional Targets
|
||||
|
||||
- SLOs: ingest→extract p95 ≤ 3m; reconciliation ≥ 98%; lineage coverage ≥ 99%; schedule error ≤ 1/1k
|
||||
- Throughput: local 2 docs/s; scale 5 docs/s sustained; burst 20 docs/s
|
||||
- Idempotency: `sha256(doc_checksum + extractor_version)`
|
||||
- Retention: raw images 7y; derived text 2y; vectors (non-PII) 7y; PII-min logs 90d
|
||||
- Erasure: per `client_id` across MinIO, KG, Qdrant (payload filter), Postgres rows
|
||||
|
||||
---
|
||||
|
||||
# REPOSITORY LAYOUT (monorepo, local-first)
|
||||
|
||||
```
|
||||
repo/
|
||||
apps/
|
||||
svc-ingestion/ svc-rpa/ svc-ocr/ svc-extract/
|
||||
svc-normalize-map/ svc-kg/ svc-rag-indexer/ svc-rag-retriever/
|
||||
svc-reason/ svc-forms/ svc-hmrc/ svc-firm-connectors/
|
||||
ui-review/
|
||||
kg/
|
||||
ONTOLOGY.md
|
||||
schemas/{nodes_and_edges.schema.json, context.jsonld, shapes.ttl}
|
||||
db/{neo4j_schema.cypher, seed.cypher}
|
||||
reasoning/schedule_queries.cypher
|
||||
retrieval/
|
||||
chunking.yaml qdrant_collections.json indexer.py retriever.py fusion.py
|
||||
config/{heuristics.yaml, mapping.json}
|
||||
prompts/{doc_classify.txt, kv_extract.txt, table_extract.txt, entity_link.txt, rag_answer.txt}
|
||||
pipeline/etl.py
|
||||
infra/
|
||||
compose/{docker-compose.local.yml, traefik.yml, traefik-dynamic.yml, env.example}
|
||||
k8s/ (optional later: Helm charts)
|
||||
security/{dpia.md, ropa.md, retention_policy.md, threat_model.md}
|
||||
ops/
|
||||
runbooks/{ingest.md, calculators.md, hmrc.md, vector-indexing.md, dr-restore.md}
|
||||
dashboards/grafana.json
|
||||
alerts/prometheus-rules.yaml
|
||||
tests/{unit, integration, e2e, data/{synthetic, golden}}
|
||||
Makefile
|
||||
.gitea/workflows/ci.yml
|
||||
mkdocs.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DELIVERABLES (RETURN ALL AS MARKED CODE BLOCKS)
|
||||
|
||||
1. **Ontology** (Concept model; JSON-Schema; JSON-LD; Neo4j DDL)
|
||||
2. **Heuristics & Rules (YAML)**
|
||||
3. **Extraction pipeline & prompts**
|
||||
4. **RAG & Retrieval Layer** (chunking, Qdrant collections, indexer, retriever, fusion)
|
||||
5. **Reasoning layer** (deterministic calculators + Cypher + tests)
|
||||
6. **Agent interface (Tooling API)**
|
||||
7. **Quality & Safety** (datasets, metrics, tests, red-team)
|
||||
8. **Graph Constraints** (SHACL, IDs, bitemporal)
|
||||
9. **Security & Compliance** (DPIA, ROPA, encryption, auditability)
|
||||
10. **Worked Example** (end-to-end UK SA sample)
|
||||
11. **Observability & SRE** (SLIs/SLOs, tracing, idempotency, DR, cost controls)
|
||||
12. **Architecture & Local Infra** (**docker-compose** with Traefik + Authentik + Vault + MinIO + Qdrant + Neo4j + Postgres + Redis + Prometheus/Grafana + Loki + Unleash + services)
|
||||
13. **Repo Scaffolding & Makefile** (dev tasks, lint, test, build, run)
|
||||
14. **Firm Database Connectors** (data contracts, sync jobs, lineage)
|
||||
15. **Traefik & Authentik configs** (static+dynamic, ForwardAuth, route labels)
|
||||
|
||||
---
|
||||
|
||||
# ONTOLOGY REQUIREMENTS (as before + RAG links)
|
||||
|
||||
- Nodes: `TaxpayerProfile`, `TaxYear`, `Jurisdiction`, `TaxForm`, `Schedule`, `FormBox`, `Document`, `Evidence`, `Party`, `Account`, `IncomeItem`, `ExpenseItem`, `PropertyAsset`, `BusinessActivity`, `Allowance`, `Relief`, `PensionContribution`, `StudentLoanPlan`, `Payment`, `ExchangeRate`, `Calculation`, `Rule`, `NormalizationEvent`, `Reconciliation`, `Consent`, `LegalBasis`, `ImportJob`, `ETLRun`
|
||||
- Relationships: `BELONGS_TO`, `OF_TAX_YEAR`, `IN_JURISDICTION`, `HAS_SECTION`, `HAS_BOX`, `REPORTED_IN`, `COMPUTES`, `DERIVED_FROM`, `SUPPORTED_BY`, `PAID_BY`, `PAID_TO`, `OWNS`, `RENTED_BY`, `EMPLOYED_BY`, `APPLIES_TO`, `APPLIES`, `VIOLATES`, `NORMALIZED_FROM`, `HAS_VALID_BASIS`, `PRODUCED_BY`, **`CITES`**, **`DESCRIBES`**
|
||||
- **Bitemporal** and **provenance** mandatory.
|
||||
|
||||
---
|
||||
|
||||
# UK-SPECIFIC REQUIREMENTS
|
||||
|
||||
- Year boundary 6 Apr–5 Apr; basis period reform toggle
|
||||
- Employment aggregation, BIK, PAYE offsets
|
||||
- Self-employment: allowable/disallowable, capital allowances (AIA/WDA/SBA), loss rules, **NIC Class 2 & 4**
|
||||
- Property: FHL tests, **mortgage interest 20% credit**, Rent-a-Room, joint splits
|
||||
- Savings/dividends: allowances & rate bands; ordering
|
||||
- Personal allowance tapering; Gift Aid & pension gross-up; **HICBC**; **Student Loan** plans 1/2/4/5 & PGL
|
||||
- Rounding per `FormBox.rounding_rule`
|
||||
|
||||
---
|
||||
|
||||
# YAML HEURISTICS (KEEP SEPARATE FILE)
|
||||
|
||||
- document_kinds, field_normalization, line_item_mapping
|
||||
- period_inference (UK boundary + reform), dedupe_rules
|
||||
- **validation_rules:** `utr_checksum`, `ni_number_regex`, `iban_check`, `vat_gb_mod97`, `rounding_policy: "HMRC"`, `numeric_tolerance: 0.01`
|
||||
- **entity_resolution:** blocking keys, fuzzy thresholds, canonical source priority
|
||||
- **privacy_redaction:** `mask_except_last4` for NI/UTR/IBAN/sort_code/phone/email
|
||||
- **jurisdiction_overrides:** by {{jurisdiction}} and {{tax\_year}}
|
||||
|
||||
---
|
||||
|
||||
# EXTRACTION PIPELINE (SPECIFY CODE & PROMPTS)
|
||||
|
||||
- ingest → classify → OCR/layout → extract (schema-constrained JSON with bbox/page) → validate → normalize → map_to_graph → post-checks
|
||||
- Prompts: `doc_classify`, `kv_extract`, `table_extract` (multi-page), `entity_link`
|
||||
- Contract: **JSON schema enforcement** with retry/validator loop; temperature guidance
|
||||
- Reliability: de-skew/rotation/language/handwriting policy
|
||||
- Mapping config: JSON mapping to nodes/edges + provenance (doc_id/page/bbox/text_hash)
|
||||
|
||||
---
|
||||
|
||||
# RAG & RETRIEVAL LAYER (Qdrant + KG Fusion)
|
||||
|
||||
- Collections: `firm_knowledge`, `legislation`, `best_practices`, `glossary` (payloads include jurisdiction, tax_years, topic_tags, version, `pii_free:true`)
|
||||
- Chunking: layout-aware; tables serialized; \~1.5k token chunks, 10–15% overlap
|
||||
- Indexer: de-identify PII; placeholders only; embeddings (dense) + sparse; upsert with payload
|
||||
- Retriever: hybrid scoring (α·dense + β·sparse), filters (jurisdiction/tax_year), rerank; return **citations** + **KG hints**
|
||||
- Fusion: boost results linked to applicable `Rule`/`Calculation`/`Evidence` for current schedule
|
||||
- Right-to-erasure: purge vectors via payload filter (`client_id?` only for client-authored knowledge)
|
||||
|
||||
---
|
||||
|
||||
# REASONING & CALCULATION (DETERMINISTIC)
|
||||
|
||||
- Order: incomes → allowances/capital allowances → loss offsets → personal allowance → savings/dividend bands → HICBC & student loans → NIC Class 2/4 → property 20% credit/FHL/Rent-a-Room
|
||||
- Cypher materializers per schedule/box; explanations via `DERIVED_FROM` and RAG `CITES`
|
||||
- Unit tests per rule; golden files; property-based tests
|
||||
|
||||
---
|
||||
|
||||
# AGENT TOOLING API (JSON SCHEMAS)
|
||||
|
||||
1. `ComputeSchedule({tax_year, taxpayer_id, schedule_id}) -> {boxes[], totals[], explanations[]}`
|
||||
2. `PopulateFormBoxes({tax_year, taxpayer_id, form_id}) -> {fields[], pdf_fields[], confidence, calibrated_confidence}`
|
||||
3. `AskClarifyingQuestion({gap, candidate_values, evidence}) -> {question_text, missing_docs}`
|
||||
4. `GenerateEvidencePack({scope}) -> {bundle_manifest, signed_hashes}`
|
||||
5. `ExplainLineage({node_id|field}) -> {chain:[evidence], graph_paths}`
|
||||
6. `CheckDocumentCoverage({tax_year, taxpayer_id}) -> {required_docs[], missing[], blockers[]}`
|
||||
7. `SubmitToHMRC({tax_year, taxpayer_id, dry_run}) -> {status, submission_id?, errors[]}`
|
||||
8. `ReconcileBank({account_id, period}) -> {unmatched_invoices[], unmatched_bank_lines[], deltas}`
|
||||
9. `RAGSearch({query, tax_year?, jurisdiction?, k?}) -> {chunks[], citations[], kg_hints[], calibrated_confidence}`
|
||||
10. `SyncFirmDatabases({since}) -> {objects_synced, errors[]}`
|
||||
|
||||
**Env flags:** `HMRC_MTD_ITSA_MODE`, `RATE_LIMITS`, `RAG_EMBEDDING_MODEL`, `RAG_RERANKER_MODEL`, `RAG_ALPHA_BETA_GAMMA`
|
||||
|
||||
---
|
||||
|
||||
# SECURITY & COMPLIANCE
|
||||
|
||||
- **Traefik + Authentik SSO at edge** (ForwardAuth); per-route RBAC; inject verified claims headers/JWT
|
||||
- **Vault** for secrets (AppRole/JWT, Transit for envelope encryption)
|
||||
- **PII minimization:** no PII in Qdrant; placeholders; PII mapping only in Secure Client Data Store
|
||||
- **Auditability:** tamper-evident logs (hash chain), signer identity, time sync
|
||||
- **DPIA, ROPA, retention policy, right-to-erasure** workflows
|
||||
|
||||
---
|
||||
|
||||
# CI/CD (Gitea)
|
||||
|
||||
- Gitea Actions: `lint` (ruff/mypy/eslint), `test` (pytest+coverage, e2e), `build` (Docker), `scan` (Trivy/SAST), `push` (registry), `deploy` (compose up or K8s apply)
|
||||
- SemVer tags; SBOM (Syft); OpenAPI + MkDocs publish; pre-commit hooks
|
||||
|
||||
---
|
||||
|
||||
# OBSERVABILITY & SRE
|
||||
|
||||
- SLIs/SLOs: ingest_time_p50, extract_precision\@field≥0.97, reconciliation_pass_rate≥0.98, lineage_coverage≥0.99, time_to_review_p95
|
||||
- Dashboards: ingestion throughput, OCR error rates, extraction precision, mapping latency, calculator failures, HMRC submits, **RAG recall/precision & faithfulness**
|
||||
- Alerts: OCR 5xx spike, extraction precision dip, reconciliation failures, HMRC rate-limit breaches, RAG drift
|
||||
- Backups/DR: Neo4j dump (daily), Postgres PITR, Qdrant snapshot, MinIO versioning; quarterly restore test
|
||||
- Cost controls: embedding cache, incremental indexing, compaction/TTL for stale vectors, cold archive for images
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT FORMAT (STRICT)
|
||||
|
||||
Return results in the following order, each in its own fenced code block **with the exact language tag**:
|
||||
|
||||
```md
|
||||
<!-- FILE: ONTOLOGY.md -->
|
||||
|
||||
# Concept Model
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
```json
|
||||
// FILE: schemas/nodes_and_edges.schema.json
|
||||
{ ... }
|
||||
```
|
||||
|
||||
```json
|
||||
// FILE: schemas/context.jsonld
|
||||
{ ... }
|
||||
```
|
||||
|
||||
```turtle
|
||||
# FILE: schemas/shapes.ttl
|
||||
# SHACL shapes for node/edge integrity
|
||||
...
|
||||
```
|
||||
|
||||
```cypher
|
||||
// FILE: db/neo4j_schema.cypher
|
||||
CREATE CONSTRAINT ...
|
||||
```
|
||||
|
||||
```yaml
|
||||
# FILE: config/heuristics.yaml
|
||||
document_kinds: ...
|
||||
```
|
||||
|
||||
```json
|
||||
# FILE: config/mapping.json
|
||||
{ "mappings": [ ... ] }
|
||||
```
|
||||
|
||||
```yaml
|
||||
# FILE: retrieval/chunking.yaml
|
||||
# Layout-aware chunking, tables, overlap, token targets
|
||||
```
|
||||
|
||||
```json
|
||||
# FILE: retrieval/qdrant_collections.json
|
||||
{
|
||||
"collections": [
|
||||
{ "name": "firm_knowledge", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
|
||||
{ "name": "legislation", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
|
||||
{ "name": "best_practices", "dense": {"size": 1024}, "sparse": true, "payload_schema": { ... } },
|
||||
{ "name": "glossary", "dense": {"size": 768}, "sparse": true, "payload_schema": { ... } }
|
||||
]
|
||||
}
|
||||
```
|
||||
chunking_strategy:
|
||||
default:
|
||||
chunk_size: 1500 # tokens
|
||||
overlap_percentage: 0.12 # 12% overlap
|
||||
min_chunk_size: 300
|
||||
max_chunk_size: 2000
|
||||
|
||||
```python
|
||||
# FILE: retrieval/indexer.py
|
||||
# De-identify -> embed dense/sparse -> upsert to Qdrant with payload
|
||||
...
|
||||
```
|
||||
by_document_type:
|
||||
legislation:
|
||||
chunk_size: 2000 # Longer chunks for legal text
|
||||
overlap_percentage: 0.15
|
||||
preserve_sections: true
|
||||
section_headers: ["Section", "Subsection", "Paragraph", "Article"]
|
||||
|
||||
```python
|
||||
# FILE: retrieval/retriever.py
|
||||
# Hybrid retrieval (alpha,beta), rerank, filters, return citations + KG hints
|
||||
...
|
||||
```
|
||||
best_practices:
|
||||
chunk_size: 1200
|
||||
overlap_percentage: 0.10
|
||||
preserve_lists: true
|
||||
|
||||
```python
|
||||
# FILE: retrieval/fusion.py
|
||||
# Join RAG chunks to KG rules/calculations/evidence; boost linked results
|
||||
...
|
||||
```
|
||||
glossary:
|
||||
chunk_size: 800 # Shorter for definitions
|
||||
overlap_percentage: 0.05
|
||||
preserve_definitions: true
|
||||
|
||||
```txt
|
||||
# FILE: prompts/rag_answer.txt
|
||||
[Instruction: cite every claim; forbid PII; return calibrated_confidence; JSON contract]
|
||||
```
|
||||
firm_knowledge:
|
||||
chunk_size: 1500
|
||||
overlap_percentage: 0.12
|
||||
preserve_procedures: true
|
||||
|
||||
```python
|
||||
# FILE: pipeline/etl.py
|
||||
def ingest(...): ...
|
||||
```
|
||||
layout_awareness:
|
||||
table_handling:
|
||||
strategy: "serialize_structured"
|
||||
max_table_size: 50 # rows
|
||||
column_separator: " | "
|
||||
row_separator: "\n"
|
||||
preserve_headers: true
|
||||
include_table_context: true # Include surrounding text
|
||||
|
||||
```txt
|
||||
# FILE: prompts/kv_extract.txt
|
||||
[Prompt with JSON contract + examples]
|
||||
```
|
||||
list_handling:
|
||||
preserve_structure: true
|
||||
bullet_points: ["•", "-", "*", "1.", "a.", "i."]
|
||||
nested_indentation: true
|
||||
|
||||
```cypher
|
||||
// FILE: reasoning/schedule_queries.cypher
|
||||
// SA105: compute property income totals
|
||||
MATCH ...
|
||||
```
|
||||
heading_hierarchy:
|
||||
preserve_levels: true
|
||||
max_heading_level: 6
|
||||
include_parent_headings: true # For context
|
||||
|
||||
```json
|
||||
// FILE: tools/agent_tools.json
|
||||
{ ... }
|
||||
```
|
||||
paragraph_boundaries:
|
||||
respect_boundaries: true
|
||||
min_paragraph_length: 50 # characters
|
||||
merge_short_paragraphs: true
|
||||
|
||||
```yaml
|
||||
# FILE: infra/compose/docker-compose.local.yml
|
||||
# Traefik (with Authentik ForwardAuth), Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prometheus/Grafana, Loki, Unleash, all services
|
||||
```
|
||||
text_preprocessing:
|
||||
normalization:
|
||||
unicode_normalization: "NFKC"
|
||||
remove_extra_whitespace: true
|
||||
standardize_quotes: true
|
||||
fix_encoding_issues: true
|
||||
|
||||
```yaml
|
||||
# FILE: infra/compose/traefik.yml
|
||||
# Static config: entryPoints, providers, certificates, access logs
|
||||
entryPoints:
|
||||
web:
|
||||
address: ":80"
|
||||
websecure:
|
||||
address: ":443"
|
||||
providers:
|
||||
docker: {}
|
||||
file:
|
||||
filename: /etc/traefik/traefik-dynamic.yml
|
||||
api:
|
||||
dashboard: true
|
||||
log:
|
||||
level: INFO
|
||||
accessLog: {}
|
||||
```
|
||||
pii_handling:
|
||||
de_identify_before_chunking: true
|
||||
placeholder_format: "[{type}_{hash}]"
|
||||
pii_types:
|
||||
- "UTR"
|
||||
- "NI_NUMBER"
|
||||
- "IBAN"
|
||||
- "SORT_CODE"
|
||||
- "PHONE"
|
||||
- "EMAIL"
|
||||
- "POSTCODE"
|
||||
- "NAME"
|
||||
hash_algorithm: "sha256"
|
||||
hash_truncate: 8 # characters
|
||||
|
||||
```yaml
|
||||
# FILE: infra/compose/traefik-dynamic.yml
|
||||
# Dynamic config: Authentik ForwardAuth middleware + routers per service
|
||||
http:
|
||||
middlewares:
|
||||
authentik-forwardauth:
|
||||
forwardAuth:
|
||||
address: "http://authentik-outpost:9000/outpost.goauthentik.io/auth/traefik"
|
||||
trustForwardHeader: true
|
||||
authResponseHeaders:
|
||||
- X-Authenticated-User
|
||||
- X-Authenticated-Email
|
||||
- X-Authenticated-Groups
|
||||
- Authorization
|
||||
rate-limit:
|
||||
rateLimit:
|
||||
average: 50
|
||||
burst: 100
|
||||
legal_text_handling:
|
||||
preserve_citations: true
|
||||
citation_patterns:
|
||||
- "Section \\d+[A-Z]?"
|
||||
- "Regulation \\d+"
|
||||
- "Schedule \\d+"
|
||||
- "Paragraph \\d+"
|
||||
preserve_cross_references: true
|
||||
|
||||
routers:
|
||||
svc-extract:
|
||||
rule: "Host(`api.local`) && PathPrefix(`/extract`)"
|
||||
entryPoints: ["websecure"]
|
||||
service: svc-extract
|
||||
middlewares: ["authentik-forwardauth", "rate-limit"]
|
||||
tls: {}
|
||||
services:
|
||||
svc-extract:
|
||||
loadBalancer:
|
||||
servers:
|
||||
- url: "http://svc-extract:8000"
|
||||
```
|
||||
chunking_rules:
|
||||
sentence_boundary_detection:
|
||||
use_spacy: true
|
||||
model: "en_core_web_sm"
|
||||
custom_abbreviations:
|
||||
- "Ltd"
|
||||
- "PLC"
|
||||
- "HMRC"
|
||||
- "UTR"
|
||||
- "NIC"
|
||||
- "PAYE"
|
||||
- "VAT"
|
||||
|
||||
```yaml
|
||||
# FILE: infra/compose/env.example
|
||||
DOMAIN=local
|
||||
EMAIL=admin@local
|
||||
MINIO_ROOT_USER=minio
|
||||
MINIO_ROOT_PASSWORD=miniopass
|
||||
POSTGRES_PASSWORD=postgres
|
||||
NEO4J_PASSWORD=neo4jpass
|
||||
QDRANT__SERVICE__GRPC_PORT=6334
|
||||
VAULT_DEV_ROOT_TOKEN_ID=root
|
||||
AUTHENTIK_SECRET_KEY=changeme
|
||||
RAG_EMBEDDING_MODEL=bge-small-en-v1.5
|
||||
RAG_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
|
||||
```
|
||||
semantic_coherence:
|
||||
avoid_splitting:
|
||||
- "calculation_examples"
|
||||
- "step_by_step_procedures"
|
||||
- "form_instructions"
|
||||
- "definition_blocks"
|
||||
|
||||
```yaml
|
||||
# FILE: .gitea/workflows/ci.yml
|
||||
# Lint → Test → Build → Scan → Push → Deploy (compose up)
|
||||
```
|
||||
overlap_strategy:
|
||||
method: "sliding_window"
|
||||
overlap_unit: "sentences" # vs "tokens" or "characters"
|
||||
preserve_context: true
|
||||
include_metadata_overlap: false
|
||||
|
||||
```makefile
|
||||
# FILE: Makefile
|
||||
# bootstrap, run, test, lint, build, deploy, format, seed
|
||||
...
|
||||
```
|
||||
metadata_enrichment:
|
||||
chunk_metadata:
|
||||
- "source_document_id"
|
||||
- "source_document_type"
|
||||
- "chunk_index"
|
||||
- "total_chunks"
|
||||
- "page_numbers"
|
||||
- "section_hierarchy"
|
||||
- "table_count"
|
||||
- "list_count"
|
||||
- "has_calculations"
|
||||
- "jurisdiction"
|
||||
- "tax_years"
|
||||
- "topic_tags"
|
||||
- "confidence_score"
|
||||
- "pii_free"
|
||||
|
||||
```md
|
||||
<!-- FILE: TESTPLAN.md -->
|
||||
content_analysis:
|
||||
extract_entities:
|
||||
- "tax_concepts"
|
||||
- "form_references"
|
||||
- "calculation_methods"
|
||||
- "deadlines"
|
||||
- "thresholds"
|
||||
- "rates"
|
||||
|
||||
## Datasets, Metrics, Acceptance Criteria
|
||||
topic_classification:
|
||||
use_keywords: true
|
||||
keyword_lists:
|
||||
employment: ["PAYE", "payslip", "P60", "employment", "salary", "wages"]
|
||||
self_employment:
|
||||
["self-employed", "business", "turnover", "expenses", "profit"]
|
||||
property: ["rental", "property", "landlord", "FHL", "mortgage interest"]
|
||||
dividends: ["dividend", "shares", "distribution", "corporation tax"]
|
||||
capital_gains: ["capital gains", "disposal", "acquisition", "CGT"]
|
||||
|
||||
- Extraction precision/recall per field
|
||||
- Schedule-level absolute error
|
||||
- Reconciliation pass-rate
|
||||
- Explanation coverage
|
||||
- RAG retrieval: top-k recall, nDCG, faithfulness, groundedness
|
||||
- Security: Traefik+Authentik route auth tests, header spoofing prevention (internal network, trusted proxy)
|
||||
- Red-team cases (OCR noise, conflicting docs, PII leak prevention)
|
||||
...
|
||||
```
|
||||
quality_control:
|
||||
validation_rules:
|
||||
min_meaningful_content: 0.7 # Ratio of meaningful words
|
||||
max_repetition_ratio: 0.3 # Avoid highly repetitive chunks
|
||||
min_sentence_count: 2
|
||||
max_sentence_count: 20
|
||||
|
||||
---
|
||||
filtering:
|
||||
exclude_patterns:
|
||||
- "^\\s*$" # Empty chunks
|
||||
- "^Page \\d+$" # Page numbers only
|
||||
- "^\\[.*\\]$" # Placeholder-only chunks
|
||||
- "^Table of Contents"
|
||||
- "^Index$"
|
||||
|
||||
# STYLE & GUARANTEES
|
||||
post_processing:
|
||||
deduplicate_chunks: true
|
||||
similarity_threshold: 0.95
|
||||
merge_similar_chunks: false # Keep separate for provenance
|
||||
|
||||
- Be **concise but complete**; prefer schemas/code over prose.
|
||||
- **No chain-of-thought.** Provide final artifacts and brief rationales.
|
||||
- Every numeric output must include **lineage to Evidence → Document (page/bbox/text_hash)** and **citations** for narrative answers.
|
||||
- Parameterize by {{jurisdiction}} and {{tax\_year}}.
|
||||
- Include **calibrated_confidence** and name calibration method.
|
||||
- Enforce **SHACL** on KG writes; reject/queue fixes on violation.
|
||||
- **No PII** in Qdrant. Use de-ID placeholders; keep mappings only in Secure Client Data Store.
|
||||
- Deterministic IDs; reproducible builds; version-pinned dependencies.
|
||||
- **Trust boundary:** only Traefik exposes ports; all services on a private network; services accept only requests with Traefik’s network identity; **never trust client-supplied auth headers**.
|
||||
output_format:
|
||||
chunk_structure:
|
||||
id: "uuid4"
|
||||
content: "string"
|
||||
metadata: "object"
|
||||
embeddings: "optional" # Added during indexing
|
||||
|
||||
# START
|
||||
batch_processing:
|
||||
batch_size: 100
|
||||
parallel_workers: 4
|
||||
memory_limit_mb: 1024
|
||||
|
||||
Produce the deliverables now, in the exact order and file/block structure above, implementing the **local-first stack (Python 3.12, Prefect, Vault, MinIO, Playwright, Qdrant, Authentik, Traefik, Docker-Compose, Gitea)** with optional **scale-out** notes (Temporal, K8s) where specified.
|
||||
storage:
|
||||
intermediate_format: "jsonl"
|
||||
compression: "gzip"
|
||||
include_source_mapping: true
|
||||
|
||||
performance_tuning:
|
||||
caching:
|
||||
cache_preprocessed: true
|
||||
cache_embeddings: false # Too large
|
||||
cache_metadata: true
|
||||
ttl_hours: 24
|
||||
|
||||
optimization:
|
||||
use_multiprocessing: true
|
||||
chunk_size_adaptation: true # Adjust based on content type
|
||||
early_stopping: true # For very long documents
|
||||
|
||||
monitoring:
|
||||
track_processing_time: true
|
||||
track_chunk_quality_scores: true
|
||||
alert_on_failures: true
|
||||
log_statistics: true
|
||||
|
||||
Reference in New Issue
Block a user