deployment, linting and infra configuration
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled

This commit is contained in:
harkon
2025-10-14 07:42:31 +01:00
parent f0f7674b8d
commit eea46ac89c
41 changed files with 1017 additions and 1448 deletions

View File

@@ -1,555 +0,0 @@
# ROLE
You are a **Senior Platform Engineer + Backend Lead** generating **production code** and **ops assets** for a microservice suite that powers an accounting Knowledge Graph + Vector RAG platform. Authentication/authorization are centralized at the **edge via Traefik + Authentik** (ForwardAuth). **Services are trust-bound** to Traefik and consume user/role claims via forwarded headers/JWT.
# MISSION
Produce fully working code for **all application services** (FastAPI + Python 3.12) with:
- Solid domain models, Pydantic v2 schemas, type hints, strict mypy, ruff lint.
- Opentelemetry tracing, Prometheus metrics, structured logging.
- Vault-backed secrets, MinIO S3 client, Qdrant client, Neo4j driver, Postgres (SQLAlchemy), Redis.
- Eventing (Kafka or SQS/SNS behind an interface).
- Deterministic data contracts, end-to-end tests, Dockerfiles, Compose, CI for Gitea.
- Traefik labels + Authentik Outpost integration for every exposed route.
- Zero PII in vectors (Qdrant), evidence-based lineage in KG, and bitemporal writes.
# GLOBAL CONSTRAINTS (APPLY TO ALL SERVICES)
- **Language & Runtime:** Python **3.12**.
- **Frameworks:** FastAPI, Pydantic v2, SQLAlchemy 2, httpx, aiokafka or boto3 (pluggable), redis-py, opentelemetry-instrumentation-fastapi, prometheus-fastapi-instrumentator.
- **Config:** `pydantic-settings` with `.env` overlay. Provide `Settings` class per service.
- **Secrets:** HashiCorp **Vault** (AppRole/JWT). Use Vault Transit to **envelope-encrypt** sensitive fields before persistence (helpers provided in `lib/security.py`).
- **Auth:** No OIDC in services. Add `TrustedProxyMiddleware`:
- Reject if request not from internal network (configurable CIDR).
- Require headers set by Traefik+Authentik (`X-Authenticated-User`, `X-Authenticated-Email`, `X-Authenticated-Groups`, `Authorization: Bearer `).
- Parse groups `roles` list on `request.state`.
- **Observability:**
- OpenTelemetry (traceparent propagation), span attrs (service, route, user, tenant).
- Prometheus metrics endpoint `/metrics` protected by internal network check.
- Structured JSON logs (timestamp, level, svc, trace_id, msg) via `structlog`.
- **Errors:** Global exception handler RFC7807 Problem+JSON (`type`, `title`, `status`, `detail`, `instance`, `trace_id`).
- **Testing:** `pytest`, `pytest-asyncio`, `hypothesis` (property tests for calculators), `coverage 90%` per service.
- **Static:** `ruff`, `mypy --strict`, `bandit`, `safety`, `licensecheck`.
- **Perf:** Each service exposes `/healthz`, `/readyz`, `/livez`; cold start < 500ms; p95 endpoint < 250ms (local).
- **Containers:** Distroless or slim images; non-root user; read-only FS; `/tmp` mounted for OCR where needed.
- **Docs:** OpenAPI JSON + ReDoc; MkDocs site with service READMEs.
# SHARED LIBS (GENERATE ONCE, REUSE)
Create `libs/` used by all services:
- `libs/config.py` base `Settings`, env parsing, Vault client factory, MinIO client factory, Qdrant client factory, Neo4j driver factory, Redis factory, Kafka/SQS client factory.
- `libs/security.py` Vault Transit helpers (`encrypt_field`, `decrypt_field`), header parsing, internal-CIDR validator.
- `libs/observability.py` otel init, prometheus instrumentor, logging config.
- `libs/events.py` abstract `EventBus` with `publish(topic, payload: dict)`, `subscribe(topic, handler)`. Two impls: Kafka (`aiokafka`) and SQS/SNS (`boto3`).
- `libs/schemas.py` **canonical Pydantic models** shared across services (Document, Evidence, IncomeItem, etc.) mirroring the ontology schemas. Include JSONSchema exports.
- `libs/storage.py` S3/MinIO helpers (bucket ensure, put/get, presigned).
- `libs/neo.py` Neo4j session helpers, Cypher runner with retry, SHACL validator invoker (pySHACL on exported RDF).
- `libs/rag.py` Qdrant collections CRUD, hybrid search (dense+sparse), rerank wrapper, de-identification utilities (regex + NER; hash placeholders).
- `libs/forms.py` PDF AcroForm fill via `pdfrw` with overlay fallback via `reportlab`.
- `libs/calibration.py` `calibrated_confidence(raw_score, method="temperature_scaling", params=...)`.
# EVENT TOPICS (STANDARDIZE)
- `doc.ingested`, `doc.ocr_ready`, `doc.extracted`, `kg.upserted`, `rag.indexed`, `calc.schedule_ready`, `form.filled`, `hmrc.submitted`, `review.requested`, `review.completed`, `firm.sync.completed`
Each payload MUST include: `event_id (ulid)`, `occurred_at (iso)`, `actor`, `tenant_id`, `trace_id`, `schema_version`, and a `data` object (service-specific).
# TRUST HEADERS FROM TRAEFIK + AUTHENTIK (USE EXACT KEYS)
- `X-Authenticated-User` (string)
- `X-Authenticated-Email` (string)
- `X-Authenticated-Groups` (comma-separated)
- `Authorization` (`Bearer <jwt>` from Authentik)
Reject any request missing these (except `/healthz|/readyz|/livez|/metrics` from internal CIDR).
---
## SERVICES TO IMPLEMENT (CODE FOR EACH)
### 1) `svc-ingestion`
**Purpose:** Accept uploads or URLs, checksum, store to MinIO, emit `doc.ingested`.
**Endpoints:**
- `POST /v1/ingest/upload` (multipart file, metadata: `tenant_id`, `kind`, `source`) `{doc_id, s3_url, checksum}`
- `POST /v1/ingest/url` (json: `{url, kind, tenant_id}`) downloads to MinIO
- `GET /v1/docs/{doc_id}` metadata
**Logic:**
- Compute SHA256, dedupe by checksum; MinIO path `tenants/{tenant_id}/raw/{doc_id}.pdf`.
- Store metadata in Postgres table `ingest_documents` (alembic migrations).
- Publish `doc.ingested` with `{doc_id, bucket, key, pages?, mime}`.
**Env:** `S3_BUCKET_RAW`, `MINIO_*`, `DB_URL`.
**Traefik labels:** route `/ingest/*`.
---
### 2) `svc-rpa`
**Purpose:** Scheduled RPA pulls from firm/client portals via Playwright.
**Tasks:**
- Playwright login flows (credentials from Vault), 2FA via Authentik OAuth device or OTP secret in Vault.
- Download statements/invoices; hand off to `svc-ingestion` via internal POST.
- Prefect flows: `pull_portal_X()`, `pull_portal_Y()` with schedules.
**Endpoints:**
- `POST /v1/rpa/run/{connector}` (manual trigger)
- `GET /v1/rpa/status/{run_id}`
**Env:** `VAULT_ADDR`, `VAULT_ROLE_ID`, `VAULT_SECRET_ID`.
---
### 3) `svc-ocr`
**Purpose:** OCR & layout extraction.
**Pipeline:**
- Pull object from MinIO, detect rotation/de-skew (`opencv-python`), split pages (`pymupdf`), OCR (`pytesseract`) or bypass if text layer present (`pdfplumber`).
- Output per-page text + **bbox** for lines/words.
- Write JSON to MinIO `tenants/{tenant_id}/ocr/{doc_id}.json` and emit `doc.ocr_ready`.
**Endpoints:**
- `POST /v1/ocr/{doc_id}` (idempotent trigger)
- `GET /v1/ocr/{doc_id}` (fetch OCR JSON)
**Env:** `TESSERACT_LANGS`, `S3_BUCKET_EVIDENCE`.
---
### 4) `svc-extract`
**Purpose:** Classify docs and extract KV + tables into **schema-constrained JSON** (with bbox/page).
**Endpoints:**
- `POST /v1/extract/{doc_id}` body: `{strategy: "llm|rules|hybrid"}`
- `GET /v1/extract/{doc_id}` structured JSON
**Implementation:**
- Use prompt files in `prompts/`: `doc_classify.txt`, `kv_extract.txt`, `table_extract.txt`.
- **Validator loop**: run LLM validate JSONSchema retry with error messages up to N times.
- Return Pydantic models from `libs/schemas.py`.
- Emit `doc.extracted`.
**Env:** `LLM_ENGINE`, `TEMPERATURE`, `MAX_TOKENS`.
---
### 5) `svc-normalize-map`
**Purpose:** Normalize & map extracted data to KG.
**Logic:**
- Currency normalization (ECB or static fx table), dates, UK tax year/basis period inference.
- Entity resolution (blocking + fuzzy).
- Generate nodes/edges (+ `Evidence` with doc_id/page/bbox/text_hash).
- Use `libs/neo.py` to write with **bitemporal** fields; run **SHACL** validator; on violation, queue `review.requested`.
- Emit `kg.upserted`.
**Endpoints:**
- `POST /v1/map/{doc_id}`
- `GET /v1/map/{doc_id}/preview` (diff view, to be used by UI)
**Env:** `NEO4J_*`.
---
### 6) `svc-kg`
**Purpose:** Graph façade + RDF/SHACL utility.
**Endpoints:**
- `GET /v1/kg/nodes/{label}/{id}`
- `POST /v1/kg/cypher` (admin-gated inline query; must check `admin` role)
- `POST /v1/kg/export/rdf` (returns RDF for SHACL)
- `POST /v1/kg/validate` (run pySHACL against `schemas/shapes.ttl`)
- `GET /v1/kg/lineage/{node_id}` (traverse `DERIVED_FROM` Evidence)
**Env:** `NEO4J_*`.
---
### 7) `svc-rag-indexer`
**Purpose:** Build Qdrant indices (firm knowledge, legislation, best practices, glossary).
**Workflow:**
- Load sources (filesystem, URLs, Firm DMS via `svc-firm-connectors`).
- **De-identify PII** (regex + NER), replace with placeholders; store mapping only in Postgres.
- Chunk (layout-aware) per `retrieval/chunking.yaml`.
- Compute **dense** embeddings (e.g., `bge-small-en-v1.5`) and **sparse** (Qdrant sparse).
- Upsert to Qdrant with payload `{jurisdiction, tax_years[], topic_tags[], version, pii_free: true, doc_id/section_id/url}`.
- Emit `rag.indexed`.
**Endpoints:**
- `POST /v1/index/run`
- `GET /v1/index/status/{run_id}`
**Env:** `QDRANT_URL`, `RAG_EMBEDDING_MODEL`, `RAG_RERANKER_MODEL`.
---
### 8) `svc-rag-retriever`
**Purpose:** Hybrid search + KG fusion with rerank and calibrated confidence.
**Endpoint:**
- `POST /v1/rag/search` `{query, tax_year?, jurisdiction?, k?}`
```
{
"chunks": [...],
"citations": [{doc_id|url, section_id?, page?, bbox?}],
"kg_hints": [{rule_id, formula_id, node_ids[]}],
"calibrated_confidence": 0.0-1.0
}
```
**Implementation:**
- Hybrid score: `alpha * dense + beta * sparse`; rerank top-K via cross-encoder; **KG fusion** (boost chunks citing Rules/Calculations relevant to schedule).
- Use `libs/calibration.py` to expose calibrated confidence.
---
### 9) `svc-reason`
**Purpose:** Deterministic calculators + materializers (UK SA).
**Endpoints:**
- `POST /v1/reason/compute_schedule` `{tax_year, taxpayer_id, schedule_id}`
- `GET /v1/reason/explain/{schedule_id}` rationale & lineage paths
**Implementation:**
- Pure functions for: employment, self-employment, property (FHL, 20% interest credit), dividends/interest, allowances, NIC (Class 2/4), HICBC, student loans (Plans 1/2/4/5, PGL).
- **Deterministic order** as defined; rounding per `FormBox.rounding_rule`.
- Use Cypher from `kg/reasoning/schedule_queries.cypher` to materialize box values; attach `DERIVED_FROM` evidence.
---
### 10) `svc-forms`
**Purpose:** Fill PDFs and assemble evidence bundles.
**Endpoints:**
- `POST /v1/forms/fill` `{tax_year, taxpayer_id, form_id}` returns PDF (binary)
- `POST /v1/forms/evidence_pack` `{scope}` ZIP + manifest + signed hashes (sha256)
**Implementation:**
- `pdfrw` for AcroForm; overlay with ReportLab if needed.
- Manifest includes `doc_id/page/bbox/text_hash` for every numeric field.
---
### 11) `svc-hmrc`
**Purpose:** HMRC submitter (stub|sandbox|live).
**Endpoints:**
- `POST /v1/hmrc/submit` `{tax_year, taxpayer_id, dry_run}` `{status, submission_id?, errors[]}`
- `GET /v1/hmrc/submissions/{id}`
**Implementation:**
- Rate limits, retries/backoff, signed audit log; environment toggle.
---
### 12) `svc-firm-connectors`
**Purpose:** Read-only connectors to Firm Databases (Practice Mgmt, DMS).
**Endpoints:**
- `POST /v1/firm/sync` `{since?}` `{objects_synced, errors[]}`
- `GET /v1/firm/objects` (paged)
**Implementation:**
- Data contracts in `config/firm_contracts/`; mappers Secure Client Data Store (Postgres) with lineage columns (`source`, `source_id`, `synced_at`).
---
### 13) `ui-review` (outline only)
- Next.js (SSO handled by Traefik+Authentik), shows extracted fields + evidence snippets; POST overrides to `svc-extract`/`svc-normalize-map`.
---
## DATA CONTRACTS (ESSENTIAL EXAMPLES)
**Event: `doc.ingested`**
```json
{
"event_id": "01J...ULID",
"occurred_at": "2025-09-13T08:00:00Z",
"actor": "svc-ingestion",
"tenant_id": "t_123",
"trace_id": "abc-123",
"schema_version": "1.0",
"data": {
"doc_id": "d_abc",
"bucket": "raw",
"key": "tenants/t_123/raw/d_abc.pdf",
"checksum": "sha256:...",
"kind": "bank_statement",
"mime": "application/pdf",
"pages": 12
}
}
```
**RAG search response shape**
```json
{
"chunks": [
{
"id": "c1",
"text": "...",
"score": 0.78,
"payload": {
"jurisdiction": "UK",
"tax_years": ["2024-25"],
"topic_tags": ["FHL"],
"pii_free": true
}
}
],
"citations": [
{ "doc_id": "leg-ITA2007", "section_id": "s272A", "url": "https://..." }
],
"kg_hints": [
{
"rule_id": "UK.FHL.Qual",
"formula_id": "FHL_Test_v1",
"node_ids": ["n123", "n456"]
}
],
"calibrated_confidence": 0.81
}
```
---
## PERSISTENCE SCHEMAS (POSTGRES; ALEMBIC)
- `ingest_documents(id pk, tenant_id, doc_id, kind, checksum, bucket, key, mime, pages, created_at)`
- `firm_objects(id pk, tenant_id, source, source_id, type, payload jsonb, synced_at)`
- Qdrant PII mapping table (if absolutely needed): `pii_links(id pk, placeholder_hash, client_id, created_at)` **encrypt with Vault Transit**; do NOT store raw values.
---
## TRAEFIK + AUTHENTIK (COMPOSE LABELS PER SERVICE)
For every service container in `infra/compose/docker-compose.local.yml`, add labels:
```
- "traefik.enable=true"
- "traefik.http.routers.svc-extract.rule=Host(`api.local`) && PathPrefix(`/extract`)"
- "traefik.http.routers.svc-extract.entrypoints=websecure"
- "traefik.http.routers.svc-extract.tls=true"
- "traefik.http.routers.svc-extract.middlewares=authentik-forwardauth,rate-limit"
- "traefik.http.services.svc-extract.loadbalancer.server.port=8000"
```
Use the shared dynamic file `traefik-dynamic.yml` with `authentik-forwardauth` and `rate-limit` middlewares.
---
## OUTPUT FORMAT (STRICT)
Implement a **multi-file codebase** as fenced blocks, EXACTLY in this order:
```txt
# FILE: libs/config.py
# factories for Vault/MinIO/Qdrant/Neo4j/Redis/EventBus, Settings base
...
```
```txt
# FILE: libs/security.py
# Vault Transit helpers, header parsing, internal CIDR checks, middleware
...
```
```txt
# FILE: libs/observability.py
# otel init, prometheus, structlog
...
```
```txt
# FILE: libs/events.py
# EventBus abstraction with Kafka and SQS/SNS impls
...
```
```txt
# FILE: libs/schemas.py
# Shared Pydantic models mirroring ontology entities
...
```
```txt
# FILE: apps/svc-ingestion/main.py
# FastAPI app, endpoints, MinIO write, Postgres, publish doc.ingested
...
```
```txt
# FILE: apps/svc-rpa/main.py
# Playwright flows, Prefect tasks, triggers
...
```
```txt
# FILE: apps/svc-ocr/main.py
# OCR pipeline, endpoints
...
```
```txt
# FILE: apps/svc-extract/main.py
# Classifier + extractors with validator loop
...
```
```txt
# FILE: apps/svc-normalize-map/main.py
# normalization, entity resolution, KG mapping, SHACL validation call
...
```
```txt
# FILE: apps/svc-kg/main.py
# KG façade, RDF export, SHACL validate, lineage traversal
...
```
```txt
# FILE: apps/svc-rag-indexer/main.py
# chunk/de-id/embed/upsert to Qdrant
...
```
```txt
# FILE: apps/svc-rag-retriever/main.py
# hybrid retrieval + rerank + KG fusion
...
```
```txt
# FILE: apps/svc-reason/main.py
# deterministic calculators, schedule compute/explain
...
```
```txt
# FILE: apps/svc-forms/main.py
# PDF fill + evidence pack
...
```
```txt
# FILE: apps/svc-hmrc/main.py
# submit stub|sandbox|live with audit + retries
...
```
```txt
# FILE: apps/svc-firm-connectors/main.py
# connectors to practice mgmt & DMS, sync to Postgres
...
```
```txt
# FILE: infra/compose/docker-compose.local.yml
# Traefik, Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prom+Grafana, Loki, Unleash, all services
...
```
```txt
# FILE: infra/compose/traefik.yml
# static Traefik config
...
```
```txt
# FILE: infra/compose/traefik-dynamic.yml
# forwardAuth middleware + routers/services
...
```
```txt
# FILE: .gitea/workflows/ci.yml
# lint->test->build->scan->push->deploy
...
```
```txt
# FILE: Makefile
# bootstrap, run, test, lint, build, deploy, format, seed
...
```
```txt
# FILE: tests/e2e/test_happy_path.py
# end-to-end: ingest -> ocr -> extract -> map -> compute -> fill -> (stub) submit
...
```
```txt
# FILE: tests/unit/test_calculators.py
# boundary tests for UK SA logic (NIC, HICBC, PA taper, FHL)
...
```
```txt
# FILE: README.md
# how to run locally with docker-compose, Authentik setup, Traefik certs
...
```
## DEFINITION OF DONE
- `docker compose up` brings the full stack up; SSO via Authentik; routes secured via Traefik ForwardAuth.
- Running `pytest` yields 90% coverage; `make e2e` passes the ingestsubmit stub flow.
- All services expose `/healthz|/readyz|/livez|/metrics`; OpenAPI at `/docs`.
- No PII stored in Qdrant; vectors carry `pii_free=true`.
- KG writes are SHACL-validated; violations produce `review.requested` events.
- Evidence lineage is present for every numeric box value.
- Gitea pipeline passes: lint, test, build, scan, push, deploy.
# START
Generate the full codebase and configs in the **exact file blocks and order** specified above.

View File

@@ -134,7 +134,7 @@ class Neo4jClient:
result = await self.run_query(query, {"properties": properties}, database)
node = result[0]["n"] if result else {}
# Return node ID if available, otherwise return the full node
return node.get("id", node)
return node.get("id", node) # type: ignore
async def update_node(
self,
@@ -209,7 +209,7 @@ class Neo4jClient:
database,
)
rel = result[0]["r"] if result else {}
return rel.get("id", rel)
return rel.get("id", rel) # type: ignore
# Original signature (using labels and IDs)
rel_properties = properties or {}
@@ -231,7 +231,7 @@ class Neo4jClient:
)
rel = result[0]["r"] if result else {}
# Return relationship ID if available, otherwise return the full relationship
return rel.get("id", rel)
return rel.get("id", rel) # type: ignore
async def get_node_lineage(
self, node_id: str, max_depth: int = 10, database: str = "neo4j"

0
libs/ocr/__init__.py Normal file
View File

507
libs/ocr/processor.py Normal file
View File

@@ -0,0 +1,507 @@
import base64
import concurrent.futures
import io
import json
import os
from pathlib import Path
from typing import Any
import numpy as np
import requests
from PIL import Image, ImageFilter
from PyPDF2 import PdfReader
class OCRProcessor:
def __init__(
self,
model_name: str = "llama3.2-vision:11b",
base_url: str = "http://localhost:11434/api/generate",
max_workers: int = 1,
provider: str = "ollama",
openai_api_key: str | None = None,
openai_base_url: str = "https://api.openai.com/v1/chat/completions",
):
self.model_name = model_name
self.base_url = base_url
self.max_workers = max_workers
self.provider = provider.lower()
self.openai_api_key = openai_api_key or os.getenv("OPENAI_API_KEY")
self.openai_base_url = openai_base_url
def _encode_image(self, image_path: str) -> str:
"""Convert image to base64 string"""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def _pdf_to_images(self, pdf_path: str) -> list[str]:
"""
Convert each page of a PDF to an image without PyMuPDF.
Strategy: extract largest embedded image per page via PyPDF2.
Saves each selected image as a temporary PNG and returns paths.
Note: Text-only pages with no embedded images will be skipped here.
Use _pdf_extract_text as a fallback for such pages.
"""
image_paths: list[str] = []
try:
reader = PdfReader(pdf_path)
for page_index, page in enumerate(reader.pages):
try:
resources = page.get("/Resources")
if resources is None:
continue
xobject = resources.get("/XObject")
if xobject is None:
continue
xobject = xobject.get_object()
largest = None
largest_area = -1
for _, obj_ref in xobject.items():
try:
obj = obj_ref.get_object()
if obj.get("/Subtype") != "/Image":
continue
width = int(obj.get("/Width", 0))
height = int(obj.get("/Height", 0))
area = width * height
if area > largest_area:
largest = obj
largest_area = area
except Exception:
continue
if largest is None:
continue
data = largest.get_data()
filt = largest.get("/Filter")
out_path = f"{pdf_path}_page{page_index}.png"
# If JPEG/JPX, write bytes directly; else convert via PIL
if filt in ("/DCTDecode",):
# JPEG
out_path = f"{pdf_path}_page{page_index}.jpg"
with open(out_path, "wb") as f:
f.write(data)
elif filt in ("/JPXDecode",):
out_path = f"{pdf_path}_page{page_index}.jp2"
with open(out_path, "wb") as f:
f.write(data)
else:
mode = "RGB"
colorspace = largest.get("/ColorSpace")
if colorspace in ("/DeviceGray",):
mode = "L"
width = int(largest.get("/Width", 0))
height = int(largest.get("/Height", 0))
try:
img = Image.frombytes(mode, (width, height), data)
except Exception:
# Best-effort decode via Pillow
img = Image.open(io.BytesIO(data))
img.save(out_path, format="PNG")
image_paths.append(out_path)
except Exception:
# Continue gracefully for problematic pages/objects
continue
return image_paths
except Exception as e:
raise ValueError(f"Could not extract images from PDF: {e}")
def _pdf_extract_text(self, pdf_path: str) -> list[str]:
"""Extract text per page using pdfplumber if available, else PyPDF2."""
texts: list[str] = []
try:
try:
import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
texts.append(page.extract_text() or "")
return texts
except Exception:
# Fallback to PyPDF2
reader = PdfReader(pdf_path)
for page in reader.pages: # type: ignore
texts.append(page.extract_text() or "")
return texts
except Exception as e:
raise ValueError(f"Could not extract text from PDF: {e}")
def _call_ollama_vision(self, prompt: str, image_base64: str) -> str:
payload = {
"model": self.model_name,
"prompt": prompt,
"stream": False,
"images": [image_base64],
}
response = requests.post(self.base_url, json=payload)
response.raise_for_status()
return response.json().get("response", "") # type: ignore
def _call_openai_vision(self, prompt: str, image_base64: str) -> str:
if not self.openai_api_key:
raise ValueError("OPENAI_API_KEY not set")
# Compose chat.completions payload for GPT-4o/mini vision
payload = {
"model": self.model_name or "gpt-4o-mini",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}",
},
},
],
}
],
"temperature": 0,
}
headers = {
"Authorization": f"Bearer {self.openai_api_key}",
"Content-Type": "application/json",
}
response = requests.post(self.openai_base_url, headers=headers, json=payload)
response.raise_for_status()
data = response.json()
try:
return data["choices"][0]["message"]["content"] # type: ignore
except Exception:
return json.dumps(data)
def _preprocess_image(self, image_path: str, language: str = "en") -> str:
"""
Preprocess image before OCR using Pillow + NumPy:
- Convert to grayscale
- Histogram equalization (contrast)
- Median denoise
- Otsu threshold and invert
"""
try:
with Image.open(image_path) as img:
if img.mode in ("RGBA", "LA"):
img = img.convert("RGB")
gray = img.convert("L")
# Histogram equalization via cumulative distribution
arr = np.asarray(gray)
hist, _ = np.histogram(arr.flatten(), 256, [0, 256]) # type: ignore
cdf = hist.cumsum()
cdf_masked = np.ma.masked_equal(cdf, 0) # type: ignore
cdf_min = cdf_masked.min() if cdf_masked.size else 0
cdf_max = cdf_masked.max() if cdf_masked.size else 0
if cdf_max == cdf_min:
eq = arr
else:
cdf_scaled = (cdf_masked - cdf_min) * 255 / (cdf_max - cdf_min)
lut = np.ma.filled(cdf_scaled, 0).astype("uint8")
eq = lut[arr]
eq_img = Image.fromarray(eq, mode="L")
# Median filter (3x3) to reduce noise
eq_img = eq_img.filter(ImageFilter.MedianFilter(size=3))
arr_eq = np.asarray(eq_img)
# Otsu threshold
hist2, _ = np.histogram(arr_eq, 256, [0, 256]) # type: ignore
total = arr_eq.size
sum_total = (np.arange(256) * hist2).sum()
sum_b = 0.0
w_b = 0.0
max_var = 0.0
thr = 0
for t in range(256):
w_b += hist2[t]
if w_b == 0:
continue
w_f = total - w_b
if w_f == 0:
break
sum_b += t * hist2[t]
m_b = sum_b / w_b
m_f = (sum_total - sum_b) / w_f
var_between = w_b * w_f * (m_b - m_f) ** 2
if var_between > max_var:
max_var = var_between
thr = t
binary = (arr_eq > thr).astype(np.uint8) * 255
# Invert: black text on white background
binary = 255 - binary
out_img = Image.fromarray(binary, mode="L")
preprocessed_path = f"{image_path}_preprocessed.jpg"
out_img.save(preprocessed_path, format="JPEG", quality=95)
return preprocessed_path
except Exception as e:
raise ValueError(f"Failed to preprocess image {image_path}: {e}")
def process_image(
self,
image_path: str,
format_type: str = "markdown",
preprocess: bool = True,
custom_prompt: str | None = None,
language: str = "en",
) -> str:
"""
Process an image (or PDF) and extract text in the specified format
Args:
image_path: Path to the image file or PDF file
format_type: One of ["markdown", "text", "json", "structured", "key_value","custom"]
preprocess: Whether to apply image preprocessing
custom_prompt: If provided, this prompt overrides the default based on format_type
language: Language code to apply language specific OCR preprocessing
"""
try:
# If the input is a PDF, process all pages
if image_path.lower().endswith(".pdf"):
image_pages = self._pdf_to_images(image_path)
responses: list[str] = []
if image_pages:
for idx, page_file in enumerate(image_pages):
# Process each page with preprocessing if enabled
if preprocess:
preprocessed_path = self._preprocess_image(
page_file, language
)
else:
preprocessed_path = page_file
image_base64 = self._encode_image(preprocessed_path)
if custom_prompt and custom_prompt.strip():
prompt = custom_prompt
else:
prompts = {
"markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
Format the output in markdown:
- Use headers (#, ##, ###) **only if they appear in the image**
- Preserve original lists (-, *, numbered lists) as they are
- Maintain all text formatting (bold, italics, underlines) exactly as seen
- **Do not add, interpret, or restructure any content**
""",
"text": f"""Extract all visible text from this image in {language} **without any changes**.
- **Do not summarize, paraphrase, or infer missing text.**
- Retain all spacing, punctuation, and formatting exactly as in the image.
- If text is unclear or partially visible, extract as much as possible without guessing.
- **Include all text, even if it seems irrelevant or repeated.**
""",
"json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
- **Do not summarize, add, or modify any text.**
- Maintain hierarchical sections and subsections as they appear.
- Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
- Include all text, even if fragmented, blurry, or unclear.
""",
"structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
- Identify and format tables **without altering content**.
- Preserve list structures (bulleted, numbered) **exactly as shown**.
- Maintain all section headings, indents, and alignments.
- **Do not add, infer, or restructure the content in any way.**
""",
"key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
- Identify and extract labels and their corresponding values without modification.
- Maintain the exact wording, punctuation, and order.
- Format each pair as 'key: value' **only if clearly structured that way in the image**.
- **Do not infer missing values or add any extra text.**
""",
"table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
- **Preserve the table structure** (rows, columns, headers) as closely as possible.
- **Do not add missing values or infer content**—if a cell is empty, leave it empty.
- Maintain all numerical, textual, and special character formatting.
- If the table contains merged cells, indicate them clearly without altering their meaning.
- Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
""",
}
prompt = prompts.get(format_type, prompts["text"])
# Route to chosen provider
if self.provider == "openai":
res = self._call_openai_vision(prompt, image_base64)
else:
res = self._call_ollama_vision(prompt, image_base64)
responses.append(f"Page {idx + 1}:\n{res}")
# Clean up temporary files
if preprocess and preprocessed_path.endswith(
"_preprocessed.jpg"
):
try:
os.remove(preprocessed_path)
except OSError:
pass
if page_file.endswith((".png", ".jpg", ".jp2")):
try:
os.remove(page_file)
except OSError:
pass
final_result = "\n".join(responses)
if format_type == "json":
try:
json_data = json.loads(final_result)
return json.dumps(json_data, indent=2)
except json.JSONDecodeError:
return final_result
return final_result
else:
# Fallback: no images found; extract raw text per page
text_pages = self._pdf_extract_text(image_path)
combined = []
for i, t in enumerate(text_pages):
combined.append(f"Page {i + 1}:\n{t}")
return "\n".join(combined)
# Process non-PDF images as before.
if preprocess:
image_path = self._preprocess_image(image_path, language)
image_base64 = self._encode_image(image_path)
# Clean up temporary files
if image_path.endswith(("_preprocessed.jpg", "_temp.jpg")):
os.remove(image_path)
if custom_prompt and custom_prompt.strip():
prompt = custom_prompt
print("Using custom prompt:", prompt)
else:
prompts = {
"markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
Format the output in markdown:
- Use headers (#, ##, ###) **only if they appear in the image**
- Preserve original lists (-, *, numbered lists) as they are
- Maintain all text formatting (bold, italics, underlines) exactly as seen
- **Do not add, interpret, or restructure any content**
""",
"text": f"""Extract all visible text from this image in {language} **without any changes**.
- **Do not summarize, paraphrase, or infer missing text.**
- Retain all spacing, punctuation, and formatting exactly as in the image.
- If text is unclear or partially visible, extract as much as possible without guessing.
- **Include all text, even if it seems irrelevant or repeated.**
""",
"json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
- **Do not summarize, add, or modify any text.**
- Maintain hierarchical sections and subsections as they appear.
- Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
- Include all text, even if fragmented, blurry, or unclear.
""",
"structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
- Identify and format tables **without altering content**.
- Preserve list structures (bulleted, numbered) **exactly as shown**.
- Maintain all section headings, indents, and alignments.
- **Do not add, infer, or restructure the content in any way.**
""",
"key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
- Identify and extract labels and their corresponding values without modification.
- Maintain the exact wording, punctuation, and order.
- Format each pair as 'key: value' **only if clearly structured that way in the image**.
- **Do not infer missing values or add any extra text.**
""",
"table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
- **Preserve the table structure** (rows, columns, headers) as closely as possible.
- **Do not add missing values or infer content**—if a cell is empty, leave it empty.
- Maintain all numerical, textual, and special character formatting.
- If the table contains merged cells, indicate them clearly without altering their meaning.
- Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
""",
}
prompt = prompts.get(format_type, prompts["text"])
print("Using default prompt:", prompt) # Debug print
# Call chosen provider with single image
if self.provider == "openai":
result = self._call_openai_vision(prompt, image_base64)
else:
result = self._call_ollama_vision(prompt, image_base64)
if format_type == "json":
try:
json_data = json.loads(result)
return json.dumps(json_data, indent=2)
except json.JSONDecodeError:
return str(result)
return str(result)
except Exception as e:
return f"Error processing image: {str(e)}"
def process_batch(
self,
input_path: str | list[str],
format_type: str = "markdown",
recursive: bool = False,
preprocess: bool = True,
custom_prompt: str | None = None,
language: str = "en",
) -> dict[str, Any]:
"""
Process multiple images in batch
Args:
input_path: Path to directory or list of image paths
format_type: Output format type
recursive: Whether to search directories recursively
preprocess: Whether to apply image preprocessing
custom_prompt: If provided, this prompt overrides the default for each image
language: Language code to apply language specific OCR preprocessing
Returns:
Dictionary with results and statistics
"""
# Collect all image paths
image_paths: list[str | Path] = []
if isinstance(input_path, str):
base_path = Path(input_path)
if base_path.is_dir():
pattern = "**/*" if recursive else "*"
for ext in [".png", ".jpg", ".jpeg", ".pdf", ".tiff"]:
image_paths.extend(base_path.glob(f"{pattern}{ext}"))
else:
image_paths = [base_path]
else:
image_paths = [Path(p) for p in input_path]
results = {}
errors = {}
# Process images in parallel
with concurrent.futures.ThreadPoolExecutor(
max_workers=self.max_workers
) as executor:
future_to_path = {
executor.submit(
self.process_image,
str(path),
format_type,
preprocess,
custom_prompt,
language,
): path
for path in image_paths
}
for future in concurrent.futures.as_completed(future_to_path):
path = future_to_path[future]
try:
results[str(path)] = future.result()
except Exception as e:
errors[str(path)] = str(e)
# pbar.update(1)
return {
"results": results,
"errors": errors,
"statistics": {
"total": len(image_paths),
"successful": len(results),
"failed": len(errors),
},
}

View File

@@ -1,13 +1,13 @@
# Core framework dependencies (Required by all services)
fastapi>=0.118.0
fastapi>=0.119.0
uvicorn[standard]>=0.37.0
pydantic>=2.11.9
pydantic>=2.12.0
pydantic-settings>=2.11.0
# Database drivers (lightweight)
sqlalchemy>=2.0.43
sqlalchemy>=2.0.44
asyncpg>=0.30.0
psycopg2-binary>=2.9.10
psycopg2-binary>=2.9.11
neo4j>=6.0.2
redis[hiredis]>=6.4.0

View File

@@ -3,3 +3,4 @@ pdfrw>=0.4
reportlab>=4.4.4
PyPDF2>=3.0.1
pdfplumber>=0.11.7
opencv-python

View File

@@ -79,7 +79,7 @@ class StorageClient:
"""Download object from bucket"""
try:
response = self.client.get_object(bucket_name, object_name)
data = response.read()
data: bytes = response.read()
response.close()
response.release_conn()
@@ -89,7 +89,7 @@ class StorageClient:
object=object_name,
size=len(data),
)
return data # type: ignore
return data
except S3Error as e:
logger.error(