deployment, linting and infra configuration

2025-10-14 07:42:31 +01:00
parent f0f7674b8d
commit eea46ac89c
41 changed files with 1017 additions and 1448 deletions
--- a/libs/config.py
+++ b/libs/config.py
@@ -1,555 +0,0 @@
-# ROLE
-
-You are a **Senior Platform Engineer + Backend Lead** generating **production code** and **ops assets** for a microservice suite that powers an accounting Knowledge Graph + Vector RAG platform. Authentication/authorization are centralized at the **edge via Traefik + Authentik** (ForwardAuth). **Services are trust-bound** to Traefik and consume user/role claims via forwarded headers/JWT.
-
-# MISSION
-
-Produce fully working code for **all application services** (FastAPI + Python 3.12) with:
-
- Solid domain models, Pydantic v2 schemas, type hints, strict mypy, ruff lint.
- Opentelemetry tracing, Prometheus metrics, structured logging.
- Vault-backed secrets, MinIO S3 client, Qdrant client, Neo4j driver, Postgres (SQLAlchemy), Redis.
- Eventing (Kafka or SQS/SNS behind an interface).
- Deterministic data contracts, end-to-end tests, Dockerfiles, Compose, CI for Gitea.
- Traefik labels + Authentik Outpost integration for every exposed route.
- Zero PII in vectors (Qdrant), evidence-based lineage in KG, and bitemporal writes.
-
-# GLOBAL CONSTRAINTS (APPLY TO ALL SERVICES)
-
- **Language & Runtime:** Python **3.12**.
- **Frameworks:** FastAPI, Pydantic v2, SQLAlchemy 2, httpx, aiokafka or boto3 (pluggable), redis-py, opentelemetry-instrumentation-fastapi, prometheus-fastapi-instrumentator.
- **Config:** `pydantic-settings` with `.env` overlay. Provide `Settings` class per service.
- **Secrets:** HashiCorp **Vault** (AppRole/JWT). Use Vault Transit to **envelope-encrypt** sensitive fields before persistence (helpers provided in `lib/security.py`).
- **Auth:** No OIDC in services. Add `TrustedProxyMiddleware`:
-
-  - Reject if request not from internal network (configurable CIDR).
-  - Require headers set by Traefik+Authentik (`X-Authenticated-User`, `X-Authenticated-Email`, `X-Authenticated-Groups`, `Authorization: Bearer …`).
-  - Parse groups → `roles` list on `request.state`.
-
- **Observability:**
-
-  - OpenTelemetry (traceparent propagation), span attrs (service, route, user, tenant).
-  - Prometheus metrics endpoint `/metrics` protected by internal network check.
-  - Structured JSON logs (timestamp, level, svc, trace_id, msg) via `structlog`.
-
- **Errors:** Global exception handler → RFC7807 Problem+JSON (`type`, `title`, `status`, `detail`, `instance`, `trace_id`).
- **Testing:** `pytest`, `pytest-asyncio`, `hypothesis` (property tests for calculators), `coverage ≥ 90%` per service.
- **Static:** `ruff`, `mypy --strict`, `bandit`, `safety`, `licensecheck`.
- **Perf:** Each service exposes `/healthz`, `/readyz`, `/livez`; cold start < 500ms; p95 endpoint < 250ms (local).
- **Containers:** Distroless or slim images; non-root user; read-only FS; `/tmp` mounted for OCR where needed.
- **Docs:** OpenAPI JSON + ReDoc; MkDocs site with service READMEs.
-
-# SHARED LIBS (GENERATE ONCE, REUSE)
-
-Create `libs/` used by all services:
-
- `libs/config.py` – base `Settings`, env parsing, Vault client factory, MinIO client factory, Qdrant client factory, Neo4j driver factory, Redis factory, Kafka/SQS client factory.
- `libs/security.py` – Vault Transit helpers (`encrypt_field`, `decrypt_field`), header parsing, internal-CIDR validator.
- `libs/observability.py` – otel init, prometheus instrumentor, logging config.
- `libs/events.py` – abstract `EventBus` with `publish(topic, payload: dict)`, `subscribe(topic, handler)`. Two impls: Kafka (`aiokafka`) and SQS/SNS (`boto3`).
- `libs/schemas.py` – **canonical Pydantic models** shared across services (Document, Evidence, IncomeItem, etc.) mirroring the ontology schemas. Include JSONSchema exports.
- `libs/storage.py` – S3/MinIO helpers (bucket ensure, put/get, presigned).
- `libs/neo.py` – Neo4j session helpers, Cypher runner with retry, SHACL validator invoker (pySHACL on exported RDF).
- `libs/rag.py` – Qdrant collections CRUD, hybrid search (dense+sparse), rerank wrapper, de-identification utilities (regex + NER; hash placeholders).
- `libs/forms.py` – PDF AcroForm fill via `pdfrw` with overlay fallback via `reportlab`.
- `libs/calibration.py` – `calibrated_confidence(raw_score, method="temperature_scaling", params=...)`.
-
-# EVENT TOPICS (STANDARDIZE)
-
- `doc.ingested`, `doc.ocr_ready`, `doc.extracted`, `kg.upserted`, `rag.indexed`, `calc.schedule_ready`, `form.filled`, `hmrc.submitted`, `review.requested`, `review.completed`, `firm.sync.completed`
-
-Each payload MUST include: `event_id (ulid)`, `occurred_at (iso)`, `actor`, `tenant_id`, `trace_id`, `schema_version`, and a `data` object (service-specific).
-
-# TRUST HEADERS FROM TRAEFIK + AUTHENTIK (USE EXACT KEYS)
-
- `X-Authenticated-User` (string)
- `X-Authenticated-Email` (string)
- `X-Authenticated-Groups` (comma-separated)
- `Authorization` (`Bearer <jwt>` from Authentik)
-  Reject any request missing these (except `/healthz|/readyz|/livez|/metrics` from internal CIDR).
-
---
-
-## SERVICES TO IMPLEMENT (CODE FOR EACH)
-
-### 1) `svc-ingestion`
-
-**Purpose:** Accept uploads or URLs, checksum, store to MinIO, emit `doc.ingested`.
-
-**Endpoints:**
-
- `POST /v1/ingest/upload` (multipart file, metadata: `tenant_id`, `kind`, `source`) → `{doc_id, s3_url, checksum}`
- `POST /v1/ingest/url` (json: `{url, kind, tenant_id}`) → downloads to MinIO
- `GET /v1/docs/{doc_id}` → metadata
-
-**Logic:**
-
- Compute SHA256, dedupe by checksum; MinIO path `tenants/{tenant_id}/raw/{doc_id}.pdf`.
- Store metadata in Postgres table `ingest_documents` (alembic migrations).
- Publish `doc.ingested` with `{doc_id, bucket, key, pages?, mime}`.
-
-**Env:** `S3_BUCKET_RAW`, `MINIO_*`, `DB_URL`.
-
-**Traefik labels:** route `/ingest/*`.
-
---
-
-### 2) `svc-rpa`
-
-**Purpose:** Scheduled RPA pulls from firm/client portals via Playwright.
-
-**Tasks:**
-
- Playwright login flows (credentials from Vault), 2FA via Authentik OAuth device or OTP secret in Vault.
- Download statements/invoices; hand off to `svc-ingestion` via internal POST.
- Prefect flows: `pull_portal_X()`, `pull_portal_Y()` with schedules.
-
-**Endpoints:**
-
- `POST /v1/rpa/run/{connector}` (manual trigger)
- `GET /v1/rpa/status/{run_id}`
-
-**Env:** `VAULT_ADDR`, `VAULT_ROLE_ID`, `VAULT_SECRET_ID`.
-
---
-
-### 3) `svc-ocr`
-
-**Purpose:** OCR & layout extraction.
-
-**Pipeline:**
-
- Pull object from MinIO, detect rotation/de-skew (`opencv-python`), split pages (`pymupdf`), OCR (`pytesseract`) or bypass if text layer present (`pdfplumber`).
- Output per-page text + **bbox** for lines/words.
- Write JSON to MinIO `tenants/{tenant_id}/ocr/{doc_id}.json` and emit `doc.ocr_ready`.
-
-**Endpoints:**
-
- `POST /v1/ocr/{doc_id}` (idempotent trigger)
- `GET /v1/ocr/{doc_id}` (fetch OCR JSON)
-
-**Env:** `TESSERACT_LANGS`, `S3_BUCKET_EVIDENCE`.
-
---
-
-### 4) `svc-extract`
-
-**Purpose:** Classify docs and extract KV + tables into **schema-constrained JSON** (with bbox/page).
-
-**Endpoints:**
-
- `POST /v1/extract/{doc_id}` body: `{strategy: "llm|rules|hybrid"}`
- `GET /v1/extract/{doc_id}` → structured JSON
-
-**Implementation:**
-
- Use prompt files in `prompts/`: `doc_classify.txt`, `kv_extract.txt`, `table_extract.txt`.
- **Validator loop**: run LLM → validate JSONSchema → retry with error messages up to N times.
- Return Pydantic models from `libs/schemas.py`.
- Emit `doc.extracted`.
-
-**Env:** `LLM_ENGINE`, `TEMPERATURE`, `MAX_TOKENS`.
-
---
-
-### 5) `svc-normalize-map`
-
-**Purpose:** Normalize & map extracted data to KG.
-
-**Logic:**
-
- Currency normalization (ECB or static fx table), dates, UK tax year/basis period inference.
- Entity resolution (blocking + fuzzy).
- Generate nodes/edges (+ `Evidence` with doc_id/page/bbox/text_hash).
- Use `libs/neo.py` to write with **bitemporal** fields; run **SHACL** validator; on violation, queue `review.requested`.
- Emit `kg.upserted`.
-
-**Endpoints:**
-
- `POST /v1/map/{doc_id}`
- `GET /v1/map/{doc_id}/preview` (diff view, to be used by UI)
-
-**Env:** `NEO4J_*`.
-
---
-
-### 6) `svc-kg`
-
-**Purpose:** Graph façade + RDF/SHACL utility.
-
-**Endpoints:**
-
- `GET /v1/kg/nodes/{label}/{id}`
- `POST /v1/kg/cypher` (admin-gated inline query; must check `admin` role)
- `POST /v1/kg/export/rdf` (returns RDF for SHACL)
- `POST /v1/kg/validate` (run pySHACL against `schemas/shapes.ttl`)
- `GET /v1/kg/lineage/{node_id}` (traverse `DERIVED_FROM` → Evidence)
-
-**Env:** `NEO4J_*`.
-
---
-
-### 7) `svc-rag-indexer`
-
-**Purpose:** Build Qdrant indices (firm knowledge, legislation, best practices, glossary).
-
-**Workflow:**
-
- Load sources (filesystem, URLs, Firm DMS via `svc-firm-connectors`).
- **De-identify PII** (regex + NER), replace with placeholders; store mapping only in Postgres.
- Chunk (layout-aware) per `retrieval/chunking.yaml`.
- Compute **dense** embeddings (e.g., `bge-small-en-v1.5`) and **sparse** (Qdrant sparse).
- Upsert to Qdrant with payload `{jurisdiction, tax_years[], topic_tags[], version, pii_free: true, doc_id/section_id/url}`.
- Emit `rag.indexed`.
-
-**Endpoints:**
-
- `POST /v1/index/run`
- `GET /v1/index/status/{run_id}`
-
-**Env:** `QDRANT_URL`, `RAG_EMBEDDING_MODEL`, `RAG_RERANKER_MODEL`.
-
---
-
-### 8) `svc-rag-retriever`
-
-**Purpose:** Hybrid search + KG fusion with rerank and calibrated confidence.
-
-**Endpoint:**
-
- `POST /v1/rag/search` `{query, tax_year?, jurisdiction?, k?}` →
-
-  ```
-  {
-    "chunks": [...],
-    "citations": [{doc_id|url, section_id?, page?, bbox?}],
-    "kg_hints": [{rule_id, formula_id, node_ids[]}],
-    "calibrated_confidence": 0.0-1.0
-  }
-  ```
-
-**Implementation:**
-
- Hybrid score: `alpha * dense + beta * sparse`; rerank top-K via cross-encoder; **KG fusion** (boost chunks citing Rules/Calculations relevant to schedule).
- Use `libs/calibration.py` to expose calibrated confidence.
-
---
-
-### 9) `svc-reason`
-
-**Purpose:** Deterministic calculators + materializers (UK SA).
-
-**Endpoints:**
-
- `POST /v1/reason/compute_schedule` `{tax_year, taxpayer_id, schedule_id}`
- `GET /v1/reason/explain/{schedule_id}` → rationale & lineage paths
-
-**Implementation:**
-
- Pure functions for: employment, self-employment, property (FHL, 20% interest credit), dividends/interest, allowances, NIC (Class 2/4), HICBC, student loans (Plans 1/2/4/5, PGL).
- **Deterministic order** as defined; rounding per `FormBox.rounding_rule`.
- Use Cypher from `kg/reasoning/schedule_queries.cypher` to materialize box values; attach `DERIVED_FROM` evidence.
-
---
-
-### 10) `svc-forms`
-
-**Purpose:** Fill PDFs and assemble evidence bundles.
-
-**Endpoints:**
-
- `POST /v1/forms/fill` `{tax_year, taxpayer_id, form_id}` → returns PDF (binary)
- `POST /v1/forms/evidence_pack` `{scope}` → ZIP + manifest + signed hashes (sha256)
-
-**Implementation:**
-
- `pdfrw` for AcroForm; overlay with ReportLab if needed.
- Manifest includes `doc_id/page/bbox/text_hash` for every numeric field.
-
---
-
-### 11) `svc-hmrc`
-
-**Purpose:** HMRC submitter (stub|sandbox|live).
-
-**Endpoints:**
-
- `POST /v1/hmrc/submit` `{tax_year, taxpayer_id, dry_run}` → `{status, submission_id?, errors[]}`
- `GET /v1/hmrc/submissions/{id}`
-
-**Implementation:**
-
- Rate limits, retries/backoff, signed audit log; environment toggle.
-
---
-
-### 12) `svc-firm-connectors`
-
-**Purpose:** Read-only connectors to Firm Databases (Practice Mgmt, DMS).
-
-**Endpoints:**
-
- `POST /v1/firm/sync` `{since?}` → `{objects_synced, errors[]}`
- `GET /v1/firm/objects` (paged)
-
-**Implementation:**
-
- Data contracts in `config/firm_contracts/`; mappers → Secure Client Data Store (Postgres) with lineage columns (`source`, `source_id`, `synced_at`).
-
---
-
-### 13) `ui-review` (outline only)
-
- Next.js (SSO handled by Traefik+Authentik), shows extracted fields + evidence snippets; POST overrides to `svc-extract`/`svc-normalize-map`.
-
---
-
-## DATA CONTRACTS (ESSENTIAL EXAMPLES)
-
-**Event: `doc.ingested`**
-
-```json
-{
-  "event_id": "01J...ULID",
-  "occurred_at": "2025-09-13T08:00:00Z",
-  "actor": "svc-ingestion",
-  "tenant_id": "t_123",
-  "trace_id": "abc-123",
-  "schema_version": "1.0",
-  "data": {
-    "doc_id": "d_abc",
-    "bucket": "raw",
-    "key": "tenants/t_123/raw/d_abc.pdf",
-    "checksum": "sha256:...",
-    "kind": "bank_statement",
-    "mime": "application/pdf",
-    "pages": 12
-  }
-}
-```
-
-**RAG search response shape**
-
-```json
-{
-  "chunks": [
-    {
-      "id": "c1",
-      "text": "...",
-      "score": 0.78,
-      "payload": {
-        "jurisdiction": "UK",
-        "tax_years": ["2024-25"],
-        "topic_tags": ["FHL"],
-        "pii_free": true
-      }
-    }
-  ],
-  "citations": [
-    { "doc_id": "leg-ITA2007", "section_id": "s272A", "url": "https://..." }
-  ],
-  "kg_hints": [
-    {
-      "rule_id": "UK.FHL.Qual",
-      "formula_id": "FHL_Test_v1",
-      "node_ids": ["n123", "n456"]
-    }
-  ],
-  "calibrated_confidence": 0.81
-}
-```
-
---
-
-## PERSISTENCE SCHEMAS (POSTGRES; ALEMBIC)
-
- `ingest_documents(id pk, tenant_id, doc_id, kind, checksum, bucket, key, mime, pages, created_at)`
- `firm_objects(id pk, tenant_id, source, source_id, type, payload jsonb, synced_at)`
- Qdrant PII mapping table (if absolutely needed): `pii_links(id pk, placeholder_hash, client_id, created_at)` — **encrypt with Vault Transit**; do NOT store raw values.
-
---
-
-## TRAEFIK + AUTHENTIK (COMPOSE LABELS PER SERVICE)
-
-For every service container in `infra/compose/docker-compose.local.yml`, add labels:
-
-```
- "traefik.enable=true"
- "traefik.http.routers.svc-extract.rule=Host(`api.local`) && PathPrefix(`/extract`)"
- "traefik.http.routers.svc-extract.entrypoints=websecure"
- "traefik.http.routers.svc-extract.tls=true"
- "traefik.http.routers.svc-extract.middlewares=authentik-forwardauth,rate-limit"
- "traefik.http.services.svc-extract.loadbalancer.server.port=8000"
-```
-
-Use the shared dynamic file `traefik-dynamic.yml` with `authentik-forwardauth` and `rate-limit` middlewares.
-
---
-
-## OUTPUT FORMAT (STRICT)
-
-Implement a **multi-file codebase** as fenced blocks, EXACTLY in this order:
-
-```txt
-# FILE: libs/config.py
-# factories for Vault/MinIO/Qdrant/Neo4j/Redis/EventBus, Settings base
-...
-```
-
-```txt
-# FILE: libs/security.py
-# Vault Transit helpers, header parsing, internal CIDR checks, middleware
-...
-```
-
-```txt
-# FILE: libs/observability.py
-# otel init, prometheus, structlog
-...
-```
-
-```txt
-# FILE: libs/events.py
-# EventBus abstraction with Kafka and SQS/SNS impls
-...
-```
-
-```txt
-# FILE: libs/schemas.py
-# Shared Pydantic models mirroring ontology entities
-...
-```
-
-```txt
-# FILE: apps/svc-ingestion/main.py
-# FastAPI app, endpoints, MinIO write, Postgres, publish doc.ingested
-...
-```
-
-```txt
-# FILE: apps/svc-rpa/main.py
-# Playwright flows, Prefect tasks, triggers
-...
-```
-
-```txt
-# FILE: apps/svc-ocr/main.py
-# OCR pipeline, endpoints
-...
-```
-
-```txt
-# FILE: apps/svc-extract/main.py
-# Classifier + extractors with validator loop
-...
-```
-
-```txt
-# FILE: apps/svc-normalize-map/main.py
-# normalization, entity resolution, KG mapping, SHACL validation call
-...
-```
-
-```txt
-# FILE: apps/svc-kg/main.py
-# KG façade, RDF export, SHACL validate, lineage traversal
-...
-```
-
-```txt
-# FILE: apps/svc-rag-indexer/main.py
-# chunk/de-id/embed/upsert to Qdrant
-...
-```
-
-```txt
-# FILE: apps/svc-rag-retriever/main.py
-# hybrid retrieval + rerank + KG fusion
-...
-```
-
-```txt
-# FILE: apps/svc-reason/main.py
-# deterministic calculators, schedule compute/explain
-...
-```
-
-```txt
-# FILE: apps/svc-forms/main.py
-# PDF fill + evidence pack
-...
-```
-
-```txt
-# FILE: apps/svc-hmrc/main.py
-# submit stub|sandbox|live with audit + retries
-...
-```
-
-```txt
-# FILE: apps/svc-firm-connectors/main.py
-# connectors to practice mgmt & DMS, sync to Postgres
-...
-```
-
-```txt
-# FILE: infra/compose/docker-compose.local.yml
-# Traefik, Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prom+Grafana, Loki, Unleash, all services
-...
-```
-
-```txt
-# FILE: infra/compose/traefik.yml
-# static Traefik config
-...
-```
-
-```txt
-# FILE: infra/compose/traefik-dynamic.yml
-# forwardAuth middleware + routers/services
-...
-```
-
-```txt
-# FILE: .gitea/workflows/ci.yml
-# lint->test->build->scan->push->deploy
-...
-```
-
-```txt
-# FILE: Makefile
-# bootstrap, run, test, lint, build, deploy, format, seed
-...
-```
-
-```txt
-# FILE: tests/e2e/test_happy_path.py
-# end-to-end: ingest -> ocr -> extract -> map -> compute -> fill -> (stub) submit
-...
-```
-
-```txt
-# FILE: tests/unit/test_calculators.py
-# boundary tests for UK SA logic (NIC, HICBC, PA taper, FHL)
-...
-```
-
-```txt
-# FILE: README.md
-# how to run locally with docker-compose, Authentik setup, Traefik certs
-...
-```
-
-## DEFINITION OF DONE
-
- `docker compose up` brings the full stack up; SSO via Authentik; routes secured via Traefik ForwardAuth.
- Running `pytest` yields ≥ 90% coverage; `make e2e` passes the ingest→…→submit stub flow.
- All services expose `/healthz|/readyz|/livez|/metrics`; OpenAPI at `/docs`.
- No PII stored in Qdrant; vectors carry `pii_free=true`.
- KG writes are SHACL-validated; violations produce `review.requested` events.
- Evidence lineage is present for every numeric box value.
- Gitea pipeline passes: lint, test, build, scan, push, deploy.
-
-# START
-
-Generate the full codebase and configs in the **exact file blocks and order** specified above.
--- a/libs/neo/client.py
+++ b/libs/neo/client.py
@@ -134,7 +134,7 @@ class Neo4jClient:
        result = await self.run_query(query, {"properties": properties}, database)
        node = result[0]["n"] if result else {}
        # Return node ID if available, otherwise return the full node
-        return node.get("id", node)
+        return node.get("id", node)  # type: ignore

    async def update_node(
        self,
@@ -209,7 +209,7 @@ class Neo4jClient:
                database,
            )
            rel = result[0]["r"] if result else {}
-            return rel.get("id", rel)
+            return rel.get("id", rel)  # type: ignore

        # Original signature (using labels and IDs)
        rel_properties = properties or {}
@@ -231,7 +231,7 @@ class Neo4jClient:
        )
        rel = result[0]["r"] if result else {}
        # Return relationship ID if available, otherwise return the full relationship
-        return rel.get("id", rel)
+        return rel.get("id", rel)  # type: ignore

    async def get_node_lineage(
        self, node_id: str, max_depth: int = 10, database: str = "neo4j"
--- a/libs/ocr/init.py
+++ b/libs/ocr/init.py
--- a/libs/ocr/processor.py
+++ b/libs/ocr/processor.py
@@ -0,0 +1,507 @@
+import base64
+import concurrent.futures
+import io
+import json
+import os
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import requests
+from PIL import Image, ImageFilter
+from PyPDF2 import PdfReader
+
+
+class OCRProcessor:
+    def __init__(
+        self,
+        model_name: str = "llama3.2-vision:11b",
+        base_url: str = "http://localhost:11434/api/generate",
+        max_workers: int = 1,
+        provider: str = "ollama",
+        openai_api_key: str | None = None,
+        openai_base_url: str = "https://api.openai.com/v1/chat/completions",
+    ):
+        self.model_name = model_name
+        self.base_url = base_url
+        self.max_workers = max_workers
+        self.provider = provider.lower()
+        self.openai_api_key = openai_api_key or os.getenv("OPENAI_API_KEY")
+        self.openai_base_url = openai_base_url
+
+    def _encode_image(self, image_path: str) -> str:
+        """Convert image to base64 string"""
+        with open(image_path, "rb") as image_file:
+            return base64.b64encode(image_file.read()).decode("utf-8")
+
+    def _pdf_to_images(self, pdf_path: str) -> list[str]:
+        """
+        Convert each page of a PDF to an image without PyMuPDF.
+        Strategy: extract largest embedded image per page via PyPDF2.
+        Saves each selected image as a temporary PNG and returns paths.
+
+        Note: Text-only pages with no embedded images will be skipped here.
+        Use _pdf_extract_text as a fallback for such pages.
+        """
+        image_paths: list[str] = []
+        try:
+            reader = PdfReader(pdf_path)
+            for page_index, page in enumerate(reader.pages):
+                try:
+                    resources = page.get("/Resources")
+                    if resources is None:
+                        continue
+                    xobject = resources.get("/XObject")
+                    if xobject is None:
+                        continue
+                    xobject = xobject.get_object()
+                    largest = None
+                    largest_area = -1
+                    for _, obj_ref in xobject.items():
+                        try:
+                            obj = obj_ref.get_object()
+                            if obj.get("/Subtype") != "/Image":
+                                continue
+                            width = int(obj.get("/Width", 0))
+                            height = int(obj.get("/Height", 0))
+                            area = width * height
+                            if area > largest_area:
+                                largest = obj
+                                largest_area = area
+                        except Exception:
+                            continue
+
+                    if largest is None:
+                        continue
+
+                    data = largest.get_data()
+                    filt = largest.get("/Filter")
+                    out_path = f"{pdf_path}_page{page_index}.png"
+                    # If JPEG/JPX, write bytes directly; else convert via PIL
+                    if filt in ("/DCTDecode",):
+                        # JPEG
+                        out_path = f"{pdf_path}_page{page_index}.jpg"
+                        with open(out_path, "wb") as f:
+                            f.write(data)
+                    elif filt in ("/JPXDecode",):
+                        out_path = f"{pdf_path}_page{page_index}.jp2"
+                        with open(out_path, "wb") as f:
+                            f.write(data)
+                    else:
+                        mode = "RGB"
+                        colorspace = largest.get("/ColorSpace")
+                        if colorspace in ("/DeviceGray",):
+                            mode = "L"
+                        width = int(largest.get("/Width", 0))
+                        height = int(largest.get("/Height", 0))
+                        try:
+                            img = Image.frombytes(mode, (width, height), data)
+                        except Exception:
+                            # Best-effort decode via Pillow
+                            img = Image.open(io.BytesIO(data))
+                        img.save(out_path, format="PNG")
+
+                    image_paths.append(out_path)
+                except Exception:
+                    # Continue gracefully for problematic pages/objects
+                    continue
+            return image_paths
+        except Exception as e:
+            raise ValueError(f"Could not extract images from PDF: {e}")
+
+    def _pdf_extract_text(self, pdf_path: str) -> list[str]:
+        """Extract text per page using pdfplumber if available, else PyPDF2."""
+        texts: list[str] = []
+        try:
+            try:
+                import pdfplumber
+
+                with pdfplumber.open(pdf_path) as pdf:
+                    for page in pdf.pages:
+                        texts.append(page.extract_text() or "")
+                return texts
+            except Exception:
+                # Fallback to PyPDF2
+                reader = PdfReader(pdf_path)
+                for page in reader.pages:  # type: ignore
+                    texts.append(page.extract_text() or "")
+                return texts
+        except Exception as e:
+            raise ValueError(f"Could not extract text from PDF: {e}")
+
+    def _call_ollama_vision(self, prompt: str, image_base64: str) -> str:
+        payload = {
+            "model": self.model_name,
+            "prompt": prompt,
+            "stream": False,
+            "images": [image_base64],
+        }
+        response = requests.post(self.base_url, json=payload)
+        response.raise_for_status()
+        return response.json().get("response", "")  # type: ignore
+
+    def _call_openai_vision(self, prompt: str, image_base64: str) -> str:
+        if not self.openai_api_key:
+            raise ValueError("OPENAI_API_KEY not set")
+        # Compose chat.completions payload for GPT-4o/mini vision
+        payload = {
+            "model": self.model_name or "gpt-4o-mini",
+            "messages": [
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": prompt},
+                        {
+                            "type": "image_url",
+                            "image_url": {
+                                "url": f"data:image/jpeg;base64,{image_base64}",
+                            },
+                        },
+                    ],
+                }
+            ],
+            "temperature": 0,
+        }
+        headers = {
+            "Authorization": f"Bearer {self.openai_api_key}",
+            "Content-Type": "application/json",
+        }
+        response = requests.post(self.openai_base_url, headers=headers, json=payload)
+        response.raise_for_status()
+        data = response.json()
+        try:
+            return data["choices"][0]["message"]["content"]  # type: ignore
+        except Exception:
+            return json.dumps(data)
+
+    def _preprocess_image(self, image_path: str, language: str = "en") -> str:
+        """
+        Preprocess image before OCR using Pillow + NumPy:
+        - Convert to grayscale
+        - Histogram equalization (contrast)
+        - Median denoise
+        - Otsu threshold and invert
+        """
+        try:
+            with Image.open(image_path) as img:
+                if img.mode in ("RGBA", "LA"):
+                    img = img.convert("RGB")
+                gray = img.convert("L")
+
+                # Histogram equalization via cumulative distribution
+                arr = np.asarray(gray)
+                hist, _ = np.histogram(arr.flatten(), 256, [0, 256])  # type: ignore
+                cdf = hist.cumsum()
+                cdf_masked = np.ma.masked_equal(cdf, 0)  # type: ignore
+                cdf_min = cdf_masked.min() if cdf_masked.size else 0
+                cdf_max = cdf_masked.max() if cdf_masked.size else 0
+                if cdf_max == cdf_min:
+                    eq = arr
+                else:
+                    cdf_scaled = (cdf_masked - cdf_min) * 255 / (cdf_max - cdf_min)
+                    lut = np.ma.filled(cdf_scaled, 0).astype("uint8")
+                    eq = lut[arr]
+
+                eq_img = Image.fromarray(eq, mode="L")
+                # Median filter (3x3) to reduce noise
+                eq_img = eq_img.filter(ImageFilter.MedianFilter(size=3))
+                arr_eq = np.asarray(eq_img)
+
+                # Otsu threshold
+                hist2, _ = np.histogram(arr_eq, 256, [0, 256])  # type: ignore
+                total = arr_eq.size
+                sum_total = (np.arange(256) * hist2).sum()
+                sum_b = 0.0
+                w_b = 0.0
+                max_var = 0.0
+                thr = 0
+                for t in range(256):
+                    w_b += hist2[t]
+                    if w_b == 0:
+                        continue
+                    w_f = total - w_b
+                    if w_f == 0:
+                        break
+                    sum_b += t * hist2[t]
+                    m_b = sum_b / w_b
+                    m_f = (sum_total - sum_b) / w_f
+                    var_between = w_b * w_f * (m_b - m_f) ** 2
+                    if var_between > max_var:
+                        max_var = var_between
+                        thr = t
+
+                binary = (arr_eq > thr).astype(np.uint8) * 255
+                # Invert: black text on white background
+                binary = 255 - binary
+
+                out_img = Image.fromarray(binary, mode="L")
+                preprocessed_path = f"{image_path}_preprocessed.jpg"
+                out_img.save(preprocessed_path, format="JPEG", quality=95)
+                return preprocessed_path
+        except Exception as e:
+            raise ValueError(f"Failed to preprocess image {image_path}: {e}")
+
+    def process_image(
+        self,
+        image_path: str,
+        format_type: str = "markdown",
+        preprocess: bool = True,
+        custom_prompt: str | None = None,
+        language: str = "en",
+    ) -> str:
+        """
+        Process an image (or PDF) and extract text in the specified format
+
+        Args:
+            image_path: Path to the image file or PDF file
+            format_type: One of ["markdown", "text", "json", "structured", "key_value","custom"]
+            preprocess: Whether to apply image preprocessing
+            custom_prompt: If provided, this prompt overrides the default based on format_type
+            language: Language code to apply language specific OCR preprocessing
+        """
+        try:
+            # If the input is a PDF, process all pages
+            if image_path.lower().endswith(".pdf"):
+                image_pages = self._pdf_to_images(image_path)
+                responses: list[str] = []
+                if image_pages:
+                    for idx, page_file in enumerate(image_pages):
+                        # Process each page with preprocessing if enabled
+                        if preprocess:
+                            preprocessed_path = self._preprocess_image(
+                                page_file, language
+                            )
+                        else:
+                            preprocessed_path = page_file
+
+                        image_base64 = self._encode_image(preprocessed_path)
+
+                        if custom_prompt and custom_prompt.strip():
+                            prompt = custom_prompt
+                        else:
+                            prompts = {
+                                "markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
+                                Format the output in markdown:
+                                - Use headers (#, ##, ###) **only if they appear in the image**
+                                - Preserve original lists (-, *, numbered lists) as they are
+                                - Maintain all text formatting (bold, italics, underlines) exactly as seen
+                                - **Do not add, interpret, or restructure any content**
+                            """,
+                                "text": f"""Extract all visible text from this image in {language} **without any changes**.
+                                - **Do not summarize, paraphrase, or infer missing text.**
+                                - Retain all spacing, punctuation, and formatting exactly as in the image.
+                                - If text is unclear or partially visible, extract as much as possible without guessing.
+                                - **Include all text, even if it seems irrelevant or repeated.**
+                                """,
+                                "json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
+                                - **Do not summarize, add, or modify any text.**
+                                - Maintain hierarchical sections and subsections as they appear.
+                                - Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
+                                - Include all text, even if fragmented, blurry, or unclear.
+                                """,
+                                "structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
+                                - Identify and format tables **without altering content**.
+                                - Preserve list structures (bulleted, numbered) **exactly as shown**.
+                                - Maintain all section headings, indents, and alignments.
+                                - **Do not add, infer, or restructure the content in any way.**
+                                """,
+                                "key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
+                                - Identify and extract labels and their corresponding values without modification.
+                                - Maintain the exact wording, punctuation, and order.
+                                - Format each pair as 'key: value' **only if clearly structured that way in the image**.
+                                - **Do not infer missing values or add any extra text.**
+                                """,
+                                "table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
+                                - **Preserve the table structure** (rows, columns, headers) as closely as possible.
+                                - **Do not add missing values or infer content**—if a cell is empty, leave it empty.
+                                - Maintain all numerical, textual, and special character formatting.
+                                - If the table contains merged cells, indicate them clearly without altering their meaning.
+                                - Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
+                                """,
+                            }
+                            prompt = prompts.get(format_type, prompts["text"])
+
+                        # Route to chosen provider
+                        if self.provider == "openai":
+                            res = self._call_openai_vision(prompt, image_base64)
+                        else:
+                            res = self._call_ollama_vision(prompt, image_base64)
+
+                        responses.append(f"Page {idx + 1}:\n{res}")
+
+                        # Clean up temporary files
+                        if preprocess and preprocessed_path.endswith(
+                            "_preprocessed.jpg"
+                        ):
+                            try:
+                                os.remove(preprocessed_path)
+                            except OSError:
+                                pass
+                        if page_file.endswith((".png", ".jpg", ".jp2")):
+                            try:
+                                os.remove(page_file)
+                            except OSError:
+                                pass
+
+                    final_result = "\n".join(responses)
+                    if format_type == "json":
+                        try:
+                            json_data = json.loads(final_result)
+                            return json.dumps(json_data, indent=2)
+                        except json.JSONDecodeError:
+                            return final_result
+                    return final_result
+                else:
+                    # Fallback: no images found; extract raw text per page
+                    text_pages = self._pdf_extract_text(image_path)
+                    combined = []
+                    for i, t in enumerate(text_pages):
+                        combined.append(f"Page {i + 1}:\n{t}")
+                    return "\n".join(combined)
+
+            # Process non-PDF images as before.
+            if preprocess:
+                image_path = self._preprocess_image(image_path, language)
+
+            image_base64 = self._encode_image(image_path)
+
+            # Clean up temporary files
+            if image_path.endswith(("_preprocessed.jpg", "_temp.jpg")):
+                os.remove(image_path)
+
+            if custom_prompt and custom_prompt.strip():
+                prompt = custom_prompt
+                print("Using custom prompt:", prompt)
+            else:
+                prompts = {
+                    "markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
+                                Format the output in markdown:
+                                - Use headers (#, ##, ###) **only if they appear in the image**
+                                - Preserve original lists (-, *, numbered lists) as they are
+                                - Maintain all text formatting (bold, italics, underlines) exactly as seen
+                                - **Do not add, interpret, or restructure any content**
+                            """,
+                    "text": f"""Extract all visible text from this image in {language} **without any changes**.
+                                - **Do not summarize, paraphrase, or infer missing text.**
+                                - Retain all spacing, punctuation, and formatting exactly as in the image.
+                                - If text is unclear or partially visible, extract as much as possible without guessing.
+                                - **Include all text, even if it seems irrelevant or repeated.**
+                                """,
+                    "json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
+                                - **Do not summarize, add, or modify any text.**
+                                - Maintain hierarchical sections and subsections as they appear.
+                                - Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
+                                - Include all text, even if fragmented, blurry, or unclear.
+                                """,
+                    "structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
+                                - Identify and format tables **without altering content**.
+                                - Preserve list structures (bulleted, numbered) **exactly as shown**.
+                                - Maintain all section headings, indents, and alignments.
+                                - **Do not add, infer, or restructure the content in any way.**
+                                """,
+                    "key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
+                                - Identify and extract labels and their corresponding values without modification.
+                                - Maintain the exact wording, punctuation, and order.
+                                - Format each pair as 'key: value' **only if clearly structured that way in the image**.
+                                - **Do not infer missing values or add any extra text.**
+                                """,
+                    "table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
+                                - **Preserve the table structure** (rows, columns, headers) as closely as possible.
+                                - **Do not add missing values or infer content**—if a cell is empty, leave it empty.
+                                - Maintain all numerical, textual, and special character formatting.
+                                - If the table contains merged cells, indicate them clearly without altering their meaning.
+                                - Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
+                                """,
+                }
+                prompt = prompts.get(format_type, prompts["text"])
+                print("Using default prompt:", prompt)  # Debug print
+
+            # Call chosen provider with single image
+            if self.provider == "openai":
+                result = self._call_openai_vision(prompt, image_base64)
+            else:
+                result = self._call_ollama_vision(prompt, image_base64)
+
+            if format_type == "json":
+                try:
+                    json_data = json.loads(result)
+                    return json.dumps(json_data, indent=2)
+                except json.JSONDecodeError:
+                    return str(result)
+
+            return str(result)
+        except Exception as e:
+            return f"Error processing image: {str(e)}"
+
+    def process_batch(
+        self,
+        input_path: str | list[str],
+        format_type: str = "markdown",
+        recursive: bool = False,
+        preprocess: bool = True,
+        custom_prompt: str | None = None,
+        language: str = "en",
+    ) -> dict[str, Any]:
+        """
+        Process multiple images in batch
+
+        Args:
+            input_path: Path to directory or list of image paths
+            format_type: Output format type
+            recursive: Whether to search directories recursively
+            preprocess: Whether to apply image preprocessing
+            custom_prompt: If provided, this prompt overrides the default for each image
+            language: Language code to apply language specific OCR preprocessing
+
+        Returns:
+            Dictionary with results and statistics
+        """
+        # Collect all image paths
+        image_paths: list[str | Path] = []
+        if isinstance(input_path, str):
+            base_path = Path(input_path)
+            if base_path.is_dir():
+                pattern = "**/*" if recursive else "*"
+                for ext in [".png", ".jpg", ".jpeg", ".pdf", ".tiff"]:
+                    image_paths.extend(base_path.glob(f"{pattern}{ext}"))
+            else:
+                image_paths = [base_path]
+        else:
+            image_paths = [Path(p) for p in input_path]
+
+        results = {}
+        errors = {}
+
+        # Process images in parallel
+        with concurrent.futures.ThreadPoolExecutor(
+            max_workers=self.max_workers
+        ) as executor:
+            future_to_path = {
+                executor.submit(
+                    self.process_image,
+                    str(path),
+                    format_type,
+                    preprocess,
+                    custom_prompt,
+                    language,
+                ): path
+                for path in image_paths
+            }
+
+            for future in concurrent.futures.as_completed(future_to_path):
+                path = future_to_path[future]
+                try:
+                    results[str(path)] = future.result()
+                except Exception as e:
+                    errors[str(path)] = str(e)
+                    # pbar.update(1)
+
+        return {
+            "results": results,
+            "errors": errors,
+            "statistics": {
+                "total": len(image_paths),
+                "successful": len(results),
+                "failed": len(errors),
+            },
+        }
--- a/libs/requirements-base.txt
+++ b/libs/requirements-base.txt
@@ -1,13 +1,13 @@
 # Core framework dependencies (Required by all services)
-fastapi>=0.118.0
+fastapi>=0.119.0
 uvicorn[standard]>=0.37.0
-pydantic>=2.11.9
+pydantic>=2.12.0
 pydantic-settings>=2.11.0

 # Database drivers (lightweight)
-sqlalchemy>=2.0.43
+sqlalchemy>=2.0.44
 asyncpg>=0.30.0
-psycopg2-binary>=2.9.10
+psycopg2-binary>=2.9.11
 neo4j>=6.0.2
 redis[hiredis]>=6.4.0

--- a/libs/requirements-pdf.txt
+++ b/libs/requirements-pdf.txt
@@ -3,3 +3,4 @@ pdfrw>=0.4
 reportlab>=4.4.4
 PyPDF2>=3.0.1
 pdfplumber>=0.11.7
+opencv-python
--- a/libs/storage/client.py
+++ b/libs/storage/client.py
@@ -79,7 +79,7 @@ class StorageClient:
        """Download object from bucket"""
        try:
            response = self.client.get_object(bucket_name, object_name)
-            data = response.read()
+            data: bytes = response.read()
            response.close()
            response.release_conn()

@@ -89,7 +89,7 @@ class StorageClient:
                object=object_name,
                size=len(data),
            )
-            return data  # type: ignore
+            return data

        except S3Error as e:
            logger.error(