deployment, linting and infra configuration
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
This commit is contained in:
@@ -1,6 +1,6 @@
|
|||||||
"""Database models for coverage service."""
|
"""Database models for coverage service."""
|
||||||
|
|
||||||
# FILE: apps/svc-coverage/models.py
|
# FILE: apps/svc_coverage/models.py
|
||||||
|
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
||||||
|
|||||||
13
apps/svc_coverage/requirements.txt
Normal file
13
apps/svc_coverage/requirements.txt
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
# Service-specific dependencies for svc_coverage
|
||||||
|
|
||||||
|
# Database migrations
|
||||||
|
alembic>=1.14.0
|
||||||
|
|
||||||
|
# OpenTelemetry (required by libs.observability)
|
||||||
|
opentelemetry-api>=1.37.0
|
||||||
|
opentelemetry-sdk>=1.37.0
|
||||||
|
opentelemetry-exporter-otlp-proto-grpc>=1.37.0
|
||||||
|
opentelemetry-instrumentation-fastapi>=0.42b0
|
||||||
|
opentelemetry-instrumentation-httpx>=0.42b0
|
||||||
|
opentelemetry-instrumentation-psycopg2>=0.42b0
|
||||||
|
opentelemetry-instrumentation-redis>=0.42b0
|
||||||
@@ -1,17 +1,17 @@
|
|||||||
# Service-specific dependencies for svc_extract
|
# Service-specific dependencies for svc_extract
|
||||||
# LLM integration
|
# LLM integration
|
||||||
openai>=1.3.0
|
openai>=2.3.0
|
||||||
anthropic>=0.7.0
|
anthropic>=0.69.0
|
||||||
|
|
||||||
# JSON schema validation
|
# JSON schema validation
|
||||||
jsonschema>=4.20.0
|
jsonschema>=4.25.1
|
||||||
|
|
||||||
# Template processing
|
# Template processing
|
||||||
jinja2>=3.1.0
|
jinja2>=3.1.6
|
||||||
|
|
||||||
# Text similarity (lightweight)
|
# Text similarity (lightweight)
|
||||||
fuzzywuzzy>=0.18.0
|
fuzzywuzzy>=0.18.0
|
||||||
python-Levenshtein>=0.23.0
|
python-Levenshtein>=0.27.1
|
||||||
|
|
||||||
# Data validation
|
# Data validation
|
||||||
cerberus>=1.3.4
|
cerberus>=1.3.7
|
||||||
|
|||||||
@@ -1,45 +1,45 @@
|
|||||||
# FastAPI and server
|
# FastAPI and server
|
||||||
fastapi>=0.104.1
|
fastapi>=0.118.3
|
||||||
uvicorn[standard]>=0.24.0
|
uvicorn[standard]>=0.37.0
|
||||||
pydantic>=2.5.0
|
pydantic>=2.5.0
|
||||||
|
|
||||||
# Service-specific dependencies
|
# Service-specific dependencies
|
||||||
# Database connectors
|
# Database connectors
|
||||||
sqlalchemy>=2.0.0
|
sqlalchemy>=2.0.44
|
||||||
pymssql>=2.2.0
|
pymssql>=2.3.7
|
||||||
cx-Oracle>=8.3.0
|
cx-Oracle>=8.3.0
|
||||||
|
|
||||||
# API clients for practice management systems
|
# API clients for practice management systems
|
||||||
zeep>=4.2.0 # SOAP client
|
zeep>=4.3.2 # SOAP client
|
||||||
xmltodict>=0.13.0
|
xmltodict>=1.0.2
|
||||||
|
|
||||||
# OAuth for various systems
|
# OAuth for various systems
|
||||||
authlib>=1.2.0
|
authlib>=1.6.5
|
||||||
requests-oauthlib>=1.3.0
|
requests-oauthlib>=2.0.0
|
||||||
|
|
||||||
# Data synchronization
|
# Data synchronization
|
||||||
pandas>=2.1.0
|
pandas>=2.3.3
|
||||||
|
|
||||||
# Rate limiting
|
# Rate limiting
|
||||||
ratelimit>=2.2.0
|
ratelimit>=2.2.1
|
||||||
|
|
||||||
# Retry mechanisms
|
# Retry mechanisms
|
||||||
tenacity>=8.2.0
|
tenacity>=9.1.2
|
||||||
|
|
||||||
# CSV processing
|
# CSV processing
|
||||||
csvkit>=1.1.0
|
csvkit>=2.1.0
|
||||||
|
|
||||||
# Excel file processing
|
# Excel file processing
|
||||||
openpyxl>=3.1.0
|
openpyxl>=3.1.5
|
||||||
xlrd>=2.0.0
|
xlrd>=2.0.2
|
||||||
|
|
||||||
# Data validation
|
# Data validation
|
||||||
marshmallow>=3.20.0
|
marshmallow>=4.0.1
|
||||||
cerberus>=1.3.4
|
cerberus>=1.3.7
|
||||||
|
|
||||||
# Connection pooling (built into SQLAlchemy)
|
# Connection pooling (built into SQLAlchemy)
|
||||||
# sqlalchemy-pool>=1.3.0 # Package doesn't exist, pooling is built into SQLAlchemy
|
# sqlalchemy-pool>=1.3.0 # Package doesn't exist, pooling is built into SQLAlchemy
|
||||||
|
|
||||||
# Additional utilities
|
# Additional utilities
|
||||||
python-dateutil>=2.8.0
|
python-dateutil>=2.9.0
|
||||||
pytz>=2023.3
|
pytz>=2025.2
|
||||||
|
|||||||
@@ -1,37 +1,37 @@
|
|||||||
# FastAPI and server
|
# FastAPI and server
|
||||||
fastapi>=0.104.1
|
fastapi>=0.118.3
|
||||||
uvicorn[standard]>=0.24.0
|
uvicorn[standard]>=0.37.0
|
||||||
pydantic>=2.5.0
|
pydantic>=2.12.0
|
||||||
|
|
||||||
# Service-specific dependencies
|
# Service-specific dependencies
|
||||||
# PDF form filling
|
# PDF form filling
|
||||||
pdfrw>=0.4
|
pdfrw>=0.4
|
||||||
reportlab>=4.0.0
|
reportlab>=4.4.4
|
||||||
|
|
||||||
# PDF processing
|
# PDF processing
|
||||||
PyPDF2>=3.0.0
|
PyPDF2>=3.0.1
|
||||||
pypdf>=3.17.0
|
pypdf>=6.1.1
|
||||||
|
|
||||||
# Image processing for overlays
|
# Image processing for overlays
|
||||||
Pillow>=10.1.0
|
Pillow>=11.3.0
|
||||||
|
|
||||||
# ZIP file creation for evidence packs
|
# ZIP file creation for evidence packs
|
||||||
zipfile36>=0.1.3
|
zipfile36>=0.1.3
|
||||||
|
|
||||||
# Template processing
|
# Template processing
|
||||||
jinja2>=3.1.0
|
jinja2>=3.1.6
|
||||||
|
|
||||||
# QR code generation
|
# QR code generation
|
||||||
qrcode>=7.4.0
|
qrcode>=8.2
|
||||||
|
|
||||||
# Barcode generation
|
# Barcode generation
|
||||||
python-barcode>=0.15.0
|
python-barcode>=0.16.1
|
||||||
|
|
||||||
# Font handling
|
# Font handling
|
||||||
fonttools>=4.44.0
|
fonttools>=4.60.1
|
||||||
|
|
||||||
# Additional PDF utilities
|
# Additional PDF utilities
|
||||||
pdfminer.six>=20231228
|
pdfminer.six>=20250506
|
||||||
|
|
||||||
# Document conversion
|
# Document conversion
|
||||||
python-docx>=1.1.0
|
python-docx>=1.2.0
|
||||||
|
|||||||
@@ -1,40 +1,40 @@
|
|||||||
# FastAPI and server
|
# FastAPI and server
|
||||||
fastapi>=0.104.1
|
fastapi>=0.118.3
|
||||||
uvicorn[standard]>=0.24.0
|
uvicorn[standard]>=0.37.0
|
||||||
pydantic>=2.5.0
|
pydantic>=2.12.0
|
||||||
|
|
||||||
# Service-specific dependencies
|
# Service-specific dependencies
|
||||||
# OAuth and authentication
|
# OAuth and authentication
|
||||||
authlib>=1.2.0
|
authlib>=1.6.5
|
||||||
oauthlib>=3.2.0
|
oauthlib>=3.3.1
|
||||||
|
|
||||||
# HTTP client with OAuth support
|
# HTTP client with OAuth support
|
||||||
requests-oauthlib>=1.3.0
|
requests-oauthlib>=2.0.0
|
||||||
|
|
||||||
# XML processing for HMRC APIs
|
# XML processing for HMRC APIs
|
||||||
lxml>=4.9.0
|
lxml>=6.0.2
|
||||||
xmltodict>=0.13.0
|
xmltodict>=1.0.2
|
||||||
|
|
||||||
# JSON Web Tokens
|
# JSON Web Tokens
|
||||||
pyjwt>=2.8.0
|
pyjwt>=2.10.1
|
||||||
|
|
||||||
# UK government API utilities
|
# UK government API utilities
|
||||||
govuk-frontend-jinja>=2.8.0
|
govuk-frontend-jinja>=3.8.0
|
||||||
|
|
||||||
# Date and time for tax years
|
# Date and time for tax years
|
||||||
python-dateutil>=2.8.0
|
python-dateutil>=2.9.0
|
||||||
|
|
||||||
# Retry mechanisms
|
# Retry mechanisms
|
||||||
tenacity>=8.2.0
|
tenacity>=9.1.2
|
||||||
|
|
||||||
# Rate limiting
|
# Rate limiting
|
||||||
ratelimit>=2.2.0
|
ratelimit>=2.2.1
|
||||||
|
|
||||||
# API validation
|
# API validation
|
||||||
marshmallow>=3.20.0
|
marshmallow>=4.0.1
|
||||||
|
|
||||||
# Encryption for sensitive data
|
# Encryption for sensitive data
|
||||||
cryptography>=41.0.0
|
cryptography>=46.0.2
|
||||||
|
|
||||||
# Additional HTTP utilities
|
# Additional HTTP utilities
|
||||||
urllib3>=2.1.0
|
urllib3>=2.5.0
|
||||||
|
|||||||
@@ -1,22 +1,22 @@
|
|||||||
# Service-specific dependencies
|
# Service-specific dependencies
|
||||||
# RDF and semantic web
|
# RDF and semantic web
|
||||||
rdflib>=7.0.0
|
rdflib>=7.2.1
|
||||||
pyshacl>=0.25.0
|
pyshacl>=0.30.1
|
||||||
|
|
||||||
# Graph algorithms
|
# Graph algorithms
|
||||||
networkx>=3.2.0
|
networkx>=3.5
|
||||||
|
|
||||||
# Data export formats
|
# Data export formats
|
||||||
xmltodict>=0.13.0
|
xmltodict>=1.0.2
|
||||||
|
|
||||||
# Query optimization
|
# Query optimization
|
||||||
pyparsing>=3.1.0
|
pyparsing>=3.2.5
|
||||||
|
|
||||||
# Graph visualization (optional)
|
# Graph visualization (optional)
|
||||||
graphviz>=0.20.0
|
graphviz>=0.21
|
||||||
|
|
||||||
# Additional Neo4j utilities
|
# Additional Neo4j utilities
|
||||||
neomodel>=5.2.0
|
neomodel>=5.5.3
|
||||||
|
|
||||||
# Cypher query building
|
# Cypher query building
|
||||||
py2neo>=2021.2.4
|
py2neo>=2021.2.4
|
||||||
|
|||||||
@@ -1,37 +1,37 @@
|
|||||||
# FastAPI and server
|
# FastAPI and server
|
||||||
fastapi>=0.104.1
|
fastapi>=0.118.3
|
||||||
uvicorn[standard]>=0.24.0
|
uvicorn[standard]>=0.37.0
|
||||||
pydantic>=2.5.0
|
pydantic>=2.12.0
|
||||||
|
|
||||||
# Service-specific dependencies
|
# Service-specific dependencies
|
||||||
# Data normalization and cleaning
|
# Data normalization and cleaning
|
||||||
pandas>=2.1.0
|
pandas>=2.3.3
|
||||||
numpy>=1.24.0
|
numpy>=2.3.3
|
||||||
|
|
||||||
# Currency and exchange rates
|
# Currency and exchange rates
|
||||||
forex-python>=1.8
|
forex-python>=1.9.2
|
||||||
babel>=2.13.0
|
babel>=2.17.0
|
||||||
|
|
||||||
# Date and time processing
|
# Date and time processing
|
||||||
python-dateutil>=2.8.0
|
python-dateutil>=2.9.0
|
||||||
pytz>=2023.3
|
pytz>=2025.2
|
||||||
|
|
||||||
# Text normalization
|
# Text normalization
|
||||||
unidecode>=1.3.0
|
unidecode>=1.4.0
|
||||||
phonenumbers>=8.13.0
|
phonenumbers>=9.0.16
|
||||||
|
|
||||||
# Entity resolution and matching
|
# Entity resolution and matching
|
||||||
recordlinkage>=0.16.0
|
recordlinkage>=0.16.0
|
||||||
fuzzywuzzy>=0.18.0
|
fuzzywuzzy>=0.18.0
|
||||||
python-Levenshtein>=0.23.0
|
python-Levenshtein>=0.27.1
|
||||||
|
|
||||||
# Geographic data
|
# Geographic data
|
||||||
geopy>=2.4.0
|
geopy>=2.4.1
|
||||||
pycountry>=23.12.0
|
pycountry>=24.6.1
|
||||||
|
|
||||||
# Data validation
|
# Data validation
|
||||||
cerberus>=1.3.4
|
cerberus>=1.3.7
|
||||||
marshmallow>=3.20.0
|
marshmallow>=4.0.1
|
||||||
|
|
||||||
# UK-specific utilities
|
# UK-specific utilities
|
||||||
uk-postcode-utils>=1.0.0
|
uk-postcode-utils>=1.1
|
||||||
|
|||||||
@@ -1,17 +1,23 @@
|
|||||||
# FILE: apps/svc-ocr/main.py
|
# FILE: apps/svc-ocr/main.py
|
||||||
# OCR and layout extraction using Tesseract, LayoutLM, and document AI
|
# OCR and layout extraction using Tesseract, LayoutLM, and document AI
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import io
|
||||||
import os
|
import os
|
||||||
|
|
||||||
# Import shared libraries
|
# Import shared libraries
|
||||||
import sys
|
import sys
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from typing import Any
|
from typing import Any, cast
|
||||||
|
|
||||||
|
import pytesseract
|
||||||
import structlog
|
import structlog
|
||||||
import ulid
|
import ulid
|
||||||
from fastapi import BackgroundTasks, Depends, HTTPException, Request
|
from fastapi import BackgroundTasks, Depends, HTTPException, Request
|
||||||
from fastapi.responses import JSONResponse
|
from fastapi.responses import JSONResponse
|
||||||
|
from pdf2image import convert_from_bytes
|
||||||
|
from PIL import Image
|
||||||
|
from PyPDF2 import PdfReader
|
||||||
|
|
||||||
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
|
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
|
||||||
|
|
||||||
@@ -19,6 +25,7 @@ from libs.app_factory import create_app
|
|||||||
from libs.config import BaseAppSettings, create_event_bus, create_minio_client
|
from libs.config import BaseAppSettings, create_event_bus, create_minio_client
|
||||||
from libs.events import EventBus, EventPayload, EventTopics
|
from libs.events import EventBus, EventPayload, EventTopics
|
||||||
from libs.observability import get_metrics, get_tracer, setup_observability
|
from libs.observability import get_metrics, get_tracer, setup_observability
|
||||||
|
from libs.ocr.processor import OCRProcessor
|
||||||
from libs.schemas import ErrorResponse
|
from libs.schemas import ErrorResponse
|
||||||
from libs.security import get_current_user, get_tenant_id
|
from libs.security import get_current_user, get_tenant_id
|
||||||
from libs.storage import DocumentStorage, StorageClient
|
from libs.storage import DocumentStorage, StorageClient
|
||||||
@@ -48,28 +55,31 @@ class OCRSettings(BaseAppSettings):
|
|||||||
include_coordinates: bool = True
|
include_coordinates: bool = True
|
||||||
include_confidence: bool = True
|
include_confidence: bool = True
|
||||||
|
|
||||||
|
# Vision/LLM OCR configuration
|
||||||
|
vision_provider: str = "ollama" # or "openai"
|
||||||
|
vision_model: str = "llama3.2-vision:11b"
|
||||||
|
vision_format: str = (
|
||||||
|
"text" # text | markdown | json | table | key_value | structured
|
||||||
|
)
|
||||||
|
vision_preprocess: bool = True
|
||||||
|
openai_base_url: str = "https://api.openai.com/v1/chat/completions"
|
||||||
|
|
||||||
# Create app and settings
|
|
||||||
app, settings = create_app(
|
|
||||||
service_name="svc-ocr",
|
|
||||||
title="Tax Agent OCR Service",
|
|
||||||
description="OCR and layout extraction service",
|
|
||||||
settings_class=OCRSettings,
|
|
||||||
) # fmt: skip
|
|
||||||
|
|
||||||
# Global clients
|
# Global clients
|
||||||
storage_client: StorageClient | None = None
|
storage_client: StorageClient | None = None
|
||||||
document_storage: DocumentStorage | None = None
|
document_storage: DocumentStorage | None = None
|
||||||
event_bus: EventBus | None = None
|
event_bus: EventBus | None = None
|
||||||
tracer = get_tracer("svc-ocr")
|
|
||||||
metrics = get_metrics()
|
vision_processor: OCRProcessor | None = None
|
||||||
|
# Settings will be initialized after app creation
|
||||||
|
settings: OCRSettings
|
||||||
|
|
||||||
|
|
||||||
@app.on_event("startup")
|
async def init_dependencies(app_settings: OCRSettings) -> None:
|
||||||
async def startup_event() -> None:
|
|
||||||
"""Initialize service dependencies"""
|
"""Initialize service dependencies"""
|
||||||
global storage_client, document_storage, event_bus
|
global storage_client, document_storage, event_bus, settings, vision_processor
|
||||||
|
|
||||||
|
settings = app_settings
|
||||||
logger.info("Starting OCR service")
|
logger.info("Starting OCR service")
|
||||||
|
|
||||||
# Setup observability
|
# Setup observability
|
||||||
@@ -79,42 +89,44 @@ async def startup_event() -> None:
|
|||||||
minio_client = create_minio_client(settings)
|
minio_client = create_minio_client(settings)
|
||||||
storage_client = StorageClient(minio_client)
|
storage_client = StorageClient(minio_client)
|
||||||
document_storage = DocumentStorage(storage_client)
|
document_storage = DocumentStorage(storage_client)
|
||||||
|
|
||||||
# Initialize event bus
|
# Initialize event bus
|
||||||
event_bus = create_event_bus(settings)
|
event_bus = create_event_bus(settings)
|
||||||
if not event_bus:
|
if not event_bus:
|
||||||
raise HTTPException(status_code=500, detail="Event bus not initialized")
|
raise HTTPException(status_code=500, detail="Event bus not initialized")
|
||||||
|
|
||||||
await event_bus.start()
|
eb = event_bus
|
||||||
|
# mypy: event_bus is Optional, so use local alias after check
|
||||||
|
await eb.start()
|
||||||
|
|
||||||
# Subscribe to document ingestion events
|
# Subscribe to document ingestion events
|
||||||
await event_bus.subscribe(EventTopics.DOC_INGESTED, _handle_document_ingested)
|
await eb.subscribe(EventTopics.DOC_INGESTED, _handle_document_ingested)
|
||||||
|
|
||||||
|
# Initialize shared OCRProcessor for vision strategy
|
||||||
|
try:
|
||||||
|
vision_processor = OCRProcessor(
|
||||||
|
model_name=settings.vision_model,
|
||||||
|
provider=settings.vision_provider,
|
||||||
|
openai_base_url=settings.openai_base_url,
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to initialize vision OCR processor", error=str(e))
|
||||||
|
|
||||||
logger.info("OCR service started successfully")
|
logger.info("OCR service started successfully")
|
||||||
|
|
||||||
|
|
||||||
@app.on_event("shutdown")
|
# Create app and settings
|
||||||
async def shutdown_event() -> None:
|
app, _settings = create_app(
|
||||||
"""Cleanup service dependencies"""
|
service_name="svc-ocr",
|
||||||
global event_bus
|
title="Tax Agent OCR Service",
|
||||||
|
description="OCR and layout extraction service",
|
||||||
|
settings_class=OCRSettings,
|
||||||
|
) # fmt: skip
|
||||||
|
|
||||||
logger.info("Shutting down OCR service")
|
# Initialize dependencies immediately
|
||||||
|
asyncio.run(init_dependencies(cast(OCRSettings, _settings)))
|
||||||
|
|
||||||
if event_bus:
|
tracer = get_tracer("svc-ocr")
|
||||||
await event_bus.stop()
|
metrics = get_metrics()
|
||||||
|
|
||||||
logger.info("OCR service shutdown complete")
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/health")
|
|
||||||
async def health_check() -> dict[str, Any]:
|
|
||||||
"""Health check endpoint"""
|
|
||||||
return {
|
|
||||||
"status": "healthy",
|
|
||||||
"service": settings.service_name,
|
|
||||||
"version": settings.service_version,
|
|
||||||
"timestamp": datetime.utcnow().isoformat(),
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
@app.post("/process/{doc_id}")
|
@app.post("/process/{doc_id}")
|
||||||
@@ -132,9 +144,14 @@ async def process_document(
|
|||||||
span.set_attribute("tenant_id", tenant_id)
|
span.set_attribute("tenant_id", tenant_id)
|
||||||
span.set_attribute("strategy", strategy)
|
span.set_attribute("strategy", strategy)
|
||||||
|
|
||||||
|
ds = document_storage
|
||||||
|
if ds is None:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=500, detail="Document storage not initialized"
|
||||||
|
)
|
||||||
try:
|
try:
|
||||||
# Check if document exists
|
# Check if document exists
|
||||||
doc_content = await document_storage.get_document(tenant_id, doc_id)
|
doc_content = await ds.get_document(tenant_id, doc_id)
|
||||||
if not doc_content:
|
if not doc_content:
|
||||||
raise HTTPException(status_code=404, detail="Document not found")
|
raise HTTPException(status_code=404, detail="Document not found")
|
||||||
|
|
||||||
@@ -142,9 +159,9 @@ async def process_document(
|
|||||||
processing_id = str(ulid.new())
|
processing_id = str(ulid.new())
|
||||||
span.set_attribute("processing_id", processing_id)
|
span.set_attribute("processing_id", processing_id)
|
||||||
|
|
||||||
# Start background processing
|
# Start background processing via sync wrapper (for mypy correctness)
|
||||||
background_tasks.add_task(
|
background_tasks.add_task(
|
||||||
_process_document_async,
|
_schedule_process_document_async,
|
||||||
doc_id,
|
doc_id,
|
||||||
tenant_id,
|
tenant_id,
|
||||||
doc_content,
|
doc_content,
|
||||||
@@ -168,7 +185,9 @@ async def process_document(
|
|||||||
raise
|
raise
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("Failed to start OCR processing", doc_id=doc_id, error=str(e))
|
logger.error("Failed to start OCR processing", doc_id=doc_id, error=str(e))
|
||||||
raise HTTPException(status_code=500, detail="Failed to start processing")
|
raise HTTPException(
|
||||||
|
status_code=500, detail="Failed to start processing"
|
||||||
|
) from e
|
||||||
|
|
||||||
|
|
||||||
@app.get("/results/{doc_id}")
|
@app.get("/results/{doc_id}")
|
||||||
@@ -183,9 +202,14 @@ async def get_ocr_results(
|
|||||||
span.set_attribute("doc_id", doc_id)
|
span.set_attribute("doc_id", doc_id)
|
||||||
span.set_attribute("tenant_id", tenant_id)
|
span.set_attribute("tenant_id", tenant_id)
|
||||||
|
|
||||||
|
ds = document_storage
|
||||||
|
if ds is None:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=500, detail="Document storage not initialized"
|
||||||
|
)
|
||||||
try:
|
try:
|
||||||
# Get OCR results from storage
|
# Get OCR results from storage
|
||||||
ocr_results = await document_storage.get_ocr_result(tenant_id, doc_id)
|
ocr_results = await ds.get_ocr_result(tenant_id, doc_id)
|
||||||
|
|
||||||
if not ocr_results:
|
if not ocr_results:
|
||||||
raise HTTPException(status_code=404, detail="OCR results not found")
|
raise HTTPException(status_code=404, detail="OCR results not found")
|
||||||
@@ -196,26 +220,32 @@ async def get_ocr_results(
|
|||||||
raise
|
raise
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("Failed to get OCR results", doc_id=doc_id, error=str(e))
|
logger.error("Failed to get OCR results", doc_id=doc_id, error=str(e))
|
||||||
raise HTTPException(status_code=500, detail="Failed to get OCR results")
|
raise HTTPException(
|
||||||
|
status_code=500, detail="Failed to get OCR results"
|
||||||
|
) from e
|
||||||
|
|
||||||
|
|
||||||
async def _handle_document_ingested(topic: str, payload: EventPayload) -> None:
|
async def _handle_document_ingested(topic: str, payload: EventPayload) -> None:
|
||||||
"""Handle document ingestion events"""
|
"""Handle document ingestion events"""
|
||||||
try:
|
data = payload.data
|
||||||
data = payload.data
|
doc_id = data.get("doc_id")
|
||||||
doc_id = data.get("doc_id")
|
tenant_id = data.get("tenant_id")
|
||||||
tenant_id = data.get("tenant_id")
|
|
||||||
|
|
||||||
if not doc_id or not tenant_id:
|
if not doc_id or not tenant_id:
|
||||||
logger.warning("Invalid document ingestion event", data=data)
|
logger.warning("Invalid document ingestion event", data=data)
|
||||||
return
|
return
|
||||||
|
ds = document_storage
|
||||||
|
if ds is None:
|
||||||
|
logger.error("Document storage not initialized")
|
||||||
|
return
|
||||||
|
|
||||||
# Auto-process PDF documents
|
# Auto-process PDF documents
|
||||||
if data.get("content_type") == "application/pdf":
|
if data.get("content_type") == "application/pdf":
|
||||||
logger.info("Auto-processing ingested document", doc_id=doc_id)
|
logger.info("Auto-processing ingested document", doc_id=doc_id)
|
||||||
|
|
||||||
|
try:
|
||||||
# Get document content
|
# Get document content
|
||||||
doc_content = await document_storage.get_document(tenant_id, doc_id)
|
doc_content = await ds.get_document(tenant_id, doc_id)
|
||||||
if doc_content:
|
if doc_content:
|
||||||
await _process_document_async(
|
await _process_document_async(
|
||||||
doc_id=doc_id,
|
doc_id=doc_id,
|
||||||
@@ -225,9 +255,10 @@ async def _handle_document_ingested(topic: str, payload: EventPayload) -> None:
|
|||||||
processing_id=str(ulid.new()),
|
processing_id=str(ulid.new()),
|
||||||
actor=payload.actor,
|
actor=payload.actor,
|
||||||
)
|
)
|
||||||
|
except Exception as e:
|
||||||
except Exception as e:
|
logger.error(
|
||||||
logger.error("Failed to handle document ingestion", error=str(e))
|
"Failed to handle document ingestion", doc_id=doc_id, error=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
async def _process_document_async(
|
async def _process_document_async(
|
||||||
@@ -250,8 +281,8 @@ async def _process_document_async(
|
|||||||
images = await _pdf_to_images(content)
|
images = await _pdf_to_images(content)
|
||||||
|
|
||||||
# Process each page
|
# Process each page
|
||||||
pages_data: list[Any] = []
|
pages_data: list[dict[str, Any]] = []
|
||||||
for page_num, image in enumerate(images, 1):
|
for page_num, image in enumerate(images, 0):
|
||||||
page_data = await _process_page(image, page_num, strategy)
|
page_data = await _process_page(image, page_num, strategy)
|
||||||
pages_data.append(page_data)
|
pages_data.append(page_data)
|
||||||
|
|
||||||
@@ -270,7 +301,10 @@ async def _process_document_async(
|
|||||||
}
|
}
|
||||||
|
|
||||||
# Store results
|
# Store results
|
||||||
await document_storage.store_ocr_result(tenant_id, doc_id, ocr_results)
|
ds = document_storage
|
||||||
|
if ds is None:
|
||||||
|
raise RuntimeError("Document storage not initialized")
|
||||||
|
await ds.store_ocr_result(tenant_id, doc_id, ocr_results)
|
||||||
|
|
||||||
# Update metrics
|
# Update metrics
|
||||||
metrics.counter("documents_processed_total").labels(
|
metrics.counter("documents_processed_total").labels(
|
||||||
@@ -282,7 +316,7 @@ async def _process_document_async(
|
|||||||
).observe(
|
).observe(
|
||||||
datetime.utcnow().timestamp()
|
datetime.utcnow().timestamp()
|
||||||
- datetime.fromisoformat(
|
- datetime.fromisoformat(
|
||||||
ocr_results["processed_at"].replace("Z", "")
|
ocr_results["processed_at"].replace("Z", "") # type: ignore
|
||||||
).timestamp()
|
).timestamp()
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -300,7 +334,9 @@ async def _process_document_async(
|
|||||||
tenant_id=tenant_id,
|
tenant_id=tenant_id,
|
||||||
)
|
)
|
||||||
|
|
||||||
await event_bus.publish(EventTopics.DOC_OCR_READY, event_payload)
|
eb = event_bus
|
||||||
|
if eb is not None:
|
||||||
|
await eb.publish(EventTopics.DOC_OCR_READY, event_payload)
|
||||||
|
|
||||||
logger.info(
|
logger.info(
|
||||||
"OCR processing completed", doc_id=doc_id, pages=len(pages_data)
|
"OCR processing completed", doc_id=doc_id, pages=len(pages_data)
|
||||||
@@ -316,58 +352,91 @@ async def _process_document_async(
|
|||||||
|
|
||||||
|
|
||||||
async def _pdf_to_images(pdf_content: bytes) -> list[bytes]:
|
async def _pdf_to_images(pdf_content: bytes) -> list[bytes]:
|
||||||
"""Convert PDF to images"""
|
"""Convert PDF to page images without PyMuPDF.
|
||||||
|
|
||||||
|
Primary: pdf2image (requires poppler). Fallback: extract largest embedded image per page via PyPDF2/Pillow.
|
||||||
|
"""
|
||||||
|
# First try pdf2image for full-page rasterization
|
||||||
try:
|
try:
|
||||||
import fitz # PyMuPDF
|
|
||||||
|
|
||||||
# Open PDF
|
|
||||||
pdf_doc = fitz.open(stream=pdf_content, filetype="pdf")
|
|
||||||
|
|
||||||
images: list[Any] = []
|
|
||||||
for page_num in range(min(len(pdf_doc), settings.max_pages)):
|
|
||||||
page = pdf_doc[page_num]
|
|
||||||
|
|
||||||
# Render page to image
|
|
||||||
mat = fitz.Matrix(2.0, 2.0) # 2x zoom for better OCR
|
|
||||||
pix = page.get_pixmap(matrix=mat)
|
|
||||||
img_data = pix.tobytes("png")
|
|
||||||
|
|
||||||
images.append(img_data)
|
|
||||||
|
|
||||||
pdf_doc.close()
|
|
||||||
return images
|
|
||||||
|
|
||||||
except ImportError:
|
|
||||||
logger.error("PyMuPDF not available, using fallback")
|
|
||||||
return await _pdf_to_images_fallback(pdf_content)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error("PDF conversion failed", error=str(e))
|
|
||||||
raise
|
|
||||||
|
|
||||||
|
|
||||||
async def _pdf_to_images_fallback(pdf_content: bytes) -> list[bytes]:
|
|
||||||
"""Fallback PDF to images conversion"""
|
|
||||||
try:
|
|
||||||
from pdf2image import convert_from_bytes
|
|
||||||
|
|
||||||
images = convert_from_bytes(
|
images = convert_from_bytes(
|
||||||
pdf_content, dpi=200, first_page=1, last_page=settings.max_pages
|
pdf_content, dpi=200, first_page=1, last_page=settings.max_pages
|
||||||
)
|
)
|
||||||
|
image_bytes: list[bytes] = []
|
||||||
# Convert PIL images to bytes
|
|
||||||
image_bytes: list[Any] = []
|
|
||||||
for img in images:
|
for img in images:
|
||||||
import io
|
|
||||||
|
|
||||||
img_buffer = io.BytesIO()
|
img_buffer = io.BytesIO()
|
||||||
img.save(img_buffer, format="PNG")
|
img.save(img_buffer, format="PNG")
|
||||||
image_bytes.append(img_buffer.getvalue())
|
image_bytes.append(img_buffer.getvalue())
|
||||||
|
|
||||||
return image_bytes
|
return image_bytes
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(
|
||||||
|
"pdf2image conversion failed; falling back to PyPDF2", error=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
except ImportError:
|
# Fallback: extract largest embedded image per page using PyPDF2
|
||||||
logger.error("pdf2image not available")
|
try:
|
||||||
raise Exception("No PDF conversion library available")
|
reader = PdfReader(io.BytesIO(pdf_content))
|
||||||
|
out_images: list[bytes] = []
|
||||||
|
for page_index, page in enumerate(reader.pages):
|
||||||
|
if page_index >= settings.max_pages:
|
||||||
|
break
|
||||||
|
try:
|
||||||
|
resources = page.get("/Resources")
|
||||||
|
if resources is None:
|
||||||
|
continue
|
||||||
|
xobject = resources.get("/XObject")
|
||||||
|
if xobject is None:
|
||||||
|
continue
|
||||||
|
xobject = xobject.get_object()
|
||||||
|
|
||||||
|
largest = None
|
||||||
|
largest_area = -1
|
||||||
|
for _, obj_ref in xobject.items():
|
||||||
|
try:
|
||||||
|
obj = obj_ref.get_object()
|
||||||
|
if obj.get("/Subtype") != "/Image":
|
||||||
|
continue
|
||||||
|
width = int(obj.get("/Width", 0))
|
||||||
|
height = int(obj.get("/Height", 0))
|
||||||
|
area = width * height
|
||||||
|
if area > largest_area:
|
||||||
|
largest = obj
|
||||||
|
largest_area = area
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if largest is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
data = largest.get_data()
|
||||||
|
filt = largest.get("/Filter")
|
||||||
|
|
||||||
|
if filt in ("/DCTDecode", "/JPXDecode"):
|
||||||
|
# JPEG or JPEG2000
|
||||||
|
out_images.append(data)
|
||||||
|
else:
|
||||||
|
# Flate or other; decode via Pillow
|
||||||
|
mode = "RGB"
|
||||||
|
colorspace = largest.get("/ColorSpace")
|
||||||
|
if colorspace in ("/DeviceGray",):
|
||||||
|
mode = "L"
|
||||||
|
width = int(largest.get("/Width", 0))
|
||||||
|
height = int(largest.get("/Height", 0))
|
||||||
|
try:
|
||||||
|
img = Image.frombytes(mode, (width, height), data)
|
||||||
|
except Exception:
|
||||||
|
img = Image.open(io.BytesIO(data))
|
||||||
|
buf = io.BytesIO()
|
||||||
|
img.save(buf, format="PNG")
|
||||||
|
out_images.append(buf.getvalue())
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not out_images:
|
||||||
|
raise RuntimeError("No images extracted via PyPDF2 fallback")
|
||||||
|
return out_images
|
||||||
|
except Exception as fallback_e:
|
||||||
|
logger.error("PDF conversion failed (both methods)", error=str(fallback_e))
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
async def _process_page(
|
async def _process_page(
|
||||||
@@ -395,6 +464,8 @@ async def _process_page(
|
|||||||
layoutlm_result.get("confidence", 0),
|
layoutlm_result.get("confidence", 0),
|
||||||
),
|
),
|
||||||
}
|
}
|
||||||
|
elif strategy == "vision":
|
||||||
|
return await _process_with_vision(image_data, page_num)
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Unknown strategy: {strategy}")
|
raise ValueError(f"Unknown strategy: {strategy}")
|
||||||
|
|
||||||
@@ -402,11 +473,6 @@ async def _process_page(
|
|||||||
async def _process_with_tesseract(image_data: bytes, page_num: int) -> dict[str, Any]:
|
async def _process_with_tesseract(image_data: bytes, page_num: int) -> dict[str, Any]:
|
||||||
"""Process page with Tesseract OCR"""
|
"""Process page with Tesseract OCR"""
|
||||||
try:
|
try:
|
||||||
import io
|
|
||||||
|
|
||||||
import pytesseract
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
# Load image
|
# Load image
|
||||||
image = Image.open(io.BytesIO(image_data))
|
image = Image.open(io.BytesIO(image_data))
|
||||||
|
|
||||||
@@ -414,13 +480,13 @@ async def _process_with_tesseract(image_data: bytes, page_num: int) -> dict[str,
|
|||||||
config = f"{settings.tesseract_config} -l {settings.languages}"
|
config = f"{settings.tesseract_config} -l {settings.languages}"
|
||||||
|
|
||||||
# Extract text with confidence
|
# Extract text with confidence
|
||||||
data = pytesseract.image_to_data(
|
data = pytesseract.image_to_data( # type: ignore
|
||||||
image, config=config, output_type=pytesseract.Output.DICT
|
image, config=config, output_type=pytesseract.Output.DICT
|
||||||
)
|
)
|
||||||
|
|
||||||
# Process results
|
# Process results
|
||||||
words: list[Any] = []
|
words: list[dict[str, Any]] = []
|
||||||
confidences: list[Any] = []
|
confidences: list[float] = []
|
||||||
|
|
||||||
for i in range(len(data["text"])):
|
for i in range(len(data["text"])):
|
||||||
if int(data["conf"][i]) > 0: # Valid confidence
|
if int(data["conf"][i]) > 0: # Valid confidence
|
||||||
@@ -449,13 +515,6 @@ async def _process_with_tesseract(image_data: bytes, page_num: int) -> dict[str,
|
|||||||
"word_count": len(words),
|
"word_count": len(words),
|
||||||
}
|
}
|
||||||
|
|
||||||
except ImportError:
|
|
||||||
logger.error("pytesseract not available")
|
|
||||||
return {
|
|
||||||
"page": page_num,
|
|
||||||
"strategy": "tesseract",
|
|
||||||
"error": "pytesseract not available",
|
|
||||||
}
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("Tesseract processing failed", page=page_num, error=str(e))
|
logger.error("Tesseract processing failed", page=page_num, error=str(e))
|
||||||
return {"page": page_num, "strategy": "tesseract", "error": str(e)}
|
return {"page": page_num, "strategy": "tesseract", "error": str(e)}
|
||||||
@@ -482,6 +541,68 @@ async def _process_with_layoutlm(image_data: bytes, page_num: int) -> dict[str,
|
|||||||
return {"page": page_num, "strategy": "layoutlm", "error": str(e)}
|
return {"page": page_num, "strategy": "layoutlm", "error": str(e)}
|
||||||
|
|
||||||
|
|
||||||
|
async def _process_with_vision(image_data: bytes, page_num: int) -> dict[str, Any]:
|
||||||
|
"""Process page with LLM vision OCR via shared OCRProcessor"""
|
||||||
|
try:
|
||||||
|
vp = vision_processor
|
||||||
|
if vp is None:
|
||||||
|
raise RuntimeError("Vision OCR processor not initialized")
|
||||||
|
|
||||||
|
# Persist the page image temporarily for the processor API
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
|
||||||
|
tmp.write(image_data)
|
||||||
|
tmp_path = tmp.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
text = vp.process_image(
|
||||||
|
image_path=tmp_path,
|
||||||
|
format_type=settings.vision_format,
|
||||||
|
preprocess=settings.vision_preprocess,
|
||||||
|
language=settings.languages,
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
os.remove(tmp_path)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return {
|
||||||
|
"page": page_num,
|
||||||
|
"strategy": "vision",
|
||||||
|
"text": text if isinstance(text, str) else str(text),
|
||||||
|
"confidence": 0.0, # Not provided by LLM API
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Vision processing failed", page=page_num, error=str(e))
|
||||||
|
return {"page": page_num, "strategy": "vision", "error": str(e)}
|
||||||
|
|
||||||
|
|
||||||
|
def _schedule_process_document_async(
|
||||||
|
doc_id: str,
|
||||||
|
tenant_id: str,
|
||||||
|
content: bytes,
|
||||||
|
strategy: str,
|
||||||
|
processing_id: str,
|
||||||
|
actor: str,
|
||||||
|
) -> None:
|
||||||
|
"""Sync wrapper to schedule the async OCR task.
|
||||||
|
|
||||||
|
This keeps FastAPI BackgroundTasks type expectations satisfied under mypy strict.
|
||||||
|
"""
|
||||||
|
asyncio.create_task(
|
||||||
|
_process_document_async(
|
||||||
|
doc_id=doc_id,
|
||||||
|
tenant_id=tenant_id,
|
||||||
|
content=content,
|
||||||
|
strategy=strategy,
|
||||||
|
processing_id=processing_id,
|
||||||
|
actor=actor,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@app.exception_handler(HTTPException)
|
@app.exception_handler(HTTPException)
|
||||||
async def http_exception_handler(request: Request, exc: HTTPException) -> JSONResponse:
|
async def http_exception_handler(request: Request, exc: HTTPException) -> JSONResponse:
|
||||||
"""Handle HTTP exceptions with RFC7807 format"""
|
"""Handle HTTP exceptions with RFC7807 format"""
|
||||||
|
|||||||
@@ -5,7 +5,7 @@
|
|||||||
pytesseract>=0.3.13
|
pytesseract>=0.3.13
|
||||||
|
|
||||||
# PDF processing
|
# PDF processing
|
||||||
PyMuPDF>=1.26.4
|
PyPDF2>=3.0.1
|
||||||
pdf2image>=1.17.0
|
pdf2image>=1.17.0
|
||||||
|
|
||||||
# Image processing
|
# Image processing
|
||||||
|
|||||||
@@ -2,7 +2,7 @@
|
|||||||
# NOTE: ML dependencies (sentence-transformers, transformers, torch, numpy) are in base-ml image
|
# NOTE: ML dependencies (sentence-transformers, transformers, torch, numpy) are in base-ml image
|
||||||
|
|
||||||
# Text chunking (lightweight alternative to langchain)
|
# Text chunking (lightweight alternative to langchain)
|
||||||
tiktoken>=0.11.0
|
tiktoken>=0.12.0
|
||||||
|
|
||||||
# Text preprocessing (lightweight)
|
# Text preprocessing (lightweight)
|
||||||
beautifulsoup4>=4.14.2
|
beautifulsoup4>=4.14.2
|
||||||
|
|||||||
@@ -1,20 +1,20 @@
|
|||||||
# FastAPI and server
|
# FastAPI and server
|
||||||
fastapi>=0.104.1
|
fastapi>=0.118.3
|
||||||
uvicorn[standard]>=0.24.0
|
uvicorn[standard]>=0.37.0
|
||||||
pydantic>=2.5.0
|
pydantic>=2.12.0
|
||||||
|
|
||||||
# Service-specific dependencies
|
# Service-specific dependencies
|
||||||
# Mathematical calculations
|
# Mathematical calculations
|
||||||
# decimal is part of Python standard library
|
# decimal is part of Python standard library
|
||||||
sympy>=1.12.0
|
sympy>=1.14.0
|
||||||
|
|
||||||
# Tax calculations
|
# Tax calculations
|
||||||
numpy>=2.3.3
|
numpy>=2.3.3
|
||||||
pandas>=2.1.0
|
pandas>=2.3.3
|
||||||
|
|
||||||
# Date and time calculations
|
# Date and time calculations
|
||||||
python-dateutil>=2.8.0
|
python-dateutil>=2.9.0
|
||||||
pytz>=2023.3
|
pytz>=2025.2
|
||||||
|
|
||||||
# UK tax specific
|
# UK tax specific
|
||||||
# uk-tax-calculator>=1.0.0 # Package may not exist, commenting out
|
# uk-tax-calculator>=1.0.0 # Package may not exist, commenting out
|
||||||
@@ -26,10 +26,10 @@ pytz>=2023.3
|
|||||||
# quantlib>=1.32.0 # Package may not exist, commenting out
|
# quantlib>=1.32.0 # Package may not exist, commenting out
|
||||||
|
|
||||||
# Data validation
|
# Data validation
|
||||||
cerberus>=1.3.4
|
cerberus>=1.3.7
|
||||||
|
|
||||||
# Template processing for explanations
|
# Template processing for explanations
|
||||||
jinja2>=3.1.0
|
jinja2>=3.1.6
|
||||||
|
|
||||||
# Statistical calculations
|
# Statistical calculations
|
||||||
scipy>=1.11.0
|
scipy>=1.16.2
|
||||||
|
|||||||
@@ -1,11 +1,11 @@
|
|||||||
# FastAPI and server
|
# FastAPI and server
|
||||||
fastapi>=0.104.1
|
fastapi>=0.118.3
|
||||||
uvicorn[standard]>=0.24.0
|
uvicorn[standard]>=0.37.0
|
||||||
pydantic>=2.5.0
|
pydantic>=2.12.0
|
||||||
|
|
||||||
# Service-specific dependencies
|
# Service-specific dependencies
|
||||||
# Browser automation
|
# Browser automation
|
||||||
playwright>=1.40.0
|
playwright>=1.55.0
|
||||||
|
|
||||||
# Additional async utilities
|
# Additional async utilities
|
||||||
# asyncio-timeout>=4.0.3 # Deprecated, use asyncio.timeout from Python 3.11+ standard library
|
# asyncio-timeout>=4.0.3 # Deprecated, use asyncio.timeout from Python 3.11+ standard library
|
||||||
@@ -14,4 +14,4 @@ playwright>=1.40.0
|
|||||||
aioredis>=2.0.1
|
aioredis>=2.0.1
|
||||||
|
|
||||||
# Browser management
|
# Browser management
|
||||||
psutil>=5.9.0
|
psutil>=7.1.0
|
||||||
|
|||||||
@@ -7,6 +7,7 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
|
|||||||
## Current State Analysis
|
## Current State Analysis
|
||||||
|
|
||||||
### Remote Server (`141.136.35.199`)
|
### Remote Server (`141.136.35.199`)
|
||||||
|
|
||||||
- **Location**: `/opt/compose/`
|
- **Location**: `/opt/compose/`
|
||||||
- **Existing Services**:
|
- **Existing Services**:
|
||||||
- Traefik v3.5.1 (reverse proxy with GoDaddy DNS challenge)
|
- Traefik v3.5.1 (reverse proxy with GoDaddy DNS challenge)
|
||||||
@@ -25,6 +26,7 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
|
|||||||
- `portainer.harkon.co.uk`
|
- `portainer.harkon.co.uk`
|
||||||
|
|
||||||
### Local Repository (`infra/compose/`)
|
### Local Repository (`infra/compose/`)
|
||||||
|
|
||||||
- **Compose Files**:
|
- **Compose Files**:
|
||||||
- `docker-compose.local.yml` - Full stack for local development
|
- `docker-compose.local.yml` - Full stack for local development
|
||||||
- `docker-compose.backend.yml` - Backend services (appears to be production-ready)
|
- `docker-compose.backend.yml` - Backend services (appears to be production-ready)
|
||||||
@@ -39,25 +41,30 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
|
|||||||
## Challenges & Conflicts
|
## Challenges & Conflicts
|
||||||
|
|
||||||
### 1. **Duplicate Services**
|
### 1. **Duplicate Services**
|
||||||
|
|
||||||
- Both environments have Traefik and Authentik
|
- Both environments have Traefik and Authentik
|
||||||
- Need to decide: shared vs. isolated
|
- Need to decide: shared vs. isolated
|
||||||
|
|
||||||
### 2. **Network Naming**
|
### 2. **Network Naming**
|
||||||
|
|
||||||
- Remote: `frontend`, `backend`
|
- Remote: `frontend`, `backend`
|
||||||
- Local: `ai-tax-agent-frontend`, `ai-tax-agent-backend`
|
- Local: `ai-tax-agent-frontend`, `ai-tax-agent-backend`
|
||||||
- Production needs: Consistent naming
|
- Production needs: Consistent naming
|
||||||
|
|
||||||
### 3. **Domain Management**
|
### 3. **Domain Management**
|
||||||
|
|
||||||
- Remote: `*.harkon.co.uk` (public)
|
- Remote: `*.harkon.co.uk` (public)
|
||||||
- Local: `*.local.lan` (development)
|
- Local: `*.local.lan` (development)
|
||||||
- Production: Need subdomains like `app.harkon.co.uk`, `api.harkon.co.uk`
|
- Production: Need subdomains like `app.harkon.co.uk`, `api.harkon.co.uk`
|
||||||
|
|
||||||
### 4. **SSL Certificates**
|
### 4. **SSL Certificates**
|
||||||
|
|
||||||
- Remote: GoDaddy DNS challenge (production)
|
- Remote: GoDaddy DNS challenge (production)
|
||||||
- Local: Self-signed certificates
|
- Local: Self-signed certificates
|
||||||
- Production: Must use GoDaddy DNS challenge
|
- Production: Must use GoDaddy DNS challenge
|
||||||
|
|
||||||
### 5. **Resource Isolation**
|
### 5. **Resource Isolation**
|
||||||
|
|
||||||
- Company services need to remain stable
|
- Company services need to remain stable
|
||||||
- Application services need independent deployment/rollback
|
- Application services need independent deployment/rollback
|
||||||
|
|
||||||
@@ -66,6 +73,7 @@ This plan outlines the strategy to host both the **AI Tax Agent application** an
|
|||||||
We will deploy the company services and the AI Tax Agent as two fully isolated stacks, each with its own Traefik and Authentik. This maximizes blast-radius isolation and avoids naming and DNS conflicts across environments.
|
We will deploy the company services and the AI Tax Agent as two fully isolated stacks, each with its own Traefik and Authentik. This maximizes blast-radius isolation and avoids naming and DNS conflicts across environments.
|
||||||
|
|
||||||
Key implications:
|
Key implications:
|
||||||
|
|
||||||
- Separate external networks and DNS namespaces per stack
|
- Separate external networks and DNS namespaces per stack
|
||||||
- Duplicate edge (Traefik) and IdP (Authentik), independent upgrades and rollbacks
|
- Duplicate edge (Traefik) and IdP (Authentik), independent upgrades and rollbacks
|
||||||
- Slightly higher resource usage in exchange for strong isolation
|
- Slightly higher resource usage in exchange for strong isolation
|
||||||
@@ -139,6 +147,7 @@ Key implications:
|
|||||||
### Domain Mapping
|
### Domain Mapping
|
||||||
|
|
||||||
**Company Services** (existing):
|
**Company Services** (existing):
|
||||||
|
|
||||||
- `traefik.harkon.co.uk` - Traefik dashboard
|
- `traefik.harkon.co.uk` - Traefik dashboard
|
||||||
- `auth.harkon.co.uk` - Authentik SSO
|
- `auth.harkon.co.uk` - Authentik SSO
|
||||||
- `gitea.harkon.co.uk` - Git hosting
|
- `gitea.harkon.co.uk` - Git hosting
|
||||||
@@ -146,6 +155,7 @@ Key implications:
|
|||||||
- `portainer.harkon.co.uk` - Docker management
|
- `portainer.harkon.co.uk` - Docker management
|
||||||
|
|
||||||
**Application Services** (app stack):
|
**Application Services** (app stack):
|
||||||
|
|
||||||
- `review.<domain>` - Review UI
|
- `review.<domain>` - Review UI
|
||||||
- `api.<domain>` - API Gateway (microservices via Traefik)
|
- `api.<domain>` - API Gateway (microservices via Traefik)
|
||||||
- `vault.<domain>` - Vault UI (admin only)
|
- `vault.<domain>` - Vault UI (admin only)
|
||||||
@@ -159,12 +169,14 @@ Key implications:
|
|||||||
### Authentication Strategy
|
### Authentication Strategy
|
||||||
|
|
||||||
**Authentik Configuration**:
|
**Authentik Configuration**:
|
||||||
|
|
||||||
1. **Company Group** - Access to Gitea, Nextcloud, Portainer
|
1. **Company Group** - Access to Gitea, Nextcloud, Portainer
|
||||||
2. **App Admin Group** - Full access to all app services
|
2. **App Admin Group** - Full access to all app services
|
||||||
3. **App User Group** - Access to Review UI and API
|
3. **App User Group** - Access to Review UI and API
|
||||||
4. **App Reviewer Group** - Access to Review UI only
|
4. **App Reviewer Group** - Access to Review UI only
|
||||||
|
|
||||||
**Middleware Configuration**:
|
**Middleware Configuration**:
|
||||||
|
|
||||||
- `authentik-forwardauth` - Standard auth for all services
|
- `authentik-forwardauth` - Standard auth for all services
|
||||||
- `admin-auth` - Requires admin group (Vault, MinIO, Neo4j, etc.)
|
- `admin-auth` - Requires admin group (Vault, MinIO, Neo4j, etc.)
|
||||||
- `reviewer-auth` - Requires reviewer or higher
|
- `reviewer-auth` - Requires reviewer or higher
|
||||||
@@ -182,6 +194,7 @@ Key implications:
|
|||||||
### Development Environment
|
### Development Environment
|
||||||
|
|
||||||
**Keep Existing Setup**:
|
**Keep Existing Setup**:
|
||||||
|
|
||||||
- Use `docker-compose.local.yml` as-is
|
- Use `docker-compose.local.yml` as-is
|
||||||
- Domain: `*.local.lan`
|
- Domain: `*.local.lan`
|
||||||
- Self-signed certificates
|
- Self-signed certificates
|
||||||
@@ -189,6 +202,7 @@ Key implications:
|
|||||||
- Full stack runs locally
|
- Full stack runs locally
|
||||||
|
|
||||||
**Benefits**:
|
**Benefits**:
|
||||||
|
|
||||||
- No dependency on remote server
|
- No dependency on remote server
|
||||||
- Fast iteration
|
- Fast iteration
|
||||||
- Complete isolation
|
- Complete isolation
|
||||||
@@ -217,19 +231,22 @@ make deploy-production # Deploy to remote server
|
|||||||
### Phase 1: Preparation (Week 1)
|
### Phase 1: Preparation (Week 1)
|
||||||
|
|
||||||
1. **Backup Current State**
|
1. **Backup Current State**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ssh deploy@141.136.35.199
|
ssh deploy@141.136.35.199
|
||||||
cd /opt/compose
|
cd /opt
|
||||||
tar -czf ~/backup-$(date +%Y%m%d).tar.gz .
|
tar -czf ~/backup-$(date +%Y%m%d).tar.gz .
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Create Production Environment File**
|
2. **Create Production Environment File**
|
||||||
- Copy `infra/compose/env.example` to `infra/compose/.env.production`
|
|
||||||
|
- Copy `infra/environments/production/.env.example` to `infra/environments/production/.env`
|
||||||
- Update all secrets and passwords
|
- Update all secrets and passwords
|
||||||
- Set `DOMAIN=harkon.co.uk`
|
- Set `DOMAIN=harkon.co.uk`
|
||||||
- Configure GoDaddy API credentials
|
- Configure GoDaddy API credentials
|
||||||
|
|
||||||
3. **Update Traefik Configuration**
|
3. **Update Traefik Configuration**
|
||||||
|
|
||||||
- Merge local Traefik config with remote
|
- Merge local Traefik config with remote
|
||||||
- Add application routes
|
- Add application routes
|
||||||
- Configure Authentik ForwardAuth
|
- Configure Authentik ForwardAuth
|
||||||
@@ -242,13 +259,15 @@ make deploy-production # Deploy to remote server
|
|||||||
### Phase 2: Infrastructure Deployment (Week 2)
|
### Phase 2: Infrastructure Deployment (Week 2)
|
||||||
|
|
||||||
1. **Deploy Application Infrastructure**
|
1. **Deploy Application Infrastructure**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# On remote server
|
# On remote server
|
||||||
cd /opt/compose/ai-tax-agent
|
cd /opt/ai-tax-agent
|
||||||
docker compose -f infrastructure.yaml up -d
|
docker compose -f infrastructure.yaml up -d
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Initialize Services**
|
2. **Initialize Services**
|
||||||
|
|
||||||
- Vault: Unseal and configure
|
- Vault: Unseal and configure
|
||||||
- Postgres: Run migrations
|
- Postgres: Run migrations
|
||||||
- Neo4j: Install plugins
|
- Neo4j: Install plugins
|
||||||
@@ -262,11 +281,13 @@ make deploy-production # Deploy to remote server
|
|||||||
### Phase 3: Application Deployment (Week 3)
|
### Phase 3: Application Deployment (Week 3)
|
||||||
|
|
||||||
1. **Deploy Microservices**
|
1. **Deploy Microservices**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker compose -f services.yaml up -d
|
docker compose -f services.yaml up -d
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Deploy Monitoring**
|
2. **Deploy Monitoring**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker compose -f monitoring.yaml up -d
|
docker compose -f monitoring.yaml up -d
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -10,7 +10,7 @@
|
|||||||
|
|
||||||
### 1. Production Compose Files Created
|
### 1. Production Compose Files Created
|
||||||
|
|
||||||
Created three production-ready Docker Compose files in `infra/compose/production/`:
|
Created three production-ready Docker Compose files in `infra/base/`:
|
||||||
|
|
||||||
#### **infrastructure.yaml**
|
#### **infrastructure.yaml**
|
||||||
- Vault (secrets management)
|
- Vault (secrets management)
|
||||||
@@ -104,7 +104,7 @@ chmod +x scripts/deploy-to-production.sh
|
|||||||
|
|
||||||
### 3. Documentation Created
|
### 3. Documentation Created
|
||||||
|
|
||||||
#### **infra/compose/production/README.md**
|
#### **infra/base manifests**
|
||||||
Comprehensive production deployment guide including:
|
Comprehensive production deployment guide including:
|
||||||
- Prerequisites checklist
|
- Prerequisites checklist
|
||||||
- Three deployment options (automated, step-by-step, manual)
|
- Three deployment options (automated, step-by-step, manual)
|
||||||
@@ -221,7 +221,7 @@ Or step-by-step:
|
|||||||
1. **Initialize Vault**
|
1. **Initialize Vault**
|
||||||
```bash
|
```bash
|
||||||
ssh deploy@141.136.35.199
|
ssh deploy@141.136.35.199
|
||||||
cd /opt/compose/ai-tax-agent
|
cd /opt/ai-tax-agent
|
||||||
docker exec -it vault vault operator init
|
docker exec -it vault vault operator init
|
||||||
# Save unseal keys!
|
# Save unseal keys!
|
||||||
docker exec -it vault vault operator unseal
|
docker exec -it vault vault operator unseal
|
||||||
@@ -382,7 +382,6 @@ Deployment is successful when:
|
|||||||
If you encounter issues:
|
If you encounter issues:
|
||||||
1. Check logs: `./scripts/deploy-to-production.sh logs <service>`
|
1. Check logs: `./scripts/deploy-to-production.sh logs <service>`
|
||||||
2. Verify status: `./scripts/deploy-to-production.sh verify`
|
2. Verify status: `./scripts/deploy-to-production.sh verify`
|
||||||
3. Review documentation: `infra/compose/production/README.md`
|
3. Review manifests: `infra/base/*.yaml`
|
||||||
4. Check deployment plan: `docs/DEPLOYMENT_PLAN.md`
|
4. Check deployment plan: `docs/DEPLOYMENT_PLAN.md`
|
||||||
5. Follow checklist: `docs/DEPLOYMENT_CHECKLIST.md`
|
5. Follow checklist: `docs/DEPLOYMENT_CHECKLIST.md`
|
||||||
|
|
||||||
|
|||||||
@@ -21,15 +21,14 @@
|
|||||||
- ✅ Created quick start guide (`docs/QUICK_START.md`)
|
- ✅ Created quick start guide (`docs/QUICK_START.md`)
|
||||||
|
|
||||||
### 3. Production Configuration Files
|
### 3. Production Configuration Files
|
||||||
- ✅ Created `infra/compose/production/infrastructure.yaml` (7 infrastructure services)
|
- ✅ Created `infra/base/infrastructure.yaml` (infrastructure, incl. Traefik + Authentik)
|
||||||
- ✅ Created `infra/compose/production/services.yaml` (14 application services + UI)
|
- ✅ Created `infra/base/services.yaml` (application services + UI)
|
||||||
- ✅ Created `infra/compose/production/monitoring.yaml` (Prometheus, Grafana, Loki, Promtail)
|
- ✅ Created `infra/base/monitoring.yaml` (Prometheus, Grafana, Loki, Promtail)
|
||||||
- ✅ Created `infra/compose/production/README.md` (deployment guide)
|
|
||||||
|
|
||||||
### 4. Monitoring Configuration
|
### 4. Monitoring Configuration
|
||||||
- ✅ Created Prometheus configuration (`infra/compose/prometheus/prometheus.yml`)
|
- ✅ Created Prometheus configuration (`infra/base/prometheus/prometheus.yml`)
|
||||||
- ✅ Created Loki configuration (`infra/compose/loki/loki-config.yml`)
|
- ✅ Created Loki configuration (`infra/base/loki/loki-config.yml`)
|
||||||
- ✅ Created Promtail configuration (`infra/compose/promtail/promtail-config.yml`)
|
- ✅ Created Promtail configuration (`infra/base/promtail/promtail-config.yml`)
|
||||||
- ✅ Configured service discovery for all 14 services
|
- ✅ Configured service discovery for all 14 services
|
||||||
- ✅ Set up 30-day metrics retention
|
- ✅ Set up 30-day metrics retention
|
||||||
|
|
||||||
@@ -266,10 +265,9 @@ df -h
|
|||||||
- `docs/ENVIRONMENT_COMPARISON.md` - Local vs Production comparison
|
- `docs/ENVIRONMENT_COMPARISON.md` - Local vs Production comparison
|
||||||
|
|
||||||
2. **Configuration:**
|
2. **Configuration:**
|
||||||
- `infra/compose/production/README.md` - Production compose guide
|
- `infra/base/infrastructure.yaml` - Infrastructure services
|
||||||
- `infra/compose/production/infrastructure.yaml` - Infrastructure services
|
- `infra/base/services.yaml` - Application services
|
||||||
- `infra/compose/production/services.yaml` - Application services
|
- `infra/base/monitoring.yaml` - Monitoring stack
|
||||||
- `infra/compose/production/monitoring.yaml` - Monitoring stack
|
|
||||||
|
|
||||||
3. **Deployment:**
|
3. **Deployment:**
|
||||||
- `docs/POST_BUILD_DEPLOYMENT.md` - Post-build deployment steps
|
- `docs/POST_BUILD_DEPLOYMENT.md` - Post-build deployment steps
|
||||||
@@ -319,4 +317,3 @@ For questions or issues:
|
|||||||
- 🟡 In Progress
|
- 🟡 In Progress
|
||||||
- ⏳ Pending
|
- ⏳ Pending
|
||||||
- ❌ Blocked
|
- ❌ Blocked
|
||||||
|
|
||||||
|
|||||||
@@ -12,7 +12,7 @@ This document compares the local development environment with the production env
|
|||||||
| **SSL** | Self-signed certificates | Let's Encrypt (GoDaddy DNS) |
|
| **SSL** | Self-signed certificates | Let's Encrypt (GoDaddy DNS) |
|
||||||
| **Networks** | `ai-tax-agent-frontend`<br/>`ai-tax-agent-backend` | `frontend`<br/>`backend` |
|
| **Networks** | `ai-tax-agent-frontend`<br/>`ai-tax-agent-backend` | `frontend`<br/>`backend` |
|
||||||
| **Compose File** | `docker-compose.local.yml` | `infrastructure.yaml`<br/>`services.yaml`<br/>`monitoring.yaml` |
|
| **Compose File** | `docker-compose.local.yml` | `infrastructure.yaml`<br/>`services.yaml`<br/>`monitoring.yaml` |
|
||||||
| **Location** | Local machine | `deploy@141.136.35.199:/opt/compose/ai-tax-agent/` |
|
| **Location** | Local machine | `deploy@141.136.35.199:/opt/ai-tax-agent/` |
|
||||||
| **Traefik** | Isolated instance | Shared with company services |
|
| **Traefik** | Isolated instance | Shared with company services |
|
||||||
| **Authentik** | Isolated instance | Shared with company services |
|
| **Authentik** | Isolated instance | Shared with company services |
|
||||||
| **Data Persistence** | Local Docker volumes | Remote Docker volumes + backups |
|
| **Data Persistence** | Local Docker volumes | Remote Docker volumes + backups |
|
||||||
@@ -271,7 +271,7 @@ make clean
|
|||||||
#### Production
|
#### Production
|
||||||
```bash
|
```bash
|
||||||
# Deploy infrastructure
|
# Deploy infrastructure
|
||||||
cd /opt/compose/ai-tax-agent
|
cd /opt/ai-tax-agent
|
||||||
docker compose -f infrastructure.yaml up -d
|
docker compose -f infrastructure.yaml up -d
|
||||||
|
|
||||||
# Deploy services
|
# Deploy services
|
||||||
@@ -370,7 +370,7 @@ docker compose -f services.yaml up -d --no-deps svc-ingestion
|
|||||||
4. **Deploy to production**:
|
4. **Deploy to production**:
|
||||||
```bash
|
```bash
|
||||||
ssh deploy@141.136.35.199
|
ssh deploy@141.136.35.199
|
||||||
cd /opt/compose/ai-tax-agent
|
cd /opt/ai-tax-agent
|
||||||
docker compose -f services.yaml pull
|
docker compose -f services.yaml pull
|
||||||
docker compose -f services.yaml up -d
|
docker compose -f services.yaml up -d
|
||||||
```
|
```
|
||||||
@@ -436,4 +436,3 @@ The key differences between local and production environments are:
|
|||||||
6. **Backups**: Local has none; production has automated backups
|
6. **Backups**: Local has none; production has automated backups
|
||||||
|
|
||||||
Both environments use the same application code and Docker images, ensuring consistency and reducing deployment risks.
|
Both environments use the same application code and Docker images, ensuring consistency and reducing deployment risks.
|
||||||
|
|
||||||
|
|||||||
@@ -1,332 +0,0 @@
|
|||||||
# Gitea Container Registry Debugging Guide
|
|
||||||
|
|
||||||
## Common Issues When Pushing Large Docker Images
|
|
||||||
|
|
||||||
### Issue 1: Not Logged In
|
|
||||||
|
|
||||||
**Symptom**: `unauthorized: authentication required`
|
|
||||||
|
|
||||||
**Solution**:
|
|
||||||
```bash
|
|
||||||
# On remote server
|
|
||||||
docker login gitea.harkon.co.uk
|
|
||||||
# Username: blue (or your Gitea username)
|
|
||||||
# Password: <your-gitea-access-token>
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Issue 2: Upload Size Limit (413 Request Entity Too Large)
|
|
||||||
|
|
||||||
**Symptom**: Push fails with `413 Request Entity Too Large` or similar error
|
|
||||||
|
|
||||||
**Root Cause**: Traefik or Gitea has a limit on request body size
|
|
||||||
|
|
||||||
**Solution A: Configure Traefik Middleware**
|
|
||||||
|
|
||||||
1. Find your Traefik configuration directory:
|
|
||||||
```bash
|
|
||||||
docker inspect traefik | grep -A 10 Mounts
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Create middleware configuration:
|
|
||||||
```bash
|
|
||||||
# Example: /opt/traefik/config/middlewares.yml
|
|
||||||
sudo tee /opt/traefik/config/middlewares.yml > /dev/null << 'EOF'
|
|
||||||
http:
|
|
||||||
middlewares:
|
|
||||||
large-upload:
|
|
||||||
buffering:
|
|
||||||
maxRequestBodyBytes: 5368709120 # 5GB
|
|
||||||
memRequestBodyBytes: 104857600 # 100MB
|
|
||||||
maxResponseBodyBytes: 5368709120 # 5GB
|
|
||||||
memResponseBodyBytes: 104857600 # 100MB
|
|
||||||
EOF
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Update Gitea container labels:
|
|
||||||
```yaml
|
|
||||||
labels:
|
|
||||||
- "traefik.http.routers.gitea.middlewares=large-upload@file"
|
|
||||||
```
|
|
||||||
|
|
||||||
4. Restart Traefik:
|
|
||||||
```bash
|
|
||||||
docker restart traefik
|
|
||||||
```
|
|
||||||
|
|
||||||
**Solution B: Configure Gitea Directly**
|
|
||||||
|
|
||||||
1. Edit Gitea configuration:
|
|
||||||
```bash
|
|
||||||
docker exec -it gitea-server vi /data/gitea/conf/app.ini
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Add/modify these settings:
|
|
||||||
```ini
|
|
||||||
[server]
|
|
||||||
LFS_MAX_FILE_SIZE = 5368709120 ; 5GB
|
|
||||||
|
|
||||||
[repository.upload]
|
|
||||||
FILE_MAX_SIZE = 5368709120 ; 5GB
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Restart Gitea:
|
|
||||||
```bash
|
|
||||||
docker restart gitea-server
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Issue 3: Network Timeout
|
|
||||||
|
|
||||||
**Symptom**: Push hangs or times out after uploading for a while
|
|
||||||
|
|
||||||
**Root Cause**: Network instability or slow connection
|
|
||||||
|
|
||||||
**Solution**: Use chunked uploads or increase timeout
|
|
||||||
|
|
||||||
1. Configure Docker daemon timeout:
|
|
||||||
```bash
|
|
||||||
# Edit /etc/docker/daemon.json
|
|
||||||
sudo tee /etc/docker/daemon.json > /dev/null << 'EOF'
|
|
||||||
{
|
|
||||||
"max-concurrent-uploads": 1,
|
|
||||||
"max-concurrent-downloads": 3,
|
|
||||||
"registry-mirrors": []
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
sudo systemctl restart docker
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Or use Traefik timeout middleware:
|
|
||||||
```yaml
|
|
||||||
http:
|
|
||||||
middlewares:
|
|
||||||
long-timeout:
|
|
||||||
buffering:
|
|
||||||
retryExpression: "IsNetworkError() && Attempts() < 3"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Issue 4: Disk Space
|
|
||||||
|
|
||||||
**Symptom**: Push fails with "no space left on device"
|
|
||||||
|
|
||||||
**Solution**:
|
|
||||||
```bash
|
|
||||||
# Check disk space
|
|
||||||
df -h
|
|
||||||
|
|
||||||
# Clean up Docker
|
|
||||||
docker system prune -a --volumes -f
|
|
||||||
|
|
||||||
# Check again
|
|
||||||
df -h
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Issue 5: Gitea Registry Not Enabled
|
|
||||||
|
|
||||||
**Symptom**: `404 Not Found` when accessing `/v2/`
|
|
||||||
|
|
||||||
**Solution**:
|
|
||||||
```bash
|
|
||||||
# Check if registry is enabled
|
|
||||||
docker exec gitea-server cat /data/gitea/conf/app.ini | grep -A 5 "\[packages\]"
|
|
||||||
|
|
||||||
# Should show:
|
|
||||||
# [packages]
|
|
||||||
# ENABLED = true
|
|
||||||
```
|
|
||||||
|
|
||||||
If not enabled, add to `app.ini`:
|
|
||||||
```ini
|
|
||||||
[packages]
|
|
||||||
ENABLED = true
|
|
||||||
```
|
|
||||||
|
|
||||||
Restart Gitea:
|
|
||||||
```bash
|
|
||||||
docker restart gitea-server
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Debugging Steps
|
|
||||||
|
|
||||||
### Step 1: Verify Gitea Registry is Accessible
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Should return 401 Unauthorized (which is good - means registry is working)
|
|
||||||
curl -I https://gitea.harkon.co.uk/v2/
|
|
||||||
|
|
||||||
# Should return 200 OK after login
|
|
||||||
docker login gitea.harkon.co.uk
|
|
||||||
curl -u "username:token" https://gitea.harkon.co.uk/v2/
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2: Test with Small Image
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Pull a small image
|
|
||||||
docker pull alpine:latest
|
|
||||||
|
|
||||||
# Tag it for your registry
|
|
||||||
docker tag alpine:latest gitea.harkon.co.uk/harkon/test:latest
|
|
||||||
|
|
||||||
# Try to push
|
|
||||||
docker push gitea.harkon.co.uk/harkon/test:latest
|
|
||||||
```
|
|
||||||
|
|
||||||
If this works, the issue is with large images (size limit).
|
|
||||||
|
|
||||||
### Step 3: Check Gitea Logs
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check for errors
|
|
||||||
docker logs gitea-server --tail 100 | grep -i error
|
|
||||||
|
|
||||||
# Watch logs in real-time while pushing
|
|
||||||
docker logs -f gitea-server
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4: Check Traefik Logs
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check for 413 or 502 errors
|
|
||||||
docker logs traefik --tail 100 | grep -E "413|502|error"
|
|
||||||
|
|
||||||
# Watch logs in real-time
|
|
||||||
docker logs -f traefik
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 5: Check Docker Daemon Logs
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check Docker daemon logs
|
|
||||||
sudo journalctl -u docker --since "1 hour ago" | grep -i error
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Quick Fix: Bypass Traefik for Registry
|
|
||||||
|
|
||||||
If Traefik is causing issues, you can expose Gitea's registry directly:
|
|
||||||
|
|
||||||
1. Update Gitea docker-compose to expose port 3000:
|
|
||||||
```yaml
|
|
||||||
services:
|
|
||||||
gitea:
|
|
||||||
ports:
|
|
||||||
- "3000:3000" # HTTP
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Use direct connection:
|
|
||||||
```bash
|
|
||||||
docker login gitea.harkon.co.uk:3000
|
|
||||||
docker push gitea.harkon.co.uk:3000/harkon/base-ml:v1.0.1
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note**: This bypasses SSL, so only use for debugging!
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommended Configuration for Large Images
|
|
||||||
|
|
||||||
### Traefik Configuration
|
|
||||||
|
|
||||||
Create `/opt/traefik/config/gitea-registry.yml`:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
http:
|
|
||||||
middlewares:
|
|
||||||
gitea-registry:
|
|
||||||
buffering:
|
|
||||||
maxRequestBodyBytes: 5368709120 # 5GB
|
|
||||||
memRequestBodyBytes: 104857600 # 100MB in memory
|
|
||||||
maxResponseBodyBytes: 5368709120 # 5GB
|
|
||||||
memResponseBodyBytes: 104857600 # 100MB in memory
|
|
||||||
|
|
||||||
routers:
|
|
||||||
gitea-registry:
|
|
||||||
rule: "Host(`gitea.harkon.co.uk`) && PathPrefix(`/v2/`)"
|
|
||||||
entryPoints:
|
|
||||||
- websecure
|
|
||||||
middlewares:
|
|
||||||
- gitea-registry
|
|
||||||
service: gitea
|
|
||||||
tls:
|
|
||||||
certResolver: letsencrypt
|
|
||||||
```
|
|
||||||
|
|
||||||
### Gitea Configuration
|
|
||||||
|
|
||||||
In `/data/gitea/conf/app.ini`:
|
|
||||||
|
|
||||||
```ini
|
|
||||||
[server]
|
|
||||||
PROTOCOL = http
|
|
||||||
DOMAIN = gitea.harkon.co.uk
|
|
||||||
ROOT_URL = https://gitea.harkon.co.uk/
|
|
||||||
HTTP_PORT = 3000
|
|
||||||
LFS_MAX_FILE_SIZE = 5368709120
|
|
||||||
|
|
||||||
[repository.upload]
|
|
||||||
FILE_MAX_SIZE = 5368709120
|
|
||||||
ENABLED = true
|
|
||||||
|
|
||||||
[packages]
|
|
||||||
ENABLED = true
|
|
||||||
CHUNKED_UPLOAD_PATH = /data/gitea/tmp/package-upload
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Testing the Fix
|
|
||||||
|
|
||||||
After applying configuration changes:
|
|
||||||
|
|
||||||
1. Restart services:
|
|
||||||
```bash
|
|
||||||
docker restart traefik
|
|
||||||
docker restart gitea-server
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Test with a large layer:
|
|
||||||
```bash
|
|
||||||
# Build base-ml (has large layers)
|
|
||||||
cd /home/deploy/ai-tax-agent
|
|
||||||
docker build -f infra/docker/base-ml.Dockerfile -t gitea.harkon.co.uk/harkon/base-ml:test .
|
|
||||||
|
|
||||||
# Try to push
|
|
||||||
docker push gitea.harkon.co.uk/harkon/base-ml:test
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Monitor logs:
|
|
||||||
```bash
|
|
||||||
# Terminal 1: Watch Traefik
|
|
||||||
docker logs -f traefik
|
|
||||||
|
|
||||||
# Terminal 2: Watch Gitea
|
|
||||||
docker logs -f gitea-server
|
|
||||||
|
|
||||||
# Terminal 3: Push image
|
|
||||||
docker push gitea.harkon.co.uk/harkon/base-ml:test
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Alternative: Use Docker Hub or GitHub Container Registry
|
|
||||||
|
|
||||||
If Gitea continues to have issues with large images, consider:
|
|
||||||
|
|
||||||
1. **Docker Hub**: Free for public images
|
|
||||||
2. **GitHub Container Registry (ghcr.io)**: Free for public/private
|
|
||||||
3. **GitLab Container Registry**: Free tier available
|
|
||||||
|
|
||||||
These are battle-tested for large ML images and have better defaults for large uploads.
|
|
||||||
|
|
||||||
@@ -1,194 +0,0 @@
|
|||||||
# Gitea Container Registry - Image Naming Fix
|
|
||||||
|
|
||||||
## Issue
|
|
||||||
|
|
||||||
The initial build script was using incorrect image naming convention for Gitea's container registry.
|
|
||||||
|
|
||||||
### Incorrect Format
|
|
||||||
|
|
||||||
```
|
|
||||||
gitea.harkon.co.uk/ai-tax-agent/svc-ingestion:v1.0.0
|
|
||||||
```
|
|
||||||
|
|
||||||
### Correct Format (Per Gitea Documentation)
|
|
||||||
|
|
||||||
```
|
|
||||||
gitea.harkon.co.uk/{owner}/{image}:{tag}
|
|
||||||
```
|
|
||||||
|
|
||||||
Where `{owner}` must be your **Gitea username** or **organization name**.
|
|
||||||
|
|
||||||
**Using organization:** `harkon` (Gitea team/organization)
|
|
||||||
|
|
||||||
## Solution
|
|
||||||
|
|
||||||
Updated the build script and production compose files to use the correct naming convention.
|
|
||||||
|
|
||||||
### Changes Made
|
|
||||||
|
|
||||||
#### 1. Build Script (`scripts/build-and-push-images.sh`)
|
|
||||||
|
|
||||||
**Before:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
REGISTRY="${1:-gitea.harkon.co.uk}"
|
|
||||||
VERSION="${2:-latest}"
|
|
||||||
PROJECT="ai-tax-agent"
|
|
||||||
|
|
||||||
IMAGE_NAME="$REGISTRY/$PROJECT/$service:$VERSION"
|
|
||||||
```
|
|
||||||
|
|
||||||
**After:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
REGISTRY="${1:-gitea.harkon.co.uk}"
|
|
||||||
VERSION="${2:-latest}"
|
|
||||||
OWNER="${3:-harkon}" # Gitea organization/team name
|
|
||||||
|
|
||||||
IMAGE_NAME="$REGISTRY/$OWNER/$service:$VERSION"
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 2. Production Services (`infra/compose/production/services.yaml`)
|
|
||||||
|
|
||||||
**Before:**
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
svc-ingestion:
|
|
||||||
image: gitea.harkon.co.uk/ai-tax-agent/svc-ingestion:latest
|
|
||||||
```
|
|
||||||
|
|
||||||
**After:**
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
svc-ingestion:
|
|
||||||
image: gitea.harkon.co.uk/harkon/svc-ingestion:latest
|
|
||||||
```
|
|
||||||
|
|
||||||
All 14 services updated:
|
|
||||||
|
|
||||||
- svc-ingestion
|
|
||||||
- svc-extract
|
|
||||||
- svc-kg
|
|
||||||
- svc-rag-retriever
|
|
||||||
- svc-rag-indexer
|
|
||||||
- svc-forms
|
|
||||||
- svc-hmrc
|
|
||||||
- svc-ocr
|
|
||||||
- svc-rpa
|
|
||||||
- svc-normalize-map
|
|
||||||
- svc-reason
|
|
||||||
- svc-firm-connectors
|
|
||||||
- svc-coverage
|
|
||||||
- ui-review
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### Build and Push Images
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# With default owner (harkon organization)
|
|
||||||
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1
|
|
||||||
|
|
||||||
# With custom owner
|
|
||||||
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 <your-gitea-org>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Pull Images
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker pull gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
|
|
||||||
```
|
|
||||||
|
|
||||||
### Push Images Manually
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Tag image
|
|
||||||
docker tag my-image:latest gitea.harkon.co.uk/harkon/my-image:v1.0.1
|
|
||||||
|
|
||||||
# Push image
|
|
||||||
docker push gitea.harkon.co.uk/harkon/my-image:v1.0.1
|
|
||||||
```
|
|
||||||
|
|
||||||
## Gitea Registry Documentation Reference
|
|
||||||
|
|
||||||
From Gitea's official documentation:
|
|
||||||
|
|
||||||
### Image Naming Convention
|
|
||||||
|
|
||||||
Images must follow this naming convention:
|
|
||||||
|
|
||||||
```
|
|
||||||
{registry}/{owner}/{image}
|
|
||||||
```
|
|
||||||
|
|
||||||
When building your docker image, using the naming convention above, this looks like:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# build an image with tag
|
|
||||||
docker build -t {registry}/{owner}/{image}:{tag} .
|
|
||||||
|
|
||||||
# name an existing image with tag
|
|
||||||
docker tag {some-existing-image}:{tag} {registry}/{owner}/{image}:{tag}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Valid Examples
|
|
||||||
|
|
||||||
For owner `testuser` on `gitea.example.com`:
|
|
||||||
|
|
||||||
- ✅ `gitea.example.com/testuser/myimage`
|
|
||||||
- ✅ `gitea.example.com/testuser/my-image`
|
|
||||||
- ✅ `gitea.example.com/testuser/my/image`
|
|
||||||
|
|
||||||
### Important Notes
|
|
||||||
|
|
||||||
1. **Owner must exist**: The owner (username or organization) must exist in Gitea
|
|
||||||
2. **Case-insensitive tags**: `image:tag` and `image:Tag` are treated as the same
|
|
||||||
3. **Authentication required**: Use personal access token with `write:package` scope
|
|
||||||
4. **Registry URL**: Use the main Gitea domain, not a separate registry subdomain
|
|
||||||
|
|
||||||
## Verification
|
|
||||||
|
|
||||||
After the fix, verify images are pushed correctly:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Login to Gitea
|
|
||||||
docker login gitea.harkon.co.uk
|
|
||||||
|
|
||||||
# Check pushed images in Gitea UI
|
|
||||||
# Navigate to: https://gitea.harkon.co.uk/blue/-/packages
|
|
||||||
```
|
|
||||||
|
|
||||||
## Current Build Status
|
|
||||||
|
|
||||||
✅ **Fixed and working!**
|
|
||||||
|
|
||||||
Build command:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./scripts/build-and-push-images.sh gitea.harkon.co.uk v1.0.1 harkon
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output:
|
|
||||||
|
|
||||||
```
|
|
||||||
ℹ️ Logging in to registry: gitea.harkon.co.uk
|
|
||||||
Login Succeeded
|
|
||||||
ℹ️ Building svc-ingestion...
|
|
||||||
ℹ️ Building: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
|
|
||||||
✅ Built: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
|
|
||||||
ℹ️ Pushing: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
|
|
||||||
✅ Pushed: gitea.harkon.co.uk/harkon/svc-ingestion:v1.0.1
|
|
||||||
```
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
1. ✅ Build script fixed
|
|
||||||
2. ✅ Production compose files updated
|
|
||||||
3. 🟡 Build in progress (14 services)
|
|
||||||
4. ⏳ Deploy to production (after build completes)
|
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
- [Gitea Container Registry Documentation](https://docs.gitea.com/usage/packages/container)
|
|
||||||
- Build script: `scripts/build-and-push-images.sh`
|
|
||||||
- Production services: `infra/compose/production/services.yaml`
|
|
||||||
@@ -148,11 +148,11 @@ docker run --rm gitea.harkon.co.uk/harkon/svc-ocr:v1.0.1 pip list | grep torch
|
|||||||
|
|
||||||
### 5. Update Production Deployment
|
### 5. Update Production Deployment
|
||||||
|
|
||||||
Update `infra/compose/production/services.yaml` to use `v1.0.1`:
|
Update `infra/base/services.yaml` to use `v1.0.1`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Find and replace v1.0.0 with v1.0.1
|
# Find and replace v1.0.0 with v1.0.1
|
||||||
sed -i '' 's/:v1.0.0/:v1.0.1/g' infra/compose/production/services.yaml
|
sed -i '' 's/:v1.0.0/:v1.0.1/g' infra/base/services.yaml
|
||||||
|
|
||||||
# Or use latest tag (already configured)
|
# Or use latest tag (already configured)
|
||||||
# No changes needed if using :latest
|
# No changes needed if using :latest
|
||||||
|
|||||||
@@ -50,7 +50,7 @@ docker login gitea.harkon.co.uk
|
|||||||
**SSH to server:**
|
**SSH to server:**
|
||||||
```bash
|
```bash
|
||||||
ssh deploy@141.136.35.199
|
ssh deploy@141.136.35.199
|
||||||
cd /opt/compose/ai-tax-agent
|
cd /opt/ai-tax-agent
|
||||||
```
|
```
|
||||||
|
|
||||||
**Initialize Vault:**
|
**Initialize Vault:**
|
||||||
@@ -62,19 +62,19 @@ docker exec -it vault vault operator unseal
|
|||||||
|
|
||||||
**Create MinIO Buckets:**
|
**Create MinIO Buckets:**
|
||||||
```bash
|
```bash
|
||||||
docker exec -it minio mc alias set local http://localhost:9092 admin <MINIO_PASSWORD>
|
docker exec -it apa-minio mc alias set local http://localhost:9000 admin <MINIO_PASSWORD>
|
||||||
docker exec -it minio mc mb local/documents
|
docker exec -it apa-minio mc mb local/documents
|
||||||
docker exec -it minio mc mb local/models
|
docker exec -it apa-minio mc mb local/models
|
||||||
```
|
```
|
||||||
|
|
||||||
**Create NATS Streams:**
|
**Create NATS Streams:**
|
||||||
```bash
|
```bash
|
||||||
docker exec -it nats nats stream add TAX_AGENT_EVENTS \
|
docker exec -it apa-nats nats stream add TAX_AGENT_EVENTS \\
|
||||||
--subjects="tax.>" --storage=file --retention=limits --max-age=7d
|
--subjects="tax.>" --storage=file --retention=limits --max-age=7d
|
||||||
```
|
```
|
||||||
|
|
||||||
**Configure Authentik:**
|
**Configure Authentik:**
|
||||||
1. Go to https://authentik.harkon.co.uk
|
1. Go to https://auth.harkon.co.uk
|
||||||
2. Create groups: `app-admin`, `app-user`, `app-reviewer`
|
2. Create groups: `app-admin`, `app-user`, `app-reviewer`
|
||||||
3. Create OAuth providers for:
|
3. Create OAuth providers for:
|
||||||
- Review UI: `app.harkon.co.uk`
|
- Review UI: `app.harkon.co.uk`
|
||||||
@@ -94,7 +94,7 @@ curl -I https://api.harkon.co.uk/healthz
|
|||||||
curl -I https://grafana.harkon.co.uk
|
curl -I https://grafana.harkon.co.uk
|
||||||
|
|
||||||
# View logs
|
# View logs
|
||||||
./scripts/deploy-to-production.sh logs svc-ingestion
|
./scripts/deploy-to-production.sh logs apa-svc-ingestion
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -127,8 +127,8 @@ curl -I https://grafana.harkon.co.uk
|
|||||||
### Restart Service
|
### Restart Service
|
||||||
```bash
|
```bash
|
||||||
ssh deploy@141.136.35.199
|
ssh deploy@141.136.35.199
|
||||||
cd /opt/compose/ai-tax-agent
|
cd /opt/ai-tax-agent
|
||||||
docker compose -f services.yaml restart svc-ingestion
|
docker compose -f services.yaml restart apa-svc-ingestion
|
||||||
```
|
```
|
||||||
|
|
||||||
### Check Status
|
### Check Status
|
||||||
@@ -163,25 +163,25 @@ docker compose -f services.yaml logs svc-ingestion
|
|||||||
docker compose -f infrastructure.yaml ps
|
docker compose -f infrastructure.yaml ps
|
||||||
|
|
||||||
# Restart
|
# Restart
|
||||||
docker compose -f services.yaml restart svc-ingestion
|
docker compose -f services.yaml restart apa-svc-ingestion
|
||||||
```
|
```
|
||||||
|
|
||||||
### SSL Issues
|
### SSL Issues
|
||||||
```bash
|
```bash
|
||||||
# Check Traefik logs
|
# Check Traefik logs
|
||||||
docker logs traefik
|
docker logs apa-traefik
|
||||||
|
|
||||||
# Check certificates
|
# Check certificates
|
||||||
sudo cat /opt/compose/traefik/certs/godaddy-acme.json | jq
|
sudo cat /opt/ai-tax-agent/traefik/certs/godaddy-acme.json | jq
|
||||||
```
|
```
|
||||||
|
|
||||||
### Database Connection
|
### Database Connection
|
||||||
```bash
|
```bash
|
||||||
# Test Postgres
|
# Test Postgres
|
||||||
docker exec -it postgres pg_isready -U postgres
|
docker exec -it apa-postgres pg_isready -U postgres
|
||||||
|
|
||||||
# Check env vars
|
# Check env vars
|
||||||
docker exec -it svc-ingestion env | grep POSTGRES
|
docker exec -it apa-svc-ingestion env | grep POSTGRES
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -190,7 +190,7 @@ docker exec -it svc-ingestion env | grep POSTGRES
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
ssh deploy@141.136.35.199
|
ssh deploy@141.136.35.199
|
||||||
cd /opt/compose/ai-tax-agent
|
cd /opt/ai-tax-agent
|
||||||
|
|
||||||
# Stop services
|
# Stop services
|
||||||
docker compose -f services.yaml down
|
docker compose -f services.yaml down
|
||||||
@@ -198,12 +198,11 @@ docker compose -f infrastructure.yaml down
|
|||||||
docker compose -f monitoring.yaml down
|
docker compose -f monitoring.yaml down
|
||||||
|
|
||||||
# Restore backup
|
# Restore backup
|
||||||
cd /opt/compose
|
cd /opt
|
||||||
tar -xzf ~/backups/backup-YYYYMMDD-HHMMSS.tar.gz
|
tar -xzf ~/backups/backup-YYYYMMDD-HHMMSS.tar.gz
|
||||||
|
|
||||||
# Restart company services
|
# Restart application infra
|
||||||
cd /opt/compose/traefik && docker compose up -d
|
cd /opt/ai-tax-agent && docker compose -f infrastructure.yaml up -d
|
||||||
cd /opt/compose/authentik && docker compose up -d
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -242,4 +241,3 @@ cd /opt/compose/authentik && docker compose up -d
|
|||||||
```bash
|
```bash
|
||||||
./scripts/deploy-to-production.sh logs <service>
|
./scripts/deploy-to-production.sh logs <service>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -16,3 +16,49 @@ http:
|
|||||||
- X-authentik-meta-provider
|
- X-authentik-meta-provider
|
||||||
- X-authentik-meta-app
|
- X-authentik-meta-app
|
||||||
- X-authentik-meta-version
|
- X-authentik-meta-version
|
||||||
|
|
||||||
|
# Large upload middleware for Gitea registry
|
||||||
|
gitea-large-upload:
|
||||||
|
buffering:
|
||||||
|
maxRequestBodyBytes: 5368709120 # 5GB
|
||||||
|
memRequestBodyBytes: 104857600 # 100MB
|
||||||
|
maxResponseBodyBytes: 5368709120 # 5GB
|
||||||
|
memResponseBodyBytes: 104857600 # 100MB
|
||||||
|
retryExpression: "IsNetworkError() && Attempts() < 3"
|
||||||
|
|
||||||
|
# Rate limiting for public APIs
|
||||||
|
api-ratelimit:
|
||||||
|
rateLimit:
|
||||||
|
average: 100
|
||||||
|
burst: 50
|
||||||
|
period: 1s
|
||||||
|
|
||||||
|
# Security headers
|
||||||
|
security-headers:
|
||||||
|
headers:
|
||||||
|
frameDeny: true
|
||||||
|
sslRedirect: true
|
||||||
|
browserXssFilter: true
|
||||||
|
contentTypeNosniff: true
|
||||||
|
stsIncludeSubdomains: true
|
||||||
|
stsPreload: true
|
||||||
|
stsSeconds: 31536000
|
||||||
|
|
||||||
|
# CORS headers
|
||||||
|
api-cors:
|
||||||
|
headers:
|
||||||
|
accessControlAllowMethods:
|
||||||
|
- GET
|
||||||
|
- POST
|
||||||
|
- PUT
|
||||||
|
- DELETE
|
||||||
|
- OPTIONS
|
||||||
|
accessControlAllowOriginList:
|
||||||
|
- "https://app.harkon.co.uk"
|
||||||
|
accessControlAllowHeaders:
|
||||||
|
- "Content-Type"
|
||||||
|
- "Authorization"
|
||||||
|
accessControlMaxAge: 100
|
||||||
|
addVaryHeader: true
|
||||||
|
|
||||||
|
# Security headers
|
||||||
|
|||||||
@@ -4,7 +4,9 @@ entryPoints:
|
|||||||
address: ":80"
|
address: ":80"
|
||||||
websecure:
|
websecure:
|
||||||
address: ":443"
|
address: ":443"
|
||||||
|
transport:
|
||||||
|
respondingTimeouts:
|
||||||
|
readTimeout: 30m
|
||||||
api:
|
api:
|
||||||
dashboard: true
|
dashboard: true
|
||||||
|
|
||||||
|
|||||||
@@ -1,31 +0,0 @@
|
|||||||
# Application-specific Traefik middlewares
|
|
||||||
# These are loaded by the application infrastructure, not the external Traefik
|
|
||||||
|
|
||||||
http:
|
|
||||||
middlewares:
|
|
||||||
# Large upload middleware for Gitea registry
|
|
||||||
gitea-large-upload:
|
|
||||||
buffering:
|
|
||||||
maxRequestBodyBytes: 5368709120 # 5GB
|
|
||||||
memRequestBodyBytes: 104857600 # 100MB
|
|
||||||
maxResponseBodyBytes: 5368709120 # 5GB
|
|
||||||
memResponseBodyBytes: 104857600 # 100MB
|
|
||||||
retryExpression: "IsNetworkError() && Attempts() < 3"
|
|
||||||
|
|
||||||
# Rate limiting for public APIs
|
|
||||||
api-ratelimit:
|
|
||||||
rateLimit:
|
|
||||||
average: 100
|
|
||||||
burst: 50
|
|
||||||
period: 1s
|
|
||||||
|
|
||||||
# Security headers
|
|
||||||
security-headers:
|
|
||||||
headers:
|
|
||||||
frameDeny: true
|
|
||||||
sslRedirect: true
|
|
||||||
browserXssFilter: true
|
|
||||||
contentTypeNosniff: true
|
|
||||||
stsIncludeSubdomains: true
|
|
||||||
stsPreload: true
|
|
||||||
stsSeconds: 31536000
|
|
||||||
@@ -1,25 +0,0 @@
|
|||||||
-----BEGIN CERTIFICATE-----
|
|
||||||
MIIEHjCCAwagAwIBAgIUbOm5g4Xhb08Lk6DIpVst7+xZHOswDQYJKoZIhvcNAQEL
|
|
||||||
BQAwEDEOMAwGA1UEAwwFbG9jYWwwHhcNMjUwOTI4MTExNTM1WhcNMzUwOTI2MTEx
|
|
||||||
NTM1WjAQMQ4wDAYDVQQDDAVsb2NhbDCCASIwDQYJKoZIhvcNAQEBBQADggEPADCC
|
|
||||||
AQoCggEBAK0370DEo3dScS8uLwBsXkuaAHn9wO2fjxEHLZwHWfFo/16t+EEAi5c3
|
|
||||||
zDs7nYQ7LPLndxBfO6xZ5uWKNIVtp6ARzAeRbGgbjXDdK3fOyRdhhKR3aZVOH1D0
|
|
||||||
xUjEm/X5jEDv81sufSjk+DIQmh8hQnp3RwdHyhkIZUCTsBXMfnj+zs1UKTdRQBF5
|
|
||||||
SUplGsbh6z3xCSI4jiNRb7mNHXqV3Fv6ycwF8YdthSDfueltBP4vT/CDtebkkKPF
|
|
||||||
dx7YWEIPPUNqEoHqeI5iYP6gnWJYcr3vU+p2BuTwUICo+njzAf+P/SsjPHbujJob
|
|
||||||
dbHUclBHIrIO4BpYZtY1a7E219MbqcECAwEAAaOCAW4wggFqMB0GA1UdDgQWBBQ7
|
|
||||||
qHpza0Bb1xI1g7cMBx33JnFQljAfBgNVHSMEGDAWgBQ7qHpza0Bb1xI1g7cMBx33
|
|
||||||
JnFQljAPBgNVHRMBAf8EBTADAQH/MIIBFQYDVR0RBIIBDDCCAQiCCWxvY2FsaG9z
|
|
||||||
dIcEfwAAAYILKi5sb2NhbC5sYW6CDmF1dGgubG9jYWwubGFughFncmFmYW5hLmxv
|
|
||||||
Y2FsLmxhboIQcmV2aWV3LmxvY2FsLmxhboINYXBpLmxvY2FsLmxhboIPdmF1bHQu
|
|
||||||
bG9jYWwubGFugg9taW5pby5sb2NhbC5sYW6CE21pbmlvLWFwaS5sb2NhbC5sYW6C
|
|
||||||
EHFkcmFudC5sb2NhbC5sYW6CD25lbzRqLmxvY2FsLmxhboIUcHJvbWV0aGV1cy5s
|
|
||||||
b2NhbC5sYW6CDmxva2kubG9jYWwubGFughF1bmxlYXNoLmxvY2FsLmxhboIRdHJh
|
|
||||||
ZWZpay5sb2NhbC5sYW4wDQYJKoZIhvcNAQELBQADggEBAICf+2MZ7BHbSD/pnvll
|
|
||||||
G7Zmk+Bntj2F6RBQVZ2ZsKPWkHeZEYJDRvU0I2uL5tvvDJp4q0hjdluJllchhGgr
|
|
||||||
qfu7i+kRnhzme7oyRTFGYp8b3zHBvLyJLmdIALxuNSjIEeh1Fx0lEhKwqOlA4y6T
|
|
||||||
jziPmsGv3IonGJM2dURGNcR7DfG6H/Yl12qV8u/tVFTxqWL+hyCE7u8v+ZIcZ+fj
|
|
||||||
82X7hXt1HvfP84EhVtfqQMb5xykLtXvPqggSCFXYIj2PanWdwEdE6P5Yr2D1Yz7r
|
|
||||||
tzpmpoetrGoMWIeB0yiWgt0qJ/KK7meoCp64mqfBc48p1p/7kj2R/FRH1Jx3gFWy
|
|
||||||
dT4=
|
|
||||||
-----END CERTIFICATE-----
|
|
||||||
@@ -1,28 +0,0 @@
|
|||||||
-----BEGIN PRIVATE KEY-----
|
|
||||||
MIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQCtN+9AxKN3UnEv
|
|
||||||
Li8AbF5LmgB5/cDtn48RBy2cB1nxaP9erfhBAIuXN8w7O52EOyzy53cQXzusWebl
|
|
||||||
ijSFbaegEcwHkWxoG41w3St3zskXYYSkd2mVTh9Q9MVIxJv1+YxA7/NbLn0o5Pgy
|
|
||||||
EJofIUJ6d0cHR8oZCGVAk7AVzH54/s7NVCk3UUAReUlKZRrG4es98QkiOI4jUW+5
|
|
||||||
jR16ldxb+snMBfGHbYUg37npbQT+L0/wg7Xm5JCjxXce2FhCDz1DahKB6niOYmD+
|
|
||||||
oJ1iWHK971Pqdgbk8FCAqPp48wH/j/0rIzx27oyaG3Wx1HJQRyKyDuAaWGbWNWux
|
|
||||||
NtfTG6nBAgMBAAECggEAHvtkNcd2HX+HcxLloUPA0fDnqOo0OcxSQI9yHvhJpB5N
|
|
||||||
nterEaVRUmjOhMGy+NXEwmWYLDt8ZuVloSTJJBxq4PyN68SdCTn0YH2Oqs03tpDg
|
|
||||||
srIRFn10qHw/VTalVqed6HeCpYp5JHlf00SY7Hx8cX8oGytCAJw50AUad6ut62IM
|
|
||||||
sp/QFdtkLhtq9vGzQUqyIP92Y/+GbxhB+eHkuvvFau1KJq7K8qhroFTwQFts9er2
|
|
||||||
890Ujmz3bF2RhHixQcpXpsf/DMyylGJTbZDmSFkTDa/c1PzqvKrmL3wP7A3bk1E5
|
|
||||||
CP8/a65ykotJEX8RkWqH2XxvRKpdWtCaeuCsmWUQ4QKBgQDTLbC9DWHCUYMWJhyW
|
|
||||||
TKAeXx5xFGHIqggN28lIkXFiCVsTZyOuRDN7Q/CbOat/0JthrzyP18L+6ewZt2ZN
|
|
||||||
RjdfGdnpUCJx6LR4dtBH8Rc+CjlSnqEgJIkgfIs8b9uEhMI1eQV+BAFQON3BzdpT
|
|
||||||
wQ86aGsrdqtpfav7cImVfGcY/QKBgQDR+7OcnEwh8s/1J2niMKjk8agyCGGHWW4M
|
|
||||||
g+vIv7lptavgEGOPMBv7QgmeuUjwSszphQXL36m39ZRmI5B+J0/onuQzv04tJeZY
|
|
||||||
WZhA+T12a+1VnvUZNZm/qp0I2rW+4m+DmJoLQlvpaaFit/1fPJ6+IzI2VzPeWhw2
|
|
||||||
vUQ5QIYhFQKBgFUWZc3mpGsNOMol1QLiIOnb3YImejfF+rTKx9FLeOnNZzrsJb5D
|
|
||||||
kJKsDzgcBnPbc5/qYXZ7sv/O9OhvsvKTxh+1ZM3TEe3fm0emZ8l05K6EpBAcBkPT
|
|
||||||
NMU4KUnSsBo2+6Fb/9CEgJr4LrG15bA1a5NXG0dJ60r37eHDuEvY8hlpAoGADWv2
|
|
||||||
PhNrdlwL2NKtHO0ZTpD3vEL24OzhcOFZx9ohYtVe6BKEGpnrn/LHpKKZO+q8EE0V
|
|
||||||
YsOoGH8U/jZVvQqMPAUz9u7Kc25Ru+H2Lmj/+brKT8e6SOM5MZwZL4CzT0Ev+Yxe
|
|
||||||
hEu4jkHXM/Uot9arGuIrCngmc5b06LbOTo6GREUCgYArWyPYeETah/GVwU7/TNY5
|
|
||||||
5f8lNbWBoXZfpVbWdoUZT6tGWciZsiXSR4x9f+1/LMIuChegSEazrJUDt7TbCkZs
|
|
||||||
s4A66pnME37aYP2sMvJF3zSnQWVIyBgGI5xX0XW/WdozKl1mdFfigyWp58uo2dS2
|
|
||||||
TxE3dy8rxpUdDCUmvJT/Fw==
|
|
||||||
-----END PRIVATE KEY-----
|
|
||||||
@@ -134,7 +134,7 @@ class Neo4jClient:
|
|||||||
result = await self.run_query(query, {"properties": properties}, database)
|
result = await self.run_query(query, {"properties": properties}, database)
|
||||||
node = result[0]["n"] if result else {}
|
node = result[0]["n"] if result else {}
|
||||||
# Return node ID if available, otherwise return the full node
|
# Return node ID if available, otherwise return the full node
|
||||||
return node.get("id", node)
|
return node.get("id", node) # type: ignore
|
||||||
|
|
||||||
async def update_node(
|
async def update_node(
|
||||||
self,
|
self,
|
||||||
@@ -209,7 +209,7 @@ class Neo4jClient:
|
|||||||
database,
|
database,
|
||||||
)
|
)
|
||||||
rel = result[0]["r"] if result else {}
|
rel = result[0]["r"] if result else {}
|
||||||
return rel.get("id", rel)
|
return rel.get("id", rel) # type: ignore
|
||||||
|
|
||||||
# Original signature (using labels and IDs)
|
# Original signature (using labels and IDs)
|
||||||
rel_properties = properties or {}
|
rel_properties = properties or {}
|
||||||
@@ -231,7 +231,7 @@ class Neo4jClient:
|
|||||||
)
|
)
|
||||||
rel = result[0]["r"] if result else {}
|
rel = result[0]["r"] if result else {}
|
||||||
# Return relationship ID if available, otherwise return the full relationship
|
# Return relationship ID if available, otherwise return the full relationship
|
||||||
return rel.get("id", rel)
|
return rel.get("id", rel) # type: ignore
|
||||||
|
|
||||||
async def get_node_lineage(
|
async def get_node_lineage(
|
||||||
self, node_id: str, max_depth: int = 10, database: str = "neo4j"
|
self, node_id: str, max_depth: int = 10, database: str = "neo4j"
|
||||||
|
|||||||
507
libs/ocr/processor.py
Normal file
507
libs/ocr/processor.py
Normal file
@@ -0,0 +1,507 @@
|
|||||||
|
import base64
|
||||||
|
import concurrent.futures
|
||||||
|
import io
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import requests
|
||||||
|
from PIL import Image, ImageFilter
|
||||||
|
from PyPDF2 import PdfReader
|
||||||
|
|
||||||
|
|
||||||
|
class OCRProcessor:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
model_name: str = "llama3.2-vision:11b",
|
||||||
|
base_url: str = "http://localhost:11434/api/generate",
|
||||||
|
max_workers: int = 1,
|
||||||
|
provider: str = "ollama",
|
||||||
|
openai_api_key: str | None = None,
|
||||||
|
openai_base_url: str = "https://api.openai.com/v1/chat/completions",
|
||||||
|
):
|
||||||
|
self.model_name = model_name
|
||||||
|
self.base_url = base_url
|
||||||
|
self.max_workers = max_workers
|
||||||
|
self.provider = provider.lower()
|
||||||
|
self.openai_api_key = openai_api_key or os.getenv("OPENAI_API_KEY")
|
||||||
|
self.openai_base_url = openai_base_url
|
||||||
|
|
||||||
|
def _encode_image(self, image_path: str) -> str:
|
||||||
|
"""Convert image to base64 string"""
|
||||||
|
with open(image_path, "rb") as image_file:
|
||||||
|
return base64.b64encode(image_file.read()).decode("utf-8")
|
||||||
|
|
||||||
|
def _pdf_to_images(self, pdf_path: str) -> list[str]:
|
||||||
|
"""
|
||||||
|
Convert each page of a PDF to an image without PyMuPDF.
|
||||||
|
Strategy: extract largest embedded image per page via PyPDF2.
|
||||||
|
Saves each selected image as a temporary PNG and returns paths.
|
||||||
|
|
||||||
|
Note: Text-only pages with no embedded images will be skipped here.
|
||||||
|
Use _pdf_extract_text as a fallback for such pages.
|
||||||
|
"""
|
||||||
|
image_paths: list[str] = []
|
||||||
|
try:
|
||||||
|
reader = PdfReader(pdf_path)
|
||||||
|
for page_index, page in enumerate(reader.pages):
|
||||||
|
try:
|
||||||
|
resources = page.get("/Resources")
|
||||||
|
if resources is None:
|
||||||
|
continue
|
||||||
|
xobject = resources.get("/XObject")
|
||||||
|
if xobject is None:
|
||||||
|
continue
|
||||||
|
xobject = xobject.get_object()
|
||||||
|
largest = None
|
||||||
|
largest_area = -1
|
||||||
|
for _, obj_ref in xobject.items():
|
||||||
|
try:
|
||||||
|
obj = obj_ref.get_object()
|
||||||
|
if obj.get("/Subtype") != "/Image":
|
||||||
|
continue
|
||||||
|
width = int(obj.get("/Width", 0))
|
||||||
|
height = int(obj.get("/Height", 0))
|
||||||
|
area = width * height
|
||||||
|
if area > largest_area:
|
||||||
|
largest = obj
|
||||||
|
largest_area = area
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if largest is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
data = largest.get_data()
|
||||||
|
filt = largest.get("/Filter")
|
||||||
|
out_path = f"{pdf_path}_page{page_index}.png"
|
||||||
|
# If JPEG/JPX, write bytes directly; else convert via PIL
|
||||||
|
if filt in ("/DCTDecode",):
|
||||||
|
# JPEG
|
||||||
|
out_path = f"{pdf_path}_page{page_index}.jpg"
|
||||||
|
with open(out_path, "wb") as f:
|
||||||
|
f.write(data)
|
||||||
|
elif filt in ("/JPXDecode",):
|
||||||
|
out_path = f"{pdf_path}_page{page_index}.jp2"
|
||||||
|
with open(out_path, "wb") as f:
|
||||||
|
f.write(data)
|
||||||
|
else:
|
||||||
|
mode = "RGB"
|
||||||
|
colorspace = largest.get("/ColorSpace")
|
||||||
|
if colorspace in ("/DeviceGray",):
|
||||||
|
mode = "L"
|
||||||
|
width = int(largest.get("/Width", 0))
|
||||||
|
height = int(largest.get("/Height", 0))
|
||||||
|
try:
|
||||||
|
img = Image.frombytes(mode, (width, height), data)
|
||||||
|
except Exception:
|
||||||
|
# Best-effort decode via Pillow
|
||||||
|
img = Image.open(io.BytesIO(data))
|
||||||
|
img.save(out_path, format="PNG")
|
||||||
|
|
||||||
|
image_paths.append(out_path)
|
||||||
|
except Exception:
|
||||||
|
# Continue gracefully for problematic pages/objects
|
||||||
|
continue
|
||||||
|
return image_paths
|
||||||
|
except Exception as e:
|
||||||
|
raise ValueError(f"Could not extract images from PDF: {e}")
|
||||||
|
|
||||||
|
def _pdf_extract_text(self, pdf_path: str) -> list[str]:
|
||||||
|
"""Extract text per page using pdfplumber if available, else PyPDF2."""
|
||||||
|
texts: list[str] = []
|
||||||
|
try:
|
||||||
|
try:
|
||||||
|
import pdfplumber
|
||||||
|
|
||||||
|
with pdfplumber.open(pdf_path) as pdf:
|
||||||
|
for page in pdf.pages:
|
||||||
|
texts.append(page.extract_text() or "")
|
||||||
|
return texts
|
||||||
|
except Exception:
|
||||||
|
# Fallback to PyPDF2
|
||||||
|
reader = PdfReader(pdf_path)
|
||||||
|
for page in reader.pages: # type: ignore
|
||||||
|
texts.append(page.extract_text() or "")
|
||||||
|
return texts
|
||||||
|
except Exception as e:
|
||||||
|
raise ValueError(f"Could not extract text from PDF: {e}")
|
||||||
|
|
||||||
|
def _call_ollama_vision(self, prompt: str, image_base64: str) -> str:
|
||||||
|
payload = {
|
||||||
|
"model": self.model_name,
|
||||||
|
"prompt": prompt,
|
||||||
|
"stream": False,
|
||||||
|
"images": [image_base64],
|
||||||
|
}
|
||||||
|
response = requests.post(self.base_url, json=payload)
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json().get("response", "") # type: ignore
|
||||||
|
|
||||||
|
def _call_openai_vision(self, prompt: str, image_base64: str) -> str:
|
||||||
|
if not self.openai_api_key:
|
||||||
|
raise ValueError("OPENAI_API_KEY not set")
|
||||||
|
# Compose chat.completions payload for GPT-4o/mini vision
|
||||||
|
payload = {
|
||||||
|
"model": self.model_name or "gpt-4o-mini",
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{"type": "text", "text": prompt},
|
||||||
|
{
|
||||||
|
"type": "image_url",
|
||||||
|
"image_url": {
|
||||||
|
"url": f"data:image/jpeg;base64,{image_base64}",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"temperature": 0,
|
||||||
|
}
|
||||||
|
headers = {
|
||||||
|
"Authorization": f"Bearer {self.openai_api_key}",
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
}
|
||||||
|
response = requests.post(self.openai_base_url, headers=headers, json=payload)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
try:
|
||||||
|
return data["choices"][0]["message"]["content"] # type: ignore
|
||||||
|
except Exception:
|
||||||
|
return json.dumps(data)
|
||||||
|
|
||||||
|
def _preprocess_image(self, image_path: str, language: str = "en") -> str:
|
||||||
|
"""
|
||||||
|
Preprocess image before OCR using Pillow + NumPy:
|
||||||
|
- Convert to grayscale
|
||||||
|
- Histogram equalization (contrast)
|
||||||
|
- Median denoise
|
||||||
|
- Otsu threshold and invert
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
with Image.open(image_path) as img:
|
||||||
|
if img.mode in ("RGBA", "LA"):
|
||||||
|
img = img.convert("RGB")
|
||||||
|
gray = img.convert("L")
|
||||||
|
|
||||||
|
# Histogram equalization via cumulative distribution
|
||||||
|
arr = np.asarray(gray)
|
||||||
|
hist, _ = np.histogram(arr.flatten(), 256, [0, 256]) # type: ignore
|
||||||
|
cdf = hist.cumsum()
|
||||||
|
cdf_masked = np.ma.masked_equal(cdf, 0) # type: ignore
|
||||||
|
cdf_min = cdf_masked.min() if cdf_masked.size else 0
|
||||||
|
cdf_max = cdf_masked.max() if cdf_masked.size else 0
|
||||||
|
if cdf_max == cdf_min:
|
||||||
|
eq = arr
|
||||||
|
else:
|
||||||
|
cdf_scaled = (cdf_masked - cdf_min) * 255 / (cdf_max - cdf_min)
|
||||||
|
lut = np.ma.filled(cdf_scaled, 0).astype("uint8")
|
||||||
|
eq = lut[arr]
|
||||||
|
|
||||||
|
eq_img = Image.fromarray(eq, mode="L")
|
||||||
|
# Median filter (3x3) to reduce noise
|
||||||
|
eq_img = eq_img.filter(ImageFilter.MedianFilter(size=3))
|
||||||
|
arr_eq = np.asarray(eq_img)
|
||||||
|
|
||||||
|
# Otsu threshold
|
||||||
|
hist2, _ = np.histogram(arr_eq, 256, [0, 256]) # type: ignore
|
||||||
|
total = arr_eq.size
|
||||||
|
sum_total = (np.arange(256) * hist2).sum()
|
||||||
|
sum_b = 0.0
|
||||||
|
w_b = 0.0
|
||||||
|
max_var = 0.0
|
||||||
|
thr = 0
|
||||||
|
for t in range(256):
|
||||||
|
w_b += hist2[t]
|
||||||
|
if w_b == 0:
|
||||||
|
continue
|
||||||
|
w_f = total - w_b
|
||||||
|
if w_f == 0:
|
||||||
|
break
|
||||||
|
sum_b += t * hist2[t]
|
||||||
|
m_b = sum_b / w_b
|
||||||
|
m_f = (sum_total - sum_b) / w_f
|
||||||
|
var_between = w_b * w_f * (m_b - m_f) ** 2
|
||||||
|
if var_between > max_var:
|
||||||
|
max_var = var_between
|
||||||
|
thr = t
|
||||||
|
|
||||||
|
binary = (arr_eq > thr).astype(np.uint8) * 255
|
||||||
|
# Invert: black text on white background
|
||||||
|
binary = 255 - binary
|
||||||
|
|
||||||
|
out_img = Image.fromarray(binary, mode="L")
|
||||||
|
preprocessed_path = f"{image_path}_preprocessed.jpg"
|
||||||
|
out_img.save(preprocessed_path, format="JPEG", quality=95)
|
||||||
|
return preprocessed_path
|
||||||
|
except Exception as e:
|
||||||
|
raise ValueError(f"Failed to preprocess image {image_path}: {e}")
|
||||||
|
|
||||||
|
def process_image(
|
||||||
|
self,
|
||||||
|
image_path: str,
|
||||||
|
format_type: str = "markdown",
|
||||||
|
preprocess: bool = True,
|
||||||
|
custom_prompt: str | None = None,
|
||||||
|
language: str = "en",
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
Process an image (or PDF) and extract text in the specified format
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to the image file or PDF file
|
||||||
|
format_type: One of ["markdown", "text", "json", "structured", "key_value","custom"]
|
||||||
|
preprocess: Whether to apply image preprocessing
|
||||||
|
custom_prompt: If provided, this prompt overrides the default based on format_type
|
||||||
|
language: Language code to apply language specific OCR preprocessing
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# If the input is a PDF, process all pages
|
||||||
|
if image_path.lower().endswith(".pdf"):
|
||||||
|
image_pages = self._pdf_to_images(image_path)
|
||||||
|
responses: list[str] = []
|
||||||
|
if image_pages:
|
||||||
|
for idx, page_file in enumerate(image_pages):
|
||||||
|
# Process each page with preprocessing if enabled
|
||||||
|
if preprocess:
|
||||||
|
preprocessed_path = self._preprocess_image(
|
||||||
|
page_file, language
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
preprocessed_path = page_file
|
||||||
|
|
||||||
|
image_base64 = self._encode_image(preprocessed_path)
|
||||||
|
|
||||||
|
if custom_prompt and custom_prompt.strip():
|
||||||
|
prompt = custom_prompt
|
||||||
|
else:
|
||||||
|
prompts = {
|
||||||
|
"markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
|
||||||
|
Format the output in markdown:
|
||||||
|
- Use headers (#, ##, ###) **only if they appear in the image**
|
||||||
|
- Preserve original lists (-, *, numbered lists) as they are
|
||||||
|
- Maintain all text formatting (bold, italics, underlines) exactly as seen
|
||||||
|
- **Do not add, interpret, or restructure any content**
|
||||||
|
""",
|
||||||
|
"text": f"""Extract all visible text from this image in {language} **without any changes**.
|
||||||
|
- **Do not summarize, paraphrase, or infer missing text.**
|
||||||
|
- Retain all spacing, punctuation, and formatting exactly as in the image.
|
||||||
|
- If text is unclear or partially visible, extract as much as possible without guessing.
|
||||||
|
- **Include all text, even if it seems irrelevant or repeated.**
|
||||||
|
""",
|
||||||
|
"json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
|
||||||
|
- **Do not summarize, add, or modify any text.**
|
||||||
|
- Maintain hierarchical sections and subsections as they appear.
|
||||||
|
- Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
|
||||||
|
- Include all text, even if fragmented, blurry, or unclear.
|
||||||
|
""",
|
||||||
|
"structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
|
||||||
|
- Identify and format tables **without altering content**.
|
||||||
|
- Preserve list structures (bulleted, numbered) **exactly as shown**.
|
||||||
|
- Maintain all section headings, indents, and alignments.
|
||||||
|
- **Do not add, infer, or restructure the content in any way.**
|
||||||
|
""",
|
||||||
|
"key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
|
||||||
|
- Identify and extract labels and their corresponding values without modification.
|
||||||
|
- Maintain the exact wording, punctuation, and order.
|
||||||
|
- Format each pair as 'key: value' **only if clearly structured that way in the image**.
|
||||||
|
- **Do not infer missing values or add any extra text.**
|
||||||
|
""",
|
||||||
|
"table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
|
||||||
|
- **Preserve the table structure** (rows, columns, headers) as closely as possible.
|
||||||
|
- **Do not add missing values or infer content**—if a cell is empty, leave it empty.
|
||||||
|
- Maintain all numerical, textual, and special character formatting.
|
||||||
|
- If the table contains merged cells, indicate them clearly without altering their meaning.
|
||||||
|
- Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
|
||||||
|
""",
|
||||||
|
}
|
||||||
|
prompt = prompts.get(format_type, prompts["text"])
|
||||||
|
|
||||||
|
# Route to chosen provider
|
||||||
|
if self.provider == "openai":
|
||||||
|
res = self._call_openai_vision(prompt, image_base64)
|
||||||
|
else:
|
||||||
|
res = self._call_ollama_vision(prompt, image_base64)
|
||||||
|
|
||||||
|
responses.append(f"Page {idx + 1}:\n{res}")
|
||||||
|
|
||||||
|
# Clean up temporary files
|
||||||
|
if preprocess and preprocessed_path.endswith(
|
||||||
|
"_preprocessed.jpg"
|
||||||
|
):
|
||||||
|
try:
|
||||||
|
os.remove(preprocessed_path)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
if page_file.endswith((".png", ".jpg", ".jp2")):
|
||||||
|
try:
|
||||||
|
os.remove(page_file)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
final_result = "\n".join(responses)
|
||||||
|
if format_type == "json":
|
||||||
|
try:
|
||||||
|
json_data = json.loads(final_result)
|
||||||
|
return json.dumps(json_data, indent=2)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return final_result
|
||||||
|
return final_result
|
||||||
|
else:
|
||||||
|
# Fallback: no images found; extract raw text per page
|
||||||
|
text_pages = self._pdf_extract_text(image_path)
|
||||||
|
combined = []
|
||||||
|
for i, t in enumerate(text_pages):
|
||||||
|
combined.append(f"Page {i + 1}:\n{t}")
|
||||||
|
return "\n".join(combined)
|
||||||
|
|
||||||
|
# Process non-PDF images as before.
|
||||||
|
if preprocess:
|
||||||
|
image_path = self._preprocess_image(image_path, language)
|
||||||
|
|
||||||
|
image_base64 = self._encode_image(image_path)
|
||||||
|
|
||||||
|
# Clean up temporary files
|
||||||
|
if image_path.endswith(("_preprocessed.jpg", "_temp.jpg")):
|
||||||
|
os.remove(image_path)
|
||||||
|
|
||||||
|
if custom_prompt and custom_prompt.strip():
|
||||||
|
prompt = custom_prompt
|
||||||
|
print("Using custom prompt:", prompt)
|
||||||
|
else:
|
||||||
|
prompts = {
|
||||||
|
"markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
|
||||||
|
Format the output in markdown:
|
||||||
|
- Use headers (#, ##, ###) **only if they appear in the image**
|
||||||
|
- Preserve original lists (-, *, numbered lists) as they are
|
||||||
|
- Maintain all text formatting (bold, italics, underlines) exactly as seen
|
||||||
|
- **Do not add, interpret, or restructure any content**
|
||||||
|
""",
|
||||||
|
"text": f"""Extract all visible text from this image in {language} **without any changes**.
|
||||||
|
- **Do not summarize, paraphrase, or infer missing text.**
|
||||||
|
- Retain all spacing, punctuation, and formatting exactly as in the image.
|
||||||
|
- If text is unclear or partially visible, extract as much as possible without guessing.
|
||||||
|
- **Include all text, even if it seems irrelevant or repeated.**
|
||||||
|
""",
|
||||||
|
"json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
|
||||||
|
- **Do not summarize, add, or modify any text.**
|
||||||
|
- Maintain hierarchical sections and subsections as they appear.
|
||||||
|
- Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
|
||||||
|
- Include all text, even if fragmented, blurry, or unclear.
|
||||||
|
""",
|
||||||
|
"structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
|
||||||
|
- Identify and format tables **without altering content**.
|
||||||
|
- Preserve list structures (bulleted, numbered) **exactly as shown**.
|
||||||
|
- Maintain all section headings, indents, and alignments.
|
||||||
|
- **Do not add, infer, or restructure the content in any way.**
|
||||||
|
""",
|
||||||
|
"key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
|
||||||
|
- Identify and extract labels and their corresponding values without modification.
|
||||||
|
- Maintain the exact wording, punctuation, and order.
|
||||||
|
- Format each pair as 'key: value' **only if clearly structured that way in the image**.
|
||||||
|
- **Do not infer missing values or add any extra text.**
|
||||||
|
""",
|
||||||
|
"table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
|
||||||
|
- **Preserve the table structure** (rows, columns, headers) as closely as possible.
|
||||||
|
- **Do not add missing values or infer content**—if a cell is empty, leave it empty.
|
||||||
|
- Maintain all numerical, textual, and special character formatting.
|
||||||
|
- If the table contains merged cells, indicate them clearly without altering their meaning.
|
||||||
|
- Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
|
||||||
|
""",
|
||||||
|
}
|
||||||
|
prompt = prompts.get(format_type, prompts["text"])
|
||||||
|
print("Using default prompt:", prompt) # Debug print
|
||||||
|
|
||||||
|
# Call chosen provider with single image
|
||||||
|
if self.provider == "openai":
|
||||||
|
result = self._call_openai_vision(prompt, image_base64)
|
||||||
|
else:
|
||||||
|
result = self._call_ollama_vision(prompt, image_base64)
|
||||||
|
|
||||||
|
if format_type == "json":
|
||||||
|
try:
|
||||||
|
json_data = json.loads(result)
|
||||||
|
return json.dumps(json_data, indent=2)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return str(result)
|
||||||
|
|
||||||
|
return str(result)
|
||||||
|
except Exception as e:
|
||||||
|
return f"Error processing image: {str(e)}"
|
||||||
|
|
||||||
|
def process_batch(
|
||||||
|
self,
|
||||||
|
input_path: str | list[str],
|
||||||
|
format_type: str = "markdown",
|
||||||
|
recursive: bool = False,
|
||||||
|
preprocess: bool = True,
|
||||||
|
custom_prompt: str | None = None,
|
||||||
|
language: str = "en",
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Process multiple images in batch
|
||||||
|
|
||||||
|
Args:
|
||||||
|
input_path: Path to directory or list of image paths
|
||||||
|
format_type: Output format type
|
||||||
|
recursive: Whether to search directories recursively
|
||||||
|
preprocess: Whether to apply image preprocessing
|
||||||
|
custom_prompt: If provided, this prompt overrides the default for each image
|
||||||
|
language: Language code to apply language specific OCR preprocessing
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with results and statistics
|
||||||
|
"""
|
||||||
|
# Collect all image paths
|
||||||
|
image_paths: list[str | Path] = []
|
||||||
|
if isinstance(input_path, str):
|
||||||
|
base_path = Path(input_path)
|
||||||
|
if base_path.is_dir():
|
||||||
|
pattern = "**/*" if recursive else "*"
|
||||||
|
for ext in [".png", ".jpg", ".jpeg", ".pdf", ".tiff"]:
|
||||||
|
image_paths.extend(base_path.glob(f"{pattern}{ext}"))
|
||||||
|
else:
|
||||||
|
image_paths = [base_path]
|
||||||
|
else:
|
||||||
|
image_paths = [Path(p) for p in input_path]
|
||||||
|
|
||||||
|
results = {}
|
||||||
|
errors = {}
|
||||||
|
|
||||||
|
# Process images in parallel
|
||||||
|
with concurrent.futures.ThreadPoolExecutor(
|
||||||
|
max_workers=self.max_workers
|
||||||
|
) as executor:
|
||||||
|
future_to_path = {
|
||||||
|
executor.submit(
|
||||||
|
self.process_image,
|
||||||
|
str(path),
|
||||||
|
format_type,
|
||||||
|
preprocess,
|
||||||
|
custom_prompt,
|
||||||
|
language,
|
||||||
|
): path
|
||||||
|
for path in image_paths
|
||||||
|
}
|
||||||
|
|
||||||
|
for future in concurrent.futures.as_completed(future_to_path):
|
||||||
|
path = future_to_path[future]
|
||||||
|
try:
|
||||||
|
results[str(path)] = future.result()
|
||||||
|
except Exception as e:
|
||||||
|
errors[str(path)] = str(e)
|
||||||
|
# pbar.update(1)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"results": results,
|
||||||
|
"errors": errors,
|
||||||
|
"statistics": {
|
||||||
|
"total": len(image_paths),
|
||||||
|
"successful": len(results),
|
||||||
|
"failed": len(errors),
|
||||||
|
},
|
||||||
|
}
|
||||||
@@ -1,13 +1,13 @@
|
|||||||
# Core framework dependencies (Required by all services)
|
# Core framework dependencies (Required by all services)
|
||||||
fastapi>=0.118.0
|
fastapi>=0.119.0
|
||||||
uvicorn[standard]>=0.37.0
|
uvicorn[standard]>=0.37.0
|
||||||
pydantic>=2.11.9
|
pydantic>=2.12.0
|
||||||
pydantic-settings>=2.11.0
|
pydantic-settings>=2.11.0
|
||||||
|
|
||||||
# Database drivers (lightweight)
|
# Database drivers (lightweight)
|
||||||
sqlalchemy>=2.0.43
|
sqlalchemy>=2.0.44
|
||||||
asyncpg>=0.30.0
|
asyncpg>=0.30.0
|
||||||
psycopg2-binary>=2.9.10
|
psycopg2-binary>=2.9.11
|
||||||
neo4j>=6.0.2
|
neo4j>=6.0.2
|
||||||
redis[hiredis]>=6.4.0
|
redis[hiredis]>=6.4.0
|
||||||
|
|
||||||
|
|||||||
@@ -3,3 +3,4 @@ pdfrw>=0.4
|
|||||||
reportlab>=4.4.4
|
reportlab>=4.4.4
|
||||||
PyPDF2>=3.0.1
|
PyPDF2>=3.0.1
|
||||||
pdfplumber>=0.11.7
|
pdfplumber>=0.11.7
|
||||||
|
opencv-python
|
||||||
|
|||||||
@@ -79,7 +79,7 @@ class StorageClient:
|
|||||||
"""Download object from bucket"""
|
"""Download object from bucket"""
|
||||||
try:
|
try:
|
||||||
response = self.client.get_object(bucket_name, object_name)
|
response = self.client.get_object(bucket_name, object_name)
|
||||||
data = response.read()
|
data: bytes = response.read()
|
||||||
response.close()
|
response.close()
|
||||||
response.release_conn()
|
response.release_conn()
|
||||||
|
|
||||||
@@ -89,7 +89,7 @@ class StorageClient:
|
|||||||
object=object_name,
|
object=object_name,
|
||||||
size=len(data),
|
size=len(data),
|
||||||
)
|
)
|
||||||
return data # type: ignore
|
return data
|
||||||
|
|
||||||
except S3Error as e:
|
except S3Error as e:
|
||||||
logger.error(
|
logger.error(
|
||||||
|
|||||||
4
mypy.ini
4
mypy.ini
@@ -18,3 +18,7 @@ disallow_untyped_defs = False
|
|||||||
|
|
||||||
[mypy-minio.*]
|
[mypy-minio.*]
|
||||||
ignore_missing_imports = True
|
ignore_missing_imports = True
|
||||||
|
|
||||||
|
[mypy-pytesseract.*]
|
||||||
|
follow_untyped_imports = True
|
||||||
|
ignore_missing_imports = True
|
||||||
|
|||||||
@@ -54,11 +54,20 @@ dependencies = [
|
|||||||
"pytesseract>=0.3.10",
|
"pytesseract>=0.3.10",
|
||||||
"Pillow>=10.1.0",
|
"Pillow>=10.1.0",
|
||||||
"playwright>=1.40.0",
|
"playwright>=1.40.0",
|
||||||
"pyshaql>=0.25.0",
|
"pyshacl>=0.25.0",
|
||||||
"rdflib>=7.0.0",
|
"rdflib>=7.0.0",
|
||||||
"spacy>=3.7.0",
|
"spacy>=3.7.0",
|
||||||
"presidio-analyzer>=2.2.0",
|
"presidio-analyzer>=2.2.0",
|
||||||
"presidio-anonymizer>=2.2.0",
|
"presidio-anonymizer>=2.2.0",
|
||||||
|
"jsonschema>=4.0.0",
|
||||||
|
"boto3>=1.0.0",
|
||||||
|
"aiokafka>=0.8.0",
|
||||||
|
"hvac>=1.0.0",
|
||||||
|
"nats-py>=2.0.0",
|
||||||
|
"pydantic-settings>=2.0.0",
|
||||||
|
"opentelemetry-exporter-otlp>=1.0.0",
|
||||||
|
"opentelemetry-instrumentation-psycopg2>=0.42b0",
|
||||||
|
"opentelemetry-instrumentation-redis>=0.42b0",
|
||||||
]
|
]
|
||||||
|
|
||||||
[project.optional-dependencies]
|
[project.optional-dependencies]
|
||||||
|
|||||||
@@ -56,6 +56,10 @@ numpy>=2.3.3
|
|||||||
# PDF processing
|
# PDF processing
|
||||||
pdfrw>=0.4
|
pdfrw>=0.4
|
||||||
reportlab>=4.4.4
|
reportlab>=4.4.4
|
||||||
|
PyPDF2>=3.0.1
|
||||||
|
pdf2image>=1.17.0
|
||||||
|
pytesseract>=0.3.10
|
||||||
|
Pillow>=10.3.0
|
||||||
|
|
||||||
# Date and time utilities
|
# Date and time utilities
|
||||||
python-dateutil>=2.9.0
|
python-dateutil>=2.9.0
|
||||||
@@ -94,3 +98,4 @@ black>=25.9.0
|
|||||||
isort>=6.0.1
|
isort>=6.0.1
|
||||||
bandit>=1.8.6
|
bandit>=1.8.6
|
||||||
safety>=3.6.2
|
safety>=3.6.2
|
||||||
|
opencv-python
|
||||||
|
|||||||
@@ -7,9 +7,9 @@ set -e
|
|||||||
|
|
||||||
# Configuration
|
# Configuration
|
||||||
REMOTE_HOST="deploy@141.136.35.199"
|
REMOTE_HOST="deploy@141.136.35.199"
|
||||||
REMOTE_PATH="/opt/compose/ai-tax-agent"
|
REMOTE_PATH="/opt/ai-tax-agent"
|
||||||
LOCAL_COMPOSE_PATH="infra/compose/production"
|
LOCAL_COMPOSE_PATH="infra/base"
|
||||||
ENV_FILE="infra/compose/.env.production"
|
ENV_FILE="infra/environments/production/.env"
|
||||||
|
|
||||||
# Colors for output
|
# Colors for output
|
||||||
RED='\033[0;31m'
|
RED='\033[0;31m'
|
||||||
@@ -66,13 +66,15 @@ backup_remote() {
|
|||||||
ssh $REMOTE_HOST << 'EOF'
|
ssh $REMOTE_HOST << 'EOF'
|
||||||
set -e
|
set -e
|
||||||
mkdir -p ~/backups
|
mkdir -p ~/backups
|
||||||
cd /opt/compose
|
cd /opt
|
||||||
|
|
||||||
# Backup compose directory (exclude large cert files)
|
# Backup application directory (exclude large cert files)
|
||||||
tar -czf ~/backups/backup-$(date +%Y%m%d-%H%M%S).tar.gz \
|
if [ -d ai-tax-agent ]; then
|
||||||
--exclude='./traefik/certs/godaddy-acme.json' \
|
tar -czf ~/backups/backup-$(date +%Y%m%d-%H%M%S).tar.gz \
|
||||||
--exclude='./*/node_modules' \
|
--exclude='./traefik/certs/godaddy-acme.json' \
|
||||||
.
|
--exclude='./*/node_modules' \
|
||||||
|
ai-tax-agent
|
||||||
|
fi
|
||||||
|
|
||||||
# Document current state
|
# Document current state
|
||||||
docker ps > ~/backups/current-services-$(date +%Y%m%d-%H%M%S).txt
|
docker ps > ~/backups/current-services-$(date +%Y%m%d-%H%M%S).txt
|
||||||
@@ -100,6 +102,9 @@ prepare_remote() {
|
|||||||
mkdir -p $REMOTE_PATH/grafana/provisioning
|
mkdir -p $REMOTE_PATH/grafana/provisioning
|
||||||
mkdir -p $REMOTE_PATH/grafana/dashboards
|
mkdir -p $REMOTE_PATH/grafana/dashboards
|
||||||
mkdir -p $REMOTE_PATH/loki
|
mkdir -p $REMOTE_PATH/loki
|
||||||
|
mkdir -p $REMOTE_PATH/promtail
|
||||||
|
mkdir -p $REMOTE_PATH/traefik/config
|
||||||
|
mkdir -p $REMOTE_PATH/authentik
|
||||||
|
|
||||||
echo "Directory structure created"
|
echo "Directory structure created"
|
||||||
ls -la $REMOTE_PATH
|
ls -la $REMOTE_PATH
|
||||||
@@ -110,7 +115,7 @@ EOF
|
|||||||
|
|
||||||
# Copy files to remote server
|
# Copy files to remote server
|
||||||
copy_files() {
|
copy_files() {
|
||||||
log_info "Copying compose files to remote server..."
|
log_info "Copying base compose files and configs to remote server..."
|
||||||
|
|
||||||
# Copy compose files
|
# Copy compose files
|
||||||
scp $LOCAL_COMPOSE_PATH/infrastructure.yaml $REMOTE_HOST:$REMOTE_PATH/
|
scp $LOCAL_COMPOSE_PATH/infrastructure.yaml $REMOTE_HOST:$REMOTE_PATH/
|
||||||
@@ -121,10 +126,13 @@ copy_files() {
|
|||||||
scp $ENV_FILE $REMOTE_HOST:$REMOTE_PATH/.env
|
scp $ENV_FILE $REMOTE_HOST:$REMOTE_PATH/.env
|
||||||
|
|
||||||
# Copy configuration files
|
# Copy configuration files
|
||||||
scp -r infra/compose/prometheus/* $REMOTE_HOST:$REMOTE_PATH/prometheus/
|
scp -r $LOCAL_COMPOSE_PATH/prometheus/* $REMOTE_HOST:$REMOTE_PATH/prometheus/
|
||||||
scp -r infra/compose/grafana/provisioning/* $REMOTE_HOST:$REMOTE_PATH/grafana/provisioning/
|
scp -r $LOCAL_COMPOSE_PATH/grafana/provisioning/* $REMOTE_HOST:$REMOTE_PATH/grafana/provisioning/
|
||||||
scp -r infra/compose/grafana/dashboards/* $REMOTE_HOST:$REMOTE_PATH/grafana/dashboards/
|
scp -r $LOCAL_COMPOSE_PATH/grafana/dashboards/* $REMOTE_HOST:$REMOTE_PATH/grafana/dashboards/
|
||||||
scp -r infra/compose/loki/* $REMOTE_HOST:$REMOTE_PATH/loki/
|
scp -r $LOCAL_COMPOSE_PATH/loki/* $REMOTE_HOST:$REMOTE_PATH/loki/
|
||||||
|
scp -r $LOCAL_COMPOSE_PATH/promtail/* $REMOTE_HOST:$REMOTE_PATH/promtail/ 2>/dev/null || true
|
||||||
|
scp -r $LOCAL_COMPOSE_PATH/traefik/config/* $REMOTE_HOST:$REMOTE_PATH/traefik/config/ 2>/dev/null || true
|
||||||
|
scp -r $LOCAL_COMPOSE_PATH/authentik/* $REMOTE_HOST:$REMOTE_PATH/authentik/ 2>/dev/null || true
|
||||||
|
|
||||||
log_success "Files copied to remote server"
|
log_success "Files copied to remote server"
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,555 +1,4 @@
|
|||||||
# ROLE
|
import pytest
|
||||||
|
|
||||||
You are a **Senior Platform Engineer + Backend Lead** generating **production code** and **ops assets** for a microservice suite that powers an accounting Knowledge Graph + Vector RAG platform. Authentication/authorization are centralized at the **edge via Traefik + Authentik** (ForwardAuth). **Services are trust-bound** to Traefik and consume user/role claims via forwarded headers/JWT.
|
def test_happy_path():
|
||||||
|
pass
|
||||||
# MISSION
|
|
||||||
|
|
||||||
Produce fully working code for **all application services** (FastAPI + Python 3.12) with:
|
|
||||||
|
|
||||||
- Solid domain models, Pydantic v2 schemas, type hints, strict mypy, ruff lint.
|
|
||||||
- Opentelemetry tracing, Prometheus metrics, structured logging.
|
|
||||||
- Vault-backed secrets, MinIO S3 client, Qdrant client, Neo4j driver, Postgres (SQLAlchemy), Redis.
|
|
||||||
- Eventing (Kafka or SQS/SNS behind an interface).
|
|
||||||
- Deterministic data contracts, end-to-end tests, Dockerfiles, Compose, CI for Gitea.
|
|
||||||
- Traefik labels + Authentik Outpost integration for every exposed route.
|
|
||||||
- Zero PII in vectors (Qdrant), evidence-based lineage in KG, and bitemporal writes.
|
|
||||||
|
|
||||||
# GLOBAL CONSTRAINTS (APPLY TO ALL SERVICES)
|
|
||||||
|
|
||||||
- **Language & Runtime:** Python **3.12**.
|
|
||||||
- **Frameworks:** FastAPI, Pydantic v2, SQLAlchemy 2, httpx, aiokafka or boto3 (pluggable), redis-py, opentelemetry-instrumentation-fastapi, prometheus-fastapi-instrumentator.
|
|
||||||
- **Config:** `pydantic-settings` with `.env` overlay. Provide `Settings` class per service.
|
|
||||||
- **Secrets:** HashiCorp **Vault** (AppRole/JWT). Use Vault Transit to **envelope-encrypt** sensitive fields before persistence (helpers provided in `lib/security.py`).
|
|
||||||
- **Auth:** No OIDC in services. Add `TrustedProxyMiddleware`:
|
|
||||||
|
|
||||||
- Reject if request not from internal network (configurable CIDR).
|
|
||||||
- Require headers set by Traefik+Authentik (`X-Authenticated-User`, `X-Authenticated-Email`, `X-Authenticated-Groups`, `Authorization: Bearer …`).
|
|
||||||
- Parse groups → `roles` list on `request.state`.
|
|
||||||
|
|
||||||
- **Observability:**
|
|
||||||
|
|
||||||
- OpenTelemetry (traceparent propagation), span attrs (service, route, user, tenant).
|
|
||||||
- Prometheus metrics endpoint `/metrics` protected by internal network check.
|
|
||||||
- Structured JSON logs (timestamp, level, svc, trace_id, msg) via `structlog`.
|
|
||||||
|
|
||||||
- **Errors:** Global exception handler → RFC7807 Problem+JSON (`type`, `title`, `status`, `detail`, `instance`, `trace_id`).
|
|
||||||
- **Testing:** `pytest`, `pytest-asyncio`, `hypothesis` (property tests for calculators), `coverage ≥ 90%` per service.
|
|
||||||
- **Static:** `ruff`, `mypy --strict`, `bandit`, `safety`, `licensecheck`.
|
|
||||||
- **Perf:** Each service exposes `/healthz`, `/readyz`, `/livez`; cold start < 500ms; p95 endpoint < 250ms (local).
|
|
||||||
- **Containers:** Distroless or slim images; non-root user; read-only FS; `/tmp` mounted for OCR where needed.
|
|
||||||
- **Docs:** OpenAPI JSON + ReDoc; MkDocs site with service READMEs.
|
|
||||||
|
|
||||||
# SHARED LIBS (GENERATE ONCE, REUSE)
|
|
||||||
|
|
||||||
Create `libs/` used by all services:
|
|
||||||
|
|
||||||
- `libs/config.py` – base `Settings`, env parsing, Vault client factory, MinIO client factory, Qdrant client factory, Neo4j driver factory, Redis factory, Kafka/SQS client factory.
|
|
||||||
- `libs/security.py` – Vault Transit helpers (`encrypt_field`, `decrypt_field`), header parsing, internal-CIDR validator.
|
|
||||||
- `libs/observability.py` – otel init, prometheus instrumentor, logging config.
|
|
||||||
- `libs/events.py` – abstract `EventBus` with `publish(topic, payload: dict)`, `subscribe(topic, handler)`. Two impls: Kafka (`aiokafka`) and SQS/SNS (`boto3`).
|
|
||||||
- `libs/schemas.py` – **canonical Pydantic models** shared across services (Document, Evidence, IncomeItem, etc.) mirroring the ontology schemas. Include JSONSchema exports.
|
|
||||||
- `libs/storage.py` – S3/MinIO helpers (bucket ensure, put/get, presigned).
|
|
||||||
- `libs/neo.py` – Neo4j session helpers, Cypher runner with retry, SHACL validator invoker (pySHACL on exported RDF).
|
|
||||||
- `libs/rag.py` – Qdrant collections CRUD, hybrid search (dense+sparse), rerank wrapper, de-identification utilities (regex + NER; hash placeholders).
|
|
||||||
- `libs/forms.py` – PDF AcroForm fill via `pdfrw` with overlay fallback via `reportlab`.
|
|
||||||
- `libs/calibration.py` – `calibrated_confidence(raw_score, method="temperature_scaling", params=...)`.
|
|
||||||
|
|
||||||
# EVENT TOPICS (STANDARDIZE)
|
|
||||||
|
|
||||||
- `doc.ingested`, `doc.ocr_ready`, `doc.extracted`, `kg.upserted`, `rag.indexed`, `calc.schedule_ready`, `form.filled`, `hmrc.submitted`, `review.requested`, `review.completed`, `firm.sync.completed`
|
|
||||||
|
|
||||||
Each payload MUST include: `event_id (ulid)`, `occurred_at (iso)`, `actor`, `tenant_id`, `trace_id`, `schema_version`, and a `data` object (service-specific).
|
|
||||||
|
|
||||||
# TRUST HEADERS FROM TRAEFIK + AUTHENTIK (USE EXACT KEYS)
|
|
||||||
|
|
||||||
- `X-Authenticated-User` (string)
|
|
||||||
- `X-Authenticated-Email` (string)
|
|
||||||
- `X-Authenticated-Groups` (comma-separated)
|
|
||||||
- `Authorization` (`Bearer <jwt>` from Authentik)
|
|
||||||
Reject any request missing these (except `/healthz|/readyz|/livez|/metrics` from internal CIDR).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## SERVICES TO IMPLEMENT (CODE FOR EACH)
|
|
||||||
|
|
||||||
### 1) `svc-ingestion`
|
|
||||||
|
|
||||||
**Purpose:** Accept uploads or URLs, checksum, store to MinIO, emit `doc.ingested`.
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/ingest/upload` (multipart file, metadata: `tenant_id`, `kind`, `source`) → `{doc_id, s3_url, checksum}`
|
|
||||||
- `POST /v1/ingest/url` (json: `{url, kind, tenant_id}`) → downloads to MinIO
|
|
||||||
- `GET /v1/docs/{doc_id}` → metadata
|
|
||||||
|
|
||||||
**Logic:**
|
|
||||||
|
|
||||||
- Compute SHA256, dedupe by checksum; MinIO path `tenants/{tenant_id}/raw/{doc_id}.pdf`.
|
|
||||||
- Store metadata in Postgres table `ingest_documents` (alembic migrations).
|
|
||||||
- Publish `doc.ingested` with `{doc_id, bucket, key, pages?, mime}`.
|
|
||||||
|
|
||||||
**Env:** `S3_BUCKET_RAW`, `MINIO_*`, `DB_URL`.
|
|
||||||
|
|
||||||
**Traefik labels:** route `/ingest/*`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2) `svc-rpa`
|
|
||||||
|
|
||||||
**Purpose:** Scheduled RPA pulls from firm/client portals via Playwright.
|
|
||||||
|
|
||||||
**Tasks:**
|
|
||||||
|
|
||||||
- Playwright login flows (credentials from Vault), 2FA via Authentik OAuth device or OTP secret in Vault.
|
|
||||||
- Download statements/invoices; hand off to `svc-ingestion` via internal POST.
|
|
||||||
- Prefect flows: `pull_portal_X()`, `pull_portal_Y()` with schedules.
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/rpa/run/{connector}` (manual trigger)
|
|
||||||
- `GET /v1/rpa/status/{run_id}`
|
|
||||||
|
|
||||||
**Env:** `VAULT_ADDR`, `VAULT_ROLE_ID`, `VAULT_SECRET_ID`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 3) `svc-ocr`
|
|
||||||
|
|
||||||
**Purpose:** OCR & layout extraction.
|
|
||||||
|
|
||||||
**Pipeline:**
|
|
||||||
|
|
||||||
- Pull object from MinIO, detect rotation/de-skew (`opencv-python`), split pages (`pymupdf`), OCR (`pytesseract`) or bypass if text layer present (`pdfplumber`).
|
|
||||||
- Output per-page text + **bbox** for lines/words.
|
|
||||||
- Write JSON to MinIO `tenants/{tenant_id}/ocr/{doc_id}.json` and emit `doc.ocr_ready`.
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/ocr/{doc_id}` (idempotent trigger)
|
|
||||||
- `GET /v1/ocr/{doc_id}` (fetch OCR JSON)
|
|
||||||
|
|
||||||
**Env:** `TESSERACT_LANGS`, `S3_BUCKET_EVIDENCE`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 4) `svc-extract`
|
|
||||||
|
|
||||||
**Purpose:** Classify docs and extract KV + tables into **schema-constrained JSON** (with bbox/page).
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/extract/{doc_id}` body: `{strategy: "llm|rules|hybrid"}`
|
|
||||||
- `GET /v1/extract/{doc_id}` → structured JSON
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
|
|
||||||
- Use prompt files in `prompts/`: `doc_classify.txt`, `kv_extract.txt`, `table_extract.txt`.
|
|
||||||
- **Validator loop**: run LLM → validate JSONSchema → retry with error messages up to N times.
|
|
||||||
- Return Pydantic models from `libs/schemas.py`.
|
|
||||||
- Emit `doc.extracted`.
|
|
||||||
|
|
||||||
**Env:** `LLM_ENGINE`, `TEMPERATURE`, `MAX_TOKENS`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 5) `svc-normalize-map`
|
|
||||||
|
|
||||||
**Purpose:** Normalize & map extracted data to KG.
|
|
||||||
|
|
||||||
**Logic:**
|
|
||||||
|
|
||||||
- Currency normalization (ECB or static fx table), dates, UK tax year/basis period inference.
|
|
||||||
- Entity resolution (blocking + fuzzy).
|
|
||||||
- Generate nodes/edges (+ `Evidence` with doc_id/page/bbox/text_hash).
|
|
||||||
- Use `libs/neo.py` to write with **bitemporal** fields; run **SHACL** validator; on violation, queue `review.requested`.
|
|
||||||
- Emit `kg.upserted`.
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/map/{doc_id}`
|
|
||||||
- `GET /v1/map/{doc_id}/preview` (diff view, to be used by UI)
|
|
||||||
|
|
||||||
**Env:** `NEO4J_*`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 6) `svc-kg`
|
|
||||||
|
|
||||||
**Purpose:** Graph façade + RDF/SHACL utility.
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `GET /v1/kg/nodes/{label}/{id}`
|
|
||||||
- `POST /v1/kg/cypher` (admin-gated inline query; must check `admin` role)
|
|
||||||
- `POST /v1/kg/export/rdf` (returns RDF for SHACL)
|
|
||||||
- `POST /v1/kg/validate` (run pySHACL against `schemas/shapes.ttl`)
|
|
||||||
- `GET /v1/kg/lineage/{node_id}` (traverse `DERIVED_FROM` → Evidence)
|
|
||||||
|
|
||||||
**Env:** `NEO4J_*`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 7) `svc-rag-indexer`
|
|
||||||
|
|
||||||
**Purpose:** Build Qdrant indices (firm knowledge, legislation, best practices, glossary).
|
|
||||||
|
|
||||||
**Workflow:**
|
|
||||||
|
|
||||||
- Load sources (filesystem, URLs, Firm DMS via `svc-firm-connectors`).
|
|
||||||
- **De-identify PII** (regex + NER), replace with placeholders; store mapping only in Postgres.
|
|
||||||
- Chunk (layout-aware) per `retrieval/chunking.yaml`.
|
|
||||||
- Compute **dense** embeddings (e.g., `bge-small-en-v1.5`) and **sparse** (Qdrant sparse).
|
|
||||||
- Upsert to Qdrant with payload `{jurisdiction, tax_years[], topic_tags[], version, pii_free: true, doc_id/section_id/url}`.
|
|
||||||
- Emit `rag.indexed`.
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/index/run`
|
|
||||||
- `GET /v1/index/status/{run_id}`
|
|
||||||
|
|
||||||
**Env:** `QDRANT_URL`, `RAG_EMBEDDING_MODEL`, `RAG_RERANKER_MODEL`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 8) `svc-rag-retriever`
|
|
||||||
|
|
||||||
**Purpose:** Hybrid search + KG fusion with rerank and calibrated confidence.
|
|
||||||
|
|
||||||
**Endpoint:**
|
|
||||||
|
|
||||||
- `POST /v1/rag/search` `{query, tax_year?, jurisdiction?, k?}` →
|
|
||||||
|
|
||||||
```
|
|
||||||
{
|
|
||||||
"chunks": [...],
|
|
||||||
"citations": [{doc_id|url, section_id?, page?, bbox?}],
|
|
||||||
"kg_hints": [{rule_id, formula_id, node_ids[]}],
|
|
||||||
"calibrated_confidence": 0.0-1.0
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
|
|
||||||
- Hybrid score: `alpha * dense + beta * sparse`; rerank top-K via cross-encoder; **KG fusion** (boost chunks citing Rules/Calculations relevant to schedule).
|
|
||||||
- Use `libs/calibration.py` to expose calibrated confidence.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 9) `svc-reason`
|
|
||||||
|
|
||||||
**Purpose:** Deterministic calculators + materializers (UK SA).
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/reason/compute_schedule` `{tax_year, taxpayer_id, schedule_id}`
|
|
||||||
- `GET /v1/reason/explain/{schedule_id}` → rationale & lineage paths
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
|
|
||||||
- Pure functions for: employment, self-employment, property (FHL, 20% interest credit), dividends/interest, allowances, NIC (Class 2/4), HICBC, student loans (Plans 1/2/4/5, PGL).
|
|
||||||
- **Deterministic order** as defined; rounding per `FormBox.rounding_rule`.
|
|
||||||
- Use Cypher from `kg/reasoning/schedule_queries.cypher` to materialize box values; attach `DERIVED_FROM` evidence.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 10) `svc-forms`
|
|
||||||
|
|
||||||
**Purpose:** Fill PDFs and assemble evidence bundles.
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/forms/fill` `{tax_year, taxpayer_id, form_id}` → returns PDF (binary)
|
|
||||||
- `POST /v1/forms/evidence_pack` `{scope}` → ZIP + manifest + signed hashes (sha256)
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
|
|
||||||
- `pdfrw` for AcroForm; overlay with ReportLab if needed.
|
|
||||||
- Manifest includes `doc_id/page/bbox/text_hash` for every numeric field.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 11) `svc-hmrc`
|
|
||||||
|
|
||||||
**Purpose:** HMRC submitter (stub|sandbox|live).
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/hmrc/submit` `{tax_year, taxpayer_id, dry_run}` → `{status, submission_id?, errors[]}`
|
|
||||||
- `GET /v1/hmrc/submissions/{id}`
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
|
|
||||||
- Rate limits, retries/backoff, signed audit log; environment toggle.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 12) `svc-firm-connectors`
|
|
||||||
|
|
||||||
**Purpose:** Read-only connectors to Firm Databases (Practice Mgmt, DMS).
|
|
||||||
|
|
||||||
**Endpoints:**
|
|
||||||
|
|
||||||
- `POST /v1/firm/sync` `{since?}` → `{objects_synced, errors[]}`
|
|
||||||
- `GET /v1/firm/objects` (paged)
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
|
|
||||||
- Data contracts in `config/firm_contracts/`; mappers → Secure Client Data Store (Postgres) with lineage columns (`source`, `source_id`, `synced_at`).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 13) `ui-review` (outline only)
|
|
||||||
|
|
||||||
- Next.js (SSO handled by Traefik+Authentik), shows extracted fields + evidence snippets; POST overrides to `svc-extract`/`svc-normalize-map`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## DATA CONTRACTS (ESSENTIAL EXAMPLES)
|
|
||||||
|
|
||||||
**Event: `doc.ingested`**
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"event_id": "01J...ULID",
|
|
||||||
"occurred_at": "2025-09-13T08:00:00Z",
|
|
||||||
"actor": "svc-ingestion",
|
|
||||||
"tenant_id": "t_123",
|
|
||||||
"trace_id": "abc-123",
|
|
||||||
"schema_version": "1.0",
|
|
||||||
"data": {
|
|
||||||
"doc_id": "d_abc",
|
|
||||||
"bucket": "raw",
|
|
||||||
"key": "tenants/t_123/raw/d_abc.pdf",
|
|
||||||
"checksum": "sha256:...",
|
|
||||||
"kind": "bank_statement",
|
|
||||||
"mime": "application/pdf",
|
|
||||||
"pages": 12
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**RAG search response shape**
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"chunks": [
|
|
||||||
{
|
|
||||||
"id": "c1",
|
|
||||||
"text": "...",
|
|
||||||
"score": 0.78,
|
|
||||||
"payload": {
|
|
||||||
"jurisdiction": "UK",
|
|
||||||
"tax_years": ["2024-25"],
|
|
||||||
"topic_tags": ["FHL"],
|
|
||||||
"pii_free": true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"citations": [
|
|
||||||
{ "doc_id": "leg-ITA2007", "section_id": "s272A", "url": "https://..." }
|
|
||||||
],
|
|
||||||
"kg_hints": [
|
|
||||||
{
|
|
||||||
"rule_id": "UK.FHL.Qual",
|
|
||||||
"formula_id": "FHL_Test_v1",
|
|
||||||
"node_ids": ["n123", "n456"]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"calibrated_confidence": 0.81
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## PERSISTENCE SCHEMAS (POSTGRES; ALEMBIC)
|
|
||||||
|
|
||||||
- `ingest_documents(id pk, tenant_id, doc_id, kind, checksum, bucket, key, mime, pages, created_at)`
|
|
||||||
- `firm_objects(id pk, tenant_id, source, source_id, type, payload jsonb, synced_at)`
|
|
||||||
- Qdrant PII mapping table (if absolutely needed): `pii_links(id pk, placeholder_hash, client_id, created_at)` — **encrypt with Vault Transit**; do NOT store raw values.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## TRAEFIK + AUTHENTIK (COMPOSE LABELS PER SERVICE)
|
|
||||||
|
|
||||||
For every service container in `infra/compose/docker-compose.local.yml`, add labels:
|
|
||||||
|
|
||||||
```
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.svc-extract.rule=Host(`api.local`) && PathPrefix(`/extract`)"
|
|
||||||
- "traefik.http.routers.svc-extract.entrypoints=websecure"
|
|
||||||
- "traefik.http.routers.svc-extract.tls=true"
|
|
||||||
- "traefik.http.routers.svc-extract.middlewares=authentik-forwardauth,rate-limit"
|
|
||||||
- "traefik.http.services.svc-extract.loadbalancer.server.port=8000"
|
|
||||||
```
|
|
||||||
|
|
||||||
Use the shared dynamic file `traefik-dynamic.yml` with `authentik-forwardauth` and `rate-limit` middlewares.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## OUTPUT FORMAT (STRICT)
|
|
||||||
|
|
||||||
Implement a **multi-file codebase** as fenced blocks, EXACTLY in this order:
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: libs/config.py
|
|
||||||
# factories for Vault/MinIO/Qdrant/Neo4j/Redis/EventBus, Settings base
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: libs/security.py
|
|
||||||
# Vault Transit helpers, header parsing, internal CIDR checks, middleware
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: libs/observability.py
|
|
||||||
# otel init, prometheus, structlog
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: libs/events.py
|
|
||||||
# EventBus abstraction with Kafka and SQS/SNS impls
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: libs/schemas.py
|
|
||||||
# Shared Pydantic models mirroring ontology entities
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-ingestion/main.py
|
|
||||||
# FastAPI app, endpoints, MinIO write, Postgres, publish doc.ingested
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-rpa/main.py
|
|
||||||
# Playwright flows, Prefect tasks, triggers
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-ocr/main.py
|
|
||||||
# OCR pipeline, endpoints
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-extract/main.py
|
|
||||||
# Classifier + extractors with validator loop
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-normalize-map/main.py
|
|
||||||
# normalization, entity resolution, KG mapping, SHACL validation call
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-kg/main.py
|
|
||||||
# KG façade, RDF export, SHACL validate, lineage traversal
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-rag-indexer/main.py
|
|
||||||
# chunk/de-id/embed/upsert to Qdrant
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-rag-retriever/main.py
|
|
||||||
# hybrid retrieval + rerank + KG fusion
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-reason/main.py
|
|
||||||
# deterministic calculators, schedule compute/explain
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-forms/main.py
|
|
||||||
# PDF fill + evidence pack
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-hmrc/main.py
|
|
||||||
# submit stub|sandbox|live with audit + retries
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: apps/svc-firm-connectors/main.py
|
|
||||||
# connectors to practice mgmt & DMS, sync to Postgres
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: infra/compose/docker-compose.local.yml
|
|
||||||
# Traefik, Authentik, Vault, MinIO, Qdrant, Neo4j, Postgres, Redis, Prom+Grafana, Loki, Unleash, all services
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: infra/compose/traefik.yml
|
|
||||||
# static Traefik config
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: infra/compose/traefik-dynamic.yml
|
|
||||||
# forwardAuth middleware + routers/services
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: .gitea/workflows/ci.yml
|
|
||||||
# lint->test->build->scan->push->deploy
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: Makefile
|
|
||||||
# bootstrap, run, test, lint, build, deploy, format, seed
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: tests/e2e/test_happy_path.py
|
|
||||||
# end-to-end: ingest -> ocr -> extract -> map -> compute -> fill -> (stub) submit
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: tests/unit/test_calculators.py
|
|
||||||
# boundary tests for UK SA logic (NIC, HICBC, PA taper, FHL)
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```txt
|
|
||||||
# FILE: README.md
|
|
||||||
# how to run locally with docker-compose, Authentik setup, Traefik certs
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
## DEFINITION OF DONE
|
|
||||||
|
|
||||||
- `docker compose up` brings the full stack up; SSO via Authentik; routes secured via Traefik ForwardAuth.
|
|
||||||
- Running `pytest` yields ≥ 90% coverage; `make e2e` passes the ingest→…→submit stub flow.
|
|
||||||
- All services expose `/healthz|/readyz|/livez|/metrics`; OpenAPI at `/docs`.
|
|
||||||
- No PII stored in Qdrant; vectors carry `pii_free=true`.
|
|
||||||
- KG writes are SHACL-validated; violations produce `review.requested` events.
|
|
||||||
- Evidence lineage is present for every numeric box value.
|
|
||||||
- Gitea pipeline passes: lint, test, build, scan, push, deploy.
|
|
||||||
|
|
||||||
# START
|
|
||||||
|
|
||||||
Generate the full codebase and configs in the **exact file blocks and order** specified above.
|
|
||||||
Reference in New Issue
Block a user