full ingestion -> OCR -> extraction flow is now working correctly.
Some checks failed
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled

This commit is contained in:
harkon
2025-11-26 15:46:59 +00:00
parent fdba81809f
commit db61b05c80
17 changed files with 170 additions and 553 deletions

View File

@@ -118,7 +118,7 @@ async def init_dependencies(app_settings: OCRSettings) -> None:
if attempt == max_retries:
raise HTTPException(
status_code=500, detail="Failed to connect to NATS after retries"
)
) from e
await asyncio.sleep(delay)
delay *= 2 # exponential backoff
@@ -280,7 +280,7 @@ async def _handle_document_ingested(topic: str, payload: EventPayload) -> None:
return
# Auto-process PDF documents
if data.get("content_type") == "application/pdf":
if data.get("mime_type") == "application/pdf":
logger.info("Auto-processing ingested document", doc_id=doc_id)
try:
@@ -347,13 +347,13 @@ async def _process_document_async(
await ds.store_ocr_result(tenant_id, doc_id, ocr_results)
# Update metrics
metrics.counter("documents_processed_total").labels(
tenant_id=tenant_id, strategy=strategy
).inc()
metrics.counter(
"ocr_documents_processed_total", labelnames=["tenant_id", "strategy"]
).labels(tenant_id=tenant_id, strategy=strategy).inc()
metrics.histogram("processing_duration_seconds").labels(
strategy=strategy
).observe(
metrics.histogram(
"ocr_processing_duration_seconds", labelnames=["strategy"]
).labels(strategy=strategy).observe(
datetime.utcnow().timestamp()
- datetime.fromisoformat(
ocr_results["processed_at"].replace("Z", "") # type: ignore
@@ -386,7 +386,10 @@ async def _process_document_async(
logger.error("OCR processing failed", doc_id=doc_id, error=str(e))
# Update error metrics
metrics.counter("processing_errors_total").labels(
metrics.counter(
"ocr_processing_errors_total",
labelnames=["tenant_id", "strategy", "error_type"],
).labels(
tenant_id=tenant_id, strategy=strategy, error_type=type(e).__name__
).inc()