Files
ai-tax-agent/docs/INFRASTRUCTURE_STATUS.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

10 KiB

Infrastructure Status Report

Date: 2025-09-29 Status: ALL SYSTEMS OPERATIONAL Last Updated: 2025-09-29 20:15 UTC

Executive Summary

All Docker Compose services are running and healthy. All health check issues have been resolved. The infrastructure is fully operational for both:

  • Production-like deployment (Docker Compose with authentication)
  • Local development (Standalone services with DISABLE_AUTH=true)

Recent Fixes Applied

Traefik Health Checks: Fixed health check endpoint from /health to /healthz - no more 500 errors Development Mode: Fixed environment variable parsing for DISABLE_AUTH Documentation: Created comprehensive guides for development and deployment

See FIXES_APPLIED.md for detailed information.

Service Health Status

Infrastructure Services (All Healthy )

Service Status Health Ports Purpose
postgres Running Healthy 5432 Primary database
redis Running Healthy 6379 Cache & session store
minio Running Healthy 9092-9093 Object storage (S3-compatible)
neo4j Running Healthy 7474, 7687 Knowledge graph database
qdrant Running Healthy 6333-6334 Vector database
nats Running Healthy 4222, 6222, 8222 Message broker
vault Running Healthy 8200 Secrets management

Authentication & Security (All Healthy )

Service Status Health Purpose
authentik-server Running Healthy SSO authentication server
authentik-worker Running Healthy Background task processor
authentik-outpost Running Healthy Forward auth proxy
authentik-db Running Healthy Authentik database
authentik-redis Running Healthy Authentik cache

Observability (All Running )

Service Status Ports Purpose
prometheus Running 9090 Metrics collection
grafana Running 3000 Metrics visualization
loki Running 3100 Log aggregation

Networking & Routing (Running )

Service Status Ports Purpose
traefik Running 80, 443, 8080 Reverse proxy & load balancer

Feature Management (Running )

Service Status Ports Purpose
unleash Running 4242 Feature flags

Application Services (All Healthy )

All 13 application services are running and healthy:

Service Status Health Purpose
svc-ingestion Running Healthy Document upload & storage
svc-extract Running Healthy Data extraction
svc-ocr Running Healthy Optical character recognition
svc-normalize-map Running Healthy Data normalization
svc-kg Running Healthy Knowledge graph management
svc-rag-indexer Running Healthy RAG indexing
svc-rag-retriever Running Healthy RAG retrieval
svc-reason Running Healthy Reasoning engine
svc-coverage Running Healthy Coverage analysis
svc-forms Running Healthy Form generation
svc-hmrc Running Healthy HMRC integration
svc-rpa Running Healthy Robotic process automation
svc-firm-connectors Running Healthy Firm integrations

UI Services (All Healthy )

Service Status Health Purpose
ui-review Running Healthy Review interface

Health Check Configuration

Infrastructure Services

All infrastructure services have health checks configured:

# PostgreSQL
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 30s
  timeout: 10s
  retries: 3

# Redis
healthcheck:
  test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
  interval: 30s
  timeout: 10s
  retries: 3

# MinIO
healthcheck:
  test: ["CMD", "mc", "--version"]
  interval: 30s
  timeout: 20s
  retries: 3

# NATS
healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8222/healthz"]
  interval: 30s
  timeout: 10s
  retries: 3

Application Services

All application services have health checks in their Dockerfiles:

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/healthz || exit 1

The /healthz endpoint is a public endpoint that doesn't require authentication.

Configuration Fixes Applied

1. Authentication Middleware Enhancement

File: libs/config/settings.py

Added proper environment variable aliases for development mode:

# Development settings
dev_mode: bool = Field(
    default=False,
    description="Enable development mode (disables auth)",
    validation_alias="DEV_MODE"
)
disable_auth: bool = Field(
    default=False,
    description="Disable authentication middleware",
    validation_alias="DISABLE_AUTH"
)

2. Middleware Configuration

File: libs/security/middleware.py

The middleware correctly handles development mode:

async def dispatch(self, request: Request, call_next: Callable[..., Any]) -> Any:
    # Check if authentication is disabled (development mode)
    if self.disable_auth:
        # Set development state
        request.state.user = "dev-user"
        request.state.email = "dev@example.com"
        request.state.roles = ["developers"]
        request.state.auth_token = "dev-token"
        logger.info("Development mode: authentication disabled", path=request.url.path)
        return await call_next(request)
    # ... rest of authentication logic

3. App Factory Integration

File: libs/app_factory.py

The app factory correctly passes the disable_auth setting to middleware:

# Add middleware
app.add_middleware(
    TrustedProxyMiddleware,
    internal_cidrs=settings.internal_cidrs,
    disable_auth=getattr(settings, "disable_auth", False),
)

Running Services

Docker Compose (Production-like)

All services run with full authentication:

# Start all services
cd infra/compose
docker-compose -f docker-compose.local.yml up -d

# Check status
docker-compose -f docker-compose.local.yml ps

# View logs
docker-compose -f docker-compose.local.yml logs -f SERVICE_NAME

Local Development (Standalone)

Services can run locally with authentication disabled:

# Run with authentication disabled
DISABLE_AUTH=true make dev-service SERVICE=svc_ingestion

# Or directly with uvicorn
DISABLE_AUTH=true cd apps/svc_ingestion && uvicorn main:app --reload --host 0.0.0.0 --port 8000

Testing

Health Check Verification

# Test public health endpoint
curl http://localhost:8000/healthz

# Expected response:
# {"status":"healthy","service":"svc-ingestion","version":"1.0.0"}

Development Mode Verification

When running with DISABLE_AUTH=true, logs show:

{
  "path": "/healthz",
  "event": "Development mode: authentication disabled",
  "logger": "libs.security.middleware",
  "level": "info",
  "service": "svc-ingestion",
  "timestamp": 1759175839.638357
}

Production Mode Testing

Without DISABLE_AUTH, requests require authentication headers:

curl -X POST http://localhost:8000/upload \
  -H "X-Authenticated-User: dev-user" \
  -H "X-Authenticated-Email: dev@example.com" \
  -H "Authorization: Bearer dev-token-12345" \
  -F "file=@document.pdf"

Network Configuration

Docker Networks

  • ai-tax-agent-frontend: Public-facing services (Traefik, UI)
  • ai-tax-agent-backend: Internal services (databases, message brokers, application services)

Port Mappings

Service Internal Port External Port Access
Traefik 80, 443, 8080 80, 443, 8080 Public
PostgreSQL 5432 5432 Internal
Redis 6379 6379 Internal
MinIO 9092-9093 9092-9093 Internal
Neo4j 7474, 7687 7474, 7687 Internal
NATS 4222, 6222, 8222 4222, 6222, 8222 Internal
Grafana 3000 3000 Public
Prometheus 9090 9090 Internal
Unleash 4242 4242 Internal

Next Steps

  1. Infrastructure: All services operational
  2. Health Checks: All passing
  3. Development Mode: Working correctly
  4. Authentication: Properly configured for both modes
  5. 📝 Documentation: Created comprehensive guides

For Developers

  • See DEVELOPMENT.md for local development setup
  • Use DISABLE_AUTH=true for local testing with Postman
  • All services support hot reload with --reload flag

For Operations

  • Monitor service health: docker-compose ps
  • View logs: docker-compose logs -f SERVICE_NAME
  • Restart services: docker-compose restart SERVICE_NAME
  • Check metrics: http://localhost:9090 (Prometheus)
  • View dashboards: http://localhost:3000 (Grafana)

Conclusion

All systems are operational and healthy Development mode working correctly Production mode working correctly Documentation complete

The infrastructure is ready for both development and production-like testing.