Files
ai-tax-agent/docs/INFRASTRUCTURE_STATUS.md
harkon b324ff09ef
Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
Initial commit
2025-10-11 08:41:36 +01:00

316 lines
10 KiB
Markdown

# Infrastructure Status Report
**Date**: 2025-09-29
**Status**: ✅ **ALL SYSTEMS OPERATIONAL**
**Last Updated**: 2025-09-29 20:15 UTC
## Executive Summary
All Docker Compose services are running and healthy. All health check issues have been resolved. The infrastructure is fully operational for both:
- **Production-like deployment** (Docker Compose with authentication)
- **Local development** (Standalone services with `DISABLE_AUTH=true`)
### Recent Fixes Applied
**Traefik Health Checks**: Fixed health check endpoint from `/health` to `/healthz` - no more 500 errors
**Development Mode**: Fixed environment variable parsing for `DISABLE_AUTH`
**Documentation**: Created comprehensive guides for development and deployment
See [FIXES_APPLIED.md](FIXES_APPLIED.md) for detailed information.
## Service Health Status
### Infrastructure Services (All Healthy ✅)
| Service | Status | Health | Ports | Purpose |
| ------------ | ------- | ---------- | ---------------- | ------------------------------ |
| **postgres** | Running | ✅ Healthy | 5432 | Primary database |
| **redis** | Running | ✅ Healthy | 6379 | Cache & session store |
| **minio** | Running | ✅ Healthy | 9092-9093 | Object storage (S3-compatible) |
| **neo4j** | Running | ✅ Healthy | 7474, 7687 | Knowledge graph database |
| **qdrant** | Running | ✅ Healthy | 6333-6334 | Vector database |
| **nats** | Running | ✅ Healthy | 4222, 6222, 8222 | Message broker |
| **vault** | Running | ✅ Healthy | 8200 | Secrets management |
### Authentication & Security (All Healthy ✅)
| Service | Status | Health | Purpose |
| --------------------- | ------- | ---------- | ------------------------- |
| **authentik-server** | Running | ✅ Healthy | SSO authentication server |
| **authentik-worker** | Running | ✅ Healthy | Background task processor |
| **authentik-outpost** | Running | ✅ Healthy | Forward auth proxy |
| **authentik-db** | Running | ✅ Healthy | Authentik database |
| **authentik-redis** | Running | ✅ Healthy | Authentik cache |
### Observability (All Running ✅)
| Service | Status | Ports | Purpose |
| -------------- | ------- | ----- | --------------------- |
| **prometheus** | Running | 9090 | Metrics collection |
| **grafana** | Running | 3000 | Metrics visualization |
| **loki** | Running | 3100 | Log aggregation |
### Networking & Routing (Running ✅)
| Service | Status | Ports | Purpose |
| ----------- | ------- | ------------- | ----------------------------- |
| **traefik** | Running | 80, 443, 8080 | Reverse proxy & load balancer |
### Feature Management (Running ✅)
| Service | Status | Ports | Purpose |
| ----------- | ------- | ----- | ------------- |
| **unleash** | Running | 4242 | Feature flags |
### Application Services (All Healthy ✅)
All 13 application services are running and healthy:
| Service | Status | Health | Purpose |
| ----------------------- | ------- | ---------- | ----------------------------- |
| **svc-ingestion** | Running | ✅ Healthy | Document upload & storage |
| **svc-extract** | Running | ✅ Healthy | Data extraction |
| **svc-ocr** | Running | ✅ Healthy | Optical character recognition |
| **svc-normalize-map** | Running | ✅ Healthy | Data normalization |
| **svc-kg** | Running | ✅ Healthy | Knowledge graph management |
| **svc-rag-indexer** | Running | ✅ Healthy | RAG indexing |
| **svc-rag-retriever** | Running | ✅ Healthy | RAG retrieval |
| **svc-reason** | Running | ✅ Healthy | Reasoning engine |
| **svc-coverage** | Running | ✅ Healthy | Coverage analysis |
| **svc-forms** | Running | ✅ Healthy | Form generation |
| **svc-hmrc** | Running | ✅ Healthy | HMRC integration |
| **svc-rpa** | Running | ✅ Healthy | Robotic process automation |
| **svc-firm-connectors** | Running | ✅ Healthy | Firm integrations |
### UI Services (All Healthy ✅)
| Service | Status | Health | Purpose |
| ------------- | ------- | ---------- | ---------------- |
| **ui-review** | Running | ✅ Healthy | Review interface |
## Health Check Configuration
### Infrastructure Services
All infrastructure services have health checks configured:
```yaml
# PostgreSQL
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 30s
timeout: 10s
retries: 3
# Redis
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
interval: 30s
timeout: 10s
retries: 3
# MinIO
healthcheck:
test: ["CMD", "mc", "--version"]
interval: 30s
timeout: 20s
retries: 3
# NATS
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8222/healthz"]
interval: 30s
timeout: 10s
retries: 3
```
### Application Services
All application services have health checks in their Dockerfiles:
```dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/healthz || exit 1
```
The `/healthz` endpoint is a public endpoint that doesn't require authentication.
## Configuration Fixes Applied
### 1. Authentication Middleware Enhancement
**File**: `libs/config/settings.py`
Added proper environment variable aliases for development mode:
```python
# Development settings
dev_mode: bool = Field(
default=False,
description="Enable development mode (disables auth)",
validation_alias="DEV_MODE"
)
disable_auth: bool = Field(
default=False,
description="Disable authentication middleware",
validation_alias="DISABLE_AUTH"
)
```
### 2. Middleware Configuration
**File**: `libs/security/middleware.py`
The middleware correctly handles development mode:
```python
async def dispatch(self, request: Request, call_next: Callable[..., Any]) -> Any:
# Check if authentication is disabled (development mode)
if self.disable_auth:
# Set development state
request.state.user = "dev-user"
request.state.email = "dev@example.com"
request.state.roles = ["developers"]
request.state.auth_token = "dev-token"
logger.info("Development mode: authentication disabled", path=request.url.path)
return await call_next(request)
# ... rest of authentication logic
```
### 3. App Factory Integration
**File**: `libs/app_factory.py`
The app factory correctly passes the `disable_auth` setting to middleware:
```python
# Add middleware
app.add_middleware(
TrustedProxyMiddleware,
internal_cidrs=settings.internal_cidrs,
disable_auth=getattr(settings, "disable_auth", False),
)
```
## Running Services
### Docker Compose (Production-like)
All services run with full authentication:
```bash
# Start all services
cd infra/compose
docker-compose -f docker-compose.local.yml up -d
# Check status
docker-compose -f docker-compose.local.yml ps
# View logs
docker-compose -f docker-compose.local.yml logs -f SERVICE_NAME
```
### Local Development (Standalone)
Services can run locally with authentication disabled:
```bash
# Run with authentication disabled
DISABLE_AUTH=true make dev-service SERVICE=svc_ingestion
# Or directly with uvicorn
DISABLE_AUTH=true cd apps/svc_ingestion && uvicorn main:app --reload --host 0.0.0.0 --port 8000
```
## Testing
### Health Check Verification
```bash
# Test public health endpoint
curl http://localhost:8000/healthz
# Expected response:
# {"status":"healthy","service":"svc-ingestion","version":"1.0.0"}
```
### Development Mode Verification
When running with `DISABLE_AUTH=true`, logs show:
```json
{
"path": "/healthz",
"event": "Development mode: authentication disabled",
"logger": "libs.security.middleware",
"level": "info",
"service": "svc-ingestion",
"timestamp": 1759175839.638357
}
```
### Production Mode Testing
Without `DISABLE_AUTH`, requests require authentication headers:
```bash
curl -X POST http://localhost:8000/upload \
-H "X-Authenticated-User: dev-user" \
-H "X-Authenticated-Email: dev@example.com" \
-H "Authorization: Bearer dev-token-12345" \
-F "file=@document.pdf"
```
## Network Configuration
### Docker Networks
- **ai-tax-agent-frontend**: Public-facing services (Traefik, UI)
- **ai-tax-agent-backend**: Internal services (databases, message brokers, application services)
### Port Mappings
| Service | Internal Port | External Port | Access |
| ---------- | ---------------- | ---------------- | -------- |
| Traefik | 80, 443, 8080 | 80, 443, 8080 | Public |
| PostgreSQL | 5432 | 5432 | Internal |
| Redis | 6379 | 6379 | Internal |
| MinIO | 9092-9093 | 9092-9093 | Internal |
| Neo4j | 7474, 7687 | 7474, 7687 | Internal |
| NATS | 4222, 6222, 8222 | 4222, 6222, 8222 | Internal |
| Grafana | 3000 | 3000 | Public |
| Prometheus | 9090 | 9090 | Internal |
| Unleash | 4242 | 4242 | Internal |
## Next Steps
1.**Infrastructure**: All services operational
2.**Health Checks**: All passing
3.**Development Mode**: Working correctly
4.**Authentication**: Properly configured for both modes
5. 📝 **Documentation**: Created comprehensive guides
### For Developers
- See [DEVELOPMENT.md](DEVELOPMENT.md) for local development setup
- Use `DISABLE_AUTH=true` for local testing with Postman
- All services support hot reload with `--reload` flag
### For Operations
- Monitor service health: `docker-compose ps`
- View logs: `docker-compose logs -f SERVICE_NAME`
- Restart services: `docker-compose restart SERVICE_NAME`
- Check metrics: http://localhost:9090 (Prometheus)
- View dashboards: http://localhost:3000 (Grafana)
## Conclusion
**All systems are operational and healthy**
**Development mode working correctly**
**Production mode working correctly**
**Documentation complete**
The infrastructure is ready for both development and production-like testing.