Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
501 lines
13 KiB
Markdown
501 lines
13 KiB
Markdown
# Service Level Indicators (SLIs) and Objectives (SLOs)
|
|
## AI Tax Agent System
|
|
|
|
**Document Version:** 1.0
|
|
**Date:** 2024-01-31
|
|
**Owner:** Site Reliability Engineering Team
|
|
|
|
## 1. Executive Summary
|
|
|
|
This document defines the Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets for the AI Tax Agent System. These metrics ensure reliable service delivery and guide operational decisions.
|
|
|
|
## 2. SLI/SLO Framework
|
|
|
|
### 2.1 Service Categories
|
|
|
|
| Service Category | Description | Criticality | Users |
|
|
|------------------|-------------|-------------|-------|
|
|
| **User-Facing** | Web UI, API Gateway | Critical | End users, integrations |
|
|
| **Data Processing** | ETL, OCR, Extraction | High | Background processes |
|
|
| **AI/ML Services** | LLM, RAG, Reasoning | High | Automated workflows |
|
|
| **Storage Services** | Databases, Object Storage | Critical | All services |
|
|
| **Infrastructure** | Auth, Monitoring, Networking | Critical | System operations |
|
|
|
|
### 2.2 SLI Types
|
|
|
|
- **Availability**: Service uptime and reachability
|
|
- **Latency**: Response time for requests
|
|
- **Quality**: Accuracy and correctness of outputs
|
|
- **Throughput**: Request processing capacity
|
|
- **Durability**: Data persistence and integrity
|
|
|
|
## 3. User-Facing Services
|
|
|
|
### 3.1 Review UI (ui-review)
|
|
|
|
#### 3.1.1 Availability SLI/SLO
|
|
```prometheus
|
|
# SLI: Percentage of successful HTTP requests
|
|
sli_ui_availability = (
|
|
sum(rate(http_requests_total{service="ui-review", code!~"5.."}[5m])) /
|
|
sum(rate(http_requests_total{service="ui-review"}[5m]))
|
|
) * 100
|
|
|
|
# SLO: 99.9% availability over 30 days
|
|
# Error Budget: 43.2 minutes downtime per month
|
|
```
|
|
|
|
**Target**: 99.9% (43.2 minutes downtime/month)
|
|
**Measurement Window**: 30 days
|
|
**Alert Threshold**: 99.5% (burn rate > 2x)
|
|
|
|
#### 3.1.2 Latency SLI/SLO
|
|
```prometheus
|
|
# SLI: 95th percentile response time
|
|
sli_ui_latency_p95 = histogram_quantile(0.95,
|
|
rate(http_request_duration_seconds_bucket{service="ui-review"}[5m])
|
|
)
|
|
|
|
# SLO: 95% of requests < 2 seconds
|
|
sli_ui_latency_success_rate = (
|
|
sum(rate(http_request_duration_seconds_bucket{service="ui-review", le="2.0"}[5m])) /
|
|
sum(rate(http_request_duration_seconds_count{service="ui-review"}[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 95% of requests < 2 seconds
|
|
**Measurement Window**: 5 minutes
|
|
**Alert Threshold**: 90% (burn rate > 5x)
|
|
|
|
### 3.2 API Gateway (traefik)
|
|
|
|
#### 3.2.1 Availability SLI/SLO
|
|
```prometheus
|
|
# SLI: API endpoint availability
|
|
sli_api_availability = (
|
|
sum(rate(traefik_service_requests_total{code!~"5.."}[5m])) /
|
|
sum(rate(traefik_service_requests_total[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 99.95% (21.6 minutes downtime/month)
|
|
**Measurement Window**: 30 days
|
|
**Alert Threshold**: 99.9% (burn rate > 2x)
|
|
|
|
#### 3.2.2 Latency SLI/SLO
|
|
```prometheus
|
|
# SLI: API response time
|
|
sli_api_latency_p99 = histogram_quantile(0.99,
|
|
rate(traefik_service_request_duration_seconds_bucket[5m])
|
|
)
|
|
```
|
|
|
|
**Target**: 99% of requests < 5 seconds
|
|
**Measurement Window**: 5 minutes
|
|
**Alert Threshold**: 95% (burn rate > 5x)
|
|
|
|
## 4. Data Processing Services
|
|
|
|
### 4.1 Document Extraction (svc-extract)
|
|
|
|
#### 4.1.1 Processing Success Rate SLI/SLO
|
|
```prometheus
|
|
# SLI: Successful document processing rate
|
|
sli_extraction_success_rate = (
|
|
sum(rate(document_processing_total{status="success"}[5m])) /
|
|
sum(rate(document_processing_total[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 95% successful processing
|
|
**Measurement Window**: 1 hour
|
|
**Alert Threshold**: 90% (burn rate > 5x)
|
|
|
|
#### 4.1.2 Processing Latency SLI/SLO
|
|
```prometheus
|
|
# SLI: Document processing time
|
|
sli_extraction_latency_p95 = histogram_quantile(0.95,
|
|
rate(document_processing_duration_seconds_bucket[5m])
|
|
)
|
|
```
|
|
|
|
**Target**: 95% of documents processed < 60 seconds
|
|
**Measurement Window**: 5 minutes
|
|
**Alert Threshold**: 90% (burn rate > 5x)
|
|
|
|
#### 4.1.3 Quality SLI/SLO
|
|
```prometheus
|
|
# SLI: Field extraction accuracy
|
|
sli_extraction_accuracy = (
|
|
sum(rate(field_extraction_correct_total[5m])) /
|
|
sum(rate(field_extraction_total[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 97% field extraction accuracy
|
|
**Measurement Window**: 1 hour
|
|
**Alert Threshold**: 95% (burn rate > 2x)
|
|
|
|
### 4.2 Knowledge Graph Service (svc-kg)
|
|
|
|
#### 4.2.1 Query Performance SLI/SLO
|
|
```prometheus
|
|
# SLI: Cypher query response time
|
|
sli_kg_query_latency_p95 = histogram_quantile(0.95,
|
|
rate(neo4j_query_duration_seconds_bucket[5m])
|
|
)
|
|
```
|
|
|
|
**Target**: 95% of queries < 10 seconds
|
|
**Measurement Window**: 5 minutes
|
|
**Alert Threshold**: 90% (burn rate > 5x)
|
|
|
|
#### 4.2.2 Data Consistency SLI/SLO
|
|
```prometheus
|
|
# SLI: Graph constraint violations
|
|
sli_kg_consistency = (
|
|
1 - (sum(rate(neo4j_constraint_violations_total[5m])) /
|
|
sum(rate(neo4j_transactions_total[5m])))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 99.9% constraint compliance
|
|
**Measurement Window**: 1 hour
|
|
**Alert Threshold**: 99.5% (burn rate > 2x)
|
|
|
|
## 5. AI/ML Services
|
|
|
|
### 5.1 RAG Retrieval (svc-rag-retriever)
|
|
|
|
#### 5.1.1 Retrieval Quality SLI/SLO
|
|
```prometheus
|
|
# SLI: Retrieval relevance score
|
|
sli_rag_relevance = avg(
|
|
rag_retrieval_relevance_score[5m]
|
|
)
|
|
```
|
|
|
|
**Target**: Average relevance score > 0.8
|
|
**Measurement Window**: 1 hour
|
|
**Alert Threshold**: 0.75 (burn rate > 2x)
|
|
|
|
#### 5.1.2 Retrieval Latency SLI/SLO
|
|
```prometheus
|
|
# SLI: Vector search response time
|
|
sli_rag_latency_p95 = histogram_quantile(0.95,
|
|
rate(rag_search_duration_seconds_bucket[5m])
|
|
)
|
|
```
|
|
|
|
**Target**: 95% of searches < 3 seconds
|
|
**Measurement Window**: 5 minutes
|
|
**Alert Threshold**: 90% (burn rate > 5x)
|
|
|
|
### 5.2 Tax Reasoning (svc-reason)
|
|
|
|
#### 5.2.1 Calculation Accuracy SLI/SLO
|
|
```prometheus
|
|
# SLI: Tax calculation accuracy
|
|
sli_calculation_accuracy = (
|
|
sum(rate(tax_calculations_correct_total[5m])) /
|
|
sum(rate(tax_calculations_total[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 99% calculation accuracy
|
|
**Measurement Window**: 1 hour
|
|
**Alert Threshold**: 98% (burn rate > 2x)
|
|
|
|
#### 5.2.2 Confidence Score SLI/SLO
|
|
```prometheus
|
|
# SLI: Average confidence score
|
|
sli_calculation_confidence = avg(
|
|
tax_calculation_confidence_score[5m]
|
|
)
|
|
```
|
|
|
|
**Target**: Average confidence > 0.9
|
|
**Measurement Window**: 1 hour
|
|
**Alert Threshold**: 0.85 (burn rate > 2x)
|
|
|
|
## 6. Storage Services
|
|
|
|
### 6.1 PostgreSQL Database
|
|
|
|
#### 6.1.1 Availability SLI/SLO
|
|
```prometheus
|
|
# SLI: Database connection success rate
|
|
sli_postgres_availability = (
|
|
sum(rate(postgres_connections_successful_total[5m])) /
|
|
sum(rate(postgres_connections_total[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 99.99% (4.3 minutes downtime/month)
|
|
**Measurement Window**: 30 days
|
|
**Alert Threshold**: 99.95% (burn rate > 2x)
|
|
|
|
#### 6.1.2 Query Performance SLI/SLO
|
|
```prometheus
|
|
# SLI: Query response time
|
|
sli_postgres_latency_p95 = histogram_quantile(0.95,
|
|
rate(postgres_query_duration_seconds_bucket[5m])
|
|
)
|
|
```
|
|
|
|
**Target**: 95% of queries < 1 second
|
|
**Measurement Window**: 5 minutes
|
|
**Alert Threshold**: 90% (burn rate > 5x)
|
|
|
|
### 6.2 Neo4j Knowledge Graph
|
|
|
|
#### 6.2.1 Availability SLI/SLO
|
|
```prometheus
|
|
# SLI: Neo4j cluster availability
|
|
sli_neo4j_availability = (
|
|
sum(neo4j_cluster_members_available) /
|
|
sum(neo4j_cluster_members_total)
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 99.9% cluster availability
|
|
**Measurement Window**: 30 days
|
|
**Alert Threshold**: 99.5% (burn rate > 2x)
|
|
|
|
### 6.3 Qdrant Vector Database
|
|
|
|
#### 6.3.1 Search Performance SLI/SLO
|
|
```prometheus
|
|
# SLI: Vector search latency
|
|
sli_qdrant_search_latency_p95 = histogram_quantile(0.95,
|
|
rate(qdrant_search_duration_seconds_bucket[5m])
|
|
)
|
|
```
|
|
|
|
**Target**: 95% of searches < 500ms
|
|
**Measurement Window**: 5 minutes
|
|
**Alert Threshold**: 90% (burn rate > 5x)
|
|
|
|
## 7. Infrastructure Services
|
|
|
|
### 7.1 Authentication (authentik)
|
|
|
|
#### 7.1.1 Authentication Success Rate SLI/SLO
|
|
```prometheus
|
|
# SLI: Authentication success rate
|
|
sli_auth_success_rate = (
|
|
sum(rate(authentik_auth_success_total[5m])) /
|
|
sum(rate(authentik_auth_attempts_total[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 99.5% authentication success
|
|
**Measurement Window**: 1 hour
|
|
**Alert Threshold**: 99% (burn rate > 2x)
|
|
|
|
### 7.2 Object Storage (minio)
|
|
|
|
#### 7.2.1 Durability SLI/SLO
|
|
```prometheus
|
|
# SLI: Object integrity check success rate
|
|
sli_storage_durability = (
|
|
sum(rate(minio_integrity_checks_success_total[5m])) /
|
|
sum(rate(minio_integrity_checks_total[5m]))
|
|
) * 100
|
|
```
|
|
|
|
**Target**: 99.999999999% (11 9's) durability
|
|
**Measurement Window**: 30 days
|
|
**Alert Threshold**: 99.99% (burn rate > 2x)
|
|
|
|
## 8. Error Budget Management
|
|
|
|
### 8.1 Error Budget Calculation
|
|
|
|
```python
|
|
def calculate_error_budget(slo_target: float, time_window_hours: int) -> dict:
|
|
"""Calculate error budget for given SLO"""
|
|
error_budget_percent = 100 - slo_target
|
|
total_minutes = time_window_hours * 60
|
|
error_budget_minutes = total_minutes * (error_budget_percent / 100)
|
|
|
|
return {
|
|
'error_budget_percent': error_budget_percent,
|
|
'error_budget_minutes': error_budget_minutes,
|
|
'total_minutes': total_minutes
|
|
}
|
|
|
|
# Example: 99.9% SLO over 30 days
|
|
error_budget = calculate_error_budget(99.9, 30 * 24)
|
|
# Result: {'error_budget_percent': 0.1, 'error_budget_minutes': 43.2, 'total_minutes': 43200}
|
|
```
|
|
|
|
### 8.2 Burn Rate Alerts
|
|
|
|
```yaml
|
|
groups:
|
|
- name: slo_alerts
|
|
rules:
|
|
# Fast burn (2% budget in 1 hour)
|
|
- alert: SLOFastBurn
|
|
expr: (
|
|
(1 - sli_ui_availability / 100) > (14.4 * 0.001) # 14.4x normal burn rate
|
|
)
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
burn_rate: fast
|
|
annotations:
|
|
summary: "SLO fast burn detected - 2% budget consumed in 1 hour"
|
|
|
|
# Slow burn (10% budget in 6 hours)
|
|
- alert: SLOSlowBurn
|
|
expr: (
|
|
(1 - sli_ui_availability / 100) > (2.4 * 0.001) # 2.4x normal burn rate
|
|
)
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
burn_rate: slow
|
|
annotations:
|
|
summary: "SLO slow burn detected - 10% budget consumed in 6 hours"
|
|
```
|
|
|
|
## 9. Monitoring Implementation
|
|
|
|
### 9.1 Prometheus Configuration
|
|
|
|
```yaml
|
|
# prometheus.yml
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
rule_files:
|
|
- "slo_rules.yml"
|
|
- "alert_rules.yml"
|
|
|
|
scrape_configs:
|
|
- job_name: 'traefik'
|
|
static_configs:
|
|
- targets: ['traefik:8080']
|
|
metrics_path: /metrics
|
|
|
|
- job_name: 'postgres'
|
|
static_configs:
|
|
- targets: ['postgres-exporter:9187']
|
|
|
|
- job_name: 'neo4j'
|
|
static_configs:
|
|
- targets: ['neo4j:2004']
|
|
|
|
- job_name: 'qdrant'
|
|
static_configs:
|
|
- targets: ['qdrant:6333']
|
|
metrics_path: /metrics
|
|
```
|
|
|
|
### 9.2 Grafana Dashboards
|
|
|
|
**SLO Dashboard Panels:**
|
|
- SLI trend graphs with SLO thresholds
|
|
- Error budget burn rate visualization
|
|
- Alert status and escalation paths
|
|
- Service dependency mapping
|
|
- Incident correlation timeline
|
|
|
|
### 9.3 Custom Metrics
|
|
|
|
```python
|
|
from prometheus_client import Counter, Histogram, Gauge
|
|
|
|
# Document processing metrics
|
|
document_processing_total = Counter(
|
|
'document_processing_total',
|
|
'Total document processing attempts',
|
|
['service', 'document_type', 'status']
|
|
)
|
|
|
|
document_processing_duration = Histogram(
|
|
'document_processing_duration_seconds',
|
|
'Document processing duration',
|
|
['service', 'document_type']
|
|
)
|
|
|
|
# Field extraction accuracy
|
|
field_extraction_accuracy = Gauge(
|
|
'field_extraction_accuracy_ratio',
|
|
'Field extraction accuracy ratio',
|
|
['service', 'field_type']
|
|
)
|
|
|
|
# Tax calculation metrics
|
|
tax_calculation_confidence = Histogram(
|
|
'tax_calculation_confidence_score',
|
|
'Tax calculation confidence score',
|
|
['service', 'calculation_type']
|
|
)
|
|
```
|
|
|
|
## 10. Incident Response Integration
|
|
|
|
### 10.1 SLO-Based Escalation
|
|
|
|
```yaml
|
|
escalation_policies:
|
|
- name: "SLO Critical Burn"
|
|
triggers:
|
|
- alert: "SLOFastBurn"
|
|
severity: "critical"
|
|
actions:
|
|
- notify: "oncall-engineer"
|
|
delay: "0m"
|
|
- notify: "engineering-manager"
|
|
delay: "15m"
|
|
- notify: "vp-engineering"
|
|
delay: "30m"
|
|
|
|
- name: "SLO Warning Burn"
|
|
triggers:
|
|
- alert: "SLOSlowBurn"
|
|
severity: "warning"
|
|
actions:
|
|
- notify: "oncall-engineer"
|
|
delay: "0m"
|
|
- create_ticket: "jira"
|
|
delay: "1h"
|
|
```
|
|
|
|
### 10.2 Post-Incident Review
|
|
|
|
**SLO Impact Assessment:**
|
|
- Error budget consumption during incident
|
|
- SLO breach duration and severity
|
|
- Customer impact quantification
|
|
- Recovery time objectives (RTO) compliance
|
|
- Lessons learned and SLO adjustments
|
|
|
|
## 11. Continuous Improvement
|
|
|
|
### 11.1 SLO Review Process
|
|
|
|
**Monthly SLO Review:**
|
|
- Error budget consumption analysis
|
|
- SLI/SLO target adjustment recommendations
|
|
- New service SLO definition
|
|
- Alert tuning and false positive reduction
|
|
|
|
### 11.2 Capacity Planning
|
|
|
|
**SLO-Driven Capacity Planning:**
|
|
- Performance trend analysis against SLOs
|
|
- Resource scaling triggers based on SLI degradation
|
|
- Load testing scenarios to validate SLO targets
|
|
- Cost optimization while maintaining SLO compliance
|
|
|
|
---
|
|
|
|
**Document Classification**: INTERNAL
|
|
**Next Review Date**: 2024-04-30
|
|
**Approval**: SRE Team, Engineering Management
|