SLA & Escalation Policy
1. Service Level Agreement (SLA)
1.1 Определение SLA в контракте
# contract.yaml
sla:
# Доступность данных
availability: "99.9%" # % времени когда данные поступают
# Свежесть данных
freshness: "PT1H" # ISO 8601 duration
# Время реакции на инциденты
response_time:
critical: "PT15M" # P1 - 15 минут
high: "PT1H" # P2 - 1 час
medium: "PT4H" # P3 - 4 часа
low: "P1D" # P4 - 1 рабочий день
# Время восстановления
recovery_time: "PT2H" # Maximum time to fix
1.2 Уровни Severity
| Severity | Код | Критерии | Response Time | Recovery Time |
| 🔴 Critical | P1 | Полная остановка, потеря данных, >500 DLQ | 15 мин | 2 часа |
| 🟠 High | P2 | Частичная деградация, 100-500 DLQ, SLA breach | 1 час | 4 часа |
| 🟡 Medium | P3 | Незначительная деградация, <100 DLQ | 4 часа | 1 день |
| ⚪ Low | P4 | Косметические проблемы, документация | 1 день | 1 неделя |
1.3 Определение Severity
Автоматическое определение
def determine_severity(incident):
"""Автоматическое определение severity"""
# P1 - Critical
if (incident.dlq_count > 500 or
incident.data_loss or
incident.complete_outage or
incident.dlq_growth_rate > 100):
return "P1"
# P2 - High
if (incident.dlq_count > 100 or
incident.sla_breach or
incident.partial_outage or
incident.affects_critical_consumer):
return "P2"
# P3 - Medium
if (incident.dlq_count > 10 or
incident.quality_degradation):
return "P3"
# P4 - Low
return "P4"
Матрица Impact × Urgency
│ Low Medium High Critical
│ Urgency Urgency Urgency Urgency
──────────────┼────────────────────────────────────────────
High Impact │ P3 P2 P1 P1
Medium Impact │ P4 P3 P2 P2
Low Impact │ P4 P4 P3 P3
2. Escalation Policy
2.1 Escalation Matrix
Time from incident start
│
├─ 0 min ▶ Alert fired, Owner notified
│ └─ Slack: #{owner-channel}
│ └─ Email: {owner-email}
│
├─ 15 min ▶ [P1 only] PagerDuty escalation
│ └─ On-call engineer paged
│ └─ Incident channel created
│
├─ 30 min ▶ [P1/P2] Team Lead notified
│ └─ Slack DM to Team Lead
│ └─ SMS if no acknowledgment
│
├─ 1 hour ▶ [P1] Manager escalation
│ └─ Engineering Manager notified
│ └─ Status page updated
│
├─ 2 hours ▶ [P1] Director escalation
│ └─ Director of Engineering notified
│ └─ War room convened
│
├─ 4 hours ▶ [P1] Executive escalation
│ └─ CTO notified
│ └─ External communication considered
│
└─ 8 hours ▶ [P1] Crisis management
└─ CEO informed
└─ Board notification if required
# escalation_contacts.yaml (example)
levels:
L1:
name: "On-Call Engineer"
contact:
slack: "@oncall"
pagerduty: "data-platform-oncall"
response_expected: "5 min"
L2:
name: "Team Lead"
contact:
slack: "@team-lead-data"
phone: "+7-XXX-XXX-XXXX"
response_expected: "15 min"
L3:
name: "Engineering Manager"
contact:
slack: "@eng-manager"
phone: "+7-XXX-XXX-XXXX"
response_expected: "30 min"
L4:
name: "Director of Engineering"
contact:
slack: "@director-eng"
phone: "+7-XXX-XXX-XXXX"
response_expected: "1 hour"
L5:
name: "CTO"
contact:
phone: "+7-XXX-XXX-XXXX"
response_expected: "2 hours"
2.3 Escalation Flowchart
┌─────────────────────────────────────────────────────────────────────────────┐
│ INCIDENT DETECTED │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Determine Severity (P1-P4) │
└───────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ P1 │ │ P2/P3 │ │ P4 │
│ Critical │ │ High/Med │ │ Low │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Immediate │ │ Notify Owner │ │ Create Ticket │
│ PagerDuty │ │ via Slack │ │ for backlog │
│ Escalation │ │ │ │ │
└───────┬───────┘ └───────┬───────┘ └───────────────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Create War │ │ Owner │
│ Room Channel │ │ Acknowledges │
│ #inc-XXXX │ │ within SLA │
└───────┬───────┘ └───────┬───────┘
│ │
│ ┌──────┴──────┐
│ │ │
│ ▼ ▼
│ ┌────────┐ ┌─────────────┐
│ │ Yes │ │ No │
│ │ Ack'd │ │ Auto-Escal. │
│ └────────┘ └──────┬──────┘
│ │
│ ▼
│ ┌───────────────┐
│ │ Escalate to │
│ │ Team Lead (L2)│
│ └───────────────┘
│
▼
┌───────────────────────────────────────┐
│ INCIDENT MANAGEMENT │
│ │
│ 1. Assign Incident Commander │
│ 2. Start Timeline Documentation │
│ 3. Regular Status Updates (15 min) │
│ 4. Coordinate Resolution │
│ 5. Escalate if needed │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ RESOLUTION & POSTMORTEM │
│ │
│ 1. Confirm Issue Resolved │
│ 2. Update Status Page │
│ 3. Notify Stakeholders │
│ 4. Schedule Postmortem (P1/P2) │
│ 5. Create Follow-up Tasks │
└───────────────────────────────────────┘
3. Communication Templates
3.1 Initial Alert
🚨 **Data Contract Incident: {namespace}.{entity}**
**Severity:** P{severity} ({severity_name})
**Status:** Investigating
**Started:** {timestamp}
**Impact:**
- DLQ Count: {dlq_count}
- Affected Consumers: {consumer_list}
- Data Loss: {yes/no}
**Owner:** @{owner_team}
**Incident Commander:** @{ic_name}
**Next Update:** {timestamp + 15min}
3.2 Status Update
📋 **Incident Update: {namespace}.{entity}** | {update_number}
**Status:** {status}
**Duration:** {duration}
**Current State:**
{current_state_description}
**Actions Taken:**
- {action_1}
- {action_2}
**Next Steps:**
- {next_step_1}
- {next_step_2}
**ETA:** {estimated_resolution_time}
**Next Update:** {timestamp + 15min}
3.3 Resolution
✅ **Incident Resolved: {namespace}.{entity}**
**Duration:** {total_duration}
**Root Cause:** {brief_root_cause}
**Resolution:**
{resolution_description}
**Impact Summary:**
- Records affected: {count}
- Consumers impacted: {list}
- Data recovered: {yes/no/partial}
**Follow-up:**
- Postmortem scheduled: {date}
- Tickets created: {ticket_links}
Thank you to everyone involved: {team_mentions}
4. RACI Matrix
| Activity | Owner | Data Platform | Team Lead | Manager | Consumer |
| Define contract | A, R | C | I | I | C |
| Monitor DLQ | I | R | I | I | I |
| Respond to alert | R, A | S | I | I | I |
| Fix data quality | R, A | S | I | I | I |
| Escalate issue | R | A | I | I | I |
| Postmortem | A | R | C | I | C |
| Update contract | A, R | C | I | I | C |
Legend: - R - Responsible (does the work) - A - Accountable (final authority) - C - Consulted (provides input) - I - Informed (kept in the loop) - S - Support (provides resources)
5. SLA Reporting
5.1 Weekly SLA Report
═══════════════════════════════════════════════════════════
DATA CONTRACTS SLA REPORT
Week of {date_range}
═══════════════════════════════════════════════════════════
SUMMARY
───────
Total Contracts: {count}
SLA Met: {count} ({percentage}%)
SLA Breached: {count} ({percentage}%)
TOP 5 PERFORMING
────────────────
1. warehouse.inventory 99.99% availability, 0 DLQ
2. marketing.campaigns 99.95% availability, 2 DLQ
3. sales.customers 99.92% availability, 5 DLQ
4. finance.transactions 99.90% availability, 8 DLQ
5. crm.leads 99.85% availability, 12 DLQ
REQUIRES ATTENTION
──────────────────
1. sales.orders 98.50% availability, 127 DLQ
└─ Root cause: 1C integration issue
└─ Action: Hotfix deployed, monitoring
2. warehouse.shipments 97.80% availability, 45 DLQ
└─ Root cause: Schema mismatch after update
└─ Action: Contract updated, reprocessing
INCIDENTS
─────────
P1 Incidents: {count} | MTTR: {avg_time}
P2 Incidents: {count} | MTTR: {avg_time}
P3 Incidents: {count} | MTTR: {avg_time}
TRENDS
──────
SLA Compliance: ▲ +2% vs last week
DLQ Volume: ▼ -15% vs last week
MTTR: ▼ -10 min vs last week
═══════════════════════════════════════════════════════════
5.2 SLA Breach Consequences
| Consecutive Breaches | Action |
| 1 | Warning, investigation required |
| 2 | Formal review with Team Lead |
| 3 | Escalation to Manager, improvement plan |
| 4+ | Executive review, potential team restructuring |
Версия: 1.0 Последнее обновление: 24 января 2026