Перейти к содержанию

SLA & Escalation Policy

1. Service Level Agreement (SLA)

1.1 Определение SLA в контракте

# contract.yaml
sla:
  # Доступность данных
  availability: "99.9%"          # % времени когда данные поступают

  # Свежесть данных
  freshness: "PT1H"              # ISO 8601 duration

  # Время реакции на инциденты
  response_time:
    critical: "PT15M"            # P1 - 15 минут
    high: "PT1H"                 # P2 - 1 час
    medium: "PT4H"               # P3 - 4 часа
    low: "P1D"                   # P4 - 1 рабочий день

  # Время восстановления
  recovery_time: "PT2H"          # Maximum time to fix

1.2 Уровни Severity

Severity Код Критерии Response Time Recovery Time
🔴 Critical P1 Полная остановка, потеря данных, >500 DLQ 15 мин 2 часа
🟠 High P2 Частичная деградация, 100-500 DLQ, SLA breach 1 час 4 часа
🟡 Medium P3 Незначительная деградация, <100 DLQ 4 часа 1 день
Low P4 Косметические проблемы, документация 1 день 1 неделя

1.3 Определение Severity

Автоматическое определение

def determine_severity(incident):
    """Автоматическое определение severity"""

    # P1 - Critical
    if (incident.dlq_count > 500 or 
        incident.data_loss or 
        incident.complete_outage or
        incident.dlq_growth_rate > 100):
        return "P1"

    # P2 - High
    if (incident.dlq_count > 100 or
        incident.sla_breach or
        incident.partial_outage or
        incident.affects_critical_consumer):
        return "P2"

    # P3 - Medium
    if (incident.dlq_count > 10 or
        incident.quality_degradation):
        return "P3"

    # P4 - Low
    return "P4"

Матрица Impact × Urgency

              │  Low       Medium     High       Critical
              │ Urgency    Urgency    Urgency    Urgency
──────────────┼────────────────────────────────────────────
High Impact   │   P3        P2         P1          P1
Medium Impact │   P4        P3         P2          P2
Low Impact    │   P4        P4         P3          P3

2. Escalation Policy

2.1 Escalation Matrix

Time from incident start
├─ 0 min    ▶ Alert fired, Owner notified
│              └─ Slack: #{owner-channel}
│              └─ Email: {owner-email}
├─ 15 min   ▶ [P1 only] PagerDuty escalation
│              └─ On-call engineer paged
│              └─ Incident channel created
├─ 30 min   ▶ [P1/P2] Team Lead notified
│              └─ Slack DM to Team Lead
│              └─ SMS if no acknowledgment
├─ 1 hour   ▶ [P1] Manager escalation
│              └─ Engineering Manager notified
│              └─ Status page updated
├─ 2 hours  ▶ [P1] Director escalation
│              └─ Director of Engineering notified
│              └─ War room convened
├─ 4 hours  ▶ [P1] Executive escalation
│              └─ CTO notified
│              └─ External communication considered
└─ 8 hours  ▶ [P1] Crisis management
               └─ CEO informed
               └─ Board notification if required

2.2 Escalation Contacts

# escalation_contacts.yaml (example)
levels:
  L1:
    name: "On-Call Engineer"
    contact:
      slack: "@oncall"
      pagerduty: "data-platform-oncall"
    response_expected: "5 min"

  L2:
    name: "Team Lead"
    contact:
      slack: "@team-lead-data"
      phone: "+7-XXX-XXX-XXXX"
    response_expected: "15 min"

  L3:
    name: "Engineering Manager"
    contact:
      slack: "@eng-manager"
      phone: "+7-XXX-XXX-XXXX"
    response_expected: "30 min"

  L4:
    name: "Director of Engineering"
    contact:
      slack: "@director-eng"
      phone: "+7-XXX-XXX-XXXX"
    response_expected: "1 hour"

  L5:
    name: "CTO"
    contact:
      phone: "+7-XXX-XXX-XXXX"
    response_expected: "2 hours"

2.3 Escalation Flowchart

┌─────────────────────────────────────────────────────────────────────────────┐
│                           INCIDENT DETECTED                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                    ┌───────────────────────────────┐
                    │   Determine Severity (P1-P4)  │
                    └───────────────────────────────┘
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
              ▼                     ▼                     ▼
       ┌──────────┐          ┌──────────┐          ┌──────────┐
       │    P1    │          │  P2/P3   │          │    P4    │
       │ Critical │          │ High/Med │          │   Low    │
       └────┬─────┘          └────┬─────┘          └────┬─────┘
            │                     │                     │
            ▼                     ▼                     ▼
    ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
    │ Immediate     │     │ Notify Owner  │     │ Create Ticket │
    │ PagerDuty     │     │ via Slack     │     │ for backlog   │
    │ Escalation    │     │               │     │               │
    └───────┬───────┘     └───────┬───────┘     └───────────────┘
            │                     │
            ▼                     ▼
    ┌───────────────┐     ┌───────────────┐
    │ Create War    │     │ Owner         │
    │ Room Channel  │     │ Acknowledges  │
    │ #inc-XXXX     │     │ within SLA    │
    └───────┬───────┘     └───────┬───────┘
            │                     │
            │              ┌──────┴──────┐
            │              │             │
            │              ▼             ▼
            │         ┌────────┐   ┌─────────────┐
            │         │  Yes   │   │    No       │
            │         │ Ack'd  │   │ Auto-Escal. │
            │         └────────┘   └──────┬──────┘
            │                             │
            │                             ▼
            │                    ┌───────────────┐
            │                    │ Escalate to   │
            │                    │ Team Lead (L2)│
            │                    └───────────────┘
    ┌───────────────────────────────────────┐
    │         INCIDENT MANAGEMENT            │
    │                                        │
    │  1. Assign Incident Commander          │
    │  2. Start Timeline Documentation       │
    │  3. Regular Status Updates (15 min)    │
    │  4. Coordinate Resolution              │
    │  5. Escalate if needed                 │
    └───────────────────────────────────────┘
    ┌───────────────────────────────────────┐
    │         RESOLUTION & POSTMORTEM        │
    │                                        │
    │  1. Confirm Issue Resolved             │
    │  2. Update Status Page                 │
    │  3. Notify Stakeholders                │
    │  4. Schedule Postmortem (P1/P2)        │
    │  5. Create Follow-up Tasks             │
    └───────────────────────────────────────┘

3. Communication Templates

3.1 Initial Alert

🚨 **Data Contract Incident: {namespace}.{entity}**

**Severity:** P{severity} ({severity_name})
**Status:** Investigating
**Started:** {timestamp}

**Impact:**
- DLQ Count: {dlq_count}
- Affected Consumers: {consumer_list}
- Data Loss: {yes/no}

**Owner:** @{owner_team}
**Incident Commander:** @{ic_name}

**Next Update:** {timestamp + 15min}

3.2 Status Update

📋 **Incident Update: {namespace}.{entity}** | {update_number}

**Status:** {status}
**Duration:** {duration}

**Current State:**
{current_state_description}

**Actions Taken:**
- {action_1}
- {action_2}

**Next Steps:**
- {next_step_1}
- {next_step_2}

**ETA:** {estimated_resolution_time}
**Next Update:** {timestamp + 15min}

3.3 Resolution

**Incident Resolved: {namespace}.{entity}**

**Duration:** {total_duration}
**Root Cause:** {brief_root_cause}

**Resolution:**
{resolution_description}

**Impact Summary:**
- Records affected: {count}
- Consumers impacted: {list}
- Data recovered: {yes/no/partial}

**Follow-up:**
- Postmortem scheduled: {date}
- Tickets created: {ticket_links}

Thank you to everyone involved: {team_mentions}

4. RACI Matrix

Activity Owner Data Platform Team Lead Manager Consumer
Define contract A, R C I I C
Monitor DLQ I R I I I
Respond to alert R, A S I I I
Fix data quality R, A S I I I
Escalate issue R A I I I
Postmortem A R C I C
Update contract A, R C I I C

Legend: - R - Responsible (does the work) - A - Accountable (final authority) - C - Consulted (provides input) - I - Informed (kept in the loop) - S - Support (provides resources)


5. SLA Reporting

5.1 Weekly SLA Report

═══════════════════════════════════════════════════════════
           DATA CONTRACTS SLA REPORT
           Week of {date_range}
═══════════════════════════════════════════════════════════

SUMMARY
───────
Total Contracts:        {count}
SLA Met:               {count} ({percentage}%)
SLA Breached:          {count} ({percentage}%)

TOP 5 PERFORMING
────────────────
1. warehouse.inventory    99.99% availability, 0 DLQ
2. marketing.campaigns    99.95% availability, 2 DLQ
3. sales.customers        99.92% availability, 5 DLQ
4. finance.transactions   99.90% availability, 8 DLQ
5. crm.leads             99.85% availability, 12 DLQ

REQUIRES ATTENTION
──────────────────
1. sales.orders          98.50% availability, 127 DLQ
   └─ Root cause: 1C integration issue
   └─ Action: Hotfix deployed, monitoring

2. warehouse.shipments   97.80% availability, 45 DLQ
   └─ Root cause: Schema mismatch after update
   └─ Action: Contract updated, reprocessing

INCIDENTS
─────────
P1 Incidents: {count} | MTTR: {avg_time}
P2 Incidents: {count} | MTTR: {avg_time}
P3 Incidents: {count} | MTTR: {avg_time}

TRENDS
──────
SLA Compliance:  ▲ +2% vs last week
DLQ Volume:      ▼ -15% vs last week
MTTR:            ▼ -10 min vs last week
═══════════════════════════════════════════════════════════

5.2 SLA Breach Consequences

Consecutive Breaches Action
1 Warning, investigation required
2 Formal review with Team Lead
3 Escalation to Manager, improvement plan
4+ Executive review, potential team restructuring

Версия: 1.0 Последнее обновление: 24 января 2026