SRE & Incident Management Platform

Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies.

Phase 1: Reliability Assessment

Before building anything, assess where you are.

Service Catalog Entry

service:
  name: ""
  tier: ""  # critical | important | standard | experimental
  owner_team: ""
  oncall_rotation: ""
  dependencies:
    upstream: []    # services we call
    downstream: []  # services that call us
  data_classification: ""  # public | internal | confidential | restricted
  deployment_frequency: ""  # daily | weekly | biweekly | monthly
  architecture: ""  # monolith | microservice | serverless | hybrid
  language: ""
  infra: ""  # k8s | ECS | Lambda | VM | bare-metal
  traffic_pattern: ""  # steady | diurnal | spiky | seasonal
  peak_rps: 0
  storage_gb: 0
  monthly_cost_usd: 0

Maturity Assessment (Score 1-5 per dimension)

Dimension	1 (Ad-hoc)	3 (Defined)	5 (Optimized)
SLOs	No SLOs defined	SLOs exist, reviewed quarterly	Data-driven SLOs, auto error budgets
Monitoring	Basic health checks	Golden signals + dashboards	Full observability, anomaly detection
Incident Response	No runbooks, hero culture	Documented process, postmortems	Automated detection, structured ICS
Automation	Manual deployments	CI/CD pipeline, some automation	Self-healing, auto-scaling, GitOps
Chaos Engineering	No testing	Basic failure injection	Continuous chaos in production
Capacity Planning	Reactive scaling	Quarterly forecasting	Predictive auto-scaling
Toil Management	>50% toil	Toil tracked, reduction plans	<25% toil, systematic elimination
On-Call Health	Burnout, 24/7 individuals	Rotation exists, escalation paths	Balanced load, <2 pages/shift

Score interpretation: - 8-16: Firefighting mode — start with SLOs + incident process - 17-24: Foundation built — add chaos engineering + toil reduction - 25-32: Maturing — optimize error budgets + capacity planning - 33-40: Advanced — focus on predictive reliability + culture

Phase 2: SLI/SLO Framework

SLI Selection by Service Type

Service Type	Primary SLI	Secondary SLIs
API/Backend	Request success rate	Latency p50/p95/p99, throughput
Frontend/Web	Page load (LCP)	FID/INP, CLS, error rate
Data Pipeline	Freshness	Correctness, completeness, throughput
Storage	Durability	Availability, latency
Streaming	Processing latency	Throughput, ordering, data loss rate
Batch Job	Success rate	Duration, SLA compliance
ML Model	Prediction latency	Accuracy drift, feature freshness

SLI Specification Template

sli:
  name: "request_success_rate"
  description: "Proportion of valid requests served successfully"
  type: "availability"  # availability | latency | quality | freshness
  measurement:
    good_events: "HTTP responses with status < 500"
    total_events: "All HTTP requests excluding health checks"
    source: "load balancer access logs"
    aggregation: "sum(good) / sum(total) over rolling 28-day window"
  exclusions:
    - "Health check endpoints (/healthz, /readyz)"
    - "Synthetic monitoring traffic"
    - "Requests from blocked IPs"
    - "4xx responses (client errors)"

SLO Target Selection Guide

Nines	Uptime %	Downtime/month	Appropriate for
2 nines	99%	7h 18m	Internal tools, dev environments
2.5	99.5%	3h 39m	Non-critical services, backoffice
3 nines	99.9%	43m 50s	Standard production services
3.5	99.95%	21m 55s	Important customer-facing services
4 nines	99.99%	4m 23s	Critical services, payments, auth
5 nines	99.999%	26s	Life-safety, financial clearing

Rules for setting targets: 1. Start lower than you think — you can always tighten 2. SLO < SLA (always have buffer — typically 0.1-0.5% margin) 3. Internal SLO < External SLO (catch problems before customers do) 4. Each nine costs ~10x more to achieve 5. If you can't measure it, you can't SLO it

SLO Document Template

slo:
  service: ""
  sli: ""
  target: 99.9  # percentage
  window: "28d"  # rolling window
  error_budget: 0.1  # 100% - target
  error_budget_minutes: 40  # per 28-day window

  burn_rate_alerts:
    - name: "fast_burn"
      burn_rate: 14.4  # exhausts budget in 2 hours
      short_window: "5m"
      long_window: "1h"
      severity: "page"
    - name: "medium_burn"
      burn_rate: 6.0   # exhausts budget in ~5 hours
      short_window: "30m"
      long_window: "6h"
      severity: "page"
    - name: "slow_burn"
      burn_rate: 1.0   # exhausts budget in 28 days
      short_window: "6h"
      long_window: "3d"
      severity: "ticket"

  review_cadence: "monthly"
  owner: ""
  stakeholders: []

  escalation_when_budget_exhausted:
    - "Halt non-critical deployments"
    - "Redirect engineering to reliability work"
    - "Escalate to VP Engineering if no improvement in 48h"

Phase 3: Error Budget Management

Error Budget Policy

error_budget_policy:
  service: ""

  budget_states:
    healthy:
      condition: "remaining_budget > 50%"
      actions:
        - "Normal development velocity"
        - "Feature work prioritized"
        - "Chaos experiments allowed"

    warning:
      condition: "remaining_budget 25-50%"
      actions:
        - "Increase monitoring scrutiny"
        - "Review recent changes for risk"
        - "Limit risky deployments to business hours"
        - "No chaos experiments"

    critical:
      condition: "remaining_budget 0-25%"
      actions:
        - "Feature freeze — reliability work only"
        - "All deployments require SRE approval"
        - "Mandatory rollback plan for every change"
        - "Daily error budget review"

    exhausted:
      condition: "remaining_budget <= 0"
      actions:
        - "Complete deployment freeze"
        - "All engineering redirected to reliability"
        - "VP Engineering notified"
        - "Postmortem required for budget exhaustion"
        - "Freeze maintained until budget recovers to 10%"

  exceptions:
    - "Security patches always allowed"
    - "Regulatory compliance changes always allowed"
    - "Data loss prevention always allowed"

  reset: "Rolling 28-day window (no manual resets)"

Burn Rate Calculation

Burn rate = (error rate observed) / (error rate allowed by SLO)

Example:
- SLO: 99.9% (error budget = 0.1%)
- Current error rate: 0.5%
- Burn rate = 0.5% / 0.1% = 5x

At 5x burn rate → budget exhausted in 28d / 5 = 5.6 days

Error Budget Dashboard

Track weekly:

Metric	Trend	Status
Budget remaining (%)	↑↓→	🟢🟡🔴
Budget consumed this week
Burn rate (1h / 6h / 24h)
Incidents consuming budget
Top error contributor
Projected exhaustion date

Phase 4: Monitoring & Alerting Architecture

Four Golden Signals

Signal	What to Measure	Alert When
Latency	p50, p95, p99 response time	p99 > 2x baseline for 5 min
Traffic	Requests/sec, concurrent users	>30% drop (indicates upstream issue) OR >50% spike
Errors	5xx rate, timeout rate, exception rate	Error rate > SLO burn rate threshold
Saturation	CPU, memory, disk, connections, queue depth	>80% sustained for 10 min

USE Method (Infrastructure)

For every resource, track: - Utilization: % of capacity used (0-100%) - Saturation: queue depth / wait time (0 = no waiting) - Errors: error count / error rate

RED Method (Services)

For every service, track: - Rate: requests per second - Errors: failed requests per second - Duration: latency distribution

Alert Design Rules

Every alert must have a runbook link — no exceptions
Every alert must be actionable — if you can't act on it, delete it
Symptoms over causes — alert on "users can't check out" not "database CPU high"
Multi-window, multi-burn-rate — avoid single-threshold alerts
Page only for customer impact — everything else is a ticket
Alert fatigue = death — review alert volume monthly; target <5 pages/week per service

Alert Severity Guide

Severity	Response Time	Notification	Examples
P0/Page	<5 min	PagerDuty + phone	SLO burn rate critical, data loss, security breach
P1/Urgent	<30 min	Slack + PagerDuty	Degraded service, elevated errors, capacity warning
P2/Ticket	Next business day	Ticket auto-created	Slow burn, non-critical component down
P3/Log	Weekly review	Dashboard only	Informational, trend detection

Structured Log Standard

{
  "timestamp": "2026-02-17T11:24:00.000Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Payment processing failed",
  "error_type": "TimeoutException",
  "error_message": "Gateway timeout after 30s",
  "http_method": "POST",
  "http_path": "/api/v1/payments",
  "http_status": 504,
  "duration_ms": 30012,
  "customer_id": "cust_xxx",
  "payment_id": "pay_yyy",
  "amount_cents": 4999,
  "retry_count": 2,
  "environment": "production",
  "host": "payment-api-7b4d9-xk2p1",
  "region": "us-east-1"
}

Phase 5: Incident Response Framework

Severity Classification Matrix

	Impact: 1 User	Impact: <25% Users	Impact: >25% Users	Impact: All Users
Core function down	SEV3	SEV2	SEV1	SEV1
Degraded performance	SEV4	SEV3	SEV2	SEV1
Non-core feature down	SEV4	SEV3	SEV3	SEV2
Cosmetic/minor	SEV4	SEV4	SEV3	SEV3

Auto-escalation triggers: - Any data loss → SEV1 minimum - Security breach with PII → SEV1 - Revenue-impacting → SEV1 or SEV2 - SLA breach imminent → auto-escalate one level

Incident Command System (ICS)

Role	Responsibility	Assigned
Incident Commander (IC)	Owns resolution, makes decisions, manages timeline
Communications Lead	Status updates, stakeholder comms, customer-facing
Operations Lead	Hands-on-keyboard, executing fixes
Subject Matter Expert	Deep knowledge of affected system
Scribe	Documenting timeline, actions, decisions

IC Rules: 1. IC does NOT debug — IC coordinates 2. IC makes final decisions when team disagrees 3. IC can escalate severity at any time 4. IC owns handoff if rotation changes 5. IC calls end-of-incident

Incident Response Workflow

DETECT → TRIAGE → RESPOND → MITIGATE → RESOLVE → REVIEW

Step 1: DETECT (0-5 min)
├── Alert fires OR user report received
├── On-call acknowledges within SLA
└── Quick assessment: is this real? What severity?

Step 2: TRIAGE (5-15 min)
├── Classify severity using matrix above
├── Assign IC and roles
├── Open incident channel (#inc-YYYY-MM-DD-title)
├── Post initial status update
└── Start timeline document

Step 3: RESPOND (15 min - ongoing)
├── IC briefs team: "Here's what we know, here's what we don't"
├── Operations Lead begins investigation
├── Check: recent deployments? Config changes? Dependency issues?
├── Parallel investigation tracks if needed
└── 15-minute check-ins for SEV1, 30-min for SEV2

Step 4: MITIGATE (ASAP)
├── Priority: STOP THE BLEEDING
├── Options (fastest first):
│   ├── Rollback last deployment
│   ├── Feature flag disable
│   ├── Traffic shift / failover
│   ├── Scale up / circuit breaker
│   └── Manual data fix
├── Mitigated ≠ Resolved — temporary fix is OK
└── Update status: "Impact mitigated, root cause investigation ongoing"

Step 5: RESOLVE
├── Root cause identified and fixed
├── Verification: SLIs back to normal for 30+ minutes
├── All-clear communicated
└── IC declares incident resolved

Step 6: REVIEW (within 5 business days)
├── Blameless postmortem written
├── Action items assigned with owners and deadlines
├── Postmortem review meeting
└── Action items tracked to completion

Communication Templates

Initial notification (internal):

🔴 INCIDENT: [Title]
Severity: SEV[X]
Impact: [Who/what is affected]
Status: Investigating
IC: [Name]
Channel: #inc-[date]-[slug]
Next update: [time]

Customer-facing status:

[Service] - Investigating increased error rates

We are currently investigating reports of [symptom]. 
Some users may experience [user-visible impact].
Our team is actively working on a resolution.
We will provide an update within [time].

Resolution notification:

✅ RESOLVED: [Title]
Duration: [X hours Y minutes]
Impact: [Summary]
Root cause: [One sentence]
Postmortem: [Link] (within 5 business days)

Phase 6: Postmortem Framework

Blameless Postmortem Template

postmortem:
  title: ""
  date: ""
  severity: ""  # SEV1-4
  duration: ""  # total incident duration
  authors: []
  reviewers: []
  status: "draft"  # draft | in-review | final

  summary: |
    One paragraph: what happened, what was the impact, how was it resolved.

  impact:
    users_affected: 0
    duration_minutes: 0
    revenue_impact_usd: 0
    slo_budget_consumed_pct: 0
    data_loss: false
    customer_tickets: 0

  timeline:
    - time: ""
      event: ""
      # Chronological, every significant event
      # Include detection time, escalation, mitigation attempts

  root_cause: |
    Technical explanation of WHY it happened.
    Go deep — surface causes are not root causes.

  contributing_factors:
    - ""  # What made it worse or delayed resolution?

  detection:
    how_detected: ""  # alert | user report | manual check
    time_to_detect_minutes: 0
    could_have_detected_sooner: ""

  resolution:
    how_resolved: ""
    time_to_mitigate_minutes: 0
    time_to_resolve_minutes: 0

  what_went_well:
    - ""  # Explicitly call out what worked

  what_went_wrong:
    - ""

  where_we_got_lucky:
    - ""  # Things that could have made it worse

  action_items:
    - id: "AI-001"
      type: ""  # prevent | detect | mitigate | process
      description: ""
      owner: ""
      priority: ""  # P0 | P1 | P2
      deadline: ""
      status: "open"  # open | in-progress | done
      ticket: ""

Root Cause Analysis Methods

Five Whys (simple incidents): 1. Why did users see errors? → API returned 500s 2. Why did API return 500s? → Database connection pool exhausted 3. Why was pool exhausted? → Long-running query held connections 4. Why was query long-running? → Missing index on new column 5. Why was index missing? → Migration didn't include index; no query performance review in CI

→ Root cause: No automated query performance check in deployment pipeline → Action: Add query plan analysis to CI for migration PRs

Fishbone / Ishikawa (complex incidents):

Categories to investigate:
├── People: Training? Fatigue? Communication?
├── Process: Runbook? Escalation? Change management?
├── Technology: Bug? Config? Capacity? Dependency?
├── Environment: Network? Cloud provider? Third party?
├── Monitoring: Detection gap? Alert fatigue? Dashboard gap?
└── Testing: Test coverage? Load testing? Chaos testing?

Contributing Factor Categories: | Category | Questions | |----------|-----------| | Trigger | What change or event started it? | | Propagation | Why did it spread? Why wasn't it contained? | | Detection | Why wasn't it caught earlier? | | Resolution | What slowed the fix? | | Process | What process gaps contributed? |

Postmortem Review Meeting (60 min)

1. Timeline walk-through (15 min)
   - Author presents chronology
   - Attendees add context ("I remember seeing X at this point")

2. Root cause deep-dive (15 min)  
   - Do we agree on root cause?
   - Are there additional contributing factors?

3. Action item review (20 min)
   - Are these the RIGHT actions?
   - Are they prioritized correctly?
   - Do owners agree on deadlines?

4. Process improvements (10 min)
   - Could we have detected this sooner?
   - Could we have resolved this faster?
   - What would have prevented this entirely?

Phase 7: Chaos Engineering

Chaos Maturity Model

Level	Name	Activities
0	None	No chaos testing
1	Exploratory	Manual fault injection in staging
2	Systematic	Scheduled chaos experiments in staging
3	Production	Controlled chaos in production (Game Days)
4	Continuous	Automated chaos in production with safety controls

Chaos Experiment Template

experiment:
  name: ""
  hypothesis: "When [fault], the system will [expected behavior]"

  steady_state:
    metrics:
      - name: ""
        baseline: ""
        acceptable_range: ""

  method:
    fault_type: ""  # network | compute | storage | dependency | data
    target: ""      # which service/component
    blast_radius: ""  # single pod | single AZ | percentage of traffic
    duration: ""

  safety:
    abort_conditions:
      - "SLO burn rate exceeds 10x"
      - "Customer-visible errors detected"
      - "Alert fires that we didn't expect"
    rollback_plan: ""
    required_approvals: []

  results:
    outcome: ""  # confirmed | disproved | inconclusive
    observations: []
    action_items: []

Chaos Experiment Library

Category	Experiment	Validates
Network	Add 200ms latency to DB calls	Timeout handling, circuit breakers
Network	Drop 5% of packets to downstream	Retry logic, error handling
Network	DNS resolution failure	Caching, fallback, error messages
Compute	Kill random pod every 10 min	Auto-restart, load balancing
Compute	CPU stress to 95% on 1 node	Auto-scaling, graceful degradation
Compute	Fill disk to 95%	Disk monitoring, log rotation, alerts
Storage	Increase DB latency 5x	Connection pool handling, timeouts
Storage	Simulate cache failure (Redis down)	Cache-aside pattern, DB fallback
Dependency	Block external API (payment provider)	Circuit breaker, queuing, retry
Dependency	Return 429s from auth service	Rate limit handling, backoff
Data	Clock skew on subset of nodes	Timestamp handling, ordering
Scale	10x traffic spike over 5 minutes	Auto-scaling speed, queue depth

Game Day Runbook

PRE-GAME (1 week before):
□ Experiment designed and reviewed
□ Steady-state metrics identified
□ Abort conditions defined
□ All participants briefed
□ Runbacks tested in staging
□ Stakeholders notified

GAME DAY:
□ Verify steady state (15 min baseline)
□ Announce in #engineering: "Chaos Game Day starting"
□ Inject fault
□ Observe and document
□ If abort condition hit → rollback immediately
□ Run for planned duration
□ Remove fault
□ Verify recovery to steady state

POST-GAME (same day):
□ Results documented
□ Surprises noted
□ Action items created
□ Share findings in team meeting

Phase 8: Toil Management

Toil Identification

Definition: Work that is manual, repetitive, automatable, tactical, without enduring value, and scales linearly with service growth.

Toil Inventory Template

toil_item:
  name: ""
  category: ""  # deployment | scaling | config | data | access | monitoring | recovery
  frequency: ""  # daily | weekly | monthly | per-incident
  time_per_occurrence_min: 0
  occurrences_per_month: 0
  total_hours_per_month: 0
  teams_affected: []
  automation_difficulty: ""  # low | medium | high
  automation_value: 0  # hours saved per month
  priority_score: 0  # value / difficulty

Toil Reduction Priority Matrix

	Low Effort	Medium Effort	High Effort
High Value (>10 hrs/mo)	DO FIRST	DO SECOND	PLAN
Med Value (2-10 hrs/mo)	DO SECOND	PLAN	EVALUATE
Low Value (<2 hrs/mo)	QUICK WIN	SKIP	SKIP

Common Toil Targets (Ranked by Impact)

Manual deployments → CI/CD pipeline + GitOps
Access provisioning → Self-service + auto-approval for low-risk
Certificate renewals → Auto-renewal (cert-manager, Let's Encrypt)
Scaling decisions → HPA + predictive auto-scaling
Log investigation → Structured logging + correlation + dashboards
Data fixes → Self-service admin tools + validation at ingestion
Config changes → Config-as-code + automated rollout
Incident response → Automated runbooks for known issues
Capacity reporting → Automated dashboards + forecasting
On-call triage → Noise reduction + auto-remediation for known patterns

Toil Budget Rule

Target: <25% of SRE time spent on toil. Track monthly. If above 25%, prioritize automation over all feature work.

Phase 9: Capacity Planning

Capacity Model Template

capacity_model:
  service: ""
  bottleneck_resource: ""  # CPU | memory | storage | connections | bandwidth

  current_state:
    peak_utilization_pct: 0
    headroom_pct: 0
    cost_per_month_usd: 0

  growth_forecast:
    metric: ""  # MAU | requests/sec | storage_gb
    current: 0
    monthly_growth_pct: 0
    projected_6mo: 0
    projected_12mo: 0

  scaling_strategy:
    type: ""  # horizontal | vertical | hybrid
    auto_scaling: true
    min_instances: 0
    max_instances: 0
    scale_up_threshold: 80  # % utilization
    scale_down_threshold: 30
    cooldown_seconds: 300

  cost_projection:
    current_monthly: 0
    projected_6mo_monthly: 0
    projected_12mo_monthly: 0

Capacity Planning Cadence

Frequency	Action
Daily	Review auto-scaling events, check for anomalies
Weekly	Review utilization trends, spot-check headroom
Monthly	Update growth model, review cost projections
Quarterly	Full capacity review, budget planning, architecture check
Pre-launch	Load test to 2x expected peak, verify scaling

Load Testing Benchmarks

Scenario	Method	Duration	Target
Baseline	Steady load at current peak	30 min	Establish metrics
Growth	2x current peak	15 min	Verify scaling works
Spike	10x normal in 60 seconds	5 min	Circuit breakers hold
Soak	1.5x normal load	4 hours	No memory leaks, degradation
Stress	Ramp until failure	Until break	Find actual limits

Phase 10: On-Call Excellence

On-Call Health Metrics

Metric	Healthy	Warning	Critical
Pages per shift	<2	2-5	>5
Off-hours pages	<1/week	1-3/week	>3/week
Time to acknowledge	<5 min	5-15 min	>15 min
Time to mitigate	<30 min	30-60 min	>60 min
False positive rate	<10%	10-30%	>30%
Escalation rate	<20%	20-40%	>40%
On-call satisfaction	>4/5	3-4/5	<3/5

On-Call Rotation Best Practices

Minimum rotation size: 5 people (one week on, four weeks off)
No back-to-back weeks unless team is too small (fix the team size)
Follow-the-sun for global teams (no one pages at 3 AM if avoidable)
Primary + secondary on-call always
Handoff document at rotation change — open issues, recent deploys, known risks
Compensation — on-call pay, time off in lieu, or equivalent

On-Call Handoff Template

## On-Call Handoff: [Date]

### Open Issues
- [Issue]: [Status, next steps]

### Recent Changes (last 7 days)
- [Deployment/config change]: [Risk level, rollback plan]

### Known Risks
- [Event/condition]: [What to watch for]

### Scheduled Maintenance
- [When]: [What, duration, rollback plan]

### Runbook Updates
- [Any new/updated runbooks since last rotation]

Runbook Template

runbook:
  title: ""
  alert_name: ""  # exact alert that triggers this
  last_updated: ""
  owner: ""

  overview: |
    What this alert means in plain English.

  impact: |
    What users/systems are affected and how.

  diagnosis:
    - step: "Check service health"
      command: ""
      expected: ""
      if_unexpected: ""
    - step: "Check recent deployments"
      command: ""
      expected: ""
      if_unexpected: "Rollback: [command]"
    - step: "Check dependencies"
      command: ""
      expected: ""
      if_unexpected: ""

  mitigation:
    - option: "Rollback"
      when: "Recent deployment suspected"
      steps: []
    - option: "Scale up"
      when: "Traffic spike"
      steps: []
    - option: "Failover"
      when: "Single component failure"
      steps: []

  escalation:
    after_minutes: 30
    contact: ""
    context_to_provide: ""

Phase 11: Reliability Review & Governance

Weekly SRE Review (30 min)

1. SLO Status (5 min)
   - Budget remaining per service
   - Any burn rate alerts this week?

2. Incident Review (10 min)
   - Incidents this week: count, severity, duration
   - Open postmortem action items: status check

3. On-Call Health (5 min)
   - Pages this week (total, off-hours, false positives)
   - Any on-call feedback?

4. Reliability Work (10 min)
   - Automation shipped this week
   - Toil reduced (hours saved)
   - Chaos experiments run
   - Capacity concerns

Monthly Reliability Report

monthly_report:
  period: ""

  slo_summary:
    services_meeting_slo: 0
    services_breaching_slo: 0
    worst_performing: ""

  incidents:
    total: 0
    by_severity: { SEV1: 0, SEV2: 0, SEV3: 0, SEV4: 0 }
    mttr_minutes: 0
    mttd_minutes: 0
    repeat_incidents: 0

  error_budget:
    services_in_healthy: 0
    services_in_warning: 0
    services_in_critical: 0
    services_exhausted: 0

  toil:
    hours_spent: 0
    hours_automated_away: 0
    pct_of_sre_time: 0

  on_call:
    total_pages: 0
    off_hours_pages: 0
    false_positive_pct: 0
    avg_ack_time_min: 0

  action_items:
    open: 0
    completed_this_month: 0
    overdue: 0

  highlights: []
  concerns: []
  next_month_priorities: []

Production Readiness Review Checklist

Before any new service goes to production:

Category	Check	Status
SLOs	SLIs defined and measured
SLOs	SLO targets set with stakeholder agreement
SLOs	Error budget policy documented
Monitoring	Golden signals dashboarded
Monitoring	Alerting configured with runbooks
Monitoring	Structured logging implemented
Monitoring	Distributed tracing enabled
Incidents	On-call rotation established
Incidents	Escalation paths documented
Incidents	Runbooks for top 5 failure modes
Capacity	Load tested to 2x expected peak
Capacity	Auto-scaling configured and tested
Capacity	Resource limits set (CPU, memory)
Resilience	Graceful degradation implemented
Resilience	Circuit breakers for dependencies
Resilience	Retry with exponential backoff
Resilience	Timeout configured for all external calls
Deploy	Rollback tested and documented
Deploy	Canary/blue-green deployment ready
Deploy	Feature flags for risky features
Security	Authentication and authorization
Security	Secrets in vault (not env vars)
Security	Dependencies scanned
Data	Backup and restore tested
Data	Data retention policy defined
Docs	Architecture diagram current
Docs	API documentation published
Docs	Operational runbook complete

Phase 12: Advanced Patterns

Self-Healing Automation

auto_remediation:
  - trigger: "pod_crash_loop"
    condition: "restart_count > 3 in 10 min"
    action: "Delete pod, let scheduler reschedule"
    escalate_if: "Still crashing after 3 auto-remediations"

  - trigger: "disk_usage_high"
    condition: "disk_usage > 85%"
    action: "Run log cleanup script, archive old data"
    escalate_if: "Still above 85% after cleanup"

  - trigger: "connection_pool_exhausted"
    condition: "available_connections = 0"
    action: "Kill idle connections, increase pool temporarily"
    escalate_if: "Pool exhausted again within 1 hour"

  - trigger: "certificate_expiring"
    condition: "days_until_expiry < 14"
    action: "Trigger cert renewal"
    escalate_if: "Renewal fails"

Multi-Region Reliability

Strategy	Complexity	RTO	Cost
Active-passive	Low	Minutes	1.5x
Active-active read	Medium	Seconds	1.8x
Active-active full	High	Near-zero	2-3x
Cell-based	Very high	Per-cell	2-4x

Decision guide: - SLO < 99.9% → Single region with good backups - SLO 99.9-99.95% → Active-passive with automated failover - SLO > 99.95% → Active-active (read or full) - SLO > 99.99% → Cell-based architecture

Reliability Culture Indicators

Healthy signals: - Postmortems are blameless and well-attended - Error budgets are respected (feature freeze actually happens) - On-call is shared fairly and compensated - Toil is tracked and reducing quarter-over-quarter - Chaos experiments happen regularly - Teams own their reliability (not just SRE)

Warning signs: - "Hero culture" — same person always saves the day - Postmortems are blame-focused or skipped - Error budget exhaustion doesn't change behavior - On-call is dreaded, same 2 people always paged - "We'll fix reliability after this feature ships" (always) - SRE team is just an ops team with a new name

Quality Scoring Rubric (0-100)

Dimension	Weight	0-2	3-4	5
SLO Coverage	20%	No SLOs	SLOs for critical services	All services with SLOs, error budgets, reviews
Monitoring	15%	Basic health checks	Golden signals + dashboards	Full observability stack + anomaly detection
Incident Response	15%	Ad-hoc, no process	ICS roles, runbooks, postmortems	Structured ICS, blameless culture, action tracking
Automation	15%	Manual everything	CI/CD + some automation	Self-healing, GitOps, <25% toil
Chaos Engineering	10%	None	Staging experiments	Continuous production chaos with safety
Capacity Planning	10%	Reactive	Quarterly forecasting	Predictive, auto-scaling, cost-optimized
On-Call Health	10%	Burnout, hero culture	Fair rotation, <5 pages/shift	Balanced, compensated, <2 pages/shift
Documentation	5%	Nothing written	Runbooks exist	Complete, current, tested runbooks

Natural Language Commands

"Assess reliability for [service]" → Run maturity assessment
"Define SLOs for [service]" → Walk through SLI selection + SLO setting
"Check error budget for [service]" → Calculate current budget status
"Start incident for [description]" → Create incident channel, assign IC, begin workflow
"Write postmortem for [incident]" → Generate structured postmortem
"Plan chaos experiment for [service]" → Design experiment with hypothesis
"Audit toil for [team]" → Inventory and prioritize toil
"Review on-call health" → Analyze page volume, satisfaction, fairness
"Production readiness review for [service]" → Run full checklist
"Monthly reliability report" → Generate comprehensive report
"Design runbook for [alert]" → Create structured runbook
"Plan capacity for [service] growing at [X%]" → Build capacity model