Back to Blog

Incident Management: On-Call Culture, Blameless Postmortems, and Runbooks

Build a mature incident management process — on-call rotation design, incident severity levels, incident commander roles, blameless postmortem templates, runboo

Viprasol Tech Team
May 27, 2026
12 min read

Incident Management: On-Call Culture, Blameless Postmortems, and Runbooks

Every engineering team will have production incidents. The difference between high-performing and average teams isn't that high-performing teams have fewer incidents — it's that they recover faster, learn more from them, and systematically prevent recurrence.

This guide covers the process and culture that makes incident management a competitive advantage rather than a recurring crisis.


Incident Severity Levels

Consistent severity definitions let everyone understand impact and set response expectations:

SeverityDefinitionResponse SLAExample
P0 — CriticalFull outage; all customers affected; revenue impactImmediate; all handsAPI returning 500 for all requests
P1 — HighMajor feature broken; significant customer impact15 minCheckout flow failing for 30% of users
P2 — MediumDegraded performance or partial feature loss1 hourDashboard loading in 8s instead of 1s
P3 — LowMinor issue; workaround available1 business dayExport button broken; users can still download via API
P4 — InformationalNo customer impact; worth trackingNext sprintInternal tool showing stale data

Define these once. Put them in your engineering handbook. Make sure on-call engineers can apply them consistently without debate at 2am.


On-Call Rotation Design

What makes on-call sustainable:

  • Maximum 1 week on-call per month per engineer
  • Incidents during nights/weekends earn compensatory time off
  • Clear escalation path (primary → secondary → engineering lead)
  • New engineers shadow for 4 weeks before going primary
  • On-call budget: time to fix what you're paged for, including root cause

What kills on-call culture:

  • Alert fatigue (> 3 alerts per shift that aren't actionable)
  • No escalation path (you're on your own at 3am)
  • No time to fix problems encountered during on-call (technical debt compounds)
  • No postmortems (same incidents recur)
# PagerDuty schedule via Terraform
resource "pagerduty_schedule" "engineering_primary" {
  name      = "Engineering Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2026-06-01T00:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 1 week
    rotation_virtual_start       = "2026-06-01T00:00:00-05:00"

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.carol.id,
      pagerduty_user.david.id,
    ]
  }
}

resource "pagerduty_escalation_policy" "engineering" {
  name = "Engineering Escalation"

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.engineering_primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.engineering_secondary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_lead.id
    }
  }
}

🌐 Looking for a Dev Team That Actually Delivers?

Most agencies sell you a project manager and assign juniors. Viprasol is different — senior engineers only, direct Slack access, and a 5.0★ Upwork record across 100+ projects.

  • React, Next.js, Node.js, TypeScript — production-grade stack
  • Fixed-price contracts — no surprise invoices
  • Full source code ownership from day one
  • 90-day post-launch support included

Incident Response Process

During an incident, clarity of roles prevents chaotic pile-ons where everyone is debugging simultaneously but no one is coordinating:

RoleResponsibility
Incident Commander (IC)Coordinates response; communicates status; makes go/no-go decisions
Technical LeadDiagnoses root cause; implements fixes
Comms LeadUpdates status page; notifies customers; keeps Slack updated
ScribeDocuments timeline of actions in incident channel

For small teams: IC + Technical Lead can be the same person for P2/P3. Always separate for P0/P1.

Incident Slack channel naming convention:

#inc-2026-0527-api-checkout-500
    ^year ^date  ^system ^symptom

Runbook Template

Runbooks are pre-written response guides for known failure modes. The time to write a runbook is not during an incident.

# Runbook: High API Error Rate (5xx)

**Last Updated:** 2026-05-20 | **Owner:** @alice | **Review Cycle:** Quarterly

🚀 Senior Engineers. No Junior Handoffs. Ever.

You get the senior developer, not a project manager who relays your requirements to someone you never meet. Every Viprasol project has a senior lead from kickoff to launch.

  • MVPs in 4–8 weeks, full platforms in 3–5 months
  • Lighthouse 90+ performance scores standard
  • Works across US, UK, AU timezones
  • Free 30-min architecture review, no commitment

When to Use This Runbook

Alert: api_error_rate_5xx > 1% for 5 minutes Or: Customer reports widespread errors

Severity Assessment

  • 5% error rate, P0/P1 users affected → P0

  • 1-5% error rate, some users affected → P1
  • < 1% error rate, specific endpoints → P2

Step 1: Confirm the Alert (2 min)

# Check current error rate
curl -s "https://api.datadoghq.com/api/v1/query?query=sum:api.errors.5xx%7B*%7D.as_rate()" \
  -H "DD-API-KEY: $DATADOG_API_KEY" | jq '.series[0].pointlist[-1][1]'

# Check which endpoints are erroring
open "https://app.datadoghq.com/apm/services/api"
# Filter by: error rate, last 15 minutes

Step 2: Check Recent Deployments (2 min)

# Recent ECS deploys
aws ecs describe-services --cluster production --services api \
  --query 'services[0].deployments[*].{status:status,taskDef:taskDefinition,updated:updatedAt}'

# Recent GitHub merges
gh pr list --state merged --limit 5 --json title,mergedAt,author

If a deploy happened in the last 30 minutes: rollback first, investigate later.

Step 3: Rollback (if deploy is suspected cause) (5 min)

# Roll back to previous ECS task definition
PREVIOUS_TASK_DEF=$(aws ecs describe-services \
  --cluster production --services api \
  --query 'services[0].deployments[-1].taskDefinition' --output text)

aws ecs update-service \
  --cluster production \
  --service api \
  --task-definition $PREVIOUS_TASK_DEF \
  --force-new-deployment

Step 4: Check Database (5 min)

# Check for long-running queries
psql $DATABASE_URL -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
ORDER BY duration DESC;"

# Check connection count
psql $DATABASE_URL -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# If connection count > 80% of max_connections: restart PgBouncer
ssh bastion "sudo systemctl restart pgbouncer"

Step 5: Check External Dependencies (5 min)

Escalation

  • P0: Page engineering lead immediately
  • If database issue: page DBA on-call
  • If Stripe issue: customer-facing comms + monitor their status page

Communication Templates

Status Page Update (initial):

We are investigating reports of errors affecting [feature]. Our team is actively working on a resolution. Updates every 10 minutes.

Status Page Update (resolved):

The issue affecting [feature] has been resolved. [X]% of requests were affected between [time] and [time]. A postmortem will be published within 48 hours.


---

## Blameless Postmortem Template

Blameless means: no individual is blamed for the incident. Systems failed; we improve systems. This is not naive — it's what makes engineers willing to be honest about mistakes.

```markdown
# Postmortem: Checkout Outage — 2026-05-15

**Severity:** P1 | **Duration:** 47 minutes (14:03–14:50 UTC)
**Impact:** ~30% of checkout attempts failed | **Revenue Impact:** ~$8,400
**Incident Commander:** @alice | **Author:** @bob | **Reviewed by:** @carol

Summary

A database migration that added an index to the orders table without CONCURRENTLY locked the table for 4 minutes. During that time, checkout requests timed out and returned 500 errors.

Timeline

Time (UTC)Event
14:00Deploy v2.4.1 started — included migration
14:03Migration began; CREATE INDEX acquired exclusive lock
14:03Checkout error rate spike to 35%
14:04PagerDuty alert fired (2-minute window)
14:06@alice acknowledged; began investigation
14:12Identified migration as cause (table lock visible in pg_stat_activity)
14:15Migration completed; lock released
14:15Error rate began recovering
14:20Error rate back to baseline
14:50Incident declared resolved (monitoring continued 30 min)

Root Cause

CREATE INDEX orders_user_id_idx ON orders(user_id); without CONCURRENTLY acquired an AccessExclusiveLock. The orders table has 8M rows; the index took 4 minutes to build, during which all DML on the table was blocked.

Why It Wasn't Caught

  1. Migration was reviewed but reviewer was not familiar with PostgreSQL lock implications
  2. No automated check in CI for blocking DDL statements
  3. The staging database had only 10K rows; the migration ran in < 1 second

Action Items

ActionOwnerDue
Add CONCURRENTLY to migration (and rerun correctly)@bobDone
Add CI check: fail if migration contains CREATE INDEX without CONCURRENTLY@carol2026-05-22
Add migration review checklist to PR template@alice2026-05-22
Staging database: add data volume similar to production (subset)@david2026-06-01

What Went Well

  • Alert fired within 2 minutes of incident start
  • Root cause identified quickly (pg_stat_activity clearly showed the lock)
  • Communication to customers was clear and timely
  • Escalation was smooth; IC role was clear

What Didn't Go Well

  • Migration review process didn't catch the locking issue
  • Staging environment didn't reflect production data volume
  • MTTR could have been shorter if rollback procedure was attempted earlier

Lessons

We did not know the CONCURRENTLY keyword was required for production-safe index creation. This knowledge is now documented in our migration runbook, and the CI check ensures it will be automatically caught in future.


---

Reducing Alert Fatigue

If your on-call engineers receive > 5 actionable alerts per week, alert quality is poor. Audit alerts:

# scripts/alert_audit.py — analyze PagerDuty alert patterns
import pdpyras

session = pdpyras.APISession(api_key=PAGERDUTY_API_KEY)

incidents = session.list_all('incidents', params={
    'since': '2026-04-01T00:00:00Z',
    'until': '2026-05-01T00:00:00Z',
    'statuses[]': ['resolved'],
})

from collections import Counter
alert_titles = Counter(inc['title'] for inc in incidents)

print("Top 10 most frequent alerts (candidates for tuning):")
for title, count in alert_titles.most_common(10):
    print(f"  {count}x — {title}")

# Any alert firing > 5 times in a month without leading to a P0/P1:
# Raise threshold, add more context, or delete if not actionable

Alert quality criteria:

  • Every alert is actionable (there's something to do when it fires)
  • Every alert has a runbook linked
  • False positive rate < 10% (9 out of 10 alerts represent real problems)
  • Alert fatigue score: < 3 wakeups per on-call shift

Working With Viprasol

We help engineering teams build incident management processes — on-call rotation design, runbook libraries, postmortem templates, alert tuning, and PagerDuty/OpsGenie configuration. A mature incident process is a product reliability investment.

Talk to our team about reliability and incident management.


See Also

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need a Modern Web Application?

From landing pages to complex SaaS platforms — we build it all with Next.js and React.

Free consultation • No commitment • Response within 24 hours

Viprasol · Web Development

Need a custom web application built?

We build React and Next.js web applications with Lighthouse ≥90 scores, mobile-first design, and full source code ownership. Senior engineers only — from architecture through deployment.