Incident Management: On-Call Culture, Blameless Postmortems, and Runbooks

Every engineering team will have production incidents. The difference between high-performing and average teams isn't that high-performing teams have fewer incidents — it's that they recover faster, learn more from them, and systematically prevent recurrence.

This guide covers the process and culture that makes incident management a competitive advantage rather than a recurring crisis.

Incident Severity Levels

Consistent severity definitions let everyone understand impact and set response expectations:

Severity	Definition	Response SLA	Example
P0 — Critical	Full outage; all customers affected; revenue impact	Immediate; all hands	API returning 500 for all requests
P1 — High	Major feature broken; significant customer impact	15 min	Checkout flow failing for 30% of users
P2 — Medium	Degraded performance or partial feature loss	1 hour	Dashboard loading in 8s instead of 1s
P3 — Low	Minor issue; workaround available	1 business day	Export button broken; users can still download via API
P4 — Informational	No customer impact; worth tracking	Next sprint	Internal tool showing stale data

Define these once. Put them in your engineering handbook. Make sure on-call engineers can apply them consistently without debate at 2am.

On-Call Rotation Design

What makes on-call sustainable:

Maximum 1 week on-call per month per engineer
Incidents during nights/weekends earn compensatory time off
Clear escalation path (primary → secondary → engineering lead)
New engineers shadow for 4 weeks before going primary
On-call budget: time to fix what you're paged for, including root cause

What kills on-call culture:

Alert fatigue (> 3 alerts per shift that aren't actionable)
No escalation path (you're on your own at 3am)
No time to fix problems encountered during on-call (technical debt compounds)
No postmortems (same incidents recur)

# PagerDuty schedule via Terraform
resource "pagerduty_schedule" "engineering_primary" {
  name      = "Engineering Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2026-06-01T00:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 1 week
    rotation_virtual_start       = "2026-06-01T00:00:00-05:00"

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.carol.id,
      pagerduty_user.david.id,
    ]
  }
}

resource "pagerduty_escalation_policy" "engineering" {
  name = "Engineering Escalation"

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.engineering_primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.engineering_secondary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_lead.id
    }
  }
}

🌐 Looking for a Dev Team That Actually Delivers?

Most agencies sell you a project manager and assign juniors. Viprasol is different — senior engineers only, direct Slack access, and a 5.0★ Upwork record across 100+ projects.

React, Next.js, Node.js, TypeScript — production-grade stack
Fixed-price contracts — no surprise invoices
Full source code ownership from day one
90-day post-launch support included

Get a Free Scope Review WhatsApp

Incident Response Process

During an incident, clarity of roles prevents chaotic pile-ons where everyone is debugging simultaneously but no one is coordinating:

Role	Responsibility
Incident Commander (IC)	Coordinates response; communicates status; makes go/no-go decisions
Technical Lead	Diagnoses root cause; implements fixes
Comms Lead	Updates status page; notifies customers; keeps Slack updated
Scribe	Documents timeline of actions in incident channel

For small teams: IC + Technical Lead can be the same person for P2/P3. Always separate for P0/P1.

Incident Slack channel naming convention:

#inc-2026-0527-api-checkout-500
    ^year ^date  ^system ^symptom

Runbook Template

Runbooks are pre-written response guides for known failure modes. The time to write a runbook is not during an incident.

# Runbook: High API Error Rate (5xx)

**Last Updated:** 2026-05-20 | **Owner:** @alice | **Review Cycle:** Quarterly

🚀 Senior Engineers. No Junior Handoffs. Ever.

You get the senior developer, not a project manager who relays your requirements to someone you never meet. Every Viprasol project has a senior lead from kickoff to launch.

MVPs in 4–8 weeks, full platforms in 3–5 months
Lighthouse 90+ performance scores standard
Works across US, UK, AU timezones
Free 30-min architecture review, no commitment

Start My Project WhatsApp

When to Use This Runbook

Alert: api_error_rate_5xx > 1% for 5 minutes Or: Customer reports widespread errors

Severity Assessment

5% error rate, P0/P1 users affected → P0
1-5% error rate, some users affected → P1
< 1% error rate, specific endpoints → P2

Step 1: Confirm the Alert (2 min)

# Check current error rate
curl -s "https://api.datadoghq.com/api/v1/query?query=sum:api.errors.5xx%7B*%7D.as_rate()" \
  -H "DD-API-KEY: $DATADOG_API_KEY" | jq '.series[0].pointlist[-1][1]'

# Check which endpoints are erroring
open "https://app.datadoghq.com/apm/services/api"
# Filter by: error rate, last 15 minutes

Step 2: Check Recent Deployments (2 min)

# Recent ECS deploys
aws ecs describe-services --cluster production --services api \
  --query 'services[0].deployments[*].{status:status,taskDef:taskDefinition,updated:updatedAt}'

# Recent GitHub merges
gh pr list --state merged --limit 5 --json title,mergedAt,author

If a deploy happened in the last 30 minutes: rollback first, investigate later.

Step 3: Rollback (if deploy is suspected cause) (5 min)

# Roll back to previous ECS task definition
PREVIOUS_TASK_DEF=$(aws ecs describe-services \
  --cluster production --services api \
  --query 'services[0].deployments[-1].taskDefinition' --output text)

aws ecs update-service \
  --cluster production \
  --service api \
  --task-definition $PREVIOUS_TASK_DEF \
  --force-new-deployment

Step 4: Check Database (5 min)

# Check for long-running queries
psql $DATABASE_URL -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
ORDER BY duration DESC;"

# Check connection count
psql $DATABASE_URL -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# If connection count > 80% of max_connections: restart PgBouncer
ssh bastion "sudo systemctl restart pgbouncer"

Step 5: Check External Dependencies (5 min)

Stripe status: https://status.stripe.com
AWS status: https://health.aws.amazon.com
Redis: redis-cli -u $REDIS_URL ping

Escalation

P0: Page engineering lead immediately
If database issue: page DBA on-call
If Stripe issue: customer-facing comms + monitor their status page

Communication Templates

Status Page Update (initial):

We are investigating reports of errors affecting [feature]. Our team is actively working on a resolution. Updates every 10 minutes.

Status Page Update (resolved):

The issue affecting [feature] has been resolved. [X]% of requests were affected between [time] and [time]. A postmortem will be published within 48 hours.


---

## Blameless Postmortem Template

Blameless means: no individual is blamed for the incident. Systems failed; we improve systems. This is not naive — it's what makes engineers willing to be honest about mistakes.

```markdown
# Postmortem: Checkout Outage — 2026-05-15

**Severity:** P1 | **Duration:** 47 minutes (14:03–14:50 UTC)
**Impact:** ~30% of checkout attempts failed | **Revenue Impact:** ~$8,400
**Incident Commander:** @alice | **Author:** @bob | **Reviewed by:** @carol

Summary

A database migration that added an index to the orders table without CONCURRENTLY locked the table for 4 minutes. During that time, checkout requests timed out and returned 500 errors.

Timeline

Time (UTC)	Event
14:00	Deploy v2.4.1 started — included migration
14:03	Migration began; `CREATE INDEX` acquired exclusive lock
14:03	Checkout error rate spike to 35%
14:04	PagerDuty alert fired (2-minute window)
14:06	@alice acknowledged; began investigation
14:12	Identified migration as cause (table lock visible in pg_stat_activity)
14:15	Migration completed; lock released
14:15	Error rate began recovering
14:20	Error rate back to baseline
14:50	Incident declared resolved (monitoring continued 30 min)

Root Cause

CREATE INDEX orders_user_id_idx ON orders(user_id); without CONCURRENTLY acquired an AccessExclusiveLock. The orders table has 8M rows; the index took 4 minutes to build, during which all DML on the table was blocked.

Why It Wasn't Caught

Migration was reviewed but reviewer was not familiar with PostgreSQL lock implications
No automated check in CI for blocking DDL statements
The staging database had only 10K rows; the migration ran in < 1 second

Action Items

Action	Owner	Due
Add `CONCURRENTLY` to migration (and rerun correctly)	@bob	Done
Add CI check: fail if migration contains `CREATE INDEX` without `CONCURRENTLY`	@carol	2026-05-22
Add migration review checklist to PR template	@alice	2026-05-22
Staging database: add data volume similar to production (subset)	@david	2026-06-01

What Went Well

Alert fired within 2 minutes of incident start
Root cause identified quickly (pg_stat_activity clearly showed the lock)
Communication to customers was clear and timely
Escalation was smooth; IC role was clear

What Didn't Go Well

Migration review process didn't catch the locking issue
Staging environment didn't reflect production data volume
MTTR could have been shorter if rollback procedure was attempted earlier

Lessons

We did not know the CONCURRENTLY keyword was required for production-safe index creation. This knowledge is now documented in our migration runbook, and the CI check ensures it will be automatically caught in future.

---

Reducing Alert Fatigue

If your on-call engineers receive > 5 actionable alerts per week, alert quality is poor. Audit alerts:

# scripts/alert_audit.py — analyze PagerDuty alert patterns
import pdpyras

session = pdpyras.APISession(api_key=PAGERDUTY_API_KEY)

incidents = session.list_all('incidents', params={
    'since': '2026-04-01T00:00:00Z',
    'until': '2026-05-01T00:00:00Z',
    'statuses[]': ['resolved'],
})

from collections import Counter
alert_titles = Counter(inc['title'] for inc in incidents)

print("Top 10 most frequent alerts (candidates for tuning):")
for title, count in alert_titles.most_common(10):
    print(f"  {count}x — {title}")

# Any alert firing > 5 times in a month without leading to a P0/P1:
# Raise threshold, add more context, or delete if not actionable

Alert quality criteria:

Every alert is actionable (there's something to do when it fires)
Every alert has a runbook linked
False positive rate < 10% (9 out of 10 alerts represent real problems)
Alert fatigue score: < 3 wakeups per on-call shift

Working With Viprasol

We help engineering teams build incident management processes — on-call rotation design, runbook libraries, postmortem templates, alert tuning, and PagerDuty/OpsGenie configuration. A mature incident process is a product reliability investment.

→ Talk to our team about reliability and incident management.

Incident Management: On-Call Culture, Blameless Postmortems, and Runbooks

Incident Management: On-Call Culture, Blameless Postmortems, and Runbooks

Incident Severity Levels

On-Call Rotation Design

🌐 Looking for a Dev Team That Actually Delivers?

Incident Response Process

Runbook Template

🚀 Senior Engineers. No Junior Handoffs. Ever.

When to Use This Runbook

Severity Assessment

Step 1: Confirm the Alert (2 min)

Step 2: Check Recent Deployments (2 min)

Step 3: Rollback (if deploy is suspected cause) (5 min)

Step 4: Check Database (5 min)

Step 5: Check External Dependencies (5 min)

Escalation

Communication Templates

Summary

Timeline

Root Cause

Why It Wasn't Caught

Action Items

What Went Well

What Didn't Go Well

Lessons

Reducing Alert Fatigue

Working With Viprasol

See Also

Viprasol Tech Team

Need a Modern Web Application?

Need a custom web application built?

Related Articles

DORA Metrics: Deployment Frequency, Lead Time, MTTR, and Change Failure Rate

PostgreSQL Replication: Streaming Replication, Read Replicas, Logical Replication, and Failover

Next.js Environment Variables: .env Hierarchy, Runtime vs Build-Time, Secret Injection, and Vercel