Incident Management: On-Call Culture, Blameless Postmortems, and Runbooks
Build a mature incident management process — on-call rotation design, incident severity levels, incident commander roles, blameless postmortem templates, runboo
Incident Management: On-Call Culture, Blameless Postmortems, and Runbooks
Every engineering team will have production incidents. The difference between high-performing and average teams isn't that high-performing teams have fewer incidents — it's that they recover faster, learn more from them, and systematically prevent recurrence.
This guide covers the process and culture that makes incident management a competitive advantage rather than a recurring crisis.
Incident Severity Levels
Consistent severity definitions let everyone understand impact and set response expectations:
| Severity | Definition | Response SLA | Example |
|---|---|---|---|
| P0 — Critical | Full outage; all customers affected; revenue impact | Immediate; all hands | API returning 500 for all requests |
| P1 — High | Major feature broken; significant customer impact | 15 min | Checkout flow failing for 30% of users |
| P2 — Medium | Degraded performance or partial feature loss | 1 hour | Dashboard loading in 8s instead of 1s |
| P3 — Low | Minor issue; workaround available | 1 business day | Export button broken; users can still download via API |
| P4 — Informational | No customer impact; worth tracking | Next sprint | Internal tool showing stale data |
Define these once. Put them in your engineering handbook. Make sure on-call engineers can apply them consistently without debate at 2am.
On-Call Rotation Design
What makes on-call sustainable:
- Maximum 1 week on-call per month per engineer
- Incidents during nights/weekends earn compensatory time off
- Clear escalation path (primary → secondary → engineering lead)
- New engineers shadow for 4 weeks before going primary
- On-call budget: time to fix what you're paged for, including root cause
What kills on-call culture:
- Alert fatigue (> 3 alerts per shift that aren't actionable)
- No escalation path (you're on your own at 3am)
- No time to fix problems encountered during on-call (technical debt compounds)
- No postmortems (same incidents recur)
# PagerDuty schedule via Terraform
resource "pagerduty_schedule" "engineering_primary" {
name = "Engineering Primary On-Call"
time_zone = "America/New_York"
layer {
name = "Weekly Rotation"
start = "2026-06-01T00:00:00-05:00"
rotation_turn_length_seconds = 604800 # 1 week
rotation_virtual_start = "2026-06-01T00:00:00-05:00"
users = [
pagerduty_user.alice.id,
pagerduty_user.bob.id,
pagerduty_user.carol.id,
pagerduty_user.david.id,
]
}
}
resource "pagerduty_escalation_policy" "engineering" {
name = "Engineering Escalation"
rule {
escalation_delay_in_minutes = 10
target {
type = "schedule_reference"
id = pagerduty_schedule.engineering_primary.id
}
}
rule {
escalation_delay_in_minutes = 10
target {
type = "schedule_reference"
id = pagerduty_schedule.engineering_secondary.id
}
}
rule {
escalation_delay_in_minutes = 10
target {
type = "user_reference"
id = pagerduty_user.engineering_lead.id
}
}
}
🌐 Looking for a Dev Team That Actually Delivers?
Most agencies sell you a project manager and assign juniors. Viprasol is different — senior engineers only, direct Slack access, and a 5.0★ Upwork record across 100+ projects.
- React, Next.js, Node.js, TypeScript — production-grade stack
- Fixed-price contracts — no surprise invoices
- Full source code ownership from day one
- 90-day post-launch support included
Incident Response Process
During an incident, clarity of roles prevents chaotic pile-ons where everyone is debugging simultaneously but no one is coordinating:
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Coordinates response; communicates status; makes go/no-go decisions |
| Technical Lead | Diagnoses root cause; implements fixes |
| Comms Lead | Updates status page; notifies customers; keeps Slack updated |
| Scribe | Documents timeline of actions in incident channel |
For small teams: IC + Technical Lead can be the same person for P2/P3. Always separate for P0/P1.
Incident Slack channel naming convention:
#inc-2026-0527-api-checkout-500
^year ^date ^system ^symptom
Runbook Template
Runbooks are pre-written response guides for known failure modes. The time to write a runbook is not during an incident.
# Runbook: High API Error Rate (5xx)
**Last Updated:** 2026-05-20 | **Owner:** @alice | **Review Cycle:** Quarterly
🚀 Senior Engineers. No Junior Handoffs. Ever.
You get the senior developer, not a project manager who relays your requirements to someone you never meet. Every Viprasol project has a senior lead from kickoff to launch.
- MVPs in 4–8 weeks, full platforms in 3–5 months
- Lighthouse 90+ performance scores standard
- Works across US, UK, AU timezones
- Free 30-min architecture review, no commitment
When to Use This Runbook
Alert: api_error_rate_5xx > 1% for 5 minutes
Or: Customer reports widespread errors
Severity Assessment
-
5% error rate, P0/P1 users affected → P0
- 1-5% error rate, some users affected → P1
- < 1% error rate, specific endpoints → P2
Step 1: Confirm the Alert (2 min)
# Check current error rate
curl -s "https://api.datadoghq.com/api/v1/query?query=sum:api.errors.5xx%7B*%7D.as_rate()" \
-H "DD-API-KEY: $DATADOG_API_KEY" | jq '.series[0].pointlist[-1][1]'
# Check which endpoints are erroring
open "https://app.datadoghq.com/apm/services/api"
# Filter by: error rate, last 15 minutes
Step 2: Check Recent Deployments (2 min)
# Recent ECS deploys
aws ecs describe-services --cluster production --services api \
--query 'services[0].deployments[*].{status:status,taskDef:taskDefinition,updated:updatedAt}'
# Recent GitHub merges
gh pr list --state merged --limit 5 --json title,mergedAt,author
If a deploy happened in the last 30 minutes: rollback first, investigate later.
Step 3: Rollback (if deploy is suspected cause) (5 min)
# Roll back to previous ECS task definition
PREVIOUS_TASK_DEF=$(aws ecs describe-services \
--cluster production --services api \
--query 'services[0].deployments[-1].taskDefinition' --output text)
aws ecs update-service \
--cluster production \
--service api \
--task-definition $PREVIOUS_TASK_DEF \
--force-new-deployment
Step 4: Check Database (5 min)
# Check for long-running queries
psql $DATABASE_URL -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
ORDER BY duration DESC;"
# Check connection count
psql $DATABASE_URL -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# If connection count > 80% of max_connections: restart PgBouncer
ssh bastion "sudo systemctl restart pgbouncer"
Step 5: Check External Dependencies (5 min)
- Stripe status: https://status.stripe.com
- AWS status: https://health.aws.amazon.com
- Redis:
redis-cli -u $REDIS_URL ping
Escalation
- P0: Page engineering lead immediately
- If database issue: page DBA on-call
- If Stripe issue: customer-facing comms + monitor their status page
Communication Templates
Status Page Update (initial):
We are investigating reports of errors affecting [feature]. Our team is actively working on a resolution. Updates every 10 minutes.
Status Page Update (resolved):
The issue affecting [feature] has been resolved. [X]% of requests were affected between [time] and [time]. A postmortem will be published within 48 hours.
---
## Blameless Postmortem Template
Blameless means: no individual is blamed for the incident. Systems failed; we improve systems. This is not naive — it's what makes engineers willing to be honest about mistakes.
```markdown
# Postmortem: Checkout Outage — 2026-05-15
**Severity:** P1 | **Duration:** 47 minutes (14:03–14:50 UTC)
**Impact:** ~30% of checkout attempts failed | **Revenue Impact:** ~$8,400
**Incident Commander:** @alice | **Author:** @bob | **Reviewed by:** @carol
Summary
A database migration that added an index to the orders table without CONCURRENTLY locked the table for 4 minutes. During that time, checkout requests timed out and returned 500 errors.
Timeline
| Time (UTC) | Event |
|---|---|
| 14:00 | Deploy v2.4.1 started — included migration |
| 14:03 | Migration began; CREATE INDEX acquired exclusive lock |
| 14:03 | Checkout error rate spike to 35% |
| 14:04 | PagerDuty alert fired (2-minute window) |
| 14:06 | @alice acknowledged; began investigation |
| 14:12 | Identified migration as cause (table lock visible in pg_stat_activity) |
| 14:15 | Migration completed; lock released |
| 14:15 | Error rate began recovering |
| 14:20 | Error rate back to baseline |
| 14:50 | Incident declared resolved (monitoring continued 30 min) |
Root Cause
CREATE INDEX orders_user_id_idx ON orders(user_id); without CONCURRENTLY acquired an AccessExclusiveLock. The orders table has 8M rows; the index took 4 minutes to build, during which all DML on the table was blocked.
Why It Wasn't Caught
- Migration was reviewed but reviewer was not familiar with PostgreSQL lock implications
- No automated check in CI for blocking DDL statements
- The staging database had only 10K rows; the migration ran in < 1 second
Action Items
| Action | Owner | Due |
|---|---|---|
Add CONCURRENTLY to migration (and rerun correctly) | @bob | Done |
Add CI check: fail if migration contains CREATE INDEX without CONCURRENTLY | @carol | 2026-05-22 |
| Add migration review checklist to PR template | @alice | 2026-05-22 |
| Staging database: add data volume similar to production (subset) | @david | 2026-06-01 |
What Went Well
- Alert fired within 2 minutes of incident start
- Root cause identified quickly (pg_stat_activity clearly showed the lock)
- Communication to customers was clear and timely
- Escalation was smooth; IC role was clear
What Didn't Go Well
- Migration review process didn't catch the locking issue
- Staging environment didn't reflect production data volume
- MTTR could have been shorter if rollback procedure was attempted earlier
Lessons
We did not know the
CONCURRENTLYkeyword was required for production-safe index creation. This knowledge is now documented in our migration runbook, and the CI check ensures it will be automatically caught in future.
---
Reducing Alert Fatigue
If your on-call engineers receive > 5 actionable alerts per week, alert quality is poor. Audit alerts:
# scripts/alert_audit.py — analyze PagerDuty alert patterns
import pdpyras
session = pdpyras.APISession(api_key=PAGERDUTY_API_KEY)
incidents = session.list_all('incidents', params={
'since': '2026-04-01T00:00:00Z',
'until': '2026-05-01T00:00:00Z',
'statuses[]': ['resolved'],
})
from collections import Counter
alert_titles = Counter(inc['title'] for inc in incidents)
print("Top 10 most frequent alerts (candidates for tuning):")
for title, count in alert_titles.most_common(10):
print(f" {count}x — {title}")
# Any alert firing > 5 times in a month without leading to a P0/P1:
# Raise threshold, add more context, or delete if not actionable
Alert quality criteria:
- Every alert is actionable (there's something to do when it fires)
- Every alert has a runbook linked
- False positive rate < 10% (9 out of 10 alerts represent real problems)
- Alert fatigue score: < 3 wakeups per on-call shift
Working With Viprasol
We help engineering teams build incident management processes — on-call rotation design, runbook libraries, postmortem templates, alert tuning, and PagerDuty/OpsGenie configuration. A mature incident process is a product reliability investment.
→ Talk to our team about reliability and incident management.
See Also
- Observability and Monitoring — the alerting foundation
- DevOps Best Practices — CI/CD and deployment practices that prevent incidents
- Engineering Metrics — MTTR as a DORA metric
- Database Migrations — preventing the common class of migration incidents
- Cloud Solutions — reliability and SRE engineering
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need a Modern Web Application?
From landing pages to complex SaaS platforms — we build it all with Next.js and React.
Free consultation • No commitment • Response within 24 hours
Need a custom web application built?
We build React and Next.js web applications with Lighthouse ≥90 scores, mobile-first design, and full source code ownership. Senior engineers only — from architecture through deployment.