On-Call Culture: Rotations, Alert Fatigue, Runbook Hygiene, and Blameless Retrospectives
Build a sustainable on-call culture: design fair rotation schedules, reduce alert fatigue with signal-to-noise tuning, maintain runbooks that engineers actually use, and run blameless retrospectives that produce real change.
On-call is the mechanism by which engineers bear the operational consequences of their architecture decisions. When done right, it drives better reliability because the people experiencing pain are the same people who can fix the root cause. When done wrong, it burns out your best engineers and drives turnover.
The difference between sustainable on-call and burnout comes down to four things: fair rotations, low noise-to-signal in alerts, runbooks that actually resolve incidents, and a retrospective culture that improves the system instead of assigning blame.
On-Call Rotation Design
Rotation Principles
1. Minimum viable rotation size: 4 engineers
With 3, someone is always on-call every third week.
With 4, it's every fourth week โ manageable.
2. Primary + secondary coverage
Primary: receives initial page
Secondary: backup if primary doesn't acknowledge in 5 minutes
Never leave a primary without backup.
3. Follow-the-sun for global teams
US-based primary during US hours
EU-based primary during EU hours
No middle-of-night pages for predictable incidents
4. Shadow rotation for new engineers
Weeks 1-4: shadow (receive alerts, not paged)
Weeks 5-8: secondary only
Week 9+: eligible for primary
PagerDuty Schedule Configuration
// src/scripts/setup-pagerduty-rotation.ts
// Configure on-call rotation via PagerDuty API
interface OnCallEngineer {
pdUserId: string;
name: string;
timezone: string;
}
const ROTATION: OnCallEngineer[] = [
{ pdUserId: "P123ABC", name: "Priya", timezone: "America/New_York" },
{ pdUserId: "P456DEF", name: "Marcus", timezone: "America/Chicago" },
{ pdUserId: "P789GHI", name: "Arjun", timezone: "Asia/Kolkata" },
{ pdUserId: "P012JKL", name: "Sophie", timezone: "Europe/Berlin" },
];
async function createWeeklyRotation(
scheduleName: string,
escalationPolicyId: string
): Promise<void> {
const response = await fetch("https://api.pagerduty.com/schedules", {
method: "POST",
headers: {
Authorization: `Token token=${process.env.PAGERDUTY_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
schedule: {
name: scheduleName,
time_zone: "UTC",
schedule_layers: [
{
name: "Primary On-Call",
start: new Date().toISOString(),
rotation_turn_length_seconds: 7 * 24 * 3600, // 1 week
rotation_virtual_start: "2026-10-06T09:00:00Z", // Monday 9am UTC
users: ROTATION.map((eng) => ({
user: { id: eng.pdUserId, type: "user" },
})),
restrictions: [
// Primary on-call only during business hours of their timezone
// After-hours go to secondary
{
type: "daily_restriction",
start_time_of_day: "09:00:00",
duration_seconds: 36000, // 10 hours
start_day_of_week: 1, // Monday
},
],
},
],
},
}),
});
const data = await response.json();
console.log("Created schedule:", data.schedule.id);
}
On-Call Compensation
Paging engineers at 3am has a cost. Acknowledge it:
Minimum compensation model (adjust to your market):
- Being on primary rotation: $X/week (even if no pages)
- Each page received: $Y/page (acknowledgment bonus)
- Critical incident (P0/P1): $Z/incident (recognition for extended response)
- After-hours vs business hours: different rates
Alternative: Compensatory time off
- Each on-call week = 0.5 days off the following week
- Each P0 incident handled: 0.5-1 additional day off
Alert Fatigue Reduction
Alert fatigue is when engineers stop responding to alerts because there are too many false positives. It's one of the leading causes of missed incidents.
Alert Classification
// src/monitoring/alert-classifier.ts
// Classify alerts by signal quality before routing to on-call
interface AlertDefinition {
name: string;
query: string;
threshold: number;
duration: string; // Must be breached for this duration before firing
severity: "info" | "warning" | "critical";
// Signal-to-noise ratio from historical data
truePositiveRate?: number; // What % of pages require actual action?
avgResolutionMinutes?: number;
}
const ALERT_DEFINITIONS: AlertDefinition[] = [
// High signal โ always page on-call
{
name: "API Error Rate > 5%",
query: `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05`,
threshold: 0.05,
duration: "2m",
severity: "critical",
truePositiveRate: 0.92, // Almost always real
},
{
name: "Database Connection Pool Exhausted",
query: `pg_pool_waiting_connections > 5`,
threshold: 5,
duration: "1m",
severity: "critical",
truePositiveRate: 0.88,
},
// Medium signal โ page on-call during business hours only
{
name: "Memory Usage > 85%",
query: `process_resident_memory_bytes / node_memory_MemTotal_bytes > 0.85`,
threshold: 0.85,
duration: "10m", // Long duration reduces false positives
severity: "warning",
truePositiveRate: 0.65,
},
// Low signal โ Slack notification only (not page)
{
name: "Slow Query Count Spike",
query: `rate(pg_slow_queries_total[5m]) > 10`,
threshold: 10,
duration: "5m",
severity: "info",
truePositiveRate: 0.40, // Noisy โ often resolves itself
},
];
// Route alerts based on true positive rate + time of day
export function shouldPage(
alert: AlertDefinition,
currentHourUTC: number
): boolean {
if (alert.severity === "critical") return true;
if (alert.severity === "warning") {
// Only page during business hours for warnings
const isBusinessHours = currentHourUTC >= 9 && currentHourUTC < 18;
return isBusinessHours && (alert.truePositiveRate ?? 0) >= 0.7;
}
// Info: never page
return false;
}
Weekly Alert Review
-- Run weekly to identify noisy alerts
SELECT
alert_name,
COUNT(*) AS total_pages,
COUNT(*) FILTER (WHERE required_action) AS actionable_pages,
ROUND(
100.0 * COUNT(*) FILTER (WHERE required_action) / COUNT(*), 1
) AS signal_rate_pct,
ROUND(AVG(resolution_minutes), 0) AS avg_resolution_min,
COUNT(*) FILTER (WHERE hour_of_day < 7 OR hour_of_day >= 22) AS overnight_pages
FROM (
SELECT
i.alert_name,
EXTRACT(HOUR FROM i.paged_at) AS hour_of_day,
i.required_action, -- Was any action actually taken?
EXTRACT(EPOCH FROM (i.resolved_at - i.paged_at)) / 60 AS resolution_minutes
FROM incident_pages i
WHERE i.paged_at >= NOW() - INTERVAL '30 days'
) sub
GROUP BY alert_name
HAVING COUNT(*) >= 5 -- Only analyze alerts with enough volume
ORDER BY signal_rate_pct ASC; -- Noisiest alerts first
-- Target: signal_rate_pct > 80% for any alert that pages on-call
-- If below 80%: increase duration threshold, raise alert value, or demote to Slack-only
๐ผ In 2026, AI Handles What Used to Take a Full Team
Lead qualification, customer support, data entry, report generation, email responses โ AI agents now do all of this automatically. We build and deploy them for your business.
- AI agents that qualify leads while you sleep
- Automated customer support that resolves 70%+ of tickets
- Internal workflow automation โ save 15+ hours/week
- Integrates with your CRM, email, Slack, and ERP
Runbook Hygiene
A runbook is only useful if it actually resolves the incident. Most runbooks are written once and never updated.
Runbook Quality Checklist
# Runbook Quality Checklist
For each runbook, verify:
## Discoverability
- [ ] Alert name in PagerDuty links directly to this runbook
- [ ] Runbook is searchable by the most common description of the symptom
- [ ] Last-updated date is visible
๐ฏ One Senior Tech Team for Everything
Instead of managing 5 freelancers across 3 timezones, work with one accountable team that covers product development, AI, cloud, and ongoing support.
- Web apps, AI agents, trading systems, SaaS platforms
- 100+ projects delivered โ 5.0 star Upwork record
- Fractional CTO advisory available for funded startups
- Free 30-min no-pitch consultation
Content Quality
- Alert name at top: matches exactly what appears in PagerDuty
- Observable symptoms: what does the engineer see when this fires?
- Severity decision tree: when is this P0 vs P1 vs P2?
- Step-by-step remediation: commands are copy-paste ready
- Commands tested in last 90 days (or marked "not recently tested")
- Common false positives documented: "if you see X, it's probably nothing"
- Escalation path: who to call if this runbook doesn't resolve it
Maintenance
- Assigned owner who uses this runbook
- "Last worked" date: when did an engineer last use this to resolve an incident?
- Feedback link: engineers can flag outdated info after incidents
Format Anti-Patterns
- No commands that require background knowledge to interpret
- No "check the metrics" without specifying which dashboard and what to look for
- No "escalate to the team" without naming a specific person or channel
### Runbook Template
```markdown
# Runbook: [Alert Name]
**PagerDuty Alert**: `[exact alert name]`
**Owner**: @[slack handle]
**Last Updated**: 2026-09-01
**Last Worked**: 2026-08-22 (used by @marcus, resolved in 12 min)
---
## What This Alert Means
[1-2 sentences explaining what the system is telling you]
Is This a False Positive?
Check first:
- If [specific condition], this is a known false positive. Acknowledge and close.
- If [specific condition], check [specific dashboard URL] before proceeding.
Severity Decision
- Memory > 90% AND climbing: P1 โ page backup immediately
- Memory > 85% AND stable: P2 โ investigate at your pace
- Memory > 85% AND business hours spike: P2 โ likely safe to watch for 10 min
Remediation Steps
Step 1: Identify the problem process
# SSH to the affected host
ssh $AFFECTED_HOST
# Find memory consumers
ps aux --sort=-%mem | head -20
# Check for memory leak pattern (RSS growing over time)
while true; do ps -o pid,rss,comm -p $(pgrep node) | tail -1; sleep 5; done
Step 2: Quick relief (buy time)
# If memory > 95% and system is at risk, restart the service
# This causes ~30s of downtime. Only do if system is about to OOM.
sudo systemctl restart myapp
# Verify recovery
curl -I https://api.viprasol.com/health
Step 3: Find root cause
- High memory after deployment: revert and escalate to #engineering
- High memory after traffic spike: scale horizontally (see runbook: horizontal-scaling)
- Slow memory growth over days: memory leak โ escalate to service owner
Escalation
If unresolved after 20 minutes:
- Business hours: ping @platform-team in #incidents
- After hours: page secondary on-call via PagerDuty
Related Dashboards
Related Runbooks
---
Blameless Retrospective Framework
The goal of a retrospective is to improve the system, not to identify who made a mistake. Engineers who fear blame stop reporting near-misses and hide information โ which makes outages worse.
The Five Whys (Correctly Applied)
# Five Whys: Wrong Application
Why did the database go down?
โ Because Marcus ran the wrong migration.
"Because Marcus" is a dead end. It assigns blame but provides no systemic fix.
# Five Whys: Correct Application
Why did the database go down?
โ Because the migration locked the table for 20 minutes.
Why did the migration lock the table?
โ Because it used ALTER TABLE ADD COLUMN NOT NULL without a default.
Why was this migration written without the safe pattern?
โ Because our migration review checklist doesn't include lock safety.
Why doesn't our checklist include lock safety?
โ Because we've never had a locking incident before โ it wasn't on our radar.
Why didn't we have documentation about safe migration patterns?
โ Because we assumed the ORM handled it.
Root cause: Gap in our migration review process.
Fix: Add lock analysis to the PR checklist; document safe migration patterns.
Retrospective Meeting Structure
# Incident Retrospective Agenda (60 minutes)
**Before the meeting**: Incident Commander completes timeline
## 1. Timeline Review (15 min)
Read the timeline together. No interruptions, no "why did you..."
The only questions allowed: "Help me understand what you were seeing at this point."
2. Impact Review (5 min)
- Duration of impact
- Users/revenue affected
- Severity classification (was our initial assessment correct?)
3. What Went Well (10 min)
Genuine positives โ fast detection, good communication, quick containment. This section matters. Teams that skip it see retrospectives as punishment.
4. Root Cause Analysis (15 min)
Five Whys applied to the primary cause. No blame language. "The engineer did X" โ "The system allowed X without safeguard Y"
5. Action Items (15 min)
For each root cause: one concrete, assigned, dated action item. No "we should improve monitoring" โ that's not an action item. Yes: "Arjun will add memory alert to the p50 baseline by Oct 7"
Retrospective Rules (post on the wall)
- Assume positive intent โ everyone was doing their best with the information they had
- No "you should have" โ only "the system should have"
- Action items must have an owner and a date
- Retrospective notes are shared with the whole engineering team
---
## On-Call Health Metrics
Track these monthly to catch burnout before it happens:
```sql
-- Monthly on-call load report
SELECT
engineer_id,
COUNT(*) AS total_pages,
COUNT(*) FILTER (WHERE EXTRACT(HOUR FROM paged_at) < 7
OR EXTRACT(HOUR FROM paged_at) >= 22) AS overnight_pages,
COUNT(*) FILTER (WHERE EXTRACT(DOW FROM paged_at) IN (0, 6)) AS weekend_pages,
ROUND(AVG(resolution_minutes), 0) AS avg_resolution_min,
MAX(resolution_minutes) AS worst_incident_min
FROM incident_pages
WHERE paged_at >= NOW() - INTERVAL '30 days'
GROUP BY engineer_id
ORDER BY overnight_pages DESC;
-- Red flags: > 15 overnight pages/month, > 8 weekend pages/month
-- These engineers are at risk of burnout and should be on lighter rotation
See Also
- Security Incident Response โ security-specific runbooks
- Observability: SLIs, SLOs, and Error Budgets โ alert design
- Engineering Metrics with DORA โ MTTR measurement
- Technical Debt Management โ reducing incident frequency
Working With Viprasol
Sustainable on-call culture is an engineering investment, not just a policy document. We help teams design rotation schedules, tune alert signal-to-noise ratios, build runbook libraries that resolve incidents quickly, and establish retrospective practices that continuously improve reliability without burning out the engineers responsible for it.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Ready to Start Your Project?
Whether it's trading bots, web apps, or AI solutions โ we deliver excellence.
Free consultation โข No commitment โข Response within 24 hours
Automate the repetitive parts of your business?
Our AI agent systems handle the tasks that eat your team's time โ scheduling, follow-ups, reporting, support โ across Telegram, WhatsApp, email, and 20+ other channels.