SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026
Define and implement SLIs and SLOs that engineering teams actually care about: error budget policies, Prometheus recording rules, Grafana dashboards, and alerting on error budget burn rate.
SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026
Most monitoring is reactive: a page goes down, an alert fires, someone fixes it. SLOs (Service Level Objectives) make reliability proactive: you define what "good enough" looks like, measure it continuously, and use the error budget to make deliberate tradeoffs between shipping features and hardening reliability.
The SRE framework behind SLOs comes from Google's Site Reliability Engineering book. This post takes that theory and turns it into working Prometheus rules, Grafana dashboards, and a team policy that changes behavior.
The SLI → SLO → Error Budget Chain
SLI (measurement): "What fraction of requests succeeded in < 500ms?"
SLO (target): "99.5% of requests must succeed in < 500ms over 30 days"
Error budget: 100% - 99.5% = 0.5% of requests may fail
= 0.5% × 30d × 24h × 60m × 60s = 12,960 seconds of downtime
Error budget policy: When we've consumed >50% of error budget → slow down releases
When we've consumed 100% → feature freeze, focus on reliability
Choosing SLIs
SLIs should measure what users actually experience. The four golden signals are a good starting framework:
| Signal | SLI Metric | When to Use |
|---|---|---|
| Availability | Success rate (non-5xx / total requests) | Every user-facing service |
| Latency | % of requests < threshold (p95 or p99) | Latency-sensitive APIs |
| Throughput | Requests per second | Batch systems, data pipelines |
| Error rate | Errors / total (more granular than availability) | Complex systems with error types |
Avoid: CPU, memory, and infrastructure metrics as SLIs — they measure symptoms, not user experience. A service can have 90% CPU and perfect availability, or 20% CPU and terrible latency.
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Prometheus: Recording Rules for SLIs
# prometheus/rules/slo-rules.yaml
groups:
- name: slo_api_service
interval: 30s
rules:
# Raw SLI: request success rate (not 5xx errors)
- record: job:http_requests_total:rate5m
expr: rate(http_requests_total[5m])
- record: job:http_request_errors_total:rate5m
expr: rate(http_requests_total{status=~"5.."}[5m])
# Availability SLI: fraction of successful requests
- record: job:http_availability:ratio_rate5m
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, service)
/
sum(rate(http_requests_total[5m])) by (job, service)
)
# Latency SLI: fraction of requests completing under 500ms
- record: job:http_latency_fast:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (job, service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (job, service)
# Error budget burn rate: how fast are we consuming the budget?
# Compare current error rate to the SLO target (99.5% = 0.5% error budget)
- record: job:error_budget_burn_rate:ratio_rate1h
expr: |
(1 - job:http_availability:ratio_rate5m)
/
(1 - 0.995) # SLO target: 99.5% availability
- record: job:error_budget_burn_rate:ratio_rate6h
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[6h])) by (job, service)
/ sum(rate(http_requests_total[6h])) by (job, service)
)
) / (1 - 0.995)
# Remaining error budget (30-day window)
- record: job:error_budget_remaining:ratio_rate30d
expr: |
1 - (
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[30d])) by (job)
/ sum(rate(http_requests_total[30d])) by (job)
)
) / (1 - 0.995)
)
Multi-Window Alerting (Google's Approach)
Single-window alerts on error rate cause alert fatigue. Multi-window burn rate alerts fire only when you're consuming budget faster than sustainable:
# prometheus/rules/slo-alerts.yaml
groups:
- name: slo_alerts
rules:
# Page immediately: burning budget 14x faster than sustainable
# AND 1h burn confirms it's not a blip
- alert: ErrorBudgetBurnFast
expr: |
job:error_budget_burn_rate:ratio_rate1h{job="api"} > 14
and
job:error_budget_burn_rate:ratio_rate6h{job="api"} > 14
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "API error budget burning fast ({{ $value | humanize }}x)"
description: |
Error budget burn rate is {{ $value | humanize }}x the sustainable rate.
At this rate, the 30-day error budget will be exhausted in
{{ div 1 $value | humanizeDuration }}.
Remaining budget: {{ printf "%.1f%%" (mul (query "job:error_budget_remaining:ratio_rate30d{job='api'}") 100) }}
# Ticket: sustained moderate burn
- alert: ErrorBudgetBurnModerate
expr: |
job:error_budget_burn_rate:ratio_rate6h{job="api"} > 2
and
job:error_budget_burn_rate:ratio_rate6h{job="api"} <= 14
for: 30m
labels:
severity: warning
team: platform
annotations:
summary: "API error budget consuming faster than expected"
description: "Burn rate {{ $value | humanize }}x — investigate and create ticket"
# Availability drops below SLO threshold
- alert: AvailabilityBelowSLO
expr: job:http_availability:ratio_rate5m{job="api"} < 0.995
for: 5m
labels:
severity: critical
annotations:
summary: "API availability below 99.5% SLO"
description: "Current availability: {{ $value | humanizePercentage }}"
# Latency SLO breach
- alert: LatencyBelowSLO
expr: job:http_latency_fast:ratio_rate5m{job="api"} < 0.99
for: 5m
labels:
severity: warning
annotations:
summary: "< 99% of requests completing in < 500ms"
description: "{{ $value | humanizePercentage }} of requests are fast"
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Grafana Dashboard: Error Budget
// Grafana dashboard panel examples (JSON model excerpt)
{
"panels": [
{
"title": "Error Budget Remaining (30d)",
"type": "gauge",
"targets": [{
"expr": "job:error_budget_remaining:ratio_rate30d{job='api'} * 100",
"legendFormat": "Budget remaining"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 25 },
{ "color": "green", "value": 50 }
]
}
}
}
},
{
"title": "Burn Rate (1h vs 6h)",
"type": "timeseries",
"targets": [
{
"expr": "job:error_budget_burn_rate:ratio_rate1h{job='api'}",
"legendFormat": "1h burn rate"
},
{
"expr": "job:error_budget_burn_rate:ratio_rate6h{job='api'}",
"legendFormat": "6h burn rate"
}
],
"thresholds": [
{ "value": 14, "color": "red", "line": true },
{ "value": 2, "color": "yellow", "line": true },
{ "value": 1, "color": "green", "line": true }
]
}
]
}
Application-Level SLI Instrumentation
// src/lib/metrics.ts — Prometheus client instrumentation
import { Registry, Counter, Histogram } from 'prom-client';
export const registry = new Registry();
export const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status', 'service'],
registers: [registry],
});
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status', 'service'],
// Carefully chosen buckets covering your SLO thresholds
buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10],
registers: [registry],
});
// Fastify middleware to instrument all routes
export function metricsPlugin(app: FastifyInstance): void {
app.addHook('onResponse', (req, reply, done) => {
const route = req.routerPath ?? 'unknown';
const method = req.method;
const status = String(reply.statusCode);
httpRequestsTotal.labels(method, route, status, 'api').inc();
httpRequestDuration
.labels(method, route, status, 'api')
.observe(reply.elapsedTime / 1000);
done();
});
// /metrics endpoint for Prometheus scraping
app.get('/metrics', async (_req, reply) => {
reply.type('text/plain');
return registry.metrics();
});
}
Error Budget Policy
The policy is what makes SLOs actionable — it defines what happens at each budget threshold:
## Error Budget Policy — API Service (SLO: 99.5% availability / 30d)
### Budget = 100% (start of month)
Normal development velocity. Features and infra changes proceed.
### Budget = 50% consumed
- Slow down high-risk deployments
- Require additional review for changes touching critical paths
- Review recent incidents for patterns
### Budget = 75% consumed
- Pause non-critical feature work
- Engineering focus shifts to reliability improvements
- Post-mortems required for all incidents that contributed
### Budget = 100% consumed (SLO breach)
- Feature freeze until budget recovers
- All engineering capacity on reliability work
- Daily sync between EM, product, and on-call
- No deployments without incident commander approval
### Recovery
Budget resets as the 30-day window slides past old incidents.
Feature work resumes when remaining budget > 25%.
### Exemptions
Planned maintenance declared 48h in advance may be excluded from SLI calculation.
Use PrometheusRule annotation: `slo.exclude: "2026-08-18T02:00:00Z/PT2H"`
Working With Viprasol
We implement SLO-based observability for engineering teams — from SLI definition and Prometheus recording rules through Grafana dashboards and error budget policy rollout.
What we deliver:
- SLI/SLO definition workshops (what to measure and why)
- Prometheus recording rules for availability and latency SLIs
- Multi-window burn rate alerting configuration
- Grafana error budget dashboard
- Error budget policy document and team rollout
→ Discuss your observability stack → Cloud infrastructure and DevOps services
See Also
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.