SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026

Most monitoring is reactive: a page goes down, an alert fires, someone fixes it. SLOs (Service Level Objectives) make reliability proactive: you define what "good enough" looks like, measure it continuously, and use the error budget to make deliberate tradeoffs between shipping features and hardening reliability.

The SRE framework behind SLOs comes from Google's Site Reliability Engineering book. This post takes that theory and turns it into working Prometheus rules, Grafana dashboards, and a team policy that changes behavior.

The SLI → SLO → Error Budget Chain

SLI (measurement):    "What fraction of requests succeeded in < 500ms?"
SLO (target):         "99.5% of requests must succeed in < 500ms over 30 days"
Error budget:         100% - 99.5% = 0.5% of requests may fail
                      = 0.5% × 30d × 24h × 60m × 60s = 12,960 seconds of downtime

Error budget policy:  When we've consumed >50% of error budget → slow down releases
                      When we've consumed 100% → feature freeze, focus on reliability

Choosing SLIs

SLIs should measure what users actually experience. The four golden signals are a good starting framework:

Signal	SLI Metric	When to Use
Availability	Success rate (non-5xx / total requests)	Every user-facing service
Latency	% of requests < threshold (p95 or p99)	Latency-sensitive APIs
Throughput	Requests per second	Batch systems, data pipelines
Error rate	Errors / total (more granular than availability)	Complex systems with error types

Avoid: CPU, memory, and infrastructure metrics as SLIs — they measure symptoms, not user experience. A service can have 90% CPU and perfect availability, or 20% CPU and terrible latency.

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Prometheus: Recording Rules for SLIs

# prometheus/rules/slo-rules.yaml
groups:
  - name: slo_api_service
    interval: 30s
    rules:
      # Raw SLI: request success rate (not 5xx errors)
      - record: job:http_requests_total:rate5m
        expr: rate(http_requests_total[5m])

      - record: job:http_request_errors_total:rate5m
        expr: rate(http_requests_total{status=~"5.."}[5m])

      # Availability SLI: fraction of successful requests
      - record: job:http_availability:ratio_rate5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, service)
            /
            sum(rate(http_requests_total[5m])) by (job, service)
          )

      # Latency SLI: fraction of requests completing under 500ms
      - record: job:http_latency_fast:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (job, service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (job, service)

      # Error budget burn rate: how fast are we consuming the budget?
      # Compare current error rate to the SLO target (99.5% = 0.5% error budget)
      - record: job:error_budget_burn_rate:ratio_rate1h
        expr: |
          (1 - job:http_availability:ratio_rate5m)
          /
          (1 - 0.995)   # SLO target: 99.5% availability

      - record: job:error_budget_burn_rate:ratio_rate6h
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[6h])) by (job, service)
              / sum(rate(http_requests_total[6h])) by (job, service)
            )
          ) / (1 - 0.995)

      # Remaining error budget (30-day window)
      - record: job:error_budget_remaining:ratio_rate30d
        expr: |
          1 - (
            (
              1 - (
                sum(rate(http_requests_total{status!~"5.."}[30d])) by (job)
                / sum(rate(http_requests_total[30d])) by (job)
              )
            ) / (1 - 0.995)
          )

Multi-Window Alerting (Google's Approach)

Single-window alerts on error rate cause alert fatigue. Multi-window burn rate alerts fire only when you're consuming budget faster than sustainable:

# prometheus/rules/slo-alerts.yaml
groups:
  - name: slo_alerts
    rules:
      # Page immediately: burning budget 14x faster than sustainable
      # AND 1h burn confirms it's not a blip
      - alert: ErrorBudgetBurnFast
        expr: |
          job:error_budget_burn_rate:ratio_rate1h{job="api"} > 14
          and
          job:error_budget_burn_rate:ratio_rate6h{job="api"} > 14
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API error budget burning fast ({{ $value | humanize }}x)"
          description: |
            Error budget burn rate is {{ $value | humanize }}x the sustainable rate.
            At this rate, the 30-day error budget will be exhausted in
            {{ div 1 $value | humanizeDuration }}.
            Remaining budget: {{ printf "%.1f%%" (mul (query "job:error_budget_remaining:ratio_rate30d{job='api'}") 100) }}

      # Ticket: sustained moderate burn
      - alert: ErrorBudgetBurnModerate
        expr: |
          job:error_budget_burn_rate:ratio_rate6h{job="api"} > 2
          and
          job:error_budget_burn_rate:ratio_rate6h{job="api"} <= 14
        for: 30m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "API error budget consuming faster than expected"
          description: "Burn rate {{ $value | humanize }}x — investigate and create ticket"

      # Availability drops below SLO threshold
      - alert: AvailabilityBelowSLO
        expr: job:http_availability:ratio_rate5m{job="api"} < 0.995
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API availability below 99.5% SLO"
          description: "Current availability: {{ $value | humanizePercentage }}"

      # Latency SLO breach
      - alert: LatencyBelowSLO
        expr: job:http_latency_fast:ratio_rate5m{job="api"} < 0.99
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "< 99% of requests completing in < 500ms"
          description: "{{ $value | humanizePercentage }} of requests are fast"

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Grafana Dashboard: Error Budget

// Grafana dashboard panel examples (JSON model excerpt)
{
  "panels": [
    {
      "title": "Error Budget Remaining (30d)",
      "type": "gauge",
      "targets": [{
        "expr": "job:error_budget_remaining:ratio_rate30d{job='api'} * 100",
        "legendFormat": "Budget remaining"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "color": "red",    "value": 0  },
              { "color": "yellow", "value": 25 },
              { "color": "green",  "value": 50 }
            ]
          }
        }
      }
    },
    {
      "title": "Burn Rate (1h vs 6h)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "job:error_budget_burn_rate:ratio_rate1h{job='api'}",
          "legendFormat": "1h burn rate"
        },
        {
          "expr": "job:error_budget_burn_rate:ratio_rate6h{job='api'}",
          "legendFormat": "6h burn rate"
        }
      ],
      "thresholds": [
        { "value": 14, "color": "red",    "line": true },
        { "value": 2,  "color": "yellow", "line": true },
        { "value": 1,  "color": "green",  "line": true }
      ]
    }
  ]
}

Application-Level SLI Instrumentation

// src/lib/metrics.ts — Prometheus client instrumentation
import { Registry, Counter, Histogram } from 'prom-client';

export const registry = new Registry();

export const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status', 'service'],
  registers: [registry],
});

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status', 'service'],
  // Carefully chosen buckets covering your SLO thresholds
  buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10],
  registers: [registry],
});

// Fastify middleware to instrument all routes
export function metricsPlugin(app: FastifyInstance): void {
  app.addHook('onResponse', (req, reply, done) => {
    const route  = req.routerPath ?? 'unknown';
    const method = req.method;
    const status = String(reply.statusCode);

    httpRequestsTotal.labels(method, route, status, 'api').inc();
    httpRequestDuration
      .labels(method, route, status, 'api')
      .observe(reply.elapsedTime / 1000);

    done();
  });

  // /metrics endpoint for Prometheus scraping
  app.get('/metrics', async (_req, reply) => {
    reply.type('text/plain');
    return registry.metrics();
  });
}

Error Budget Policy

The policy is what makes SLOs actionable — it defines what happens at each budget threshold:

## Error Budget Policy — API Service (SLO: 99.5% availability / 30d)

### Budget = 100% (start of month)
Normal development velocity. Features and infra changes proceed.

### Budget = 50% consumed
- Slow down high-risk deployments
- Require additional review for changes touching critical paths
- Review recent incidents for patterns

### Budget = 75% consumed  
- Pause non-critical feature work
- Engineering focus shifts to reliability improvements
- Post-mortems required for all incidents that contributed

### Budget = 100% consumed (SLO breach)
- Feature freeze until budget recovers
- All engineering capacity on reliability work
- Daily sync between EM, product, and on-call
- No deployments without incident commander approval

### Recovery
Budget resets as the 30-day window slides past old incidents.
Feature work resumes when remaining budget > 25%.

### Exemptions
Planned maintenance declared 48h in advance may be excluded from SLI calculation.
Use PrometheusRule annotation: `slo.exclude: "2026-08-18T02:00:00Z/PT2H"`

Working With Viprasol

We implement SLO-based observability for engineering teams — from SLI definition and Prometheus recording rules through Grafana dashboards and error budget policy rollout.

What we deliver:

SLI/SLO definition workshops (what to measure and why)
Prometheus recording rules for availability and latency SLIs
Multi-window burn rate alerting configuration
Grafana error budget dashboard
Error budget policy document and team rollout

→ Discuss your observability stack → Cloud infrastructure and DevOps services

SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026

SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026

The SLI → SLO → Error Budget Chain

Choosing SLIs

☁️ Is Your Cloud Costing Too Much?

Prometheus: Recording Rules for SLIs

Multi-Window Alerting (Google's Approach)

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Grafana Dashboard: Error Budget

Application-Level SLI Instrumentation

Error Budget Policy

Working With Viprasol

See Also

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works

AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection

OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help