Back to Blog

SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026

Define and implement SLIs and SLOs that engineering teams actually care about: error budget policies, Prometheus recording rules, Grafana dashboards, and alerting on error budget burn rate.

Viprasol Tech Team
August 18, 2026
13 min read

SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026

Most monitoring is reactive: a page goes down, an alert fires, someone fixes it. SLOs (Service Level Objectives) make reliability proactive: you define what "good enough" looks like, measure it continuously, and use the error budget to make deliberate tradeoffs between shipping features and hardening reliability.

The SRE framework behind SLOs comes from Google's Site Reliability Engineering book. This post takes that theory and turns it into working Prometheus rules, Grafana dashboards, and a team policy that changes behavior.


The SLI → SLO → Error Budget Chain

SLI (measurement):    "What fraction of requests succeeded in < 500ms?"
SLO (target):         "99.5% of requests must succeed in < 500ms over 30 days"
Error budget:         100% - 99.5% = 0.5% of requests may fail
                      = 0.5% × 30d × 24h × 60m × 60s = 12,960 seconds of downtime

Error budget policy:  When we've consumed >50% of error budget → slow down releases
                      When we've consumed 100% → feature freeze, focus on reliability

Choosing SLIs

SLIs should measure what users actually experience. The four golden signals are a good starting framework:

SignalSLI MetricWhen to Use
AvailabilitySuccess rate (non-5xx / total requests)Every user-facing service
Latency% of requests < threshold (p95 or p99)Latency-sensitive APIs
ThroughputRequests per secondBatch systems, data pipelines
Error rateErrors / total (more granular than availability)Complex systems with error types

Avoid: CPU, memory, and infrastructure metrics as SLIs — they measure symptoms, not user experience. A service can have 90% CPU and perfect availability, or 20% CPU and terrible latency.


☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

  • AWS, GCP, Azure certified engineers
  • Infrastructure as Code (Terraform, CDK)
  • Docker, Kubernetes, GitHub Actions CI/CD
  • Typical audit recovers $500–$3,000/month in savings

Prometheus: Recording Rules for SLIs

# prometheus/rules/slo-rules.yaml
groups:
  - name: slo_api_service
    interval: 30s
    rules:
      # Raw SLI: request success rate (not 5xx errors)
      - record: job:http_requests_total:rate5m
        expr: rate(http_requests_total[5m])

      - record: job:http_request_errors_total:rate5m
        expr: rate(http_requests_total{status=~"5.."}[5m])

      # Availability SLI: fraction of successful requests
      - record: job:http_availability:ratio_rate5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, service)
            /
            sum(rate(http_requests_total[5m])) by (job, service)
          )

      # Latency SLI: fraction of requests completing under 500ms
      - record: job:http_latency_fast:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (job, service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (job, service)

      # Error budget burn rate: how fast are we consuming the budget?
      # Compare current error rate to the SLO target (99.5% = 0.5% error budget)
      - record: job:error_budget_burn_rate:ratio_rate1h
        expr: |
          (1 - job:http_availability:ratio_rate5m)
          /
          (1 - 0.995)   # SLO target: 99.5% availability

      - record: job:error_budget_burn_rate:ratio_rate6h
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[6h])) by (job, service)
              / sum(rate(http_requests_total[6h])) by (job, service)
            )
          ) / (1 - 0.995)

      # Remaining error budget (30-day window)
      - record: job:error_budget_remaining:ratio_rate30d
        expr: |
          1 - (
            (
              1 - (
                sum(rate(http_requests_total{status!~"5.."}[30d])) by (job)
                / sum(rate(http_requests_total[30d])) by (job)
              )
            ) / (1 - 0.995)
          )

Multi-Window Alerting (Google's Approach)

Single-window alerts on error rate cause alert fatigue. Multi-window burn rate alerts fire only when you're consuming budget faster than sustainable:

# prometheus/rules/slo-alerts.yaml
groups:
  - name: slo_alerts
    rules:
      # Page immediately: burning budget 14x faster than sustainable
      # AND 1h burn confirms it's not a blip
      - alert: ErrorBudgetBurnFast
        expr: |
          job:error_budget_burn_rate:ratio_rate1h{job="api"} > 14
          and
          job:error_budget_burn_rate:ratio_rate6h{job="api"} > 14
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API error budget burning fast ({{ $value | humanize }}x)"
          description: |
            Error budget burn rate is {{ $value | humanize }}x the sustainable rate.
            At this rate, the 30-day error budget will be exhausted in
            {{ div 1 $value | humanizeDuration }}.
            Remaining budget: {{ printf "%.1f%%" (mul (query "job:error_budget_remaining:ratio_rate30d{job='api'}") 100) }}

      # Ticket: sustained moderate burn
      - alert: ErrorBudgetBurnModerate
        expr: |
          job:error_budget_burn_rate:ratio_rate6h{job="api"} > 2
          and
          job:error_budget_burn_rate:ratio_rate6h{job="api"} <= 14
        for: 30m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "API error budget consuming faster than expected"
          description: "Burn rate {{ $value | humanize }}x — investigate and create ticket"

      # Availability drops below SLO threshold
      - alert: AvailabilityBelowSLO
        expr: job:http_availability:ratio_rate5m{job="api"} < 0.995
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API availability below 99.5% SLO"
          description: "Current availability: {{ $value | humanizePercentage }}"

      # Latency SLO breach
      - alert: LatencyBelowSLO
        expr: job:http_latency_fast:ratio_rate5m{job="api"} < 0.99
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "< 99% of requests completing in < 500ms"
          description: "{{ $value | humanizePercentage }} of requests are fast"

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

  • Staging + production environments with feature flags
  • Automated security scanning in the pipeline
  • Uptime monitoring + alerting + runbook automation
  • On-call support handover docs included

Grafana Dashboard: Error Budget

// Grafana dashboard panel examples (JSON model excerpt)
{
  "panels": [
    {
      "title": "Error Budget Remaining (30d)",
      "type": "gauge",
      "targets": [{
        "expr": "job:error_budget_remaining:ratio_rate30d{job='api'} * 100",
        "legendFormat": "Budget remaining"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "color": "red",    "value": 0  },
              { "color": "yellow", "value": 25 },
              { "color": "green",  "value": 50 }
            ]
          }
        }
      }
    },
    {
      "title": "Burn Rate (1h vs 6h)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "job:error_budget_burn_rate:ratio_rate1h{job='api'}",
          "legendFormat": "1h burn rate"
        },
        {
          "expr": "job:error_budget_burn_rate:ratio_rate6h{job='api'}",
          "legendFormat": "6h burn rate"
        }
      ],
      "thresholds": [
        { "value": 14, "color": "red",    "line": true },
        { "value": 2,  "color": "yellow", "line": true },
        { "value": 1,  "color": "green",  "line": true }
      ]
    }
  ]
}

Application-Level SLI Instrumentation

// src/lib/metrics.ts — Prometheus client instrumentation
import { Registry, Counter, Histogram } from 'prom-client';

export const registry = new Registry();

export const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status', 'service'],
  registers: [registry],
});

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status', 'service'],
  // Carefully chosen buckets covering your SLO thresholds
  buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10],
  registers: [registry],
});

// Fastify middleware to instrument all routes
export function metricsPlugin(app: FastifyInstance): void {
  app.addHook('onResponse', (req, reply, done) => {
    const route  = req.routerPath ?? 'unknown';
    const method = req.method;
    const status = String(reply.statusCode);

    httpRequestsTotal.labels(method, route, status, 'api').inc();
    httpRequestDuration
      .labels(method, route, status, 'api')
      .observe(reply.elapsedTime / 1000);

    done();
  });

  // /metrics endpoint for Prometheus scraping
  app.get('/metrics', async (_req, reply) => {
    reply.type('text/plain');
    return registry.metrics();
  });
}

Error Budget Policy

The policy is what makes SLOs actionable — it defines what happens at each budget threshold:

## Error Budget Policy — API Service (SLO: 99.5% availability / 30d)

### Budget = 100% (start of month)
Normal development velocity. Features and infra changes proceed.

### Budget = 50% consumed
- Slow down high-risk deployments
- Require additional review for changes touching critical paths
- Review recent incidents for patterns

### Budget = 75% consumed  
- Pause non-critical feature work
- Engineering focus shifts to reliability improvements
- Post-mortems required for all incidents that contributed

### Budget = 100% consumed (SLO breach)
- Feature freeze until budget recovers
- All engineering capacity on reliability work
- Daily sync between EM, product, and on-call
- No deployments without incident commander approval

### Recovery
Budget resets as the 30-day window slides past old incidents.
Feature work resumes when remaining budget > 25%.

### Exemptions
Planned maintenance declared 48h in advance may be excluded from SLI calculation.
Use PrometheusRule annotation: `slo.exclude: "2026-08-18T02:00:00Z/PT2H"`

Working With Viprasol

We implement SLO-based observability for engineering teams — from SLI definition and Prometheus recording rules through Grafana dashboards and error budget policy rollout.

What we deliver:

  • SLI/SLO definition workshops (what to measure and why)
  • Prometheus recording rules for availability and latency SLIs
  • Multi-window burn rate alerting configuration
  • Grafana error budget dashboard
  • Error budget policy document and team rollout

Discuss your observability stackCloud infrastructure and DevOps services


See Also

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need DevOps & Cloud Expertise?

Scale your infrastructure with confidence. AWS, GCP, Azure certified team.

Free consultation • No commitment • Response within 24 hours

Viprasol · Big Data & Analytics

Making sense of your data at scale?

Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.