Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works

An unobservable system is a system you can't debug, can't optimize, and can't confidently change. When an incident happens at 2am, the difference between a team that resolves it in 15 minutes and one that spends 4 hours guessing is observability.

Observability has three pillars: logs (what happened), metrics (how the system behaves over time), and traces (how requests flow through distributed services). This guide covers all three, plus alerting — the layer that turns raw signals into actionable notifications.

The Three Pillars

Pillar 1: Structured Logging

Structured logs are JSON objects, not unformatted strings. They're queryable, filterable, and parseable by log aggregation systems without fragile regex.

// ❌ BAD: Unstructured logging
console.log(`User ${userId} created order ${orderId} for $${amount}`);
// Output: "User abc-123 created order ord-456 for $99.99"
// → Impossible to query by amount range, hard to parse

// ✅ GOOD: Structured logging with pino
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  // In production, output JSON; in development, pretty-print
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
  base: {
    service: 'order-service',
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV,
  },
});

// Log with structured context
logger.info({
  event: 'order.created',
  userId,
  orderId,
  amountCents: amount,
  itemCount: items.length,
}, 'Order created');
// Output: {"level":"info","time":1714579200000,"service":"order-service","event":"order.created","userId":"abc-123","orderId":"ord-456","amountCents":9999,"itemCount":3,"msg":"Order created"}

Request correlation: Every log line in a request should carry the same trace/request ID:

// Express middleware: attach request ID to all logs in request scope
import { AsyncLocalStorage } from 'async_hooks';
import { randomUUID } from 'crypto';

const requestContext = new AsyncLocalStorage<{ requestId: string; userId?: string }>();

app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] as string ?? randomUUID();
  res.setHeader('x-request-id', requestId);
  
  requestContext.run({ requestId, userId: req.user?.sub }, next);
});

// Child logger that automatically includes request context
function getLogger() {
  const ctx = requestContext.getStore();
  return logger.child(ctx ?? {});
}

// Usage in any service — automatically includes requestId and userId
async function processOrder(orderId: string) {
  const log = getLogger();
  log.info({ orderId }, 'Processing order');
  // → {"requestId":"xyz-789","userId":"abc-123","orderId":"ord-456","msg":"Processing order"}
}

Log Levels — Use Them Correctly

// ERROR: Something failed that requires human attention
logger.error({ err, orderId }, 'Payment processing failed');

// WARN: Something unexpected but handled — monitor but don't page
logger.warn({ retryCount, orderId }, 'Retry attempt for payment');

// INFO: Normal business events — creates audit trail
logger.info({ orderId, userId }, 'Order confirmed');

// DEBUG: Detailed debugging — disabled in production
logger.debug({ payload }, 'Webhook payload received');

// NEVER: Don't log sensitive data
// ❌ logger.info({ creditCard, cvv }, 'Payment details');
// ❌ logger.info({ password }, 'Login attempt');

Pillar 2: Metrics

Metrics are time-series measurements of system behavior. They answer questions like "what's the request rate?" and "is the error rate increasing?"

The Four Golden Signals

Focus on these four before adding anything else:

Latency — How long requests take (p50, p95, p99)
Traffic — How many requests/sec
Errors — Error rate (%, not absolute count)
Saturation — How full the system is (CPU %, queue depth, connection pool)

Prometheus + Node.js

import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// HTTP request metrics
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});

const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [registry],
});

// Business metrics
const ordersCreated = new Counter({
  name: 'orders_created_total',
  help: 'Total orders created',
  labelNames: ['plan', 'payment_method'],
  registers: [registry],
});

const orderValueCents = new Histogram({
  name: 'order_value_cents',
  help: 'Order value distribution in cents',
  buckets: [500, 1000, 2500, 5000, 10000, 25000, 50000, 100000],
  registers: [registry],
});

// Queue depth (gauge — can go up and down)
const jobQueueDepth = new Gauge({
  name: 'job_queue_depth',
  help: 'Current number of jobs in queue',
  labelNames: ['queue_name'],
  registers: [registry],
});

// Middleware to instrument all HTTP requests
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  
  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route?.path ?? req.path,
      status_code: String(res.statusCode),
    };
    end(labels);
    httpRequestTotal.inc(labels);
  });
  
  next();
});

// Metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['api-service:3000']
    metrics_path: '/metrics'

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Pillar 3: Distributed Tracing (OpenTelemetry)

Traces show how a single request flows through multiple services — essential for debugging latency and errors in distributed systems.

// OpenTelemetry setup (run before any other imports)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
    new PgInstrumentation(),  // Auto-traces all PostgreSQL queries
  ],
});

sdk.start();

// Custom spans for business-critical operations
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processPayment(orderId: string, amount: number) {
  return tracer.startActiveSpan('payment.process', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'payment.amount_cents': amount,
    });
    
    try {
      const result = await stripeClient.chargeCard(amount);
      span.setAttributes({ 'payment.intent_id': result.id });
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err: any) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

A trace for a single order creation might show:

[API Handler: POST /orders] → 245ms
  [Auth middleware: verify JWT] → 5ms
  [DB: SELECT user] → 8ms
  [Payment service: charge card] → 180ms
    [HTTP: POST api.stripe.com] → 175ms
  [DB: INSERT order] → 12ms
  [Queue: enqueue fulfillment] → 3ms

This immediately shows that 180ms of the 245ms total is payment processing — not a database issue.

Alerting Strategy

Alert on Symptoms, Not Causes

Bad alerts (cause-based):

CPU > 80% (might be fine — CPU spikes are normal during processing)
Memory > 70% (might be fine — applications cache aggressively)
Disk > 80% (might be fine — depends on growth rate)

Good alerts (symptom-based):

Error rate > 1% for 5 minutes (users are experiencing errors)
p99 latency > 2 seconds for 5 minutes (users are seeing slow responses)
No successful orders in last 10 minutes (business is stopped)
Queue depth > 10,000 for 15 minutes (processing is falling behind)

Prometheus Alerting Rules

# alerts.yml
groups:
  - name: api-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: SlowResponses
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency {{ $value | humanizeDuration }}"

      - alert: JobQueueBacklog
        expr: job_queue_depth{queue_name="fulfillment"} > 10000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Fulfillment queue depth: {{ $value }}"

Alert Routing

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        severity: warning
      receiver: slack

receivers:
  - name: pagerduty
    pagerduty_configs:
      - routing_key: $PAGERDUTY_KEY
        description: '{{ .CommonAnnotations.summary }}'
  - name: slack
    slack_configs:
      - api_url: $SLACK_WEBHOOK
        channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Observability Stack Options (2026)

Stack	Cost	Best For
Open source self-hosted: Prometheus + Grafana + Loki + Tempo	~$50–$200/month infra	Cost-sensitive, full control
AWS native: CloudWatch + X-Ray	$100–$500/month	AWS-native, no extra infra
Datadog	$200–$2,000+/month	Enterprise, best UX, all-in-one
Grafana Cloud	Free–$200/month	Open source stack, managed
Honeycomb	$100–$500/month	Best for distributed tracing
New Relic	$100–$1,000+/month	All-in-one, established

Recommendation:

Startups: Grafana Cloud (free tier is generous)
Mid-market: Datadog (if budget allows) or self-hosted Prometheus + Grafana
Enterprise: Datadog or Honeycomb

Implementation Cost

Scope	Investment
Structured logging + log shipping	$3,000–$8,000
Prometheus metrics + Grafana	$5,000–$15,000
OpenTelemetry distributed tracing	$8,000–$20,000
Full observability platform	$20,000–$50,000
Alerting strategy + runbooks	$5,000–$15,000

Working With Viprasol

We implement observability stacks for engineering teams — structured logging, Prometheus instrumentation, distributed tracing, and alerting strategy that actually wakes the right person at the right time.

→ Observability setup →
→ DevOps as a Service →
→ Cloud Solutions →

Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works

Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works

The Three Pillars

Pillar 1: Structured Logging

Log Levels — Use Them Correctly

Pillar 2: Metrics

The Four Golden Signals

Prometheus + Node.js

Prometheus Configuration

☁️ Is Your Cloud Costing Too Much?

Pillar 3: Distributed Tracing (OpenTelemetry)

Alerting Strategy

Alert on Symptoms, Not Causes

Prometheus Alerting Rules

Alert Routing

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Observability Stack Options (2026)

Implementation Cost

Working With Viprasol

See Also

Sources

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

SLI, SLO, and Error Budgets: Building Meaningful Observability in 2026

OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help

AWS CloudWatch Logs Insights: Query Patterns, Dashboards, Alarms, and Structured Logging