Back to Blog

Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works

Observability and monitoring in 2026 — structured logging, Prometheus metrics, distributed tracing with OpenTelemetry, alerting strategy, and the three pillars

Viprasol Tech Team
April 19, 2026
13 min read

Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works

An unobservable system is a system you can't debug, can't optimize, and can't confidently change. When an incident happens at 2am, the difference between a team that resolves it in 15 minutes and one that spends 4 hours guessing is observability.

Observability has three pillars: logs (what happened), metrics (how the system behaves over time), and traces (how requests flow through distributed services). This guide covers all three, plus alerting — the layer that turns raw signals into actionable notifications.


The Three Pillars

Pillar 1: Structured Logging

Structured logs are JSON objects, not unformatted strings. They're queryable, filterable, and parseable by log aggregation systems without fragile regex.

// ❌ BAD: Unstructured logging
console.log(`User ${userId} created order ${orderId} for $${amount}`);
// Output: "User abc-123 created order ord-456 for $99.99"
// → Impossible to query by amount range, hard to parse

// ✅ GOOD: Structured logging with pino
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  // In production, output JSON; in development, pretty-print
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
  base: {
    service: 'order-service',
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV,
  },
});

// Log with structured context
logger.info({
  event: 'order.created',
  userId,
  orderId,
  amountCents: amount,
  itemCount: items.length,
}, 'Order created');
// Output: {"level":"info","time":1714579200000,"service":"order-service","event":"order.created","userId":"abc-123","orderId":"ord-456","amountCents":9999,"itemCount":3,"msg":"Order created"}

Request correlation: Every log line in a request should carry the same trace/request ID:

// Express middleware: attach request ID to all logs in request scope
import { AsyncLocalStorage } from 'async_hooks';
import { randomUUID } from 'crypto';

const requestContext = new AsyncLocalStorage<{ requestId: string; userId?: string }>();

app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] as string ?? randomUUID();
  res.setHeader('x-request-id', requestId);
  
  requestContext.run({ requestId, userId: req.user?.sub }, next);
});

// Child logger that automatically includes request context
function getLogger() {
  const ctx = requestContext.getStore();
  return logger.child(ctx ?? {});
}

// Usage in any service — automatically includes requestId and userId
async function processOrder(orderId: string) {
  const log = getLogger();
  log.info({ orderId }, 'Processing order');
  // → {"requestId":"xyz-789","userId":"abc-123","orderId":"ord-456","msg":"Processing order"}
}

Log Levels — Use Them Correctly

// ERROR: Something failed that requires human attention
logger.error({ err, orderId }, 'Payment processing failed');

// WARN: Something unexpected but handled — monitor but don't page
logger.warn({ retryCount, orderId }, 'Retry attempt for payment');

// INFO: Normal business events — creates audit trail
logger.info({ orderId, userId }, 'Order confirmed');

// DEBUG: Detailed debugging — disabled in production
logger.debug({ payload }, 'Webhook payload received');

// NEVER: Don't log sensitive data
// ❌ logger.info({ creditCard, cvv }, 'Payment details');
// ❌ logger.info({ password }, 'Login attempt');

Pillar 2: Metrics

Metrics are time-series measurements of system behavior. They answer questions like "what's the request rate?" and "is the error rate increasing?"

The Four Golden Signals

Focus on these four before adding anything else:

  1. Latency — How long requests take (p50, p95, p99)
  2. Traffic — How many requests/sec
  3. Errors — Error rate (%, not absolute count)
  4. Saturation — How full the system is (CPU %, queue depth, connection pool)

Prometheus + Node.js

import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// HTTP request metrics
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});

const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [registry],
});

// Business metrics
const ordersCreated = new Counter({
  name: 'orders_created_total',
  help: 'Total orders created',
  labelNames: ['plan', 'payment_method'],
  registers: [registry],
});

const orderValueCents = new Histogram({
  name: 'order_value_cents',
  help: 'Order value distribution in cents',
  buckets: [500, 1000, 2500, 5000, 10000, 25000, 50000, 100000],
  registers: [registry],
});

// Queue depth (gauge — can go up and down)
const jobQueueDepth = new Gauge({
  name: 'job_queue_depth',
  help: 'Current number of jobs in queue',
  labelNames: ['queue_name'],
  registers: [registry],
});

// Middleware to instrument all HTTP requests
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  
  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route?.path ?? req.path,
      status_code: String(res.statusCode),
    };
    end(labels);
    httpRequestTotal.inc(labels);
  });
  
  next();
});

// Metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['api-service:3000']
    metrics_path: '/metrics'

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

  • AWS, GCP, Azure certified engineers
  • Infrastructure as Code (Terraform, CDK)
  • Docker, Kubernetes, GitHub Actions CI/CD
  • Typical audit recovers $500–$3,000/month in savings

Pillar 3: Distributed Tracing (OpenTelemetry)

Traces show how a single request flows through multiple services — essential for debugging latency and errors in distributed systems.

// OpenTelemetry setup (run before any other imports)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
    new PgInstrumentation(),  // Auto-traces all PostgreSQL queries
  ],
});

sdk.start();

// Custom spans for business-critical operations
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processPayment(orderId: string, amount: number) {
  return tracer.startActiveSpan('payment.process', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'payment.amount_cents': amount,
    });
    
    try {
      const result = await stripeClient.chargeCard(amount);
      span.setAttributes({ 'payment.intent_id': result.id });
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err: any) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

A trace for a single order creation might show:

[API Handler: POST /orders] → 245ms
  [Auth middleware: verify JWT] → 5ms
  [DB: SELECT user] → 8ms
  [Payment service: charge card] → 180ms
    [HTTP: POST api.stripe.com] → 175ms
  [DB: INSERT order] → 12ms
  [Queue: enqueue fulfillment] → 3ms

This immediately shows that 180ms of the 245ms total is payment processing — not a database issue.


Alerting Strategy

Alert on Symptoms, Not Causes

Bad alerts (cause-based):

  • CPU > 80% (might be fine — CPU spikes are normal during processing)
  • Memory > 70% (might be fine — applications cache aggressively)
  • Disk > 80% (might be fine — depends on growth rate)

Good alerts (symptom-based):

  • Error rate > 1% for 5 minutes (users are experiencing errors)
  • p99 latency > 2 seconds for 5 minutes (users are seeing slow responses)
  • No successful orders in last 10 minutes (business is stopped)
  • Queue depth > 10,000 for 15 minutes (processing is falling behind)

Prometheus Alerting Rules

# alerts.yml
groups:
  - name: api-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: SlowResponses
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency {{ $value | humanizeDuration }}"

      - alert: JobQueueBacklog
        expr: job_queue_depth{queue_name="fulfillment"} > 10000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Fulfillment queue depth: {{ $value }}"

Alert Routing

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        severity: warning
      receiver: slack

receivers:
  - name: pagerduty
    pagerduty_configs:
      - routing_key: $PAGERDUTY_KEY
        description: '{{ .CommonAnnotations.summary }}'
  - name: slack
    slack_configs:
      - api_url: $SLACK_WEBHOOK
        channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

  • Staging + production environments with feature flags
  • Automated security scanning in the pipeline
  • Uptime monitoring + alerting + runbook automation
  • On-call support handover docs included

Observability Stack Options (2026)

StackCostBest For
Open source self-hosted: Prometheus + Grafana + Loki + Tempo~$50–$200/month infraCost-sensitive, full control
AWS native: CloudWatch + X-Ray$100–$500/monthAWS-native, no extra infra
Datadog$200–$2,000+/monthEnterprise, best UX, all-in-one
Grafana CloudFree–$200/monthOpen source stack, managed
Honeycomb$100–$500/monthBest for distributed tracing
New Relic$100–$1,000+/monthAll-in-one, established

Recommendation:

  • Startups: Grafana Cloud (free tier is generous)
  • Mid-market: Datadog (if budget allows) or self-hosted Prometheus + Grafana
  • Enterprise: Datadog or Honeycomb

Implementation Cost

ScopeInvestment
Structured logging + log shipping$3,000–$8,000
Prometheus metrics + Grafana$5,000–$15,000
OpenTelemetry distributed tracing$8,000–$20,000
Full observability platform$20,000–$50,000
Alerting strategy + runbooks$5,000–$15,000

Working With Viprasol

We implement observability stacks for engineering teams — structured logging, Prometheus instrumentation, distributed tracing, and alerting strategy that actually wakes the right person at the right time.

Observability setup →
DevOps as a Service →
Cloud Solutions →


See Also


Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need DevOps & Cloud Expertise?

Scale your infrastructure with confidence. AWS, GCP, Azure certified team.

Free consultation • No commitment • Response within 24 hours

Viprasol · Big Data & Analytics

Making sense of your data at scale?

Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.