Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works
Observability and monitoring in 2026 — structured logging, Prometheus metrics, distributed tracing with OpenTelemetry, alerting strategy, and the three pillars
Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works
An unobservable system is a system you can't debug, can't optimize, and can't confidently change. When an incident happens at 2am, the difference between a team that resolves it in 15 minutes and one that spends 4 hours guessing is observability.
Observability has three pillars: logs (what happened), metrics (how the system behaves over time), and traces (how requests flow through distributed services). This guide covers all three, plus alerting — the layer that turns raw signals into actionable notifications.
The Three Pillars
Pillar 1: Structured Logging
Structured logs are JSON objects, not unformatted strings. They're queryable, filterable, and parseable by log aggregation systems without fragile regex.
// ❌ BAD: Unstructured logging
console.log(`User ${userId} created order ${orderId} for $${amount}`);
// Output: "User abc-123 created order ord-456 for $99.99"
// → Impossible to query by amount range, hard to parse
// ✅ GOOD: Structured logging with pino
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
// In production, output JSON; in development, pretty-print
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined,
base: {
service: 'order-service',
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV,
},
});
// Log with structured context
logger.info({
event: 'order.created',
userId,
orderId,
amountCents: amount,
itemCount: items.length,
}, 'Order created');
// Output: {"level":"info","time":1714579200000,"service":"order-service","event":"order.created","userId":"abc-123","orderId":"ord-456","amountCents":9999,"itemCount":3,"msg":"Order created"}
Request correlation: Every log line in a request should carry the same trace/request ID:
// Express middleware: attach request ID to all logs in request scope
import { AsyncLocalStorage } from 'async_hooks';
import { randomUUID } from 'crypto';
const requestContext = new AsyncLocalStorage<{ requestId: string; userId?: string }>();
app.use((req, res, next) => {
const requestId = req.headers['x-request-id'] as string ?? randomUUID();
res.setHeader('x-request-id', requestId);
requestContext.run({ requestId, userId: req.user?.sub }, next);
});
// Child logger that automatically includes request context
function getLogger() {
const ctx = requestContext.getStore();
return logger.child(ctx ?? {});
}
// Usage in any service — automatically includes requestId and userId
async function processOrder(orderId: string) {
const log = getLogger();
log.info({ orderId }, 'Processing order');
// → {"requestId":"xyz-789","userId":"abc-123","orderId":"ord-456","msg":"Processing order"}
}
Log Levels — Use Them Correctly
// ERROR: Something failed that requires human attention
logger.error({ err, orderId }, 'Payment processing failed');
// WARN: Something unexpected but handled — monitor but don't page
logger.warn({ retryCount, orderId }, 'Retry attempt for payment');
// INFO: Normal business events — creates audit trail
logger.info({ orderId, userId }, 'Order confirmed');
// DEBUG: Detailed debugging — disabled in production
logger.debug({ payload }, 'Webhook payload received');
// NEVER: Don't log sensitive data
// ❌ logger.info({ creditCard, cvv }, 'Payment details');
// ❌ logger.info({ password }, 'Login attempt');
Pillar 2: Metrics
Metrics are time-series measurements of system behavior. They answer questions like "what's the request rate?" and "is the error rate increasing?"
The Four Golden Signals
Focus on these four before adding anything else:
- Latency — How long requests take (p50, p95, p99)
- Traffic — How many requests/sec
- Errors — Error rate (%, not absolute count)
- Saturation — How full the system is (CPU %, queue depth, connection pool)
Prometheus + Node.js
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const registry = new Registry();
// HTTP request metrics
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
registers: [registry],
});
const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [registry],
});
// Business metrics
const ordersCreated = new Counter({
name: 'orders_created_total',
help: 'Total orders created',
labelNames: ['plan', 'payment_method'],
registers: [registry],
});
const orderValueCents = new Histogram({
name: 'order_value_cents',
help: 'Order value distribution in cents',
buckets: [500, 1000, 2500, 5000, 10000, 25000, 50000, 100000],
registers: [registry],
});
// Queue depth (gauge — can go up and down)
const jobQueueDepth = new Gauge({
name: 'job_queue_depth',
help: 'Current number of jobs in queue',
labelNames: ['queue_name'],
registers: [registry],
});
// Middleware to instrument all HTTP requests
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const labels = {
method: req.method,
route: req.route?.path ?? req.path,
status_code: String(res.statusCode),
};
end(labels);
httpRequestTotal.inc(labels);
});
next();
});
// Metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.send(await registry.metrics());
});
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-service:3000']
metrics_path: '/metrics'
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Pillar 3: Distributed Tracing (OpenTelemetry)
Traces show how a single request flows through multiple services — essential for debugging latency and errors in distributed systems.
// OpenTelemetry setup (run before any other imports)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
const sdk = new NodeSDK({
serviceName: 'order-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new PgInstrumentation(), // Auto-traces all PostgreSQL queries
],
});
sdk.start();
// Custom spans for business-critical operations
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processPayment(orderId: string, amount: number) {
return tracer.startActiveSpan('payment.process', async (span) => {
span.setAttributes({
'order.id': orderId,
'payment.amount_cents': amount,
});
try {
const result = await stripeClient.chargeCard(amount);
span.setAttributes({ 'payment.intent_id': result.id });
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err: any) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
}
A trace for a single order creation might show:
[API Handler: POST /orders] → 245ms
[Auth middleware: verify JWT] → 5ms
[DB: SELECT user] → 8ms
[Payment service: charge card] → 180ms
[HTTP: POST api.stripe.com] → 175ms
[DB: INSERT order] → 12ms
[Queue: enqueue fulfillment] → 3ms
This immediately shows that 180ms of the 245ms total is payment processing — not a database issue.
Alerting Strategy
Alert on Symptoms, Not Causes
Bad alerts (cause-based):
- CPU > 80% (might be fine — CPU spikes are normal during processing)
- Memory > 70% (might be fine — applications cache aggressively)
- Disk > 80% (might be fine — depends on growth rate)
Good alerts (symptom-based):
- Error rate > 1% for 5 minutes (users are experiencing errors)
- p99 latency > 2 seconds for 5 minutes (users are seeing slow responses)
- No successful orders in last 10 minutes (business is stopped)
- Queue depth > 10,000 for 15 minutes (processing is falling behind)
Prometheus Alerting Rules
# alerts.yml
groups:
- name: api-service
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate: {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
- alert: SlowResponses
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency {{ $value | humanizeDuration }}"
- alert: JobQueueBacklog
expr: job_queue_depth{queue_name="fulfillment"} > 10000
for: 15m
labels:
severity: warning
annotations:
summary: "Fulfillment queue depth: {{ $value }}"
Alert Routing
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: pagerduty
- match:
severity: warning
receiver: slack
receivers:
- name: pagerduty
pagerduty_configs:
- routing_key: $PAGERDUTY_KEY
description: '{{ .CommonAnnotations.summary }}'
- name: slack
slack_configs:
- api_url: $SLACK_WEBHOOK
channel: '#alerts'
title: '{{ .CommonAnnotations.summary }}'
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Observability Stack Options (2026)
| Stack | Cost | Best For |
|---|---|---|
| Open source self-hosted: Prometheus + Grafana + Loki + Tempo | ~$50–$200/month infra | Cost-sensitive, full control |
| AWS native: CloudWatch + X-Ray | $100–$500/month | AWS-native, no extra infra |
| Datadog | $200–$2,000+/month | Enterprise, best UX, all-in-one |
| Grafana Cloud | Free–$200/month | Open source stack, managed |
| Honeycomb | $100–$500/month | Best for distributed tracing |
| New Relic | $100–$1,000+/month | All-in-one, established |
Recommendation:
- Startups: Grafana Cloud (free tier is generous)
- Mid-market: Datadog (if budget allows) or self-hosted Prometheus + Grafana
- Enterprise: Datadog or Honeycomb
Implementation Cost
| Scope | Investment |
|---|---|
| Structured logging + log shipping | $3,000–$8,000 |
| Prometheus metrics + Grafana | $5,000–$15,000 |
| OpenTelemetry distributed tracing | $8,000–$20,000 |
| Full observability platform | $20,000–$50,000 |
| Alerting strategy + runbooks | $5,000–$15,000 |
Working With Viprasol
We implement observability stacks for engineering teams — structured logging, Prometheus instrumentation, distributed tracing, and alerting strategy that actually wakes the right person at the right time.
→ Observability setup →
→ DevOps as a Service →
→ Cloud Solutions →
See Also
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.