AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection

CloudWatch is the default observability platform for AWS workloads, and in 2026 it's capable enough that most teams don't need to bolt on a third-party APM tool. But most teams use it badly—they rely only on default EC2/RDS metrics, write ad-hoc Log Insights queries, and set simplistic threshold alarms that fire constantly or never.

This post covers the production CloudWatch setup we implement at Viprasol: custom business metrics via Embedded Metric Format (EMF), structured Log Insights queries, composite alarms that reduce noise, anomaly detection for traffic patterns, and the Terraform that manages it all as code.

The Observability Stack

Application
    │
    ├── Structured JSON logs → CloudWatch Logs
    ├── EMF custom metrics → CloudWatch Metrics
    └── X-Ray traces → CloudWatch ServiceMap
         │
CloudWatch
    ├── Log Groups → Log Insights (queries)
    ├── Metric Namespaces → Dashboards, Alarms
    ├── Anomaly Detection → Dynamic thresholds
    └── Composite Alarms → Reduce noise
         │
    SNS → PagerDuty / Slack / Email

Structured Logging (the Foundation)

Every log line should be machine-parseable JSON. Log Insights becomes useless with unstructured text.

// lib/logger.ts
import { createLogger, format, transports } from "winston";

const isProduction = process.env.NODE_ENV === "production";

export const logger = createLogger({
  level: process.env.LOG_LEVEL ?? "info",
  format: format.combine(
    format.timestamp({ format: "ISO" }),
    format.errors({ stack: true }),
    isProduction
      ? format.json()
      : format.combine(format.colorize(), format.simple())
  ),
  defaultMeta: {
    service: process.env.SERVICE_NAME ?? "api",
    version: process.env.APP_VERSION ?? "unknown",
    environment: process.env.NODE_ENV,
  },
  transports: [new transports.Console()],
});

// Typed log helper for request events
export function logRequest(params: {
  method: string;
  path: string;
  statusCode: number;
  durationMs: number;
  userId?: string;
  teamId?: string;
  errorCode?: string;
}) {
  const level = params.statusCode >= 500 ? "error" : 
                params.statusCode >= 400 ? "warn" : "info";

  logger[level]("http_request", {
    ...params,
    type: "http_request",
  });
}

// Structured business event
export function logEvent(event: string, data: Record<string, unknown>) {
  logger.info(event, {
    ...data,
    type: "business_event",
    timestamp: new Date().toISOString(),
  });
}

In production, CloudWatch Logs receives JSON objects. Log Insights can query any field.

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Custom Metrics with Embedded Metric Format (EMF)

EMF lets you emit custom metrics through your log stream—no PutMetricData API calls, no SDK dependency for metrics. CloudWatch parses the structured log and extracts the metrics automatically.

// lib/metrics/emf.ts
import { createEMFLogger } from "aws-embedded-metrics";
import { Unit } from "aws-embedded-metrics";

// Business metric: track API latency by route
export async function recordApiLatency(
  route: string,
  method: string,
  statusCode: number,
  durationMs: number
) {
  const metrics = createEMFLogger();

  metrics.putDimensions({
    Route: route,
    Method: method,
    StatusCode: String(statusCode),
    Environment: process.env.NODE_ENV ?? "production",
  });

  metrics.putMetric("Latency", durationMs, Unit.Milliseconds);
  metrics.putMetric("RequestCount", 1, Unit.Count);

  if (statusCode >= 500) {
    metrics.putMetric("ServerErrors", 1, Unit.Count);
  } else if (statusCode >= 400) {
    metrics.putMetric("ClientErrors", 1, Unit.Count);
  }

  metrics.setNamespace("MyApp/API");
  await metrics.flush();
}

// Business metric: SaaS-specific events
export async function recordBusinessEvent(
  eventType: "signup" | "subscription_created" | "payment_succeeded" | "payment_failed" | "churn",
  data: {
    plan?: string;
    amountCents?: number;
    country?: string;
  } = {}
) {
  const metrics = createEMFLogger();

  metrics.putDimensions({
    EventType: eventType,
    Plan: data.plan ?? "unknown",
    Environment: process.env.NODE_ENV ?? "production",
  });

  metrics.putMetric("EventCount", 1, Unit.Count);

  if (data.amountCents !== undefined) {
    metrics.putMetric("RevenueUSD", data.amountCents / 100, Unit.None);
  }

  metrics.setNamespace("MyApp/Business");
  await metrics.flush();
}

Fastify middleware integration:

// middleware/metrics.ts (Fastify)
import { FastifyPluginAsync } from "fastify";
import { recordApiLatency } from "@/lib/metrics/emf";

const metricsPlugin: FastifyPluginAsync = async (fastify) => {
  fastify.addHook("onResponse", async (request, reply) => {
    const durationMs = reply.elapsedTime;
    const route = request.routeOptions?.url ?? request.url;

    await recordApiLatency(
      route,
      request.method,
      reply.statusCode,
      durationMs
    );
  });
};

export default metricsPlugin;

Log Insights Query Library

Save these as named queries in CloudWatch (or via Terraform):

Error rate by route (last 1 hour)

fields @timestamp, method, path, statusCode, durationMs, errorCode
| filter type = "http_request" and statusCode >= 400
| stats
    count() as errorCount,
    count_distinct(userId) as affectedUsers,
    avg(durationMs) as avgLatencyMs
  by bin(5m), path, statusCode
| sort errorCount desc
| limit 50

Slow queries (p95 latency)

fields @timestamp, path, method, durationMs, userId
| filter type = "http_request"
| stats
    count() as requests,
    avg(durationMs) as avgMs,
    pct(durationMs, 50) as p50Ms,
    pct(durationMs, 95) as p95Ms,
    pct(durationMs, 99) as p99Ms,
    max(durationMs) as maxMs
  by path
| sort p95Ms desc
| limit 20

Business events in the last 24 hours

fields @timestamp, eventType, plan, amountUSD
| filter type = "business_event"
| stats
    count() as events,
    sum(amountUSD) as totalRevenue
  by bin(1h), eventType, plan
| sort @timestamp desc

Error funnel: find users hitting repeated errors

fields @timestamp, userId, path, statusCode, errorCode
| filter type = "http_request" and statusCode >= 500
| stats count() as errorCount by userId, errorCode
| filter errorCount >= 3
| sort errorCount desc
| limit 25

Lambda cold start frequency

fields @timestamp, @initDuration, @duration, @memorySize
| filter @initDuration > 0
| stats
    count() as coldStarts,
    avg(@initDuration) as avgInitMs,
    max(@initDuration) as maxInitMs,
    avg(@duration) as avgDurationMs
  by bin(5m)
| sort @timestamp desc

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Terraform: Alarms, Dashboards, and Anomaly Detection

# modules/cloudwatch-observability/main.tf

locals {
  full_name = "${var.service_name}-${var.environment}"
  common_tags = merge(var.tags, {
    Module      = "cloudwatch-observability"
    Environment = var.environment
    ManagedBy   = "terraform"
  })
}

# ─── Log Groups ────────────────────────────────────────────────────────────────

resource "aws_cloudwatch_log_group" "app" {
  name              = "/aws/app/${local.full_name}"
  retention_in_days = var.log_retention_days  # 30 for prod, 7 for dev
  tags              = local.common_tags
}

resource "aws_cloudwatch_log_group" "lambda" {
  count             = var.lambda_function_name != "" ? 1 : 0
  name              = "/aws/lambda/${var.lambda_function_name}"
  retention_in_days = var.log_retention_days
  tags              = local.common_tags
}

# ─── Metric Filters (extract metrics from logs) ────────────────────────────────

resource "aws_cloudwatch_log_metric_filter" "error_count" {
  name           = "${local.full_name}-error-count"
  log_group_name = aws_cloudwatch_log_group.app.name
  pattern        = "{ $.type = \"http_request\" && $.statusCode >= 500 }"

  metric_transformation {
    name          = "ServerErrorCount"
    namespace     = "MyApp/${var.service_name}"
    value         = "1"
    default_value = "0"
    unit          = "Count"
    dimensions    = {
      Environment = "$.environment"
    }
  }
}

resource "aws_cloudwatch_log_metric_filter" "payment_succeeded" {
  name           = "${local.full_name}-payment-succeeded"
  log_group_name = aws_cloudwatch_log_group.app.name
  pattern        = "{ $.type = \"business_event\" && $.eventType = \"payment_succeeded\" }"

  metric_transformation {
    name      = "PaymentSucceeded"
    namespace = "MyApp/Business"
    value     = "$.amountUSD"
    unit      = "None"
  }
}

# ─── Anomaly Detection ─────────────────────────────────────────────────────────

resource "aws_cloudwatch_metric_alarm" "request_count_anomaly" {
  alarm_name          = "${local.full_name}-request-count-anomaly"
  alarm_description   = "Abnormal request volume (anomaly detection — 2σ threshold)"
  comparison_operator = "GreaterThanUpperThreshold"
  evaluation_periods  = 3
  threshold_metric_id = "e1"
  treat_missing_data  = "notBreaching"

  metric_query {
    id          = "m1"
    return_data = true
    metric {
      metric_name = "RequestCount"
      namespace   = "MyApp/${var.service_name}"
      period      = 300
      stat        = "Sum"
    }
  }

  metric_query {
    id          = "e1"
    expression  = "ANOMALY_DETECTION_BAND(m1, 2)"
    label       = "RequestCount (Expected)"
    return_data = true
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
  tags          = local.common_tags
}

# ─── Standard Threshold Alarms ────────────────────────────────────────────────

resource "aws_cloudwatch_metric_alarm" "error_rate_high" {
  alarm_name          = "${local.full_name}-error-rate-high"
  alarm_description   = "Server error rate >2% over 5 minutes"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 2
  treat_missing_data  = "notBreaching"

  metric_query {
    id = "errors"
    metric {
      metric_name = "ServerErrorCount"
      namespace   = "MyApp/${var.service_name}"
      period      = 300
      stat        = "Sum"
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "RequestCount"
      namespace   = "MyApp/${var.service_name}"
      period      = 300
      stat        = "Sum"
    }
  }

  metric_query {
    id          = "error_rate"
    expression  = "(errors / requests) * 100"
    label       = "Error Rate (%)"
    return_data = true
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  tags          = local.common_tags
}

resource "aws_cloudwatch_metric_alarm" "latency_p95_high" {
  alarm_name          = "${local.full_name}-latency-p95"
  alarm_description   = "p95 API latency >2000ms"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "Latency"
  namespace           = "MyApp/${var.service_name}"
  period              = 60
  extended_statistic  = "p95"
  threshold           = 2000
  treat_missing_data  = "notBreaching"
  alarm_actions       = [aws_sns_topic.alerts.arn]
  tags                = local.common_tags
}

# ─── Composite Alarm (reduce noise) ───────────────────────────────────────────

resource "aws_cloudwatch_composite_alarm" "service_degraded" {
  alarm_name        = "${local.full_name}-service-degraded"
  alarm_description = "Page on-call: both error rate AND latency are elevated"

  # Only fire when BOTH conditions are true (reduces false positives)
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.error_rate_high.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.latency_p95_high.alarm_name})"

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
  ok_actions    = [aws_sns_topic.critical_alerts.arn]
  tags          = local.common_tags
}

# ─── SNS Topics ───────────────────────────────────────────────────────────────

resource "aws_sns_topic" "alerts" {
  name = "${local.full_name}-alerts"
  tags = local.common_tags
}

resource "aws_sns_topic" "critical_alerts" {
  name = "${local.full_name}-critical-alerts"
  tags = local.common_tags
}

resource "aws_sns_topic_subscription" "slack_alerts" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "https"
  endpoint  = var.slack_webhook_url
}

CloudWatch Dashboard (Terraform)

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = local.full_name

  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 8
        height = 6
        properties = {
          title  = "Request Volume"
          period = 60
          stat   = "Sum"
          metrics = [
            ["MyApp/${var.service_name}", "RequestCount", { label = "Requests" }]
          ]
          view = "timeSeries"
          annotations = {
            alarms = [aws_cloudwatch_metric_alarm.request_count_anomaly.arn]
          }
        }
      },
      {
        type   = "metric"
        x      = 8
        y      = 0
        width  = 8
        height = 6
        properties = {
          title  = "Error Rate (%)"
          period = 60
          view   = "timeSeries"
          metrics = [
            [{ expression = "(errors/requests)*100", label = "Error Rate %", id = "e1" }],
            ["MyApp/${var.service_name}", "ServerErrorCount", { id = "errors", visible = false }],
            ["MyApp/${var.service_name}", "RequestCount", { id = "requests", visible = false }]
          ]
          yAxis = { left = { min = 0, max = 10 } }
        }
      },
      {
        type   = "metric"
        x      = 16
        y      = 0
        width  = 8
        height = 6
        properties = {
          title   = "API Latency"
          period  = 60
          view    = "timeSeries"
          metrics = [
            ["MyApp/${var.service_name}", "Latency", { stat = "p50", label = "p50" }],
            ["MyApp/${var.service_name}", "Latency", { stat = "p95", label = "p95" }],
            ["MyApp/${var.service_name}", "Latency", { stat = "p99", label = "p99" }]
          ]
        }
      },
      {
        type   = "log"
        x      = 0
        y      = 6
        width  = 24
        height = 6
        properties = {
          title  = "Recent Errors"
          query  = "SOURCE '${aws_cloudwatch_log_group.app.name}' | fields @timestamp, path, statusCode, errorCode, userId | filter statusCode >= 500 | sort @timestamp desc | limit 20"
          region = var.region
          view   = "table"
        }
      }
    ]
  })
}

Named Log Insights Queries (Saved Queries)

resource "aws_cloudwatch_query_definition" "slow_endpoints" {
  name = "${local.full_name}/slow-endpoints"

  log_group_names = [aws_cloudwatch_log_group.app.name]

  query_string = <<-EOT
    fields @timestamp, path, method, durationMs, userId
    | filter type = "http_request"
    | stats avg(durationMs) as avgMs, pct(durationMs, 95) as p95Ms, count() as requests by path
    | sort p95Ms desc
    | limit 20
  EOT
}

resource "aws_cloudwatch_query_definition" "error_analysis" {
  name = "${local.full_name}/error-analysis"

  log_group_names = [aws_cloudwatch_log_group.app.name]

  query_string = <<-EOT
    fields @timestamp, path, statusCode, errorCode, userId, @message
    | filter type = "http_request" and statusCode >= 400
    | stats count() as errorCount by errorCode, path, statusCode
    | sort errorCount desc
    | limit 50
  EOT
}

Cost Optimization

CloudWatch costs can surprise teams. 2026 pricing (us-east-1):

Resource	Pricing	Typical Monthly Cost
Custom metrics	$0.30/metric/month (first 10K)	$15–$150 for 50–500 metrics
Log ingestion	$0.50/GB	$5–$50 for 10–100 GB
Log storage	$0.03/GB/month	$3–$30 for 100–1000 GB
Log Insights queries	$0.005/GB scanned	$1–$20 typical
Dashboard	$3/dashboard/month	$3–$15
Alarms	$0.10/alarm/month	$5–$20
Anomaly detection	$0.10/metric/month	$5–$20

Cost reduction tactics:

Set log retention to 7–30 days (not forever)
Use metric filters instead of EMF for simple counters
Sample debug logs (1 in 10) in production
Use CloudWatch Logs Insights only for ad-hoc queries; use dashboards for regular views

Cost and Timeline Estimates

Scope	Timeline	Engineering Cost
Basic alarms (CPU, memory, errors)	0.5–1 day	$400–$800
Custom metrics via EMF	1–2 days	$800–$1,600
Log Insights query library	1 day	$600–$1,000
Full Terraform-managed observability	3–5 days	$2,400–$4,000
Dashboard + anomaly detection + composite alarms	+2–3 days	$1,600–$2,400
Complete production observability stack	1–1.5 weeks	$5,000–$8,000

Working With Viprasol

We build production observability stacks for AWS workloads—from initial CloudWatch setup through full incident response runbooks. Our cloud team has instrumented applications generating terabytes of logs per month, with dashboards that engineering and business teams actually use.

What we deliver:

Complete Terraform-managed CloudWatch configuration
EMF custom metric instrumentation in your application code
Log Insights query library for your specific use cases
Composite alarm strategy to minimize alert fatigue
Slack/PagerDuty integration and escalation runbooks

See our cloud infrastructure services or contact us to discuss your observability requirements.

AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection

AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection

The Observability Stack

Structured Logging (the Foundation)

☁️ Is Your Cloud Costing Too Much?

Custom Metrics with Embedded Metric Format (EMF)

Fastify middleware integration:

Log Insights Query Library

Error rate by route (last 1 hour)

Slow queries (p95 latency)

Business events in the last 24 hours

Error funnel: find users hitting repeated errors

Lambda cold start frequency

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Terraform: Alarms, Dashboards, and Anomaly Detection

CloudWatch Dashboard (Terraform)

Named Log Insights Queries (Saved Queries)

Cost Optimization

Cost and Timeline Estimates

See Also

Working With Viprasol

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

AWS CloudWatch Logs Insights: Query Patterns, Dashboards, Alarms, and Structured Logging

AWS ECS Autoscaling: Target Tracking, Step Scaling, and Fargate Capacity Providers with Terraform

AWS IAM Least-Privilege Design: Policy Patterns, Condition Keys, and Terraform