Back to Blog

AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection

Build production AWS CloudWatch observability: custom metrics with EMF, Log Insights queries, composite alarms, anomaly detection, dashboards, and Terraform automation.

Viprasol Tech Team
December 25, 2026
13 min read

AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection

CloudWatch is the default observability platform for AWS workloads, and in 2026 it's capable enough that most teams don't need to bolt on a third-party APM tool. But most teams use it badly—they rely only on default EC2/RDS metrics, write ad-hoc Log Insights queries, and set simplistic threshold alarms that fire constantly or never.

This post covers the production CloudWatch setup we implement at Viprasol: custom business metrics via Embedded Metric Format (EMF), structured Log Insights queries, composite alarms that reduce noise, anomaly detection for traffic patterns, and the Terraform that manages it all as code.


The Observability Stack

Application
    │
    ├── Structured JSON logs → CloudWatch Logs
    ├── EMF custom metrics → CloudWatch Metrics
    └── X-Ray traces → CloudWatch ServiceMap
         │
CloudWatch
    ├── Log Groups → Log Insights (queries)
    ├── Metric Namespaces → Dashboards, Alarms
    ├── Anomaly Detection → Dynamic thresholds
    └── Composite Alarms → Reduce noise
         │
    SNS → PagerDuty / Slack / Email

Structured Logging (the Foundation)

Every log line should be machine-parseable JSON. Log Insights becomes useless with unstructured text.

// lib/logger.ts
import { createLogger, format, transports } from "winston";

const isProduction = process.env.NODE_ENV === "production";

export const logger = createLogger({
  level: process.env.LOG_LEVEL ?? "info",
  format: format.combine(
    format.timestamp({ format: "ISO" }),
    format.errors({ stack: true }),
    isProduction
      ? format.json()
      : format.combine(format.colorize(), format.simple())
  ),
  defaultMeta: {
    service: process.env.SERVICE_NAME ?? "api",
    version: process.env.APP_VERSION ?? "unknown",
    environment: process.env.NODE_ENV,
  },
  transports: [new transports.Console()],
});

// Typed log helper for request events
export function logRequest(params: {
  method: string;
  path: string;
  statusCode: number;
  durationMs: number;
  userId?: string;
  teamId?: string;
  errorCode?: string;
}) {
  const level = params.statusCode >= 500 ? "error" : 
                params.statusCode >= 400 ? "warn" : "info";

  logger[level]("http_request", {
    ...params,
    type: "http_request",
  });
}

// Structured business event
export function logEvent(event: string, data: Record<string, unknown>) {
  logger.info(event, {
    ...data,
    type: "business_event",
    timestamp: new Date().toISOString(),
  });
}

In production, CloudWatch Logs receives JSON objects. Log Insights can query any field.


☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

  • AWS, GCP, Azure certified engineers
  • Infrastructure as Code (Terraform, CDK)
  • Docker, Kubernetes, GitHub Actions CI/CD
  • Typical audit recovers $500–$3,000/month in savings

Custom Metrics with Embedded Metric Format (EMF)

EMF lets you emit custom metrics through your log stream—no PutMetricData API calls, no SDK dependency for metrics. CloudWatch parses the structured log and extracts the metrics automatically.

// lib/metrics/emf.ts
import { createEMFLogger } from "aws-embedded-metrics";
import { Unit } from "aws-embedded-metrics";

// Business metric: track API latency by route
export async function recordApiLatency(
  route: string,
  method: string,
  statusCode: number,
  durationMs: number
) {
  const metrics = createEMFLogger();

  metrics.putDimensions({
    Route: route,
    Method: method,
    StatusCode: String(statusCode),
    Environment: process.env.NODE_ENV ?? "production",
  });

  metrics.putMetric("Latency", durationMs, Unit.Milliseconds);
  metrics.putMetric("RequestCount", 1, Unit.Count);

  if (statusCode >= 500) {
    metrics.putMetric("ServerErrors", 1, Unit.Count);
  } else if (statusCode >= 400) {
    metrics.putMetric("ClientErrors", 1, Unit.Count);
  }

  metrics.setNamespace("MyApp/API");
  await metrics.flush();
}

// Business metric: SaaS-specific events
export async function recordBusinessEvent(
  eventType: "signup" | "subscription_created" | "payment_succeeded" | "payment_failed" | "churn",
  data: {
    plan?: string;
    amountCents?: number;
    country?: string;
  } = {}
) {
  const metrics = createEMFLogger();

  metrics.putDimensions({
    EventType: eventType,
    Plan: data.plan ?? "unknown",
    Environment: process.env.NODE_ENV ?? "production",
  });

  metrics.putMetric("EventCount", 1, Unit.Count);

  if (data.amountCents !== undefined) {
    metrics.putMetric("RevenueUSD", data.amountCents / 100, Unit.None);
  }

  metrics.setNamespace("MyApp/Business");
  await metrics.flush();
}

Fastify middleware integration:

// middleware/metrics.ts (Fastify)
import { FastifyPluginAsync } from "fastify";
import { recordApiLatency } from "@/lib/metrics/emf";

const metricsPlugin: FastifyPluginAsync = async (fastify) => {
  fastify.addHook("onResponse", async (request, reply) => {
    const durationMs = reply.elapsedTime;
    const route = request.routeOptions?.url ?? request.url;

    await recordApiLatency(
      route,
      request.method,
      reply.statusCode,
      durationMs
    );
  });
};

export default metricsPlugin;

Log Insights Query Library

Save these as named queries in CloudWatch (or via Terraform):

Error rate by route (last 1 hour)

fields @timestamp, method, path, statusCode, durationMs, errorCode
| filter type = "http_request" and statusCode >= 400
| stats
    count() as errorCount,
    count_distinct(userId) as affectedUsers,
    avg(durationMs) as avgLatencyMs
  by bin(5m), path, statusCode
| sort errorCount desc
| limit 50

Slow queries (p95 latency)

fields @timestamp, path, method, durationMs, userId
| filter type = "http_request"
| stats
    count() as requests,
    avg(durationMs) as avgMs,
    pct(durationMs, 50) as p50Ms,
    pct(durationMs, 95) as p95Ms,
    pct(durationMs, 99) as p99Ms,
    max(durationMs) as maxMs
  by path
| sort p95Ms desc
| limit 20

Business events in the last 24 hours

fields @timestamp, eventType, plan, amountUSD
| filter type = "business_event"
| stats
    count() as events,
    sum(amountUSD) as totalRevenue
  by bin(1h), eventType, plan
| sort @timestamp desc

Error funnel: find users hitting repeated errors

fields @timestamp, userId, path, statusCode, errorCode
| filter type = "http_request" and statusCode >= 500
| stats count() as errorCount by userId, errorCode
| filter errorCount >= 3
| sort errorCount desc
| limit 25

Lambda cold start frequency

fields @timestamp, @initDuration, @duration, @memorySize
| filter @initDuration > 0
| stats
    count() as coldStarts,
    avg(@initDuration) as avgInitMs,
    max(@initDuration) as maxInitMs,
    avg(@duration) as avgDurationMs
  by bin(5m)
| sort @timestamp desc

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

  • Staging + production environments with feature flags
  • Automated security scanning in the pipeline
  • Uptime monitoring + alerting + runbook automation
  • On-call support handover docs included

Terraform: Alarms, Dashboards, and Anomaly Detection

# modules/cloudwatch-observability/main.tf

locals {
  full_name = "${var.service_name}-${var.environment}"
  common_tags = merge(var.tags, {
    Module      = "cloudwatch-observability"
    Environment = var.environment
    ManagedBy   = "terraform"
  })
}

# ─── Log Groups ────────────────────────────────────────────────────────────────

resource "aws_cloudwatch_log_group" "app" {
  name              = "/aws/app/${local.full_name}"
  retention_in_days = var.log_retention_days  # 30 for prod, 7 for dev
  tags              = local.common_tags
}

resource "aws_cloudwatch_log_group" "lambda" {
  count             = var.lambda_function_name != "" ? 1 : 0
  name              = "/aws/lambda/${var.lambda_function_name}"
  retention_in_days = var.log_retention_days
  tags              = local.common_tags
}

# ─── Metric Filters (extract metrics from logs) ────────────────────────────────

resource "aws_cloudwatch_log_metric_filter" "error_count" {
  name           = "${local.full_name}-error-count"
  log_group_name = aws_cloudwatch_log_group.app.name
  pattern        = "{ $.type = \"http_request\" && $.statusCode >= 500 }"

  metric_transformation {
    name          = "ServerErrorCount"
    namespace     = "MyApp/${var.service_name}"
    value         = "1"
    default_value = "0"
    unit          = "Count"
    dimensions    = {
      Environment = "$.environment"
    }
  }
}

resource "aws_cloudwatch_log_metric_filter" "payment_succeeded" {
  name           = "${local.full_name}-payment-succeeded"
  log_group_name = aws_cloudwatch_log_group.app.name
  pattern        = "{ $.type = \"business_event\" && $.eventType = \"payment_succeeded\" }"

  metric_transformation {
    name      = "PaymentSucceeded"
    namespace = "MyApp/Business"
    value     = "$.amountUSD"
    unit      = "None"
  }
}

# ─── Anomaly Detection ─────────────────────────────────────────────────────────

resource "aws_cloudwatch_metric_alarm" "request_count_anomaly" {
  alarm_name          = "${local.full_name}-request-count-anomaly"
  alarm_description   = "Abnormal request volume (anomaly detection — 2σ threshold)"
  comparison_operator = "GreaterThanUpperThreshold"
  evaluation_periods  = 3
  threshold_metric_id = "e1"
  treat_missing_data  = "notBreaching"

  metric_query {
    id          = "m1"
    return_data = true
    metric {
      metric_name = "RequestCount"
      namespace   = "MyApp/${var.service_name}"
      period      = 300
      stat        = "Sum"
    }
  }

  metric_query {
    id          = "e1"
    expression  = "ANOMALY_DETECTION_BAND(m1, 2)"
    label       = "RequestCount (Expected)"
    return_data = true
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
  tags          = local.common_tags
}

# ─── Standard Threshold Alarms ────────────────────────────────────────────────

resource "aws_cloudwatch_metric_alarm" "error_rate_high" {
  alarm_name          = "${local.full_name}-error-rate-high"
  alarm_description   = "Server error rate >2% over 5 minutes"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 2
  treat_missing_data  = "notBreaching"

  metric_query {
    id = "errors"
    metric {
      metric_name = "ServerErrorCount"
      namespace   = "MyApp/${var.service_name}"
      period      = 300
      stat        = "Sum"
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "RequestCount"
      namespace   = "MyApp/${var.service_name}"
      period      = 300
      stat        = "Sum"
    }
  }

  metric_query {
    id          = "error_rate"
    expression  = "(errors / requests) * 100"
    label       = "Error Rate (%)"
    return_data = true
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  tags          = local.common_tags
}

resource "aws_cloudwatch_metric_alarm" "latency_p95_high" {
  alarm_name          = "${local.full_name}-latency-p95"
  alarm_description   = "p95 API latency >2000ms"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "Latency"
  namespace           = "MyApp/${var.service_name}"
  period              = 60
  extended_statistic  = "p95"
  threshold           = 2000
  treat_missing_data  = "notBreaching"
  alarm_actions       = [aws_sns_topic.alerts.arn]
  tags                = local.common_tags
}

# ─── Composite Alarm (reduce noise) ───────────────────────────────────────────

resource "aws_cloudwatch_composite_alarm" "service_degraded" {
  alarm_name        = "${local.full_name}-service-degraded"
  alarm_description = "Page on-call: both error rate AND latency are elevated"

  # Only fire when BOTH conditions are true (reduces false positives)
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.error_rate_high.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.latency_p95_high.alarm_name})"

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
  ok_actions    = [aws_sns_topic.critical_alerts.arn]
  tags          = local.common_tags
}

# ─── SNS Topics ───────────────────────────────────────────────────────────────

resource "aws_sns_topic" "alerts" {
  name = "${local.full_name}-alerts"
  tags = local.common_tags
}

resource "aws_sns_topic" "critical_alerts" {
  name = "${local.full_name}-critical-alerts"
  tags = local.common_tags
}

resource "aws_sns_topic_subscription" "slack_alerts" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "https"
  endpoint  = var.slack_webhook_url
}

CloudWatch Dashboard (Terraform)

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = local.full_name

  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 8
        height = 6
        properties = {
          title  = "Request Volume"
          period = 60
          stat   = "Sum"
          metrics = [
            ["MyApp/${var.service_name}", "RequestCount", { label = "Requests" }]
          ]
          view = "timeSeries"
          annotations = {
            alarms = [aws_cloudwatch_metric_alarm.request_count_anomaly.arn]
          }
        }
      },
      {
        type   = "metric"
        x      = 8
        y      = 0
        width  = 8
        height = 6
        properties = {
          title  = "Error Rate (%)"
          period = 60
          view   = "timeSeries"
          metrics = [
            [{ expression = "(errors/requests)*100", label = "Error Rate %", id = "e1" }],
            ["MyApp/${var.service_name}", "ServerErrorCount", { id = "errors", visible = false }],
            ["MyApp/${var.service_name}", "RequestCount", { id = "requests", visible = false }]
          ]
          yAxis = { left = { min = 0, max = 10 } }
        }
      },
      {
        type   = "metric"
        x      = 16
        y      = 0
        width  = 8
        height = 6
        properties = {
          title   = "API Latency"
          period  = 60
          view    = "timeSeries"
          metrics = [
            ["MyApp/${var.service_name}", "Latency", { stat = "p50", label = "p50" }],
            ["MyApp/${var.service_name}", "Latency", { stat = "p95", label = "p95" }],
            ["MyApp/${var.service_name}", "Latency", { stat = "p99", label = "p99" }]
          ]
        }
      },
      {
        type   = "log"
        x      = 0
        y      = 6
        width  = 24
        height = 6
        properties = {
          title  = "Recent Errors"
          query  = "SOURCE '${aws_cloudwatch_log_group.app.name}' | fields @timestamp, path, statusCode, errorCode, userId | filter statusCode >= 500 | sort @timestamp desc | limit 20"
          region = var.region
          view   = "table"
        }
      }
    ]
  })
}

Named Log Insights Queries (Saved Queries)

resource "aws_cloudwatch_query_definition" "slow_endpoints" {
  name = "${local.full_name}/slow-endpoints"

  log_group_names = [aws_cloudwatch_log_group.app.name]

  query_string = <<-EOT
    fields @timestamp, path, method, durationMs, userId
    | filter type = "http_request"
    | stats avg(durationMs) as avgMs, pct(durationMs, 95) as p95Ms, count() as requests by path
    | sort p95Ms desc
    | limit 20
  EOT
}

resource "aws_cloudwatch_query_definition" "error_analysis" {
  name = "${local.full_name}/error-analysis"

  log_group_names = [aws_cloudwatch_log_group.app.name]

  query_string = <<-EOT
    fields @timestamp, path, statusCode, errorCode, userId, @message
    | filter type = "http_request" and statusCode >= 400
    | stats count() as errorCount by errorCode, path, statusCode
    | sort errorCount desc
    | limit 50
  EOT
}

Cost Optimization

CloudWatch costs can surprise teams. 2026 pricing (us-east-1):

ResourcePricingTypical Monthly Cost
Custom metrics$0.30/metric/month (first 10K)$15–$150 for 50–500 metrics
Log ingestion$0.50/GB$5–$50 for 10–100 GB
Log storage$0.03/GB/month$3–$30 for 100–1000 GB
Log Insights queries$0.005/GB scanned$1–$20 typical
Dashboard$3/dashboard/month$3–$15
Alarms$0.10/alarm/month$5–$20
Anomaly detection$0.10/metric/month$5–$20

Cost reduction tactics:

  1. Set log retention to 7–30 days (not forever)
  2. Use metric filters instead of EMF for simple counters
  3. Sample debug logs (1 in 10) in production
  4. Use CloudWatch Logs Insights only for ad-hoc queries; use dashboards for regular views

Cost and Timeline Estimates

ScopeTimelineEngineering Cost
Basic alarms (CPU, memory, errors)0.5–1 day$400–$800
Custom metrics via EMF1–2 days$800–$1,600
Log Insights query library1 day$600–$1,000
Full Terraform-managed observability3–5 days$2,400–$4,000
Dashboard + anomaly detection + composite alarms+2–3 days$1,600–$2,400
Complete production observability stack1–1.5 weeks$5,000–$8,000

See Also


Working With Viprasol

We build production observability stacks for AWS workloads—from initial CloudWatch setup through full incident response runbooks. Our cloud team has instrumented applications generating terabytes of logs per month, with dashboards that engineering and business teams actually use.

What we deliver:

  • Complete Terraform-managed CloudWatch configuration
  • EMF custom metric instrumentation in your application code
  • Log Insights query library for your specific use cases
  • Composite alarm strategy to minimize alert fatigue
  • Slack/PagerDuty integration and escalation runbooks

See our cloud infrastructure services or contact us to discuss your observability requirements.

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need DevOps & Cloud Expertise?

Scale your infrastructure with confidence. AWS, GCP, Azure certified team.

Free consultation • No commitment • Response within 24 hours

Viprasol · Big Data & Analytics

Making sense of your data at scale?

Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.