AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection
Build production AWS CloudWatch observability: custom metrics with EMF, Log Insights queries, composite alarms, anomaly detection, dashboards, and Terraform automation.
AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection
CloudWatch is the default observability platform for AWS workloads, and in 2026 it's capable enough that most teams don't need to bolt on a third-party APM tool. But most teams use it badly—they rely only on default EC2/RDS metrics, write ad-hoc Log Insights queries, and set simplistic threshold alarms that fire constantly or never.
This post covers the production CloudWatch setup we implement at Viprasol: custom business metrics via Embedded Metric Format (EMF), structured Log Insights queries, composite alarms that reduce noise, anomaly detection for traffic patterns, and the Terraform that manages it all as code.
The Observability Stack
Application
│
├── Structured JSON logs → CloudWatch Logs
├── EMF custom metrics → CloudWatch Metrics
└── X-Ray traces → CloudWatch ServiceMap
│
CloudWatch
├── Log Groups → Log Insights (queries)
├── Metric Namespaces → Dashboards, Alarms
├── Anomaly Detection → Dynamic thresholds
└── Composite Alarms → Reduce noise
│
SNS → PagerDuty / Slack / Email
Structured Logging (the Foundation)
Every log line should be machine-parseable JSON. Log Insights becomes useless with unstructured text.
// lib/logger.ts
import { createLogger, format, transports } from "winston";
const isProduction = process.env.NODE_ENV === "production";
export const logger = createLogger({
level: process.env.LOG_LEVEL ?? "info",
format: format.combine(
format.timestamp({ format: "ISO" }),
format.errors({ stack: true }),
isProduction
? format.json()
: format.combine(format.colorize(), format.simple())
),
defaultMeta: {
service: process.env.SERVICE_NAME ?? "api",
version: process.env.APP_VERSION ?? "unknown",
environment: process.env.NODE_ENV,
},
transports: [new transports.Console()],
});
// Typed log helper for request events
export function logRequest(params: {
method: string;
path: string;
statusCode: number;
durationMs: number;
userId?: string;
teamId?: string;
errorCode?: string;
}) {
const level = params.statusCode >= 500 ? "error" :
params.statusCode >= 400 ? "warn" : "info";
logger[level]("http_request", {
...params,
type: "http_request",
});
}
// Structured business event
export function logEvent(event: string, data: Record<string, unknown>) {
logger.info(event, {
...data,
type: "business_event",
timestamp: new Date().toISOString(),
});
}
In production, CloudWatch Logs receives JSON objects. Log Insights can query any field.
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Custom Metrics with Embedded Metric Format (EMF)
EMF lets you emit custom metrics through your log stream—no PutMetricData API calls, no SDK dependency for metrics. CloudWatch parses the structured log and extracts the metrics automatically.
// lib/metrics/emf.ts
import { createEMFLogger } from "aws-embedded-metrics";
import { Unit } from "aws-embedded-metrics";
// Business metric: track API latency by route
export async function recordApiLatency(
route: string,
method: string,
statusCode: number,
durationMs: number
) {
const metrics = createEMFLogger();
metrics.putDimensions({
Route: route,
Method: method,
StatusCode: String(statusCode),
Environment: process.env.NODE_ENV ?? "production",
});
metrics.putMetric("Latency", durationMs, Unit.Milliseconds);
metrics.putMetric("RequestCount", 1, Unit.Count);
if (statusCode >= 500) {
metrics.putMetric("ServerErrors", 1, Unit.Count);
} else if (statusCode >= 400) {
metrics.putMetric("ClientErrors", 1, Unit.Count);
}
metrics.setNamespace("MyApp/API");
await metrics.flush();
}
// Business metric: SaaS-specific events
export async function recordBusinessEvent(
eventType: "signup" | "subscription_created" | "payment_succeeded" | "payment_failed" | "churn",
data: {
plan?: string;
amountCents?: number;
country?: string;
} = {}
) {
const metrics = createEMFLogger();
metrics.putDimensions({
EventType: eventType,
Plan: data.plan ?? "unknown",
Environment: process.env.NODE_ENV ?? "production",
});
metrics.putMetric("EventCount", 1, Unit.Count);
if (data.amountCents !== undefined) {
metrics.putMetric("RevenueUSD", data.amountCents / 100, Unit.None);
}
metrics.setNamespace("MyApp/Business");
await metrics.flush();
}
Fastify middleware integration:
// middleware/metrics.ts (Fastify)
import { FastifyPluginAsync } from "fastify";
import { recordApiLatency } from "@/lib/metrics/emf";
const metricsPlugin: FastifyPluginAsync = async (fastify) => {
fastify.addHook("onResponse", async (request, reply) => {
const durationMs = reply.elapsedTime;
const route = request.routeOptions?.url ?? request.url;
await recordApiLatency(
route,
request.method,
reply.statusCode,
durationMs
);
});
};
export default metricsPlugin;
Log Insights Query Library
Save these as named queries in CloudWatch (or via Terraform):
Error rate by route (last 1 hour)
fields @timestamp, method, path, statusCode, durationMs, errorCode
| filter type = "http_request" and statusCode >= 400
| stats
count() as errorCount,
count_distinct(userId) as affectedUsers,
avg(durationMs) as avgLatencyMs
by bin(5m), path, statusCode
| sort errorCount desc
| limit 50
Slow queries (p95 latency)
fields @timestamp, path, method, durationMs, userId
| filter type = "http_request"
| stats
count() as requests,
avg(durationMs) as avgMs,
pct(durationMs, 50) as p50Ms,
pct(durationMs, 95) as p95Ms,
pct(durationMs, 99) as p99Ms,
max(durationMs) as maxMs
by path
| sort p95Ms desc
| limit 20
Business events in the last 24 hours
fields @timestamp, eventType, plan, amountUSD
| filter type = "business_event"
| stats
count() as events,
sum(amountUSD) as totalRevenue
by bin(1h), eventType, plan
| sort @timestamp desc
Error funnel: find users hitting repeated errors
fields @timestamp, userId, path, statusCode, errorCode
| filter type = "http_request" and statusCode >= 500
| stats count() as errorCount by userId, errorCode
| filter errorCount >= 3
| sort errorCount desc
| limit 25
Lambda cold start frequency
fields @timestamp, @initDuration, @duration, @memorySize
| filter @initDuration > 0
| stats
count() as coldStarts,
avg(@initDuration) as avgInitMs,
max(@initDuration) as maxInitMs,
avg(@duration) as avgDurationMs
by bin(5m)
| sort @timestamp desc
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Terraform: Alarms, Dashboards, and Anomaly Detection
# modules/cloudwatch-observability/main.tf
locals {
full_name = "${var.service_name}-${var.environment}"
common_tags = merge(var.tags, {
Module = "cloudwatch-observability"
Environment = var.environment
ManagedBy = "terraform"
})
}
# ─── Log Groups ────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_log_group" "app" {
name = "/aws/app/${local.full_name}"
retention_in_days = var.log_retention_days # 30 for prod, 7 for dev
tags = local.common_tags
}
resource "aws_cloudwatch_log_group" "lambda" {
count = var.lambda_function_name != "" ? 1 : 0
name = "/aws/lambda/${var.lambda_function_name}"
retention_in_days = var.log_retention_days
tags = local.common_tags
}
# ─── Metric Filters (extract metrics from logs) ────────────────────────────────
resource "aws_cloudwatch_log_metric_filter" "error_count" {
name = "${local.full_name}-error-count"
log_group_name = aws_cloudwatch_log_group.app.name
pattern = "{ $.type = \"http_request\" && $.statusCode >= 500 }"
metric_transformation {
name = "ServerErrorCount"
namespace = "MyApp/${var.service_name}"
value = "1"
default_value = "0"
unit = "Count"
dimensions = {
Environment = "$.environment"
}
}
}
resource "aws_cloudwatch_log_metric_filter" "payment_succeeded" {
name = "${local.full_name}-payment-succeeded"
log_group_name = aws_cloudwatch_log_group.app.name
pattern = "{ $.type = \"business_event\" && $.eventType = \"payment_succeeded\" }"
metric_transformation {
name = "PaymentSucceeded"
namespace = "MyApp/Business"
value = "$.amountUSD"
unit = "None"
}
}
# ─── Anomaly Detection ─────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "request_count_anomaly" {
alarm_name = "${local.full_name}-request-count-anomaly"
alarm_description = "Abnormal request volume (anomaly detection — 2σ threshold)"
comparison_operator = "GreaterThanUpperThreshold"
evaluation_periods = 3
threshold_metric_id = "e1"
treat_missing_data = "notBreaching"
metric_query {
id = "m1"
return_data = true
metric {
metric_name = "RequestCount"
namespace = "MyApp/${var.service_name}"
period = 300
stat = "Sum"
}
}
metric_query {
id = "e1"
expression = "ANOMALY_DETECTION_BAND(m1, 2)"
label = "RequestCount (Expected)"
return_data = true
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
tags = local.common_tags
}
# ─── Standard Threshold Alarms ────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "error_rate_high" {
alarm_name = "${local.full_name}-error-rate-high"
alarm_description = "Server error rate >2% over 5 minutes"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 2
treat_missing_data = "notBreaching"
metric_query {
id = "errors"
metric {
metric_name = "ServerErrorCount"
namespace = "MyApp/${var.service_name}"
period = 300
stat = "Sum"
}
}
metric_query {
id = "requests"
metric {
metric_name = "RequestCount"
namespace = "MyApp/${var.service_name}"
period = 300
stat = "Sum"
}
}
metric_query {
id = "error_rate"
expression = "(errors / requests) * 100"
label = "Error Rate (%)"
return_data = true
}
alarm_actions = [aws_sns_topic.alerts.arn]
tags = local.common_tags
}
resource "aws_cloudwatch_metric_alarm" "latency_p95_high" {
alarm_name = "${local.full_name}-latency-p95"
alarm_description = "p95 API latency >2000ms"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "Latency"
namespace = "MyApp/${var.service_name}"
period = 60
extended_statistic = "p95"
threshold = 2000
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.alerts.arn]
tags = local.common_tags
}
# ─── Composite Alarm (reduce noise) ───────────────────────────────────────────
resource "aws_cloudwatch_composite_alarm" "service_degraded" {
alarm_name = "${local.full_name}-service-degraded"
alarm_description = "Page on-call: both error rate AND latency are elevated"
# Only fire when BOTH conditions are true (reduces false positives)
alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.error_rate_high.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.latency_p95_high.alarm_name})"
alarm_actions = [aws_sns_topic.critical_alerts.arn]
ok_actions = [aws_sns_topic.critical_alerts.arn]
tags = local.common_tags
}
# ─── SNS Topics ───────────────────────────────────────────────────────────────
resource "aws_sns_topic" "alerts" {
name = "${local.full_name}-alerts"
tags = local.common_tags
}
resource "aws_sns_topic" "critical_alerts" {
name = "${local.full_name}-critical-alerts"
tags = local.common_tags
}
resource "aws_sns_topic_subscription" "slack_alerts" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "https"
endpoint = var.slack_webhook_url
}
CloudWatch Dashboard (Terraform)
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = local.full_name
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
x = 0
y = 0
width = 8
height = 6
properties = {
title = "Request Volume"
period = 60
stat = "Sum"
metrics = [
["MyApp/${var.service_name}", "RequestCount", { label = "Requests" }]
]
view = "timeSeries"
annotations = {
alarms = [aws_cloudwatch_metric_alarm.request_count_anomaly.arn]
}
}
},
{
type = "metric"
x = 8
y = 0
width = 8
height = 6
properties = {
title = "Error Rate (%)"
period = 60
view = "timeSeries"
metrics = [
[{ expression = "(errors/requests)*100", label = "Error Rate %", id = "e1" }],
["MyApp/${var.service_name}", "ServerErrorCount", { id = "errors", visible = false }],
["MyApp/${var.service_name}", "RequestCount", { id = "requests", visible = false }]
]
yAxis = { left = { min = 0, max = 10 } }
}
},
{
type = "metric"
x = 16
y = 0
width = 8
height = 6
properties = {
title = "API Latency"
period = 60
view = "timeSeries"
metrics = [
["MyApp/${var.service_name}", "Latency", { stat = "p50", label = "p50" }],
["MyApp/${var.service_name}", "Latency", { stat = "p95", label = "p95" }],
["MyApp/${var.service_name}", "Latency", { stat = "p99", label = "p99" }]
]
}
},
{
type = "log"
x = 0
y = 6
width = 24
height = 6
properties = {
title = "Recent Errors"
query = "SOURCE '${aws_cloudwatch_log_group.app.name}' | fields @timestamp, path, statusCode, errorCode, userId | filter statusCode >= 500 | sort @timestamp desc | limit 20"
region = var.region
view = "table"
}
}
]
})
}
Named Log Insights Queries (Saved Queries)
resource "aws_cloudwatch_query_definition" "slow_endpoints" {
name = "${local.full_name}/slow-endpoints"
log_group_names = [aws_cloudwatch_log_group.app.name]
query_string = <<-EOT
fields @timestamp, path, method, durationMs, userId
| filter type = "http_request"
| stats avg(durationMs) as avgMs, pct(durationMs, 95) as p95Ms, count() as requests by path
| sort p95Ms desc
| limit 20
EOT
}
resource "aws_cloudwatch_query_definition" "error_analysis" {
name = "${local.full_name}/error-analysis"
log_group_names = [aws_cloudwatch_log_group.app.name]
query_string = <<-EOT
fields @timestamp, path, statusCode, errorCode, userId, @message
| filter type = "http_request" and statusCode >= 400
| stats count() as errorCount by errorCode, path, statusCode
| sort errorCount desc
| limit 50
EOT
}
Cost Optimization
CloudWatch costs can surprise teams. 2026 pricing (us-east-1):
| Resource | Pricing | Typical Monthly Cost |
|---|---|---|
| Custom metrics | $0.30/metric/month (first 10K) | $15–$150 for 50–500 metrics |
| Log ingestion | $0.50/GB | $5–$50 for 10–100 GB |
| Log storage | $0.03/GB/month | $3–$30 for 100–1000 GB |
| Log Insights queries | $0.005/GB scanned | $1–$20 typical |
| Dashboard | $3/dashboard/month | $3–$15 |
| Alarms | $0.10/alarm/month | $5–$20 |
| Anomaly detection | $0.10/metric/month | $5–$20 |
Cost reduction tactics:
- Set log retention to 7–30 days (not forever)
- Use metric filters instead of EMF for simple counters
- Sample debug logs (1 in 10) in production
- Use CloudWatch Logs Insights only for ad-hoc queries; use dashboards for regular views
Cost and Timeline Estimates
| Scope | Timeline | Engineering Cost |
|---|---|---|
| Basic alarms (CPU, memory, errors) | 0.5–1 day | $400–$800 |
| Custom metrics via EMF | 1–2 days | $800–$1,600 |
| Log Insights query library | 1 day | $600–$1,000 |
| Full Terraform-managed observability | 3–5 days | $2,400–$4,000 |
| Dashboard + anomaly detection + composite alarms | +2–3 days | $1,600–$2,400 |
| Complete production observability stack | 1–1.5 weeks | $5,000–$8,000 |
See Also
- AWS ECS Fargate Production — Container workloads generating these metrics
- AWS RDS Proxy — Database metrics to add to CloudWatch
- Kubernetes Cost Optimization — Container resource monitoring patterns
- Terraform State Management — Managing CloudWatch resources as code
Working With Viprasol
We build production observability stacks for AWS workloads—from initial CloudWatch setup through full incident response runbooks. Our cloud team has instrumented applications generating terabytes of logs per month, with dashboards that engineering and business teams actually use.
What we deliver:
- Complete Terraform-managed CloudWatch configuration
- EMF custom metric instrumentation in your application code
- Log Insights query library for your specific use cases
- Composite alarm strategy to minimize alert fatigue
- Slack/PagerDuty integration and escalation runbooks
See our cloud infrastructure services or contact us to discuss your observability requirements.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.