AWS CloudWatch Logs Insights: Query Patterns, Dashboards, Alarms, and Structured Logging
Master AWS CloudWatch Logs Insights for production observability. Covers structured JSON logging from Node.js, Logs Insights query syntax for errors and latency, CloudWatch dashboards, metric filters, and alarm configuration with Terraform.
Application logs are often the difference between a 5-minute incident resolution and a 5-hour one. CloudWatch Logs Insights lets you query structured JSON logs with a SQL-like language, detect anomalies, build dashboards, and trigger alarms — all within the AWS ecosystem. The catch: if your logs aren't structured, Logs Insights is much less powerful.
This guide covers structured logging setup, essential query patterns, metric filters, dashboard configuration, and alarm setup with Terraform.
Structured Logging in Node.js
// lib/logger.ts — structured JSON logger using pino
import pino from "pino";
export const logger = pino({
level: process.env.LOG_LEVEL ?? "info",
// CloudWatch doesn't need pretty printing — raw JSON is faster to ingest and query
formatters: {
level: (label) => ({ level: label }), // level: "info" not level: 30
bindings: () => ({}), // Remove pid/hostname (CloudWatch adds context)
},
base: {
service: process.env.SERVICE_NAME ?? "app",
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
},
// Redact sensitive fields before they hit logs
redact: {
paths: ["password", "token", "authorization", "*.password", "*.secret"],
censor: "[REDACTED]",
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Typed child logger for request context
export function createRequestLogger(requestId: string, userId?: string) {
return logger.child({ requestId, userId });
}
// middleware/logging.ts — Next.js middleware or Express logger
import { NextRequest, NextResponse } from "next/server";
import { nanoid } from "nanoid";
import { logger } from "@/lib/logger";
export function withLogging(
handler: (req: NextRequest, log: ReturnType<typeof logger.child>) => Promise<NextResponse>
) {
return async (req: NextRequest): Promise<NextResponse> => {
const requestId = req.headers.get("x-request-id") ?? nanoid();
const log = logger.child({
requestId,
method: req.method,
url: req.nextUrl.pathname,
});
const start = Date.now();
try {
const response = await handler(req, log);
log.info({
event: "request_completed",
statusCode: response.status,
durationMs: Date.now() - start,
});
return response;
} catch (err) {
log.error({
event: "request_error",
error: err instanceof Error ? err.message : String(err),
stack: err instanceof Error ? err.stack : undefined,
durationMs: Date.now() - start,
});
throw err;
}
};
}
Example Structured Log Output
{
"level": "info",
"time": "2027-04-23T09:14:32.105Z",
"service": "api",
"environment": "production",
"version": "1.4.2",
"requestId": "req_01abc123",
"userId": "usr_abc456",
"event": "request_completed",
"method": "POST",
"url": "/api/invoices",
"statusCode": 201,
"durationMs": 142
}
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Logs Insights Query Patterns
# 1. Error rate over time
filter level = "error"
| stats count(*) as errors by bin(5m) as period
| sort period asc
# 2. Top 10 slowest endpoints (p99 latency)
filter ispresent(durationMs)
| stats
count(*) as requests,
avg(durationMs) as avg_ms,
pct(durationMs, 95) as p95_ms,
pct(durationMs, 99) as p99_ms
by url
| sort p99_ms desc
| limit 10
# 3. Error breakdown by type
filter level = "error"
| stats count(*) as count by error
| sort count desc
| limit 20
# 4. Request volume per user (detect heavy users or abuse)
filter ispresent(userId) and event = "request_completed"
| stats count(*) as requests by userId
| sort requests desc
| limit 25
# 5. 5xx error rate by endpoint
filter event = "request_completed"
| stats
count(*) as total,
count(statusCode >= 500 and statusCode < 600) as errors_5xx
by url
| fields url, total, errors_5xx,
round(errors_5xx / total * 100, 2) as error_rate_pct
| filter total > 10 # Only endpoints with meaningful volume
| sort error_rate_pct desc
# 6. Specific user journey: trace a requestId
filter requestId = "req_01abc123"
| fields time, level, event, statusCode, durationMs, error
| sort time asc
# 7. Database slow queries (if you log query durations)
filter event = "db_query" and queryMs > 500
| stats
count(*) as count,
avg(queryMs) as avg_ms,
max(queryMs) as max_ms
by query
| sort avg_ms desc
| limit 15
# 8. Failed logins by IP (security monitoring)
filter event = "login_failed"
| stats count(*) as attempts by ip
| filter attempts > 5
| sort attempts desc
# 9. Deployment impact: error rate before vs. after version change
filter ispresent(version)
| stats count(level = "error") as errors, count(*) as total by version
| fields version, errors, total, round(errors/total*100,2) as error_pct
| sort version asc
Metric Filters: Convert Logs to Metrics
# terraform/cloudwatch-metrics.tf
# Error count metric from structured logs
resource "aws_cloudwatch_log_metric_filter" "error_count" {
name = "${var.app_name}-error-count"
log_group_name = aws_cloudwatch_log_group.app.name
# Match JSON logs with level = "error"
pattern = "{ $.level = \"error\" }"
metric_transformation {
name = "ErrorCount"
namespace = "App/${var.app_name}"
value = "1"
unit = "Count"
# Include dimensions for filtering in dashboards
dimensions = {
Environment = "$.environment"
}
}
}
# p99 latency metric (requires numeric extraction)
resource "aws_cloudwatch_log_metric_filter" "request_latency" {
name = "${var.app_name}-request-latency"
log_group_name = aws_cloudwatch_log_group.app.name
pattern = "{ $.event = \"request_completed\" && $.durationMs > 0 }"
metric_transformation {
name = "RequestLatency"
namespace = "App/${var.app_name}"
value = "$.durationMs"
unit = "Milliseconds"
dimensions = {
Environment = "$.environment"
}
}
}
# 5xx error metric
resource "aws_cloudwatch_log_metric_filter" "server_errors" {
name = "${var.app_name}-5xx-errors"
log_group_name = aws_cloudwatch_log_group.app.name
pattern = "{ $.statusCode >= 500 && $.statusCode < 600 }"
metric_transformation {
name = "ServerErrors"
namespace = "App/${var.app_name}"
value = "1"
unit = "Count"
}
}
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Alarms
# terraform/cloudwatch-alarms.tf
# High error rate alarm
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "${var.app_name}-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ErrorCount"
namespace = "App/${var.app_name}"
period = 60 # 1-minute periods
statistic = "Sum"
threshold = 10 # Alert if >10 errors per minute
treat_missing_data = "notBreaching"
alarm_description = "Error count exceeded 10 per minute for 2 consecutive minutes"
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
dimensions = { Environment = var.environment }
}
# High latency alarm (p99 > 2 seconds)
resource "aws_cloudwatch_metric_alarm" "high_latency" {
alarm_name = "${var.app_name}-high-p99-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
extended_statistic = "p99"
metric_name = "RequestLatency"
namespace = "App/${var.app_name}"
period = 60
threshold = 2000 # ms
treat_missing_data = "notBreaching"
alarm_description = "p99 latency > 2s for 3 consecutive minutes"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = { Environment = var.environment }
}
# Zero requests alarm (app might be down)
resource "aws_cloudwatch_metric_alarm" "no_requests" {
alarm_name = "${var.app_name}-no-requests"
comparison_operator = "LessThanThreshold"
evaluation_periods = 3
metric_name = "RequestLatency"
namespace = "App/${var.app_name}"
period = 60
statistic = "SampleCount"
threshold = 1
treat_missing_data = "breaching" # Missing data = app is down
alarm_description = "No requests received for 3 minutes during business hours"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = { Environment = var.environment }
}
CloudWatch Dashboard
# terraform/cloudwatch-dashboard.tf
resource "aws_cloudwatch_dashboard" "app" {
dashboard_name = "${var.app_name}-${var.environment}"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
x = 0; y = 0; width = 12; height = 6
properties = {
title = "Error Rate"
period = 60
metrics = [[
"App/${var.app_name}", "ErrorCount",
"Environment", var.environment,
{ stat = "Sum", color = "#d62728" }
]]
view = "timeSeries"
yAxis = { left = { min = 0 } }
annotations = {
horizontal = [{ value = 10, label = "Alert threshold", color = "#ff7f0e" }]
}
}
},
{
type = "metric"
x = 12; y = 0; width = 12; height = 6
properties = {
title = "Request Latency"
period = 60
metrics = [
["App/${var.app_name}", "RequestLatency", "Environment", var.environment, { stat = "p50", label = "p50" }],
["App/${var.app_name}", "RequestLatency", "Environment", var.environment, { stat = "p95", label = "p95" }],
["App/${var.app_name}", "RequestLatency", "Environment", var.environment, { stat = "p99", label = "p99", color = "#d62728" }],
]
view = "timeSeries"
}
},
{
type = "log"
x = 0; y = 6; width = 24; height = 8
properties = {
title = "Recent Errors"
query = "SOURCE '${aws_cloudwatch_log_group.app.name}' | filter level = \"error\" | fields time, event, error, url, userId | sort time desc | limit 50"
region = var.aws_region
view = "table"
}
}
]
})
}
Log Retention and Cost
resource "aws_cloudwatch_log_group" "app" {
name = "/app/${var.app_name}/${var.environment}"
retention_in_days = var.environment == "production" ? 90 : 14
tags = { Environment = var.environment }
}
CloudWatch pricing (2026):
| Component | Price |
|---|---|
| Log ingestion | $0.50/GB |
| Log storage | $0.03/GB/month |
| Logs Insights queries | $0.005/GB scanned |
| Metric filters | Free (up to 10 filters per log group) |
| Custom metrics | $0.30/metric/month |
| Alarms | $0.10/alarm/month |
For a medium SaaS app: 10GB/day ingestion = $5/day + storage. At $0.03/GB-month for 90-day retention: ~$27/month storage.
See Also
- AWS CloudWatch Observability and Dashboards
- AWS CloudTrail Audit Logging
- OpenTelemetry in Node.js
- Node.js Performance Profiling
- AWS ECS Fargate Production Setup
Working With Viprasol
Observability is what separates teams that fix incidents in 5 minutes from teams that spend 5 hours correlating log files from three different places. Our team sets up structured JSON logging, Logs Insights query libraries, metric filters for error rate and latency, and CloudWatch dashboards that give you situational awareness at a glance.
What we deliver:
- pino structured logger with redaction for sensitive fields
- Request middleware with requestId, durationMs, statusCode
- Logs Insights saved queries for errors, latency p99, 5xx rate, user journeys
- Terraform: metric filters, CloudWatch alarms, and dashboard
- Log group retention policy matched to compliance requirements
Talk to our team about your observability stack →
Or explore our cloud infrastructure services.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.