Chaos Engineering: Fault Injection, Chaos Monkey, and Resilience Testing in Production
Implement chaos engineering to build resilient systems — fault injection principles, Chaos Monkey setup, AWS Fault Injection Simulator, game days, and measuring
Chaos Engineering: Fault Injection, Chaos Monkey, and Resilience Testing in Production
Chaos engineering is the practice of deliberately injecting failures into your system to discover weaknesses before they cause unplanned outages. The core premise: your system will fail. Chaos engineering lets you choose when, how, and with a team standing by — rather than discovering failure modes at 2am with customers affected.
Netflix coined the term and built Chaos Monkey. The principles apply to any system that needs to be resilient.
The Chaos Engineering Hypothesis Model
Every chaos experiment follows the same structure:
- Define steady state: What does normal look like? (p99 latency < 200ms, error rate < 0.1%)
- Hypothesize: "We believe the system will remain in steady state when X fails"
- Introduce failure: Terminate instances, inject latency, block network traffic
- Observe: Did the system stay in steady state? What broke?
- Fix and repeat: Address discovered weaknesses; re-run experiment
This is not random destruction. It's scientific experimentation to build confidence in your system's resilience.
Start in Staging, Not Production
The chaos engineering narrative focuses on production experiments. For most teams starting out:
| Environment | When | What You Learn |
|---|---|---|
| Local | Always | Basic fault tolerance |
| Staging | First | Architecture weaknesses |
| Production | After staging is solid | Real-world behavior under actual load |
You gain confidence to run production experiments by first validating that your system handles failures in staging. Don't start with production chaos on Day 1.
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Simple Fault Injection: No Tools Required
Before installing Chaos Monkey, implement these manually:
# Kill a random container on ECS (simulates instance failure)
TASK_ARN=$(aws ecs list-tasks \
--cluster production \
--service-name api \
--query 'taskArns[0]' \
--output text)
aws ecs stop-task \
--cluster production \
--task $TASK_ARN \
--reason "Chaos engineering experiment: testing ECS auto-recovery"
# Watch: does ECS spin up a replacement? How long does it take?
# Expected: new task running within 30-60 seconds
# Check if ALB health checks removed the failing task from rotation
aws elbv2 describe-target-health \
--target-group-arn $TARGET_GROUP_ARN
# Simulate database connection exhaustion
# Connect 100 idle connections to PostgreSQL, then observe API behavior
for i in {1..100}; do
psql $DATABASE_URL -c "SELECT pg_sleep(300);" &
done
# Hypothesis: API should return 503 (not hang) when DB pool exhausted
# Expected: connection pool timeout returns error within 5s
# Actual: might hang for 30s without proper timeout configuration
AWS Fault Injection Simulator (FIS)
AWS FIS provides managed fault injection for AWS infrastructure — terminate EC2 instances, throttle API calls, inject network latency, kill ECS tasks.
# terraform/fis.tf — define reusable experiments
# IAM role for FIS to take actions
resource "aws_iam_role" "fis_role" {
name = "fis-experiment-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "fis.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "fis_policy" {
role = aws_iam_role.fis_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["ecs:StopTask", "ecs:ListTasks"]
Resource = "*"
Condition = {
StringEquals = {
"aws:ResourceTag/Environment" = "staging"
}
}
},
{
Effect = "Allow"
Action = ["ec2:DescribeInstances", "ec2:TerminateInstances"]
Resource = "*"
Condition = {
StringEquals = {
"aws:ResourceTag/ChaosTarget" = "true"
}
}
}
]
})
}
# Experiment: Terminate 50% of ECS tasks (simulates AZ failure)
resource "aws_fis_experiment_template" "ecs_task_termination" {
description = "Terminate 50% of API service tasks to test auto-recovery"
role_arn = aws_iam_role.fis_role.arn
action {
name = "TerminateTasks"
action_id = "aws:ecs:stop-task"
parameter {
key = "count"
value = "50" # percentage
}
target {
key = "Tasks"
value = "ecsTaskTargets"
}
}
target {
name = "ecsTaskTargets"
resource_type = "aws:ecs:task"
selection_mode = "PERCENT(50)"
resource_tag {
key = "aws:ecs:serviceName"
value = "api-service"
}
}
# Stop conditions — halt experiment if things get too bad
stop_condition {
source = "aws:cloudwatch:alarm"
value = aws_cloudwatch_metric_alarm.error_rate_critical.arn
}
tags = { Environment = "staging" }
}
# run_experiment.py — start FIS experiment and monitor results
import boto3
import time
fis = boto3.client('fis', region_name='us-east-1')
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
def run_chaos_experiment(template_id: str, duration_seconds: int = 300):
print(f"Starting chaos experiment: {template_id}")
# Capture baseline metrics
baseline = get_metrics('before')
# Start experiment
response = fis.start_experiment(
experimentTemplateId=template_id,
tags={'RunBy': 'chaos-team', 'Purpose': 'resilience-testing'},
)
experiment_id = response['experiment']['id']
try:
# Monitor during experiment
for i in range(duration_seconds // 10):
time.sleep(10)
metrics = get_metrics('during')
print(f"t+{(i+1)*10}s | Error rate: {metrics['error_rate']:.2f}% | p99: {metrics['p99_ms']:.0f}ms")
finally:
# Capture recovery metrics
time.sleep(120) # Wait for system to stabilize
recovery = get_metrics('after')
print("\n=== Experiment Results ===")
print(f"Baseline error rate: {baseline['error_rate']:.2f}%")
print(f"Peak error rate: {recovery['error_rate']:.2f}%")
print(f"Recovery time: ~2 minutes (observed)")
print(f"Steady state resumed: {'Yes' if recovery['error_rate'] < 0.5 else 'No'}")
def get_metrics(phase: str) -> dict:
# Query CloudWatch for current error rate and p99 latency
# ... implementation
pass
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
The Game Day
A game day is a structured chaos engineering exercise where the team deliberately causes failures and practices responding:
# Game Day Runbook: "What Happens When the Database Goes Down?"
## Date: 2026-05-15 | Time: 10:00 AM–12:00 PM | Team: Engineering + On-Call
Hypothesis
When the primary PostgreSQL instance becomes unavailable, the system will:
- Automatically fail over to read replica within 60 seconds (RDS Multi-AZ)
- Return 503 errors (not 502 or timeouts) for write operations during failover
- Recover to normal operation within 90 seconds
- Generate a PagerDuty alert within 2 minutes
Setup
- Notify stakeholders (outage window is intentional)
- Confirm staging environment health
- Set up monitoring dashboard (Datadog, CloudWatch)
- Assign roles: Experiment operator, Observer, Scribe
Experiment Steps
- 10:00 — Capture baseline metrics
- 10:05 — Operator reboots primary RDS instance
- 10:05–10:10 — Team observes behavior; Scribe records timeline
- 10:10 — Did failover complete? How long did it take?
- 10:15 — Examine error logs during failover window
- 10:30 — Debrief: what worked, what didn't
Success Criteria
- Failover completed in < 90 seconds
- API returned 503 (not 504) during failover
- PagerDuty alert fired within 2 minutes
- No data loss
What to Capture
- Exact timestamps of failure and recovery
- Error messages from API during outage
- Customer-visible impact (yes/no, duration)
- Any surprises (things that broke unexpectedly)
---
## Application-Level Fault Injection
Inject failures in your application code during testing:
```typescript
// lib/chaos.ts — fault injection for staging/testing
const CHAOS_ENABLED = process.env.CHAOS_ENABLED === 'true';
export function chaosMiddleware() {
return async (request: FastifyRequest, reply: FastifyReply, done: () => void) => {
if (!CHAOS_ENABLED) return done();
const roll = Math.random();
// 5% of requests: inject 2-second latency
if (roll < 0.05) {
await new Promise(resolve => setTimeout(resolve, 2000));
}
// 2% of requests: return 500 error
else if (roll < 0.07) {
return reply.code(500).send({ error: 'Chaos engineering: injected failure' });
}
// 1% of requests: return 503 (service unavailable)
else if (roll < 0.08) {
return reply.code(503).send({ error: 'Chaos engineering: service unavailable' });
}
done();
};
}
// Database chaos: randomly fail queries
export async function chaosQuery<T>(queryFn: () => Promise<T>): Promise<T> {
if (CHAOS_ENABLED && Math.random() < 0.02) {
throw new Error('Chaos engineering: simulated database failure');
}
return queryFn();
}
What to Experiment On First
Prioritize experiments that test your most critical resilience mechanisms:
| Experiment | Hypothesis | Tool |
|---|---|---|
| Kill one ECS task | Service continues with remaining tasks | AWS FIS |
| Kill all tasks in one AZ | Multi-AZ routing works correctly | AWS FIS |
| RDS failover | Read replica promotes in < 90s | RDS reboot with failover |
| Redis unavailable | App degrades gracefully (cache miss, not crash) | Stop Redis container |
| Slow downstream API | Circuit breaker opens; returns cached/default response | tc (traffic control) |
| CPU spike | Horizontal scaling triggers correctly | stress-ng |
Working With Viprasol
We run chaos engineering engagements for client systems — designing experiments, building fault injection infrastructure, running game days with your team, and helping address discovered weaknesses. Resilience is an engineering investment that pays off in reduced incident frequency and faster recovery.
→ Talk to our cloud team about chaos engineering and resilience testing.
See Also
- Observability and Monitoring — the prerequisite for meaningful chaos experiments
- DevOps Best Practices — CI/CD and incident response foundations
- Load Testing Tools — performance testing before chaos testing
- Kubernetes vs ECS — container orchestration resilience features
- Cloud Solutions — cloud architecture and reliability engineering
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.