Cloud Cost Engineering: Rightsizing, Reserved Instances, Spot Fleets, and Savings Plans
Cut AWS cloud costs 40–70% with systematic rightsizing, Compute Savings Plans, Spot Fleet strategies, container cost allocation, and FinOps practices that scale with your organization.
The average company wastes 32% of their cloud spend, according to Flexera's 2025 State of the Cloud Report. On a $500K/year AWS bill, that's $160K in waste — more than a mid-level engineer's salary.
Cloud cost engineering is not about cutting corners. It's about paying the right price for the capacity you actually use, and using the financial instruments AWS provides to reduce that price by 40–70%.
The Cost Reduction Stack
Work through these in order — each layer builds on the previous:
Layer 5: FinOps culture (showback, chargeback, unit economics)
Layer 4: Architecture optimization (caching, CDN, right services)
Layer 3: Commitment discounts (Savings Plans, Reserved Instances)
Layer 2: Rightsizing (match instance size to actual usage)
Layer 1: Waste elimination (idle resources, orphaned volumes, unused IPs)
Most teams skip to Layer 3 (commitments) without doing Layer 1–2, which means they're committing to the wrong amount of the wrong resource types.
Layer 1: Waste Elimination
Start with resources that are running and doing nothing:
#!/bin/bash
# scripts/find-waste.sh
# Find common categories of wasted AWS spend
echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].[VolumeId,Size,CreateTime,AvailabilityZone]' \
--output table
echo ""
echo "=== Idle Load Balancers (0 healthy targets) ==="
aws elbv2 describe-load-balancers --query 'LoadBalancers[*].LoadBalancerArn' --output text | \
tr '\t' '\n' | \
xargs -I {} sh -c '
COUNT=$(aws elbv2 describe-target-health --target-group-arn {} --query "length(TargetHealthDescriptions[?TargetHealth.State==\`healthy\`])" 2>/dev/null || echo 0)
if [ "$COUNT" -eq 0 ]; then echo "Idle: {}"; fi
'
echo ""
echo "=== Elastic IPs Not Associated with Instances ==="
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==null].[AllocationId,PublicIp]' \
--output table
echo ""
echo "=== Snapshots Older Than 90 Days ==="
CUTOFF=$(date -d '90 days ago' --iso-8601=seconds)
aws ec2 describe-snapshots \
--owner-ids self \
--query "Snapshots[?StartTime<'${CUTOFF}'].[SnapshotId,VolumeSize,StartTime]" \
--output table
echo ""
echo "=== Stopped EC2 Instances (still charging for EBS) ==="
aws ec2 describe-instances \
--filters Name=instance-state-name,Values=stopped \
--query 'Reservations[*].Instances[*].[InstanceId,InstanceType,Tags[?Key==`Name`].Value|[0],LaunchTime]' \
--output table
Automated Waste Cleanup with AWS Lambda
// src/lambda/cost-cleanup/handler.ts
import { EC2Client, DescribeVolumesCommand, DeleteVolumeCommand } from "@aws-sdk/client-ec2";
const ec2 = new EC2Client({ region: process.env.AWS_REGION });
export async function handler(): Promise<void> {
// Find volumes unattached for more than 7 days
const { Volumes } = await ec2.send(
new DescribeVolumesCommand({
Filters: [{ Name: "status", Values: ["available"] }],
})
);
const staleVolumes = (Volumes ?? []).filter((v) => {
const createTime = new Date(v.CreateTime!);
const daysSinceCreation = (Date.now() - createTime.getTime()) / (1000 * 60 * 60 * 24);
return daysSinceCreation > 7;
});
console.log(`Found ${staleVolumes.length} stale volumes`);
for (const volume of staleVolumes) {
// Tag for deletion review instead of deleting immediately
console.log(`Tagging volume ${volume.VolumeId} for deletion review`);
// In production: tag with deletion-scheduled date, alert to Slack,
// delete after 7-day review window
}
}
Typical savings from waste elimination: 10–20% of total bill.
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Layer 2: Rightsizing
Finding Over-Provisioned Instances
# scripts/rightsize_analysis.py
import boto3
from datetime import datetime, timedelta
from typing import NamedTuple
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
ec2 = boto3.client('ec2', region_name='us-east-1')
class InstanceMetrics(NamedTuple):
instance_id: str
instance_type: str
avg_cpu_percent: float
max_cpu_percent: float
avg_memory_percent: float
recommendation: str
def analyze_instance(instance_id: str, instance_type: str) -> InstanceMetrics:
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=14)
def get_metric(metric_name: str, namespace: str = 'AWS/EC2') -> tuple[float, float]:
response = cloudwatch.get_metric_statistics(
Namespace=namespace,
MetricName=metric_name,
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600, # 1-hour buckets
Statistics=['Average', 'Maximum'],
)
datapoints = response['Datapoints']
if not datapoints:
return 0.0, 0.0
avg = sum(d['Average'] for d in datapoints) / len(datapoints)
maximum = max(d['Maximum'] for d in datapoints)
return avg, maximum
avg_cpu, max_cpu = get_metric('CPUUtilization')
# Memory requires CloudWatch agent
avg_mem, _ = get_metric('mem_used_percent', 'CWAgent')
# Rightsizing recommendation
recommendation = "OK"
if avg_cpu < 10 and max_cpu < 40:
recommendation = "DOWNSIZE: CPU consistently low"
elif avg_cpu < 20 and avg_mem < 20:
recommendation = "DOWNSIZE: Both CPU and memory underutilized"
elif avg_cpu > 80:
recommendation = "UPSIZE: CPU consistently high"
return InstanceMetrics(
instance_id=instance_id,
instance_type=instance_type,
avg_cpu_percent=round(avg_cpu, 1),
max_cpu_percent=round(max_cpu, 1),
avg_memory_percent=round(avg_mem, 1),
recommendation=recommendation,
)
# Run analysis
paginator = ec2.get_paginator('describe_instances')
for page in paginator.paginate(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]):
for reservation in page['Reservations']:
for instance in reservation['Instances']:
metrics = analyze_instance(instance['InstanceId'], instance['InstanceType'])
if 'DOWNSIZE' in metrics.recommendation:
monthly_savings = estimate_downsize_savings(metrics.instance_type)
print(f"{metrics.instance_id} ({metrics.instance_type}): {metrics.recommendation}")
print(f" CPU avg/max: {metrics.avg_cpu_percent}% / {metrics.max_cpu_percent}%")
print(f" Estimated monthly savings: ${monthly_savings}")
Typical savings from rightsizing: 15–30% of compute costs.
Layer 3: Commitment Discounts
Savings Plans vs Reserved Instances
SAVINGS PLANS (recommended for most):
├── Compute Savings Plans (most flexible)
│ ├── Applies to EC2, Fargate, Lambda
│ ├── Any region, any instance family, any OS
│ └── Up to 66% savings vs On-Demand
├── EC2 Instance Savings Plans
│ ├── Applies to specific instance family in one region
│ └── Up to 72% savings vs On-Demand
└── SageMaker Savings Plans
└── SageMaker only, up to 64%
RESERVED INSTANCES (use for specific cases):
├── RDS instances (no Savings Plans option)
├── ElastiCache, Redshift, OpenSearch
└── EC2 if you need specific hardware guarantees
Calculating Optimal Commitment Level
// src/scripts/savings-plan-calculator.ts
import {
CostExplorerClient,
GetRightsizingRecommendationCommand,
GetSavingsPlansPurchaseRecommendationCommand,
} from "@aws-sdk/client-cost-explorer";
const ce = new CostExplorerClient({ region: "us-east-1" });
interface SavingsPlanRecommendation {
termInYears: 1 | 3;
paymentOption: "NoUpfront" | "PartialUpfront" | "AllUpfront";
hourlyCommitment: number;
estimatedMonthlySavings: number;
estimatedSavingsPercentage: number;
estimatedROI: number;
}
export async function getSavingsPlanRecommendations(): Promise<
SavingsPlanRecommendation[]
> {
const response = await ce.send(
new GetSavingsPlansPurchaseRecommendationCommand({
SavingsPlansType: "COMPUTE_SP",
TermInYears: "ONE_YEAR",
PaymentOption: "NO_UPFRONT",
LookbackPeriodInDays: "SIXTY_DAYS",
})
);
return (
response.SavingsPlansPurchaseRecommendation?.SavingsPlansPurchaseRecommendationDetails?.map(
(detail) => ({
termInYears: 1,
paymentOption: "NoUpfront",
hourlyCommitment: Number(
detail.HourlyCommitmentToPurchase ?? 0
),
estimatedMonthlySavings: Number(
detail.EstimatedMonthlySavingsAmount ?? 0
),
estimatedSavingsPercentage: Number(
detail.EstimatedSavingsPercentage ?? 0
),
estimatedROI: Number(detail.CurrentAverageHourlyOnDemandSpend ?? 0) > 0
? Number(detail.EstimatedSavingsPercentage ?? 0)
: 0,
})
) ?? []
);
}
// Rule of thumb: commit to the 30-day p50 of your Compute spend
// This covers baseline load; burst goes on On-Demand
Commitment Strategy
Baseline (always running): Cover with Savings Plans (1yr No-Upfront)
┌────────────────────────────┐
│ Savings Plan commits: │
│ $X/hour (30-day p20 spend) │
└────────────────────────────┘
Variable (predictable peaks): Cover with Savings Plans (3yr if >$50K)
┌────────────────────────────┐
│ Savings Plan covers: │
│ $Y/hour (30-day p80 spend) │
└────────────────────────────┘
Burst (spiky, unpredictable): On-Demand or Spot
┌────────────────────────────┐
│ On-Demand / Spot Fleet │
│ for everything above p80 │
└────────────────────────────┘
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Layer 4: Spot Fleet for Stateless Workloads
Spot instances are spare AWS capacity at 60–90% discount. They can be interrupted with 2-minute notice, making them ideal for stateless workloads:
# terraform/spot-fleet.tf
resource "aws_spot_fleet_request" "batch_workers" {
iam_fleet_role = aws_iam_role.spot_fleet.arn
target_capacity = 10
# Use multiple instance types for availability
# If one pool runs out, Spot Fleet uses another
launch_template_config {
launch_template_specification {
id = aws_launch_template.worker.id
version = "$Latest"
}
overrides {
instance_type = "m6i.xlarge"
weighted_capacity = 1
availability_zone = "us-east-1a"
}
overrides {
instance_type = "m6a.xlarge"
weighted_capacity = 1
availability_zone = "us-east-1b"
}
overrides {
instance_type = "m5.xlarge"
weighted_capacity = 1
availability_zone = "us-east-1c"
}
overrides {
instance_type = "r6i.large"
weighted_capacity = 1
availability_zone = "us-east-1a"
}
}
# Replace interrupted instances automatically
allocation_strategy = "capacityOptimized"
# Mix spot + on-demand for reliability
# 80% spot, 20% on-demand baseline
on_demand_base_capacity = 2 # Always have 2 on-demand
on_demand_target_capacity_percentage = 20 # Rest is split 80/20
spot_instance_interruption_behavior = "terminate"
tags = {
Name = "batch-workers-spot-fleet"
}
}
ECS Capacity Provider with Spot
# terraform/ecs-capacity-provider.tf
resource "aws_ecs_capacity_provider" "spot" {
name = "spot-workers"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.spot.arn
managed_scaling {
maximum_scaling_step_size = 10
minimum_scaling_step_size = 1
status = "ENABLED"
target_capacity = 90
}
managed_termination_protection = "DISABLED"
}
}
resource "aws_ecs_cluster_capacity_providers" "main" {
cluster_name = aws_ecs_cluster.main.name
capacity_providers = [
"FARGATE", # Always-on baseline
"FARGATE_SPOT", # Cheap for batch/dev
aws_ecs_capacity_provider.spot.name
]
default_capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 1
base = 2 # Minimum 2 Fargate tasks always
}
default_capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 4 # 4x more likely to use Spot
base = 0
}
}
Layer 5: FinOps — Cost Allocation and Accountability
Without cost allocation, engineers have no incentive to optimize:
// src/lambda/cost-reporter/handler.ts
import {
CostExplorerClient,
GetCostAndUsageCommand,
} from "@aws-sdk/client-cost-explorer";
const ce = new CostExplorerClient({ region: "us-east-1" });
export async function weeklyTeamCostReport(): Promise<void> {
const endDate = new Date().toISOString().split("T")[0];
const startDate = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
.toISOString()
.split("T")[0];
const response = await ce.send(
new GetCostAndUsageCommand({
TimePeriod: { Start: startDate, End: endDate },
Granularity: "DAILY",
GroupBy: [{ Type: "TAG", Key: "team" }], // Requires tagging all resources
Metrics: ["UnblendedCost"],
})
);
// Build team cost report
const teamCosts: Record<string, number> = {};
for (const result of response.ResultsByTime ?? []) {
for (const group of result.Groups ?? []) {
const team = group.Keys?.[0]?.replace("team$", "") ?? "untagged";
const cost = Number(group.Metrics?.UnblendedCost?.Amount ?? 0);
teamCosts[team] = (teamCosts[team] ?? 0) + cost;
}
}
// Post to Slack with week-over-week comparison
const message = Object.entries(teamCosts)
.sort(([, a], [, b]) => b - a)
.map(([team, cost]) => `${team}: $${cost.toFixed(2)}`)
.join("\n");
await postToSlack({
channel: "#engineering-costs",
text: `Weekly AWS Cost by Team (${startDate} → ${endDate})\n\`\`\`\n${message}\n\`\`\``,
});
}
Resource Tagging Policy
// scripts/enforce-tags.ts
// Run as a pre-commit hook or CI check
const REQUIRED_TAGS = ["team", "service", "environment", "cost-center"] as const;
interface TerraformPlan {
resource_changes: Array<{
type: string;
change: {
actions: string[];
after: {
tags?: Record<string, string>;
};
};
}>;
}
export function validateTagCompliance(plan: TerraformPlan): string[] {
const violations: string[] = [];
const TAGGABLE_TYPES = new Set([
"aws_instance",
"aws_db_instance",
"aws_elasticache_cluster",
"aws_ecs_service",
"aws_lambda_function",
"aws_s3_bucket",
]);
for (const resource of plan.resource_changes) {
if (!TAGGABLE_TYPES.has(resource.type)) continue;
if (!resource.change.actions.includes("create") && !resource.change.actions.includes("update")) continue;
const tags = resource.change.after.tags ?? {};
const missingTags = REQUIRED_TAGS.filter((tag) => !tags[tag]);
if (missingTags.length > 0) {
violations.push(
`${resource.type}: missing required tags: ${missingTags.join(", ")}`
);
}
}
return violations;
}
Cost Savings Reference
| Optimization | Effort | Typical Savings |
|---|---|---|
| Eliminate idle resources | Low (hours) | 5–15% |
| Rightsize EC2/RDS | Medium (days) | 15–30% |
| 1-yr Compute Savings Plan (No Upfront) | Low (minutes) | 33–40% on committed spend |
| 3-yr Compute Savings Plan (All Upfront) | Low (minutes) | 60–66% on committed spend |
| Spot Fleet for batch workloads | Medium (days) | 60–80% vs On-Demand |
| S3 Intelligent Tiering | Low (hours) | 10–40% on storage |
| Reserved RDS (1-yr) | Low (minutes) | 30–35% on database |
| Fargate Spot for dev/staging | Low (hours) | 60–70% on dev compute |
Combined realistic savings on a $500K/year bill: $150K–$300K/year with 2–4 weeks of engineering effort.
See Also
- Kubernetes Operators: Automating Complex Workloads — K8s cost allocation
- Infrastructure Cost Tagging Strategies — tagging for FinOps
- Multi-Cloud Strategy and Architecture — avoiding vendor lock-in
- Serverless Architecture: When Lambda Beats EC2 — serverless economics
Working With Viprasol
Our cloud engineering team has performed cost optimization engagements for SaaS companies spending $50K–$2M/year on AWS. We combine automated waste detection, rightsizing analysis, and commitment strategy to deliver 30–60% cost reductions — typically within 30 days.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.