Kubernetes Cost Optimization: VPA, HPA, Bin-Packing, Spot Nodes, and Karpenter
Cut Kubernetes costs by 40–60%: configure Vertical Pod Autoscaler for right-sizing, Horizontal Pod Autoscaler for traffic-based scaling, bin-packing with pod topology, spot node groups with Karpenter, and idle resource cleanup.
Most Kubernetes clusters are significantly over-provisioned. Developers set resource requests conservatively (or copy from StackOverflow), nodes run at 20–30% utilization, and the bill accumulates quietly. A systematic cost optimization pass typically finds 40–60% savings without touching application performance.
The optimization hierarchy: right-size pods first (VPA), scale to demand (HPA/KEDA), pack pods efficiently (topology), then use spot for everything that tolerates interruption (Karpenter).
Step 1: Right-Sizing with VPA
Vertical Pod Autoscaler observes actual CPU and memory usage, then recommends (or automatically adjusts) resource requests:
# kubernetes/vpa/recommendation-mode.yaml
# Start in recommendation mode — don't auto-apply yet
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommendation only — read with: kubectl get vpa api-server-vpa
resourcePolicy:
containerPolicies:
- containerName: api
# Set bounds to prevent VPA from recommending unreasonable values
minAllowed:
cpu: 50m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 4Gi
controlledResources: ["cpu", "memory"]
# Read VPA recommendations after 24-48 hours of traffic
kubectl describe vpa api-server-vpa -n production
# Output (relevant section):
# Recommendation:
# Container Recommendations:
# Container Name: api
# Lower Bound:
# CPU: 80m
# Memory: 256Mi
# Target: ← Use these values for your resource requests
# CPU: 150m
# Memory: 512Mi
# Upper Bound:
# CPU: 800m
# Memory: 1.5Gi
# Uncapped Target:
# CPU: 150m
# Memory: 512Mi
# Apply VPA recommendations to your deployment
# kubernetes/deployments/api-server.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
resources:
requests:
cpu: "150m" # VPA recommended target
memory: "512Mi" # VPA recommended target
limits:
cpu: "800m" # VPA upper bound × 1.2 buffer
memory: "1.5Gi" # Match memory limit to upper bound (OOM prevention)
VPA in Auto mode — only enable after validating recommendations:
spec:
updatePolicy:
updateMode: "Auto" # VPA will evict and restart pods to apply new requests
# Note: causes brief restarts — only use with proper PodDisruptionBudget
Step 2: Horizontal Scaling with HPA
HPA scales pod count based on metrics. Don't just use CPU — scale on what actually drives load:
# kubernetes/hpa/api-server-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2 # Never go below 2 for HA
maxReplicas: 20
metrics:
# Scale on CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Target 70% CPU — leaves headroom for spikes
# Scale on custom metric: requests per second per pod
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # Scale when > 100 RPS per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up again
policies:
- type: Pods
value: 4 # Add at most 4 pods at a time
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25 # Remove at most 25% of pods at a time
periodSeconds: 60
KEDA for Event-Driven Scaling
# kubernetes/keda/queue-scaler.yaml
# Scale workers based on SQS queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-scaler
namespace: production
spec:
scaleTargetRef:
name: background-worker
minReplicaCount: 0 # Scale to zero when queue is empty (save cost)
maxReplicaCount: 50
pollingInterval: 15 # Check queue every 15 seconds
cooldownPeriod: 300 # Wait 5 min after last message before scaling to 0
triggers:
- type: aws-sqs-queue
authenticationRef:
name: keda-aws-credentials
metadata:
queueURL: "https://sqs.us-east-1.amazonaws.com/123456789/jobs-queue"
queueLength: "10" # Target: 10 messages per worker replica
awsRegion: "us-east-1"
scaleOnInFlight: "true"
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Step 3: Bin-Packing Pod Topology
Kubernetes default scheduler spreads pods across nodes. For cost, you want the opposite — pack pods densely so fewer nodes are needed:
# kubernetes/deployments/api-server.yaml
spec:
template:
spec:
# Topology spread — prefer same node, allow spreading
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # Don't block scheduling if can't satisfy
labelSelector:
matchLabels:
app: api-server
# Pod Anti-Affinity: keep critical replicas on different nodes (HA)
# Use for databases, stateful services
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: api-server
tier: critical
# Cluster-level bin-packing with Descheduler
# Periodically evicts pods from under-utilized nodes
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: default
pluginConfig:
- name: LowNodeUtilization
args:
thresholds:
cpu: 20 # Node is "low" if CPU < 20%
memory: 20
pods: 20
targetThresholds:
cpu: 50 # Move pods to nodes with CPU < 50%
memory: 50
pods: 100
plugins:
balance:
enabled:
- LowNodeUtilization
deschedule:
enabled:
- RemovePodsViolatingTopologySpreadConstraint
Step 4: Spot Instances with Karpenter
Karpenter provisions nodes just-in-time and prefers spot instances when workloads allow:
# kubernetes/karpenter/node-pool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
metadata:
labels:
node-type: general-purpose
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
name: default
requirements:
# Allow multiple instance families for better spot availability
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["m5", "m5a", "m6i", "m6a", "m7i", "m7a"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge"]
# Mix spot and on-demand
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
disruption:
consolidationPolicy: WhenUnderutilized # Remove nodes when not needed
consolidateAfter: 30s
limits:
cpu: 1000 # Max 1000 vCPUs across this pool
memory: 4000Gi
# kubernetes/karpenter/ec2-node-class.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2023
role: "KarpenterNodeRole-production" # IAM role for nodes
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: production
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: production
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
iops: 3000
encrypted: true
tags:
Environment: production
Spot Interruption Handling
# kubernetes/spot-handler/deployment.yaml
# AWS Node Termination Handler — gracefully drain spot nodes before interruption
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: aws-node-termination-handler
namespace: kube-system
spec:
selector:
matchLabels:
app: aws-node-termination-handler
template:
spec:
containers:
- name: aws-node-termination-handler
image: public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0
env:
- name: ENABLE_SPOT_INTERRUPTION_DRAINING
value: "true"
- name: NODE_TERMINATION_GRACE_PERIOD
value: "120" # 2 minutes to drain
- name: POD_TERMINATION_GRACE_PERIOD
value: "60" # 60s for pods to shut down
- name: ENABLE_REBALANCE_MONITORING
value: "true" # Proactive rebalancing before interruption
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Cost Savings Comparison
| Optimization | Typical Savings | Effort | Risk |
|---|---|---|---|
| VPA right-sizing | 20–40% | Medium | Low (recommendation mode first) |
| HPA (remove over-provisioning) | 15–30% | Low | Low |
| KEDA scale-to-zero for workers | 50–80% on workers | Low | Low |
| Spot instances for workers | 60–70% vs on-demand | Medium | Medium (interruption handling) |
| Karpenter bin-packing | 20–35% on node count | Medium | Low |
| Spot for stateless app pods | 50–65% | Medium | Medium |
| Combined | 40–65% total |
Namespace Resource Quotas (Guardrails)
# kubernetes/quotas/production-namespace.yaml
# Prevent any team from accidentally provisioning unlimited resources
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "100" # Total CPU requests across all pods
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
pods: "500" # Max pod count
count/deployments.apps: "50"
---
# LimitRange: defaults for containers that don't specify resources
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
See Also
- Multi-Cloud Strategy — cloud provider selection
- Infrastructure Cost Tagging — cost attribution
- Kubernetes Networking — service mesh, ingress
- Cloud Cost Engineering — broader cloud cost patterns
Working With Viprasol
Kubernetes cost optimization is systematic engineering work — profiling actual resource usage, implementing autoscaling policies, migrating workloads to spot, and maintaining the guardrails that prevent costs from drifting back. Our platform engineers typically achieve 40–60% cost reduction within 4–6 weeks of engagement.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.