Most Kubernetes clusters are significantly over-provisioned. Developers set resource requests conservatively (or copy from StackOverflow), nodes run at 20–30% utilization, and the bill accumulates quietly. A systematic cost optimization pass typically finds 40–60% savings without touching application performance.

The optimization hierarchy: right-size pods first (VPA), scale to demand (HPA/KEDA), pack pods efficiently (topology), then use spot for everything that tolerates interruption (Karpenter).

Step 1: Right-Sizing with VPA

Vertical Pod Autoscaler observes actual CPU and memory usage, then recommends (or automatically adjusts) resource requests:

# kubernetes/vpa/recommendation-mode.yaml
# Start in recommendation mode — don't auto-apply yet
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"  # Recommendation only — read with: kubectl get vpa api-server-vpa
  resourcePolicy:
    containerPolicies:
    - containerName: api
      # Set bounds to prevent VPA from recommending unreasonable values
      minAllowed:
        cpu: 50m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 4Gi
      controlledResources: ["cpu", "memory"]

# Read VPA recommendations after 24-48 hours of traffic
kubectl describe vpa api-server-vpa -n production

# Output (relevant section):
# Recommendation:
#   Container Recommendations:
#     Container Name: api
#     Lower Bound:
#       CPU:     80m
#       Memory:  256Mi
#     Target:               ← Use these values for your resource requests
#       CPU:     150m
#       Memory:  512Mi
#     Upper Bound:
#       CPU:     800m
#       Memory:  1.5Gi
#     Uncapped Target:
#       CPU:     150m
#       Memory:  512Mi

# Apply VPA recommendations to your deployment
# kubernetes/deployments/api-server.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        resources:
          requests:
            cpu: "150m"      # VPA recommended target
            memory: "512Mi"  # VPA recommended target
          limits:
            cpu: "800m"      # VPA upper bound × 1.2 buffer
            memory: "1.5Gi" # Match memory limit to upper bound (OOM prevention)

VPA in Auto mode — only enable after validating recommendations:

spec:
  updatePolicy:
    updateMode: "Auto"  # VPA will evict and restart pods to apply new requests
    # Note: causes brief restarts — only use with proper PodDisruptionBudget

Step 2: Horizontal Scaling with HPA

HPA scales pod count based on metrics. Don't just use CPU — scale on what actually drives load:

# kubernetes/hpa/api-server-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2    # Never go below 2 for HA
  maxReplicas: 20
  metrics:
  # Scale on CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Target 70% CPU — leaves headroom for spikes

  # Scale on custom metric: requests per second per pod
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"  # Scale when > 100 RPS per pod

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  # Wait 60s before scaling up again
      policies:
      - type: Pods
        value: 4            # Add at most 4 pods at a time
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 25           # Remove at most 25% of pods at a time
        periodSeconds: 60

KEDA for Event-Driven Scaling

# kubernetes/keda/queue-scaler.yaml
# Scale workers based on SQS queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: background-worker
  minReplicaCount: 0  # Scale to zero when queue is empty (save cost)
  maxReplicaCount: 50
  pollingInterval: 15  # Check queue every 15 seconds
  cooldownPeriod: 300  # Wait 5 min after last message before scaling to 0
  triggers:
  - type: aws-sqs-queue
    authenticationRef:
      name: keda-aws-credentials
    metadata:
      queueURL: "https://sqs.us-east-1.amazonaws.com/123456789/jobs-queue"
      queueLength: "10"   # Target: 10 messages per worker replica
      awsRegion: "us-east-1"
      scaleOnInFlight: "true"

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Step 3: Bin-Packing Pod Topology

Kubernetes default scheduler spreads pods across nodes. For cost, you want the opposite — pack pods densely so fewer nodes are needed:

# kubernetes/deployments/api-server.yaml
spec:
  template:
    spec:
      # Topology spread — prefer same node, allow spreading
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway  # Don't block scheduling if can't satisfy
        labelSelector:
          matchLabels:
            app: api-server

      # Pod Anti-Affinity: keep critical replicas on different nodes (HA)
      # Use for databases, stateful services
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  app: api-server
                  tier: critical

# Cluster-level bin-packing with Descheduler
# Periodically evicts pods from under-utilized nodes
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: default
  pluginConfig:
  - name: LowNodeUtilization
    args:
      thresholds:
        cpu: 20     # Node is "low" if CPU < 20%
        memory: 20
        pods: 20
      targetThresholds:
        cpu: 50     # Move pods to nodes with CPU < 50%
        memory: 50
        pods: 100
  plugins:
    balance:
      enabled:
      - LowNodeUtilization
    deschedule:
      enabled:
      - RemovePodsViolatingTopologySpreadConstraint

Step 4: Spot Instances with Karpenter

Karpenter provisions nodes just-in-time and prefers spot instances when workloads allow:

# kubernetes/karpenter/node-pool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    metadata:
      labels:
        node-type: general-purpose
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: default
      requirements:
      # Allow multiple instance families for better spot availability
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values: ["m5", "m5a", "m6i", "m6a", "m7i", "m7a"]
      - key: karpenter.k8s.aws/instance-size
        operator: In
        values: ["large", "xlarge", "2xlarge"]
      # Mix spot and on-demand
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
  disruption:
    consolidationPolicy: WhenUnderutilized  # Remove nodes when not needed
    consolidateAfter: 30s
  limits:
    cpu: 1000        # Max 1000 vCPUs across this pool
    memory: 4000Gi

# kubernetes/karpenter/ec2-node-class.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2023
  role: "KarpenterNodeRole-production"  # IAM role for nodes
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: production
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: production
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      volumeSize: 50Gi
      volumeType: gp3
      iops: 3000
      encrypted: true
  tags:
    Environment: production

Spot Interruption Handling

# kubernetes/spot-handler/deployment.yaml
# AWS Node Termination Handler — gracefully drain spot nodes before interruption
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: aws-node-termination-handler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: aws-node-termination-handler
  template:
    spec:
      containers:
      - name: aws-node-termination-handler
        image: public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0
        env:
        - name: ENABLE_SPOT_INTERRUPTION_DRAINING
          value: "true"
        - name: NODE_TERMINATION_GRACE_PERIOD
          value: "120"  # 2 minutes to drain
        - name: POD_TERMINATION_GRACE_PERIOD
          value: "60"   # 60s for pods to shut down
        - name: ENABLE_REBALANCE_MONITORING
          value: "true" # Proactive rebalancing before interruption

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Cost Savings Comparison

Optimization	Typical Savings	Effort	Risk
VPA right-sizing	20–40%	Medium	Low (recommendation mode first)
HPA (remove over-provisioning)	15–30%	Low	Low
KEDA scale-to-zero for workers	50–80% on workers	Low	Low
Spot instances for workers	60–70% vs on-demand	Medium	Medium (interruption handling)
Karpenter bin-packing	20–35% on node count	Medium	Low
Spot for stateless app pods	50–65%	Medium	Medium
Combined	40–65% total

Namespace Resource Quotas (Guardrails)

# kubernetes/quotas/production-namespace.yaml
# Prevent any team from accidentally provisioning unlimited resources
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"       # Total CPU requests across all pods
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    pods: "500"               # Max pod count
    count/deployments.apps: "50"

---
# LimitRange: defaults for containers that don't specify resources
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container

Working With Viprasol

Kubernetes cost optimization is systematic engineering work — profiling actual resource usage, implementing autoscaling policies, migrating workloads to spot, and maintaining the guardrails that prevent costs from drifting back. Our platform engineers typically achieve 40–60% cost reduction within 4–6 weeks of engagement.

Cloud engineering services → | Talk to our engineers →

Kubernetes Cost Optimization: VPA, HPA, Bin-Packing, Spot Nodes, and Karpenter

Step 1: Right-Sizing with VPA

Step 2: Horizontal Scaling with HPA

KEDA for Event-Driven Scaling

☁️ Is Your Cloud Costing Too Much?

Step 3: Bin-Packing Pod Topology

Step 4: Spot Instances with Karpenter

Spot Interruption Handling

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Cost Savings Comparison

Namespace Resource Quotas (Guardrails)

See Also

Working With Viprasol

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

AWS Partner Network: How to Leverage Cloud Partnerships (2026)

Advantages of Cloud Computing: Transform Your Infrastructure in 2026

AWS ECS Autoscaling: Target Tracking, Step Scaling, and Fargate Capacity Providers with Terraform