DevOps Best Practices: CI/CD, Monitoring, and Infrastructure Automation in 2026

DevOps is the discipline of reducing the time and risk between writing code and running it in production. The practices here aren't theoretical — they're the specific implementations we use for clients handling real production traffic, and the gaps we've seen cause outages or slow engineering teams to a crawl.

The Four Pillars of Mature DevOps

Continuous Integration: Every code change is built and tested automatically
Continuous Deployment: Passing changes deploy to production without manual intervention
Infrastructure as Code: All infrastructure is version-controlled and reproducible
Observability: You know what's happening in production before users tell you

Most teams have partial implementations of each. The compounding value comes from having all four working together.

Pillar 1: CI That Actually Works

A CI pipeline that takes 45 minutes to run is nearly useless — developers stop waiting for it and merge anyway. The goal is a pipeline that completes in under 10 minutes and catches real bugs before they reach main.

GitHub Actions workflow with parallel jobs:

# .github/workflows/ci.yml
name: CI

on:
  pull_request:
    branches: [main, develop]

env:
  NODE_VERSION: '20'

jobs:
  # Fast check — runs first, fails fast
  lint-types:
    name: Lint & Type Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run type-check

  # Unit tests — parallel by directory
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]  # Split tests across 4 runners
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - run: npm ci
      - run: npx jest --shard=${{ matrix.shard }}/4 --coverage --coverageReporters=json
      - uses: actions/upload-artifact@v4
        with:
          name: coverage-${{ matrix.shard }}
          path: coverage/coverage-final.json

  # Integration tests — needs postgres
  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_USER: testuser
          POSTGRES_PASSWORD: testpass
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - run: npm ci
      - run: npm run db:migrate:test
        env:
          DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
      - run: npm run test:integration
        env:
          DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb

  # Security scan
  security:
    name: Security Audit
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm audit --audit-level=high
      - uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          severity: 'CRITICAL,HIGH'

  # Build Docker image
  build:
    name: Build & Push Image
    runs-on: ubuntu-latest
    needs: [lint-types, unit-tests, integration-tests, security]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      - uses: aws-actions/amazon-ecr-login@v2
      - name: Build and push
        run: |
          IMAGE_TAG=${{ github.sha }}
          docker build -t ${{ secrets.ECR_REGISTRY }}/app:$IMAGE_TAG .
          docker push ${{ secrets.ECR_REGISTRY }}/app:$IMAGE_TAG
          echo "IMAGE_TAG=$IMAGE_TAG" >> $GITHUB_OUTPUT

Key CI principles:

Fail fast: Lint and type checks run first, before slower tests
Parallel where possible: Test sharding cuts runtime proportionally
Cache aggressively: npm ci with cache cuts 60–90s per run
Security in CI: Catching vulnerabilities before production is far cheaper than after

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Pillar 2: Continuous Deployment with Zero Downtime

ECS blue/green deployment with CodeDeploy:

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    needs: [build]  # From CI workflow
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval for prod
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Update ECS task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: task-definition.json
          container-name: app
          image: ${{ secrets.ECR_REGISTRY }}/app:${{ github.sha }}

      - name: Deploy to ECS with blue/green
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: production-app
          cluster: production
          wait-for-service-stability: true
          codedeploy-appspec: appspec.yaml
          codedeploy-application: production-app
          codedeploy-deployment-group: production-blue-green

      - name: Run smoke tests
        run: |
          sleep 30  # Wait for deployment to propagate
          curl -f https://api.yourapp.com/health || exit 1

      - name: Notify on failure
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          slack-bot-token: ${{ secrets.SLACK_BOT_TOKEN }}
          channel-id: 'deployments'
          slack-message: "❌ Production deployment failed for ${{ github.sha }}"

Pillar 3: GitOps with Terraform

Infrastructure changes should go through the same review process as code changes. GitOps means your Git repository is the single source of truth for infrastructure state.

# .github/workflows/terraform.yml
name: Terraform

on:
  pull_request:
    paths: ['terraform/**']
  push:
    branches: [main]
    paths: ['terraform/**']

jobs:
  terraform:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: terraform/
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: '1.7.0'
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - run: terraform init
      - run: terraform validate
      - run: terraform fmt -check

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=tfplan -no-color
        continue-on-error: true  # Comment plan even on failure

      - name: Comment PR with plan
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan 📋
            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -auto-approve tfplan

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Pillar 4: Observability Stack

A production system without observability is a black box. You need metrics (what's happening), logs (what went wrong), and traces (why a specific request was slow).

Prometheus alerting rules:

# prometheus/alerts.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.service }}"
          description: "Current error rate: {{ $value | humanizePercentage }}"

      - alert: SlowP99Latency
        expr: |
          histogram_quantile(0.99, 
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s for {{ $labels.service }}"

      - alert: HighMemoryUsage
        expr: |
          container_memory_usage_bytes{container!=""}
          / container_spec_memory_limit_bytes{container!=""} > 0.85
        for: 10m
        labels:
          severity: warning

  - name: database
    rules:
      - alert: PostgresSlowQueries
        expr: pg_stat_activity_max_tx_duration{state="active"} > 30
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL query running for >30s"

      - alert: PostgresConnectionsHigh
        expr: |
          pg_stat_database_numbackends / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: critical

Structured logging in Node.js (Pino):

// lib/logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    service: process.env.SERVICE_NAME,
    version: process.env.APP_VERSION,
    env: process.env.NODE_ENV,
  },
  // In production, ship JSON logs to CloudWatch/Datadog/Loki
  // In dev, pretty-print for readability
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
});

// Request context with trace ID
export function createRequestLogger(requestId: string, userId?: string) {
  return logger.child({ requestId, userId });
}

// Usage
app.addHook('onRequest', async (request, reply) => {
  const requestId = request.headers['x-request-id'] as string 
    ?? crypto.randomUUID();
  request.log = createRequestLogger(requestId, request.headers['x-user-id'] as string);
  reply.header('x-request-id', requestId);
});

DevOps Toolchain Reference

Category	Tool	Cost	When
CI/CD	GitHub Actions	Free–$48/mo	Majority of teams
CI/CD	GitLab CI	Free–$19/user/mo	Self-hosted or GitLab users
Container Registry	AWS ECR	$0.10/GB	AWS-native
IaC	Terraform	Free (open source)	Any cloud
Secrets	AWS Secrets Manager	$0.40/secret/mo	Production secrets
Monitoring	Prometheus + Grafana	Free (self-hosted)	Kubernetes environments
Monitoring	Datadog	$15–23/host/mo	Managed, full-stack
Logging	CloudWatch	$0.50/GB ingested	AWS-native
Logging	Grafana Loki	Free (self-hosted)	Kubernetes + cost-sensitive
Alerting	PagerDuty	$21/user/mo	On-call rotation
APM + Tracing	OpenTelemetry	Free	Any backend

Team Structure: DevOps vs SRE vs Platform Engineering

Role	Focus	When You Need It
DevOps Engineer	CI/CD pipelines, automation, IaC	10+ engineers, 2+ services
SRE	Reliability, SLOs, incident response	Mission-critical systems, 50+ engineers
Platform Engineer	Internal developer platforms, golden paths	100+ engineers, multiple teams

Most startups benefit from a part-time DevOps consultant or embedded DevOps engineer before reaching 20 engineers.

Cost of DevOps Maturity

Maturity Level	What's Included	Monthly Cost
Basic	GitHub Actions CI, manual deploy	$20–50
Standard	CI/CD + Terraform + CloudWatch	$150–400
Advanced	Blue/green deploys + Prometheus + Datadog	$800–2,000
Enterprise	Platform engineering + SRE toolchain	$3,000–10,000+

Tooling cost is usually less than 5% of engineering salary cost at these stages. The ROI comes from engineering time saved and incidents avoided.

Working With Viprasol

We implement DevOps pipelines for product teams that want to ship faster without adding operational risk. That typically means a GitHub Actions CI/CD pipeline, Terraform infrastructure, structured logging, and alerting — delivered in 4–8 weeks, not months.

Our clients typically see deployment frequency increase 3–5× and mean time to recovery (MTTR) drop 60–80% after a DevOps implementation.

→ Talk to our DevOps team about your current setup.

DevOps Best Practices: CI/CD, Monitoring, and Infrastructure Automation in 2026

DevOps Best Practices: CI/CD, Monitoring, and Infrastructure Automation in 2026

The Four Pillars of Mature DevOps

Pillar 1: CI That Actually Works

☁️ Is Your Cloud Costing Too Much?

Pillar 2: Continuous Deployment with Zero Downtime

Pillar 3: GitOps with Terraform

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Pillar 4: Observability Stack

DevOps Toolchain Reference

Team Structure: DevOps vs SRE vs Platform Engineering

Cost of DevOps Maturity

Working With Viprasol

See Also

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection

AWS ECS Autoscaling: Target Tracking, Step Scaling, and Fargate Capacity Providers with Terraform

AWS IAM Least-Privilege Design: Policy Patterns, Condition Keys, and Terraform