DevOps Best Practices: CI/CD, Monitoring, and Infrastructure Automation in 2026
Implement DevOps best practices with GitHub Actions CI/CD, Prometheus monitoring, Terraform infrastructure, and GitOps workflows. Real configs, cost tables, and
DevOps Best Practices: CI/CD, Monitoring, and Infrastructure Automation in 2026
DevOps is the discipline of reducing the time and risk between writing code and running it in production. The practices here aren't theoretical — they're the specific implementations we use for clients handling real production traffic, and the gaps we've seen cause outages or slow engineering teams to a crawl.
The Four Pillars of Mature DevOps
- Continuous Integration: Every code change is built and tested automatically
- Continuous Deployment: Passing changes deploy to production without manual intervention
- Infrastructure as Code: All infrastructure is version-controlled and reproducible
- Observability: You know what's happening in production before users tell you
Most teams have partial implementations of each. The compounding value comes from having all four working together.
Pillar 1: CI That Actually Works
A CI pipeline that takes 45 minutes to run is nearly useless — developers stop waiting for it and merge anyway. The goal is a pipeline that completes in under 10 minutes and catches real bugs before they reach main.
GitHub Actions workflow with parallel jobs:
# .github/workflows/ci.yml
name: CI
on:
pull_request:
branches: [main, develop]
env:
NODE_VERSION: '20'
jobs:
# Fast check — runs first, fails fast
lint-types:
name: Lint & Type Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run type-check
# Unit tests — parallel by directory
unit-tests:
name: Unit Tests
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4] # Split tests across 4 runners
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npx jest --shard=${{ matrix.shard }}/4 --coverage --coverageReporters=json
- uses: actions/upload-artifact@v4
with:
name: coverage-${{ matrix.shard }}
path: coverage/coverage-final.json
# Integration tests — needs postgres
integration-tests:
name: Integration Tests
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run db:migrate:test
env:
DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
- run: npm run test:integration
env:
DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
# Security scan
security:
name: Security Audit
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm audit --audit-level=high
- uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
severity: 'CRITICAL,HIGH'
# Build Docker image
build:
name: Build & Push Image
runs-on: ubuntu-latest
needs: [lint-types, unit-tests, integration-tests, security]
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- uses: aws-actions/amazon-ecr-login@v2
- name: Build and push
run: |
IMAGE_TAG=${{ github.sha }}
docker build -t ${{ secrets.ECR_REGISTRY }}/app:$IMAGE_TAG .
docker push ${{ secrets.ECR_REGISTRY }}/app:$IMAGE_TAG
echo "IMAGE_TAG=$IMAGE_TAG" >> $GITHUB_OUTPUT
Key CI principles:
- Fail fast: Lint and type checks run first, before slower tests
- Parallel where possible: Test sharding cuts runtime proportionally
- Cache aggressively:
npm ciwith cache cuts 60–90s per run - Security in CI: Catching vulnerabilities before production is far cheaper than after
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Pillar 2: Continuous Deployment with Zero Downtime
ECS blue/green deployment with CodeDeploy:
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
needs: [build] # From CI workflow
runs-on: ubuntu-latest
environment: production # Requires manual approval for prod
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Update ECS task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: task-definition.json
container-name: app
image: ${{ secrets.ECR_REGISTRY }}/app:${{ github.sha }}
- name: Deploy to ECS with blue/green
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: production-app
cluster: production
wait-for-service-stability: true
codedeploy-appspec: appspec.yaml
codedeploy-application: production-app
codedeploy-deployment-group: production-blue-green
- name: Run smoke tests
run: |
sleep 30 # Wait for deployment to propagate
curl -f https://api.yourapp.com/health || exit 1
- name: Notify on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
slack-bot-token: ${{ secrets.SLACK_BOT_TOKEN }}
channel-id: 'deployments'
slack-message: "❌ Production deployment failed for ${{ github.sha }}"
Pillar 3: GitOps with Terraform
Infrastructure changes should go through the same review process as code changes. GitOps means your Git repository is the single source of truth for infrastructure state.
# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths: ['terraform/**']
push:
branches: [main]
paths: ['terraform/**']
jobs:
terraform:
runs-on: ubuntu-latest
defaults:
run:
working-directory: terraform/
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: '1.7.0'
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- run: terraform init
- run: terraform validate
- run: terraform fmt -check
- name: Terraform Plan
id: plan
run: terraform plan -out=tfplan -no-color
continue-on-error: true # Comment plan even on failure
- name: Comment PR with plan
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const output = `#### Terraform Plan 📋
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
- name: Terraform Apply
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: terraform apply -auto-approve tfplan
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Pillar 4: Observability Stack
A production system without observability is a black box. You need metrics (what's happening), logs (what went wrong), and traces (why a specific request was slow).
Prometheus alerting rules:
# prometheus/alerts.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.service }}"
description: "Current error rate: {{ $value | humanizePercentage }}"
- alert: SlowP99Latency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 2s for {{ $labels.service }}"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{container!=""}
/ container_spec_memory_limit_bytes{container!=""} > 0.85
for: 10m
labels:
severity: warning
- name: database
rules:
- alert: PostgresSlowQueries
expr: pg_stat_activity_max_tx_duration{state="active"} > 30
for: 1m
labels:
severity: warning
annotations:
summary: "PostgreSQL query running for >30s"
- alert: PostgresConnectionsHigh
expr: |
pg_stat_database_numbackends / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: critical
Structured logging in Node.js (Pino):
// lib/logger.ts
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
formatters: {
level: (label) => ({ level: label }),
},
base: {
service: process.env.SERVICE_NAME,
version: process.env.APP_VERSION,
env: process.env.NODE_ENV,
},
// In production, ship JSON logs to CloudWatch/Datadog/Loki
// In dev, pretty-print for readability
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined,
});
// Request context with trace ID
export function createRequestLogger(requestId: string, userId?: string) {
return logger.child({ requestId, userId });
}
// Usage
app.addHook('onRequest', async (request, reply) => {
const requestId = request.headers['x-request-id'] as string
?? crypto.randomUUID();
request.log = createRequestLogger(requestId, request.headers['x-user-id'] as string);
reply.header('x-request-id', requestId);
});
DevOps Toolchain Reference
| Category | Tool | Cost | When |
|---|---|---|---|
| CI/CD | GitHub Actions | Free–$48/mo | Majority of teams |
| CI/CD | GitLab CI | Free–$19/user/mo | Self-hosted or GitLab users |
| Container Registry | AWS ECR | $0.10/GB | AWS-native |
| IaC | Terraform | Free (open source) | Any cloud |
| Secrets | AWS Secrets Manager | $0.40/secret/mo | Production secrets |
| Monitoring | Prometheus + Grafana | Free (self-hosted) | Kubernetes environments |
| Monitoring | Datadog | $15–23/host/mo | Managed, full-stack |
| Logging | CloudWatch | $0.50/GB ingested | AWS-native |
| Logging | Grafana Loki | Free (self-hosted) | Kubernetes + cost-sensitive |
| Alerting | PagerDuty | $21/user/mo | On-call rotation |
| APM + Tracing | OpenTelemetry | Free | Any backend |
Team Structure: DevOps vs SRE vs Platform Engineering
| Role | Focus | When You Need It |
|---|---|---|
| DevOps Engineer | CI/CD pipelines, automation, IaC | 10+ engineers, 2+ services |
| SRE | Reliability, SLOs, incident response | Mission-critical systems, 50+ engineers |
| Platform Engineer | Internal developer platforms, golden paths | 100+ engineers, multiple teams |
Most startups benefit from a part-time DevOps consultant or embedded DevOps engineer before reaching 20 engineers.
Cost of DevOps Maturity
| Maturity Level | What's Included | Monthly Cost |
|---|---|---|
| Basic | GitHub Actions CI, manual deploy | $20–50 |
| Standard | CI/CD + Terraform + CloudWatch | $150–400 |
| Advanced | Blue/green deploys + Prometheus + Datadog | $800–2,000 |
| Enterprise | Platform engineering + SRE toolchain | $3,000–10,000+ |
Tooling cost is usually less than 5% of engineering salary cost at these stages. The ROI comes from engineering time saved and incidents avoided.
Working With Viprasol
We implement DevOps pipelines for product teams that want to ship faster without adding operational risk. That typically means a GitHub Actions CI/CD pipeline, Terraform infrastructure, structured logging, and alerting — delivered in 4–8 weeks, not months.
Our clients typically see deployment frequency increase 3–5× and mean time to recovery (MTTR) drop 60–80% after a DevOps implementation.
→ Talk to our DevOps team about your current setup.
See Also
- CI/CD Pipeline Setup — deep dive on pipeline design
- Infrastructure as Code — Terraform patterns and remote state
- Kubernetes vs ECS — container orchestration for your pipeline
- Observability and Monitoring — OpenTelemetry, Prometheus, alerting
- Cloud Solutions — AWS infrastructure and DevOps services
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.