OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help

Every distributed system eventually develops the same problem: something is slow or broken, and nobody knows where. "Is it the database?" "Is it the upstream API?" "Why does this request take 3 seconds sometimes?" The answer is usually buried in logs across four services that share no correlation IDs.

OpenTelemetry (OTel) solves this by creating a vendor-neutral, language-agnostic standard for traces, metrics, and logs — the three pillars of observability. In 2026, OTel is stable, widely supported, and the right answer for any team that cares about production reliability.

This post covers a production-ready OTel setup: Node.js auto-instrumentation, custom spans, OTLP export, and correlating the three signals for faster incident resolution.

The Three Signals — and How They Relate

Signal	What It Captures	Best For
Traces	Request flow across services with timing	Finding where latency lives
Metrics	Aggregated numbers (counters, histograms, gauges)	Alerting, dashboards, SLOs
Logs	Timestamped text with context	Debugging specific errors

The magic happens when you correlate them: a trace ID in your log entry links the log to the exact span where the error occurred. OTel makes this correlation automatic.

Architecture: OTel Collector as Central Hub

App Services (Node.js, Python, Go)
    │ OTLP/gRPC (4317)
    ▼
OTel Collector  ──── Traces ────► Jaeger / Grafana Tempo
                ──── Metrics ───► Prometheus / Grafana Mimir
                ──── Logs ──────► Loki / Elasticsearch

Never export directly from your app to Jaeger/Prometheus in production — the OTel Collector handles batching, retry, transformation, and routing. Apps emit OTLP; the Collector fans out to backends.

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Node.js: Auto-Instrumentation Setup

The fastest way to get traces is auto-instrumentation — OTel wraps http, express, pg, redis, axios, and 40+ libraries automatically.

npm install \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/exporter-metrics-otlp-grpc \
  @opentelemetry/sdk-metrics

// src/instrumentation.ts — must be imported FIRST before any app code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'api-service',
    [SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '1.0.0',
    'deployment.environment': process.env.NODE_ENV ?? 'production',
  }),

  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4317',
  }),

  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 15_000, // Every 15 seconds
  }),

  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingRequestHook: (req) => {
          // Don't trace health checks
          return req.url === '/health' || req.url === '/ready';
        },
      },
      '@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: true },
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().finally(() => process.exit(0));
});

// src/index.ts — instrumentation must be first import
import './instrumentation';
import Fastify from 'fastify';
// ... rest of app

With this setup, every HTTP request, database query, and Redis operation is traced automatically. No code changes needed in your route handlers.

Custom Spans: Adding Business Context

Auto-instrumentation gives you infrastructure spans. Custom spans add business context — which payment processor was called, which feature flag was evaluated, how many items were in the cart.

// src/lib/tracing.ts
import { trace, context, SpanStatusCode, SpanKind } from '@opentelemetry/api';

const tracer = trace.getTracer('api-service', '1.0.0');

// Wrapper for adding spans to async functions
export async function withSpan<T>(
  name: string,
  fn: () => Promise<T>,
  attributes?: Record<string, string | number | boolean>,
): Promise<T> {
  return tracer.startActiveSpan(name, { attributes }, async (span) => {
    try {
      const result = await fn();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// src/services/payment.ts
import { trace, SpanStatusCode } from '@opentelemetry/api';
import { withSpan } from '@/lib/tracing';

export async function processPayment(order: Order): Promise<PaymentResult> {
  return withSpan(
    'payment.process',
    async () => {
      // Stripe call is auto-instrumented via http
      // We add business-level context here
      const span = trace.getActiveSpan();
      span?.setAttributes({
        'payment.amount': order.total,
        'payment.currency': order.currency,
        'payment.method': order.paymentMethod,
        'order.id': order.id,
        'order.item_count': order.items.length,
      });

      const result = await stripe.charges.create({
        amount: Math.round(order.total * 100),
        currency: order.currency,
        source: order.paymentToken,
        metadata: { orderId: order.id },
      });

      span?.setAttributes({
        'payment.charge_id': result.id,
        'payment.status': result.status,
      });

      return result;
    },
    { 'order.id': order.id },
  );
}

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Custom Metrics

// src/lib/metrics.ts
import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('api-service', '1.0.0');

// Counters — track totals
export const httpRequestCounter = meter.createCounter('http.requests.total', {
  description: 'Total HTTP requests by route and status',
});

// Histograms — track distributions
export const orderValueHistogram = meter.createHistogram('order.value', {
  description: 'Distribution of order values in USD',
  unit: 'USD',
  advice: {
    explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 5000],
  },
});

// Gauges — track current state
export const activeWebSocketsGauge = meter.createObservableGauge(
  'websocket.connections.active',
  { description: 'Currently active WebSocket connections' },
);

// Register observable gauge callback
activeWebSocketsGauge.addCallback((observableResult) => {
  observableResult.observe(wsManager.getConnectionCount(), {
    'server.instance': process.env.HOSTNAME ?? 'unknown',
  });
});

// Usage in route handler
export function recordOrderMetrics(order: Order) {
  orderValueHistogram.record(order.total, {
    'order.currency': order.currency,
    'order.region': order.region,
    'payment.method': order.paymentMethod,
  });
}

Correlating Logs with Traces

This is where observability gets powerful. When a log entry contains the trace ID, you can jump from a Loki log line directly to the Jaeger trace.

// src/lib/logger.ts
import pino from 'pino';
import { trace, context } from '@opentelemetry/api';

function getTraceContext() {
  const span = trace.getActiveSpan();
  if (!span) return {};
  
  const ctx = span.spanContext();
  return {
    traceId: ctx.traceId,
    spanId: ctx.spanId,
    // Grafana Tempo expects these field names
    'trace_id': ctx.traceId,
    'span_id': ctx.spanId,
  };
}

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  formatters: {
    log(obj) {
      return { ...obj, ...getTraceContext() };
    },
  },
  transport: process.env.NODE_ENV !== 'production'
    ? { target: 'pino-pretty' }
    : undefined,
});

Now every log line automatically includes traceId and spanId. In Grafana, you can configure the Loki datasource to derive fields and create links to Tempo traces.

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

  # Add environment tag to all telemetry
  resource:
    attributes:
      - key: deployment.environment
        value: ${DEPLOYMENT_ENV}
        action: upsert

  # Sample 10% of successful traces, 100% of errors
  probabilistic_sampler:
    hash_seed: 42
    sampling_percentage: 10

  filter/errors_only:
    error_mode: ignore
    traces:
      span:
        - 'status.code == STATUS_CODE_ERROR'

exporters:
  otlp/tempo:
    endpoint: http://tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Python: FastAPI with OTel

# instrumentation.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

import os

def setup_telemetry(app=None):
    resource = Resource.create({
        SERVICE_NAME: os.getenv("SERVICE_NAME", "python-service"),
        "deployment.environment": os.getenv("ENVIRONMENT", "production"),
    })

    otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector:4317")

    # Traces
    tracer_provider = TracerProvider(resource=resource)
    tracer_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True))
    )
    trace.set_tracer_provider(tracer_provider)

    # Metrics
    reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint=otlp_endpoint, insecure=True),
        export_interval_millis=15_000,
    )
    metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[reader]))

    # Auto-instrument
    if app:
        FastAPIInstrumentor.instrument_app(app, excluded_urls="health,ready")
    SQLAlchemyInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()

# main.py
from fastapi import FastAPI
from instrumentation import setup_telemetry
from opentelemetry import trace

app = FastAPI()
setup_telemetry(app)

tracer = trace.get_tracer(__name__)

@app.get("/orders/{order_id}")
async def get_order(order_id: str):
    with tracer.start_as_current_span("order.fetch") as span:
        span.set_attribute("order.id", order_id)
        order = await db.fetch_order(order_id)
        span.set_attribute("order.status", order.status)
        return order

Sampling Strategy

Sampling is essential — tracing every request in high-traffic systems generates terabytes of data.

Strategy	When to Use	Rate
Head-based (probabilistic)	Uniform traffic, cost control	1–10%
Tail-based (error-focused)	Keep all errors, sample successes	Errors: 100%, Success: 5%
Rate limiting	Bursty traffic	Max N traces/sec
Parent-based	Microservices — follow caller's decision	Inherit from parent

For most production systems: tail-based sampling in the Collector — sample 5% of normal traces, 100% of error traces, and 100% of traces exceeding P95 latency.

Cost Comparison: OTel Backends

Backend	Traces	Metrics	Logs	Pricing Model	Est. Monthly (10K req/min)
Grafana Cloud	Tempo	Mimir	Loki	Usage-based	$50–$200
Jaeger OSS	✅	❌	❌	Self-hosted	$20–$80 (infra)
Datadog APM	✅	✅	✅	Per host + spans	$300–$800
Honeycomb	✅	Limited	Limited	Per event	$150–$500
AWS X-Ray + CW	✅	✅	✅	Per trace/event	$100–$400
Self-hosted Grafana Stack	Tempo	Mimir	Loki	Infra only	$80–$200

For startups: Grafana Cloud free tier (50GB traces, 10K metrics) handles most early-stage loads. Switch to self-hosted when monthly cost exceeds $200.

Working With Viprasol

Our platform engineering team implements end-to-end observability stacks — from OTel SDK setup in your services to Grafana dashboards that surface actionable insights in minutes.

What we deliver:

OTel Collector deployment (Kubernetes/ECS) with sampling config
Auto-instrumentation for Node.js, Python, Go services
Custom span and metric instrumentation for business events
Grafana dashboards: RED metrics, SLO tracking, error rate alerts
Trace-to-log correlation across all services

→ Discuss your observability needs → Cloud infrastructure services

OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help

OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help

The Three Signals — and How They Relate

Architecture: OTel Collector as Central Hub

☁️ Is Your Cloud Costing Too Much?

Node.js: Auto-Instrumentation Setup

Custom Spans: Adding Business Context

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Custom Metrics

Correlating Logs with Traces

OTel Collector Configuration

Python: FastAPI with OTel

Sampling Strategy

Cost Comparison: OTel Backends

Working With Viprasol

See Also

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

Distributed Tracing: OpenTelemetry, Jaeger, Tempo, and Trace-Based Debugging

Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works

AWS CloudWatch Observability in 2026: Custom Metrics, Log Insights, and Anomaly Detection