OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help
Set up OpenTelemetry in Node.js and Python services. Auto-instrumentation, custom spans, OTLP export to Jaeger/Grafana Tempo, and correlating traces with logs a
OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help
Every distributed system eventually develops the same problem: something is slow or broken, and nobody knows where. "Is it the database?" "Is it the upstream API?" "Why does this request take 3 seconds sometimes?" The answer is usually buried in logs across four services that share no correlation IDs.
OpenTelemetry (OTel) solves this by creating a vendor-neutral, language-agnostic standard for traces, metrics, and logs — the three pillars of observability. In 2026, OTel is stable, widely supported, and the right answer for any team that cares about production reliability.
This post covers a production-ready OTel setup: Node.js auto-instrumentation, custom spans, OTLP export, and correlating the three signals for faster incident resolution.
The Three Signals — and How They Relate
| Signal | What It Captures | Best For |
|---|---|---|
| Traces | Request flow across services with timing | Finding where latency lives |
| Metrics | Aggregated numbers (counters, histograms, gauges) | Alerting, dashboards, SLOs |
| Logs | Timestamped text with context | Debugging specific errors |
The magic happens when you correlate them: a trace ID in your log entry links the log to the exact span where the error occurred. OTel makes this correlation automatic.
Architecture: OTel Collector as Central Hub
App Services (Node.js, Python, Go)
│ OTLP/gRPC (4317)
▼
OTel Collector ──── Traces ────► Jaeger / Grafana Tempo
──── Metrics ───► Prometheus / Grafana Mimir
──── Logs ──────► Loki / Elasticsearch
Never export directly from your app to Jaeger/Prometheus in production — the OTel Collector handles batching, retry, transformation, and routing. Apps emit OTLP; the Collector fans out to backends.
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Node.js: Auto-Instrumentation Setup
The fastest way to get traces is auto-instrumentation — OTel wraps http, express, pg, redis, axios, and 40+ libraries automatically.
npm install \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-metrics-otlp-grpc \
@opentelemetry/sdk-metrics
// src/instrumentation.ts — must be imported FIRST before any app code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'api-service',
[SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '1.0.0',
'deployment.environment': process.env.NODE_ENV ?? 'production',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4317',
}),
exportIntervalMillis: 15_000, // Every 15 seconds
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
'@opentelemetry/instrumentation-http': {
ignoreIncomingRequestHook: (req) => {
// Don't trace health checks
return req.url === '/health' || req.url === '/ready';
},
},
'@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: true },
}),
],
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown().finally(() => process.exit(0));
});
// src/index.ts — instrumentation must be first import
import './instrumentation';
import Fastify from 'fastify';
// ... rest of app
With this setup, every HTTP request, database query, and Redis operation is traced automatically. No code changes needed in your route handlers.
Custom Spans: Adding Business Context
Auto-instrumentation gives you infrastructure spans. Custom spans add business context — which payment processor was called, which feature flag was evaluated, how many items were in the cart.
// src/lib/tracing.ts
import { trace, context, SpanStatusCode, SpanKind } from '@opentelemetry/api';
const tracer = trace.getTracer('api-service', '1.0.0');
// Wrapper for adding spans to async functions
export async function withSpan<T>(
name: string,
fn: () => Promise<T>,
attributes?: Record<string, string | number | boolean>,
): Promise<T> {
return tracer.startActiveSpan(name, { attributes }, async (span) => {
try {
const result = await fn();
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}
// src/services/payment.ts
import { trace, SpanStatusCode } from '@opentelemetry/api';
import { withSpan } from '@/lib/tracing';
export async function processPayment(order: Order): Promise<PaymentResult> {
return withSpan(
'payment.process',
async () => {
// Stripe call is auto-instrumented via http
// We add business-level context here
const span = trace.getActiveSpan();
span?.setAttributes({
'payment.amount': order.total,
'payment.currency': order.currency,
'payment.method': order.paymentMethod,
'order.id': order.id,
'order.item_count': order.items.length,
});
const result = await stripe.charges.create({
amount: Math.round(order.total * 100),
currency: order.currency,
source: order.paymentToken,
metadata: { orderId: order.id },
});
span?.setAttributes({
'payment.charge_id': result.id,
'payment.status': result.status,
});
return result;
},
{ 'order.id': order.id },
);
}
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Custom Metrics
// src/lib/metrics.ts
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('api-service', '1.0.0');
// Counters — track totals
export const httpRequestCounter = meter.createCounter('http.requests.total', {
description: 'Total HTTP requests by route and status',
});
// Histograms — track distributions
export const orderValueHistogram = meter.createHistogram('order.value', {
description: 'Distribution of order values in USD',
unit: 'USD',
advice: {
explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 5000],
},
});
// Gauges — track current state
export const activeWebSocketsGauge = meter.createObservableGauge(
'websocket.connections.active',
{ description: 'Currently active WebSocket connections' },
);
// Register observable gauge callback
activeWebSocketsGauge.addCallback((observableResult) => {
observableResult.observe(wsManager.getConnectionCount(), {
'server.instance': process.env.HOSTNAME ?? 'unknown',
});
});
// Usage in route handler
export function recordOrderMetrics(order: Order) {
orderValueHistogram.record(order.total, {
'order.currency': order.currency,
'order.region': order.region,
'payment.method': order.paymentMethod,
});
}
Correlating Logs with Traces
This is where observability gets powerful. When a log entry contains the trace ID, you can jump from a Loki log line directly to the Jaeger trace.
// src/lib/logger.ts
import pino from 'pino';
import { trace, context } from '@opentelemetry/api';
function getTraceContext() {
const span = trace.getActiveSpan();
if (!span) return {};
const ctx = span.spanContext();
return {
traceId: ctx.traceId,
spanId: ctx.spanId,
// Grafana Tempo expects these field names
'trace_id': ctx.traceId,
'span_id': ctx.spanId,
};
}
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
formatters: {
log(obj) {
return { ...obj, ...getTraceContext() };
},
},
transport: process.env.NODE_ENV !== 'production'
? { target: 'pino-pretty' }
: undefined,
});
Now every log line automatically includes traceId and spanId. In Grafana, you can configure the Loki datasource to derive fields and create links to Tempo traces.
OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
check_interval: 5s
# Add environment tag to all telemetry
resource:
attributes:
- key: deployment.environment
value: ${DEPLOYMENT_ENV}
action: upsert
# Sample 10% of successful traces, 100% of errors
probabilistic_sampler:
hash_seed: 42
sampling_percentage: 10
filter/errors_only:
error_mode: ignore
traces:
span:
- 'status.code == STATUS_CODE_ERROR'
exporters:
otlp/tempo:
endpoint: http://tempo:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
namespace: otel
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Python: FastAPI with OTel
# instrumentation.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
import os
def setup_telemetry(app=None):
resource = Resource.create({
SERVICE_NAME: os.getenv("SERVICE_NAME", "python-service"),
"deployment.environment": os.getenv("ENVIRONMENT", "production"),
})
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector:4317")
# Traces
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True))
)
trace.set_tracer_provider(tracer_provider)
# Metrics
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint=otlp_endpoint, insecure=True),
export_interval_millis=15_000,
)
metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[reader]))
# Auto-instrument
if app:
FastAPIInstrumentor.instrument_app(app, excluded_urls="health,ready")
SQLAlchemyInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()
# main.py
from fastapi import FastAPI
from instrumentation import setup_telemetry
from opentelemetry import trace
app = FastAPI()
setup_telemetry(app)
tracer = trace.get_tracer(__name__)
@app.get("/orders/{order_id}")
async def get_order(order_id: str):
with tracer.start_as_current_span("order.fetch") as span:
span.set_attribute("order.id", order_id)
order = await db.fetch_order(order_id)
span.set_attribute("order.status", order.status)
return order
Sampling Strategy
Sampling is essential — tracing every request in high-traffic systems generates terabytes of data.
| Strategy | When to Use | Rate |
|---|---|---|
| Head-based (probabilistic) | Uniform traffic, cost control | 1–10% |
| Tail-based (error-focused) | Keep all errors, sample successes | Errors: 100%, Success: 5% |
| Rate limiting | Bursty traffic | Max N traces/sec |
| Parent-based | Microservices — follow caller's decision | Inherit from parent |
For most production systems: tail-based sampling in the Collector — sample 5% of normal traces, 100% of error traces, and 100% of traces exceeding P95 latency.
Cost Comparison: OTel Backends
| Backend | Traces | Metrics | Logs | Pricing Model | Est. Monthly (10K req/min) |
|---|---|---|---|---|---|
| Grafana Cloud | Tempo | Mimir | Loki | Usage-based | $50–$200 |
| Jaeger OSS | ✅ | ❌ | ❌ | Self-hosted | $20–$80 (infra) |
| Datadog APM | ✅ | ✅ | ✅ | Per host + spans | $300–$800 |
| Honeycomb | ✅ | Limited | Limited | Per event | $150–$500 |
| AWS X-Ray + CW | ✅ | ✅ | ✅ | Per trace/event | $100–$400 |
| Self-hosted Grafana Stack | Tempo | Mimir | Loki | Infra only | $80–$200 |
For startups: Grafana Cloud free tier (50GB traces, 10K metrics) handles most early-stage loads. Switch to self-hosted when monthly cost exceeds $200.
Working With Viprasol
Our platform engineering team implements end-to-end observability stacks — from OTel SDK setup in your services to Grafana dashboards that surface actionable insights in minutes.
What we deliver:
- OTel Collector deployment (Kubernetes/ECS) with sampling config
- Auto-instrumentation for Node.js, Python, Go services
- Custom span and metric instrumentation for business events
- Grafana dashboards: RED metrics, SLO tracking, error rate alerts
- Trace-to-log correlation across all services
→ Discuss your observability needs → Cloud infrastructure services
See Also
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.