OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help
Set up OpenTelemetry in Node.js and Python services. Auto-instrumentation, custom spans, OTLP export to Jaeger/Grafana Tempo, and correlating traces with logs a
OpenTelemetry: Observability for Modern Applications (2026)
Observability is the difference between firefighting in the dark and methodically solving problems. At Viprasol, we've moved from the fragmented world of multiple monitoring tools to OpenTelemetry—a unified approach to collecting, processing, and exporting telemetry data. This shift has transformed how we understand what's happening inside our applications.
The Observability Crisis We Solved
Five years ago, our monitoring setup looked like this: Application Insights for some services, Datadog for others, custom logging in a few places, and manual traces scattered throughout the codebase. Each tool worked fine in isolation, but getting a complete picture of a user request flowing through our system was nearly impossible.
A user reported slow performance. We checked metrics. No spike. We checked logs. Found an error, but couldn't correlate it with anything else. We grabbed a sample trace from one service, but the next service in the chain logged differently. Three hours later, we finally found the culprit: a database connection pool was exhausted in an obscure service.
This experience pushed us to find a better way. We discovered OpenTelemetry—an open standard for observability that was gaining momentum. Instead of replacing one vendor lock-in with another, we could instrument our code once and send data to any backend we chose. That flexibility changed everything.
Understanding the Three Pillars of OpenTelemetry
OpenTelemetry unifies three types of telemetry data:
Traces
A trace represents the entire journey of a single request through your system. It shows:
- Which services processed the request
- How long each operation took
- Where errors occurred
- Dependencies between operations
When a user makes a request to your application, a trace captures every step: frontend JavaScript execution, API call, database query, cache lookup, external API call. All connected in a single timeline.
Metrics
Metrics answer the question: "What's happening in aggregate?" They measure:
- Request rates and latencies
- Error percentages
- CPU and memory usage
- Queue depths and throughput
- Business metrics (signups, purchases, etc.)
Unlike traces which are request-specific, metrics are rolled-up statistics. They tell you that your 99th percentile latency is 2 seconds, not that user Alice's request took 2 seconds.
Logs
Logs remain important, but in OpenTelemetry they're contextualized. Instead of a log message floating in isolation, it includes trace IDs and span IDs, connecting it to the broader picture.
Code:
2026-03-07T10:15:23Z ERROR [trace_id=abc123] Payment processing failed
// vs
2026-03-07T10:15:23Z ERROR Payment processing failed (the old way)
The first log message can be found instantly by anyone looking at the payment processing trace. The second requires guesswork and hope.
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Setting Up OpenTelemetry in Node.js Applications
For web development projects, here's how we bootstrap OpenTelemetry:
Code:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-node';
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
const traceExporter = new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces'
});
const metricExporter = new OTLPMetricExporter({
url: 'http://otel-collector:4318/v1/metrics'
});
const sdk = new NodeSDK({
traceExporter,
instrumentations: [getNodeAutoInstrumentations()],
metricReader: new PeriodicExportingMetricReader({
exporter: metricExporter
})
});
sdk.start();
console.log('OpenTelemetry started');
This single initialization automatically instruments:
- HTTP requests
- Database calls
- External API calls
- Async operations
- Custom code
Auto-instrumentation is powerful, but custom instrumentation is where you gain real insight:
Code:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('my-app');
async function processPayment(userId, amount) {
const span = tracer.startSpan('payment.process');
try {
span.setAttributes({
'user.id': userId,
'payment.amount': amount,
'payment.currency': 'USD'
});
const result = await chargeCard(userId, amount);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
Browser and Frontend Instrumentation
OpenTelemetry isn't just for backend. Modern SaaS development requires frontend observability too:
Code:
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { ZoneContextManager } from '@opentelemetry/context-zone';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const provider = new WebTracerProvider({
resource: new Resource({
'service.name': 'frontend-app'
})
});
provider.addSpanProcessor(
new BatchSpanProcessor(new OTLPTraceExporter())
);
provider.register({
contextManager: new ZoneContextManager()
});
const tracer = trace.getTracer('app');
// Track user interactions
document.addEventListener('click', (event) => {
const span = tracer.startSpan('user.interaction.click');
span.setAttributes({
'element.id': event.target.id,
'element.class': event.target.className
});
span.end();
});

⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Recommended Reading
Deployment Architecture
For cloud solutions, OpenTelemetry follows this pattern:
Code:
┌─────────────────────────────────────┐
│ Your Applications │
│ (Node.js, Python, Go, Java, etc.) │
└────────────────┬────────────────────┘
│ OTLP Protocol (HTTP/gRPC)
▼
┌─────────────────────────────────────┐
│ OpenTelemetry Collector │
│ - Receives telemetry │
│ - Batches for efficiency │
│ - Routes to multiple backends │
└────────┬───────────────┬────────────┘
│ │
▼ ▼
Jaeger (Traces) Prometheus (Metrics)
Each application sends telemetry to a central collector, which acts as a router. This provides:
- Decoupling: Change backends without redeploying applications
- Batching: More efficient network usage
- Filtering: Reduce storage costs by dropping unneeded data
- Transformation: Enrich telemetry with additional context
Practical Instrumentation Patterns
Database Observability
Most frameworks auto-instrument databases, but custom context helps:
Code:
async function queryDatabase(query, params) {
const span = tracer.startSpan('db.query', {
attributes: {
'db.system': 'postgres',
'db.statement': query.substring(0, 100), // Truncate for safety
'db.params': params.length
}
});
try {
const startTime = Date.now();
const result = await pool.query(query, params);
span.setAttributes({
'db.rows_affected': result.rowCount,
'db.duration_ms': Date.now() - startTime
});
return result;
} catch (error) {
span.recordException(error);
throw error;
} finally {
span.end();
}
}
External API Calls
Track third-party integrations:
Code:
async function callExternalAPI(service, endpoint) {
const span = tracer.startSpan('http.client', {
attributes: {
'http.method': 'GET',
'http.url': **${service}${endpoint}**,
'http.target': endpoint
}
});
try {
const response = await fetch(**${service}${endpoint}**);
span.setAttributes({
'http.status_code': response.status,
'http.response_time_ms': Date.now() - startTime
});
return response;
} catch (error) {
span.recordException(error);
throw error;
} finally {
span.end();
}
}
Business Logic Instrumentation
This is where OpenTelemetry really shines:
Code:
async function checkoutCart(userId, items) {
const span = tracer.startSpan('checkout.process');
span.setAttributes({
'user.id': userId,
'cart.item_count': items.length,
'cart.total': items.reduce((sum, i) => sum + i.price, 0)
});
// Capture business events
const validationSpan = tracer.startSpan('checkout.validation', {
parent: span
});
validateItems(items);
validationSpan.end();
const paymentSpan = tracer.startSpan('checkout.payment', {
parent: span
});
const paymentResult = await processPayment(userId, items);
paymentSpan.setAttributes({
'payment.status': paymentResult.status,
'payment.method': paymentResult.method
});
paymentSpan.end();
span.end();
return paymentResult;
}
Sampling Strategies for Cost Control
Collecting telemetry for every request gets expensive at scale. Sampling reduces costs while maintaining insight:
Code:
import { ProbabilitySampler } from '@opentelemetry/sdk-trace-node';
// Sample 10% of requests
const sdk = new NodeSDK({
sampler: new ProbabilitySampler(0.1)
});
Better: adaptive sampling that samples more when error rates are high:
Code:
class AdaptiveSampler implements Sampler {
shouldSample(context, traceId, spanName, spanKind, attributes) {
// Sample all errors
if (attributes['error'] === true) {
return { decision: SamplingDecision.RECORD_AND_SAMPLE };
}
// Sample 5% of normal requests
if (Math.random() < 0.05) {
return { decision: SamplingDecision.RECORD_AND_SAMPLE };
}
// Don't sample health checks
return { decision: SamplingDecision.NOT_RECORD };
}
}
Key Features Comparison
| Feature | Jaeger | Tempo | Datadog |
|---|---|---|---|
| Open Source | Yes | Yes | No |
| Trace Storage | Local/ES | S3/GCS | Proprietary |
| Cost | Low | Low | High |
| Ease of Setup | Medium | Easy | Very Easy |
| Query Flexibility | Good | Limited | Excellent |
For detailed implementation guidance, consult the official OpenTelemetry documentation and explore Jaeger's architecture guide to understand how distributed tracing works at scale. Also review Google Cloud's observability documentation for additional best practices.
Common Pitfalls and Solutions
Too Much Data, Too Little Insight
Don't instrument everything. Focus on:
- User-facing operations
- External integrations
- Error paths
- Business-critical workflows
Cardinality Explosion
Avoid creating spans with unbounded attributes:
Code:
// Bad: Creates thousands of unique span names
for (let i = 0; i < items.length; i++) {
tracer.startSpan(**item.process.${items[i].id}**);
}
// Good: Single span with list attribute
const span = tracer.startSpan('items.process');
span.setAttributes({
'items.count': items.length
});
Performance Impact
OpenTelemetry instrumentation has overhead. Minimize it:
Code:
// Batch exports instead of sending individually
const processor = new BatchSpanProcessor(exporter, {
maxQueueSize: 2048,
maxExportBatchSize: 512,
scheduledDelayMillis: 5000
});
Advanced Instrumentation Strategies
Request Context Propagation
Trace requests across services using W3C Trace Context:
Code:
import { W3CTraceContextPropagator } from '@opentelemetry/core';
const propagator = new W3CTraceContextPropagator();
// Extract trace context from incoming request
const extractedContext = propagator.extract(
context.active(),
request.headers,
defaultTextMapGetter
);
// Set as active context for downstream operations
context.with(extractedContext, async () => {
// All operations here use the same trace
await processRequest(request);
});
Custom Resource Attributes
Add metadata to identify your services:
Code:
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const resource = Resource.default().merge(
new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.2.3',
'deployment.environment': process.env.NODE_ENV,
'git.commit': process.env.GIT_SHA,
'kubernetes.namespace': process.env.K8S_NAMESPACE
})
);
Filtering and Processing Telemetry
Reduce storage costs by filtering unneeded data:
Code:
class FilteringSpanProcessor {
onStart(span, context) {
// Don't trace health checks
if (span.name.includes('health')) {
span.addEvent('filtered');
return;
}
}
onEnd(span) {
// Drop very fast spans in production
if (span.duration < 1 && process.env.NODE_ENV === 'production') {
return;
}
}
}
Correlation with Business Events
Connect telemetry to business metrics:
Code:
// In payment processing
async function processPayment(userId: string, amount: number) {
const span = tracer.startSpan('payment.process');
span.setAttributes({
'user.id': userId,
'payment.amount': amount,
'user.tier': await getUserTier(userId),
'payment.method': 'credit_card'
});
// Track business event
metrics.recordPayment(amount);
try {
const result = await chargeCard(userId, amount);
span.addEvent('payment.success', {
'transaction.id': result.transactionId
});
return result;
} catch (error) {
span.recordException(error);
metrics.recordPaymentFailure(amount);
throw error;
} finally {
span.end();
}
}
Deployment and Operations
Docker Container Setup
OpenTelemetry in containers:
Code:
FROM node:18-alpine
WORKDIR /app
COPY package.json .
RUN npm install
COPY . .
ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
ENV OTEL_EXPORTER_OTLP_INSECURE=true
ENV OTEL_TRACES_EXPORTER=otlp
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_LOGS_EXPORTER=otlp
CMD ["node", "app.js"]
Kubernetes Integration
Use sidecar pattern for the collector:
Code:
apiVersion: v1
kind: Pod
metadata:
name: app-with-collector
spec:
containers:
- name: app
image: myapp:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://localhost:4318
- name: otel-collector
image: otel/opentelemetry-collector:latest
ports:
- containerPort: 4318
FAQ
Q: Do I need to use OpenTelemetry? A: If you run multiple services, yes. It's the industry standard. For single monoliths, it's still valuable for understanding performance.
Q: Can I migrate from another tool? A: Yes. OpenTelemetry works alongside existing tools. Gradually migrate by setting up both.
Q: What's the performance overhead? A: Typically 5-15% CPU impact when batched. Auto-instrumentation is more expensive than manual.
Q: How much data should I collect? A: Start with 100% sampling in development, 5-10% in production, 100% for errors.
Q: Can I query OpenTelemetry data? A: Yes, through your backend. Jaeger, Tempo, and others have query UIs.
Q: What about privacy and data retention? A: OpenTelemetry doesn't store data—backends do. Implement retention policies (30-90 days typical).
Q: How do I handle cardinality explosion? A: Avoid using unbounded values (user IDs, order IDs) as attribute keys. Use them as values instead, and limit unique values.
Q: What's the learning curve? A: Basic instrumentation is straightforward. Advanced patterns (sampling, filtering, context propagation) take more time to master.
Moving Forward with OpenTelemetry
Observability is not optional anymore. As systems grow more complex, the ability to see what's happening becomes mission-critical. OpenTelemetry provides the foundation that lets us instrument once and adapt our observability infrastructure as our needs evolve.
Start with auto-instrumentation. It gives you 80% of the value. Then add custom spans for business logic. Ship telemetry to a backend you choose. Move from reactive firefighting to proactive understanding.
The teams we work with—across web development, SaaS, and cloud infrastructure—consistently tell us that OpenTelemetry transformed how they debug production issues. What used to take hours now takes minutes. And more importantly, they catch problems before users notice them.
External Resources
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 1000+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.