Distributed Tracing: OpenTelemetry, Jaeger, Tempo, and Trace-Based Debugging
Implement distributed tracing across microservices — OpenTelemetry SDK setup, trace context propagation, Jaeger vs Grafana Tempo, span attributes, and using tra
Distributed Tracing: OpenTelemetry, Jaeger, Tempo, and Trace-Based Debugging
In a monolith, a slow request is easy to debug: add a profiler, look at the stack trace. In a distributed system with 10 services, a 2-second latency spike could originate in any one of them — or in the network between them. Distributed tracing answers the question: where did this request actually spend its time?
A trace is the complete journey of a request through your system. A span is one unit of work within that journey. Together they give you a flamegraph of your entire request across every service it touched.
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the CNCF project that standardizes observability instrumentation. It replaces vendor-specific SDKs (Jaeger SDK, Zipkin SDK, Datadog tracer) with one implementation that exports to any backend.
Your App + OTel SDK → OTel Collector → Backend (Jaeger / Tempo / Datadog / Honeycomb)
The OTel Collector decouples your application from the backend — switch from Jaeger to Honeycomb without changing application code.
Node.js Instrumentation
// instrumentation.ts — must be loaded before any other module
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
// Auto-instrumentation: automatically traces HTTP, PostgreSQL, Redis, gRPC
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME ?? 'api',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION ?? '0.0.1',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? 'development',
}),
// Export to OTel Collector (which forwards to Jaeger/Tempo)
spanProcessor: new BatchSpanProcessor(
new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4317',
})
),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
// Don't trace health checks — too noisy
ignoreIncomingRequestHook: (req) => req.url?.includes('/health') ?? false,
},
'@opentelemetry/instrumentation-pg': {
// Include DB statement in span (mask sensitive params)
addSqlCommenterCommentToQueries: true,
},
'@opentelemetry/instrumentation-redis-4': {
dbStatementSerializer: (cmdName) => cmdName, // Log command, not value
},
}),
],
});
sdk.start();
console.log('OpenTelemetry SDK started');
// Graceful shutdown
process.on('SIGTERM', async () => {
await sdk.shutdown();
});
// package.json start script — load instrumentation first
// "start": "node --require ./dist/instrumentation.js ./dist/app.js"
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Manual Spans for Custom Operations
Auto-instrumentation traces HTTP and database calls. Add manual spans for business logic:
// routes/payment.ts
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
app.post('/payments', async (request, reply) => {
// Create a span for the entire payment flow
return tracer.startActiveSpan('process_payment', async (span) => {
try {
span.setAttributes({
'payment.amount_cents': request.body.amountCents,
'payment.currency': request.body.currency,
'payment.method': request.body.paymentMethodId,
'user.id': request.user.id,
'tenant.id': request.tenantId,
});
// Child span: validate payment
const validationResult = await tracer.startActiveSpan(
'validate_payment',
async (validationSpan) => {
try {
const result = await validatePayment(request.body);
validationSpan.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err: any) {
validationSpan.setStatus({
code: SpanStatusCode.ERROR,
message: err.message,
});
validationSpan.recordException(err);
throw err;
} finally {
validationSpan.end();
}
}
);
// Child span: charge via Stripe
const charge = await tracer.startActiveSpan(
'stripe_charge',
async (stripeSpan) => {
stripeSpan.setAttribute('stripe.payment_method', request.body.paymentMethodId);
try {
const result = await stripe.paymentIntents.create({
amount: request.body.amountCents,
currency: request.body.currency,
payment_method: request.body.paymentMethodId,
confirm: true,
});
stripeSpan.setAttribute('stripe.payment_intent_id', result.id);
stripeSpan.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err: any) {
stripeSpan.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
stripeSpan.recordException(err);
throw err;
} finally {
stripeSpan.end();
}
}
);
span.setAttribute('payment.intent_id', charge.id);
span.setStatus({ code: SpanStatusCode.OK });
return reply.send({ success: true, paymentIntentId: charge.id });
} catch (err: any) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
});
Trace Context Propagation
For traces to connect across services, HTTP clients must propagate trace context headers:
// Auto-instrumentation handles this for fetch and axios automatically.
// For manual HTTP clients:
import { propagation, context } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
async function callUserService(userId: string): Promise<User> {
const headers: Record<string, string> = {
'Content-Type': 'application/json',
};
// Inject current trace context into outgoing request headers
propagation.inject(context.active(), headers);
// Adds: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
const response = await fetch(`http://user-service/users/${userId}`, {
headers,
});
return response.json();
}
When the user-service extracts this header, its spans automatically become children of the caller's span — building one connected trace across both services.
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1000
# Add resource attributes to all telemetry
resource:
attributes:
- key: cluster
value: production
action: upsert
# Filter out health check spans (reduce noise)
filter:
traces:
span:
- 'attributes["http.route"] == "/health/live"'
- 'attributes["http.route"] == "/health/ready"'
# Sample high-volume, low-priority traces
probabilistic_sampler:
hash_seed: 22
sampling_percentage: 10 # Sample 10% of low-priority traces
exporters:
# Grafana Tempo
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Also send to Datadog if you use it
datadog:
api:
key: ${DATADOG_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, filter]
exporters: [otlp/tempo]
Jaeger vs Grafana Tempo
| Feature | Jaeger | Grafana Tempo |
|---|---|---|
| Backend storage | Cassandra, Elasticsearch, Badger | Object storage (S3, GCS) — much cheaper |
| Query UI | Jaeger UI | Grafana (also searches Tempo) |
| Correlate with metrics | Via Grafana data source | Native in Grafana stack |
| Cost at scale | High (Elasticsearch) | Low (S3 ~$0.023/GB) |
| Self-host complexity | Medium | Low (especially with Grafana Cloud) |
| Best for | Existing Jaeger users, Elasticsearch users | New deployments, cost-sensitive, Grafana stack |
For new deployments: Grafana Tempo + Grafana for visualization. Tempo stores traces in S3 and is dramatically cheaper than Elasticsearch at scale.
Trace-Based Debugging Workflow
When you get a latency alert:
1. Find the slow trace:
Search Tempo/Jaeger: service=api, duration>2s, status=ok
(Slow successful requests are often the worst — users don't know they're slow)
2. Identify the bottleneck span:
Look at the waterfall: which span took the most time?
Often: database query (N+1 pattern), external API call, serialization
3. Check span attributes:
db.statement: "SELECT * FROM events WHERE user_id = ?"
db.rows_affected: 150,000 ← that's the problem
4. Cross-correlate with metrics:
Grafana: link from trace to PostgreSQL dashboard
Find: query volume spike at the same time as latency spike
5. Fix and verify:
Add index, deploy
Run same query with EXPLAIN ANALYZE
Confirm p99 latency drops in traces within 5 minutes
Working With Viprasol
We implement observability stacks — OpenTelemetry instrumentation, OTel Collector configuration, Grafana Tempo or Jaeger deployment, and dashboards that connect traces to metrics and logs. Complete observability is what turns production incidents from hour-long investigations into 10-minute diagnoses.
→ Talk to our observability team about distributed tracing for your platform.
See Also
- Observability and Monitoring — metrics and alerting alongside tracing
- DevOps Best Practices — observability as a DevOps practice
- Microservices Architecture — distributed systems that need tracing
- Kubernetes vs ECS — deploying the OTel collector in containers
- Cloud Solutions — observability infrastructure and DevOps
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.