Distributed Tracing: OpenTelemetry, Jaeger, Tempo, and Trace-Based Debugging

In a monolith, a slow request is easy to debug: add a profiler, look at the stack trace. In a distributed system with 10 services, a 2-second latency spike could originate in any one of them — or in the network between them. Distributed tracing answers the question: where did this request actually spend its time?

A trace is the complete journey of a request through your system. A span is one unit of work within that journey. Together they give you a flamegraph of your entire request across every service it touched.

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the CNCF project that standardizes observability instrumentation. It replaces vendor-specific SDKs (Jaeger SDK, Zipkin SDK, Datadog tracer) with one implementation that exports to any backend.

Your App + OTel SDK → OTel Collector → Backend (Jaeger / Tempo / Datadog / Honeycomb)

The OTel Collector decouples your application from the backend — switch from Jaeger to Honeycomb without changing application code.

Node.js Instrumentation

// instrumentation.ts — must be loaded before any other module
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';

// Auto-instrumentation: automatically traces HTTP, PostgreSQL, Redis, gRPC
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME ?? 'api',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION ?? '0.0.1',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? 'development',
  }),

  // Export to OTel Collector (which forwards to Jaeger/Tempo)
  spanProcessor: new BatchSpanProcessor(
    new OTLPTraceExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4317',
    })
  ),

  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        // Don't trace health checks — too noisy
        ignoreIncomingRequestHook: (req) => req.url?.includes('/health') ?? false,
      },
      '@opentelemetry/instrumentation-pg': {
        // Include DB statement in span (mask sensitive params)
        addSqlCommenterCommentToQueries: true,
      },
      '@opentelemetry/instrumentation-redis-4': {
        dbStatementSerializer: (cmdName) => cmdName,  // Log command, not value
      },
    }),
  ],
});

sdk.start();
console.log('OpenTelemetry SDK started');

// Graceful shutdown
process.on('SIGTERM', async () => {
  await sdk.shutdown();
});

// package.json start script — load instrumentation first
// "start": "node --require ./dist/instrumentation.js ./dist/app.js"

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Manual Spans for Custom Operations

Auto-instrumentation traces HTTP and database calls. Add manual spans for business logic:

// routes/payment.ts
import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

app.post('/payments', async (request, reply) => {
  // Create a span for the entire payment flow
  return tracer.startActiveSpan('process_payment', async (span) => {
    try {
      span.setAttributes({
        'payment.amount_cents': request.body.amountCents,
        'payment.currency': request.body.currency,
        'payment.method': request.body.paymentMethodId,
        'user.id': request.user.id,
        'tenant.id': request.tenantId,
      });

      // Child span: validate payment
      const validationResult = await tracer.startActiveSpan(
        'validate_payment',
        async (validationSpan) => {
          try {
            const result = await validatePayment(request.body);
            validationSpan.setStatus({ code: SpanStatusCode.OK });
            return result;
          } catch (err: any) {
            validationSpan.setStatus({
              code: SpanStatusCode.ERROR,
              message: err.message,
            });
            validationSpan.recordException(err);
            throw err;
          } finally {
            validationSpan.end();
          }
        }
      );

      // Child span: charge via Stripe
      const charge = await tracer.startActiveSpan(
        'stripe_charge',
        async (stripeSpan) => {
          stripeSpan.setAttribute('stripe.payment_method', request.body.paymentMethodId);
          try {
            const result = await stripe.paymentIntents.create({
              amount: request.body.amountCents,
              currency: request.body.currency,
              payment_method: request.body.paymentMethodId,
              confirm: true,
            });
            stripeSpan.setAttribute('stripe.payment_intent_id', result.id);
            stripeSpan.setStatus({ code: SpanStatusCode.OK });
            return result;
          } catch (err: any) {
            stripeSpan.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
            stripeSpan.recordException(err);
            throw err;
          } finally {
            stripeSpan.end();
          }
        }
      );

      span.setAttribute('payment.intent_id', charge.id);
      span.setStatus({ code: SpanStatusCode.OK });
      return reply.send({ success: true, paymentIntentId: charge.id });

    } catch (err: any) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
});

Trace Context Propagation

For traces to connect across services, HTTP clients must propagate trace context headers:

// Auto-instrumentation handles this for fetch and axios automatically.
// For manual HTTP clients:
import { propagation, context } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

async function callUserService(userId: string): Promise<User> {
  const headers: Record<string, string> = {
    'Content-Type': 'application/json',
  };

  // Inject current trace context into outgoing request headers
  propagation.inject(context.active(), headers);
  // Adds: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

  const response = await fetch(`http://user-service/users/${userId}`, {
    headers,
  });
  return response.json();
}

When the user-service extracts this header, its spans automatically become children of the caller's span — building one connected trace across both services.

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000

  # Add resource attributes to all telemetry
  resource:
    attributes:
      - key: cluster
        value: production
        action: upsert

  # Filter out health check spans (reduce noise)
  filter:
    traces:
      span:
        - 'attributes["http.route"] == "/health/live"'
        - 'attributes["http.route"] == "/health/ready"'

  # Sample high-volume, low-priority traces
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 10  # Sample 10% of low-priority traces

exporters:
  # Grafana Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Also send to Datadog if you use it
  datadog:
    api:
      key: ${DATADOG_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, filter]
      exporters: [otlp/tempo]

Jaeger vs Grafana Tempo

Feature	Jaeger	Grafana Tempo
Backend storage	Cassandra, Elasticsearch, Badger	Object storage (S3, GCS) — much cheaper
Query UI	Jaeger UI	Grafana (also searches Tempo)
Correlate with metrics	Via Grafana data source	Native in Grafana stack
Cost at scale	High (Elasticsearch)	Low (S3 ~$0.023/GB)
Self-host complexity	Medium	Low (especially with Grafana Cloud)
Best for	Existing Jaeger users, Elasticsearch users	New deployments, cost-sensitive, Grafana stack

For new deployments: Grafana Tempo + Grafana for visualization. Tempo stores traces in S3 and is dramatically cheaper than Elasticsearch at scale.

Trace-Based Debugging Workflow

When you get a latency alert:

1. Find the slow trace:
   Search Tempo/Jaeger: service=api, duration>2s, status=ok
   (Slow successful requests are often the worst — users don't know they're slow)

2. Identify the bottleneck span:
   Look at the waterfall: which span took the most time?
   Often: database query (N+1 pattern), external API call, serialization

3. Check span attributes:
   db.statement: "SELECT * FROM events WHERE user_id = ?"
   db.rows_affected: 150,000   ← that's the problem

4. Cross-correlate with metrics:
   Grafana: link from trace to PostgreSQL dashboard
   Find: query volume spike at the same time as latency spike

5. Fix and verify:
   Add index, deploy
   Run same query with EXPLAIN ANALYZE
   Confirm p99 latency drops in traces within 5 minutes

Working With Viprasol

We implement observability stacks — OpenTelemetry instrumentation, OTel Collector configuration, Grafana Tempo or Jaeger deployment, and dashboards that connect traces to metrics and logs. Complete observability is what turns production incidents from hour-long investigations into 10-minute diagnoses.

→ Talk to our observability team about distributed tracing for your platform.

Distributed Tracing: OpenTelemetry, Jaeger, Tempo, and Trace-Based Debugging

Distributed Tracing: OpenTelemetry, Jaeger, Tempo, and Trace-Based Debugging

OpenTelemetry: The Standard

Node.js Instrumentation

☁️ Is Your Cloud Costing Too Much?

Manual Spans for Custom Operations

Trace Context Propagation

⚙️ DevOps Done Right — Zero Downtime, Full Automation

OTel Collector Configuration

Jaeger vs Grafana Tempo

Trace-Based Debugging Workflow

Working With Viprasol

See Also

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

OpenTelemetry in Production: Traces, Metrics, and Logs That Actually Help

OpenTelemetry for Node.js: Auto-Instrumentation, Custom Spans, and OTLP Export

Observability and Monitoring: Logs, Metrics, Traces, and Alerting That Works