Skip to main content

ADR-0043: OpenTelemetry for Distributed Tracing

Status

Accepted - 2025-01-26


Context

TVL Platform has distributed architecture (Fastify API, Next.js frontend, BullMQ workers, Redis, PostgreSQL) requiring request tracing across services.


Decision

OpenTelemetry (OTEL) for distributed tracing with Grafana Tempo backend.

Rationale

  1. Vendor-Neutral: Open standard, not locked to vendor
  2. Auto-Instrumentation: Automatic tracing for HTTP, DB, Redis
  3. Industry Standard: Adopted by AWS, Google Cloud, Datadog
  4. Unified: Single SDK for traces, metrics, logs

Alternatives Considered

Alternative 1: Jaeger

Rejected - Heavy infrastructure, OpenTelemetry is successor

Alternative 2: Zipkin

Rejected - Less active development, smaller ecosystem

Alternative 3: AWS X-Ray

Rejected - Vendor lock-in, doesn't work locally


Implementation

1. Install OpenTelemetry SDK

pnpm add @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
pnpm add @opentelemetry/exporter-trace-otlp-http

2. Initialize Tracing

// src/telemetry/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const traceExporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
});

const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'tvl-api',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
}),
traceExporter,
instrumentations: [
getNodeAutoInstrumentations({
// Instrument Fastify, PostgreSQL, Redis, HTTP
'@opentelemetry/instrumentation-http': {},
'@opentelemetry/instrumentation-fastify': {},
'@opentelemetry/instrumentation-pg': {},
'@opentelemetry/instrumentation-redis-4': {},
}),
],
});

sdk.start();

process.on('SIGTERM', () => {
sdk.shutdown().then(() => console.log('Tracing terminated'));
});

export default sdk;

3. Start Application with Tracing

// src/server.ts
import './telemetry/tracing'; // Initialize FIRST
import fastify from 'fastify';

const app = fastify();

// Routes automatically traced!
app.get('/api/v1/bookings', async (req, reply) => {
// This request is automatically traced:
// 1. HTTP request span
// 2. Database query spans
// 3. Redis cache spans
const bookings = await db.select().from(bookings);
return { bookings };
});

await app.listen({ port: 4000 });

Manual Spans (Custom Instrumentation)

// src/services/bookings/createBooking.ts
import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('booking-service');

export async function createBooking(data: CreateBookingInput) {
// Create custom span
return await tracer.startActiveSpan('createBooking', async (span) => {
try {
span.setAttribute('booking.guestName', data.guestName);
span.setAttribute('booking.propertyId', data.propertyId);

// Business logic
const booking = await db.insert(bookings).values(data).returning();

// Add event to span
span.addEvent('booking.created', {
bookingId: booking[0].id,
});

return booking[0];
} catch (error) {
// Record error in span
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
});
}

Trace Context Propagation

HTTP Request → Worker Job

// src/routes/webhooks/hostaway.ts
import { propagation, context } from '@opentelemetry/api';

app.post('/webhooks/hostaway', async (req, reply) => {
// Capture current trace context
const ctx = context.active();
const carrier = {};
propagation.inject(ctx, carrier);

// Pass to worker job
await webhookQueue.add('hostaway.booking.created', {
payload: req.body,
traceContext: carrier, // Propagate trace context
});

return reply.status(200).send({ received: true });
});

// src/workers/webhookWorker.ts
import { propagation, context } from '@opentelemetry/api';

const worker = new Worker('webhooks', async (job) => {
// Extract trace context
const parentCtx = propagation.extract(context.active(), job.data.traceContext);

// Continue trace
await context.with(parentCtx, async () => {
await processWebhook(job.data.payload);
});
});

Distributed Trace Example

User Request → Fastify API → PostgreSQL → Redis → BullMQ → Worker → External API
│ │ │ │ │ │ │
└──────────────┴──────────────┴─────────┴────────┴────────┴──────────┘
Single Trace (trace_id: abc-123)

Trace spans:

  1. GET /api/v1/bookings (HTTP request) - 250ms
    • SELECT * FROM bookings (DB query) - 150ms
    • GET org:123:bookings (Redis cache miss) - 2ms
    • webhook.queue.add (BullMQ enqueue) - 5ms
  2. processWebhook (Worker job) - 500ms
    • POST https://api.hostaway.com/bookings (External API) - 450ms

Total request time: 750ms (250ms API + 500ms worker)


Sampling Strategy

Production Sampling (10%)

import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

const sdk = new NodeSDK({
// Sample 10% of traces (reduces cost)
sampler: new TraceIdRatioBasedSampler(0.1),
// ...
});

Always Sample Errors

import { ParentBasedSampler, AlwaysOnSampler } from '@opentelemetry/sdk-trace-base';

class ErrorSampler {
shouldSample(context, traceId, spanName, spanKind, attributes, links) {
// Always sample if error occurred
if (attributes['http.status_code'] >= 400) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}

// Otherwise, sample 10%
return new TraceIdRatioBasedSampler(0.1).shouldSample(context, traceId, spanName, spanKind, attributes, links);
}
}

Exporting to Grafana Tempo

Environment Variables

# .env.local
OTEL_EXPORTER_OTLP_ENDPOINT=https://tempo.grafana.net
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64-encoded-credentials>

Grafana Cloud Setup

  1. Sign up: https://grafana.com/products/cloud/
  2. Go to TempoSend Traces
  3. Copy OTLP endpoint and credentials
  4. Add to Doppler secrets

Querying Traces

Grafana Tempo UI

1. Search by trace ID:
trace_id=abc-123-def-456

2. Search by service:
service.name=tvl-api

3. Search by duration:
duration > 1s

4. Search by error:
status.code=ERROR

TraceQL (Tempo Query Language)

# Find slow booking creation requests
{ service.name="tvl-api" && name="createBooking" && duration > 500ms }

# Find failed webhook processing
{ service.name="webhook-worker" && status.code=ERROR }

# Find traces with DB queries >100ms
{ resource.service.name="tvl-api" && db.statement!="" && duration > 100ms }

Dashboard Metrics from Traces

Request Performance:
- p50, p95, p99 latency
- Error rate (%)
- Requests/second
- Slowest endpoints

Database Performance:
- Query latency (p50, p95, p99)
- Slow queries (>100ms)
- Connection pool usage

External API Performance:
- Hostaway API latency
- Airbnb API latency
- Circuit breaker state

Consequences

Positive

  • Full Visibility: Track requests across services
  • Root Cause Analysis: Identify slow queries, API calls
  • Vendor-Neutral: Not locked to Grafana (can switch to Datadog, Honeycomb)
  • Auto-Instrumentation: No code changes for common frameworks

Negative

  • Performance Overhead: 1-5ms latency per request
  • Learning Curve: TraceQL, span attributes
  • Cost: Grafana Cloud ($8-50/month for MVP)

Mitigations

  • Use sampling (10% in production)
  • Monitor trace volume (alert if >1M spans/day)
  • Free tier sufficient for MVP.0-MVP.1 (<50k spans/month)

Validation Checklist

  • OpenTelemetry SDK initialized
  • Auto-instrumentation enabled (HTTP, DB, Redis)
  • Custom spans for business logic
  • Trace context propagation (HTTP → Worker)
  • Sampling configured (10% production)
  • Grafana Tempo integration
  • Dashboard for request performance

References