ADR-0043: OpenTelemetry for Distributed Tracing
Status
Accepted - 2025-01-26
Context
TVL Platform has distributed architecture (Fastify API, Next.js frontend, BullMQ workers, Redis, PostgreSQL) requiring request tracing across services.
Decision
OpenTelemetry (OTEL) for distributed tracing with Grafana Tempo backend.
Rationale
- Vendor-Neutral: Open standard, not locked to vendor
- Auto-Instrumentation: Automatic tracing for HTTP, DB, Redis
- Industry Standard: Adopted by AWS, Google Cloud, Datadog
- Unified: Single SDK for traces, metrics, logs
Alternatives Considered
Alternative 1: Jaeger
Rejected - Heavy infrastructure, OpenTelemetry is successor
Alternative 2: Zipkin
Rejected - Less active development, smaller ecosystem
Alternative 3: AWS X-Ray
Rejected - Vendor lock-in, doesn't work locally
Implementation
1. Install OpenTelemetry SDK
pnpm add @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
pnpm add @opentelemetry/exporter-trace-otlp-http
2. Initialize Tracing
// src/telemetry/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
});
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'tvl-api',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
  }),
  traceExporter,
  instrumentations: [
    getNodeAutoInstrumentations({
      // Instrument Fastify, PostgreSQL, Redis, HTTP
      '@opentelemetry/instrumentation-http': {},
      '@opentelemetry/instrumentation-fastify': {},
      '@opentelemetry/instrumentation-pg': {},
      '@opentelemetry/instrumentation-redis-4': {},
    }),
  ],
});
sdk.start();
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => console.log('Tracing terminated'));
});
export default sdk;
3. Start Application with Tracing
// src/server.ts
import './telemetry/tracing'; // Initialize FIRST
import fastify from 'fastify';
const app = fastify();
// Routes automatically traced!
app.get('/api/v1/bookings', async (req, reply) => {
  // This request is automatically traced:
  // 1. HTTP request span
  // 2. Database query spans
  // 3. Redis cache spans
  const bookings = await db.select().from(bookings);
  return { bookings };
});
await app.listen({ port: 4000 });
Manual Spans (Custom Instrumentation)
// src/services/bookings/createBooking.ts
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('booking-service');
export async function createBooking(data: CreateBookingInput) {
  // Create custom span
  return await tracer.startActiveSpan('createBooking', async (span) => {
    try {
      span.setAttribute('booking.guestName', data.guestName);
      span.setAttribute('booking.propertyId', data.propertyId);
      // Business logic
      const booking = await db.insert(bookings).values(data).returning();
      // Add event to span
      span.addEvent('booking.created', {
        bookingId: booking[0].id,
      });
      return booking[0];
    } catch (error) {
      // Record error in span
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}
Trace Context Propagation
HTTP Request → Worker Job
// src/routes/webhooks/hostaway.ts
import { propagation, context } from '@opentelemetry/api';
app.post('/webhooks/hostaway', async (req, reply) => {
  // Capture current trace context
  const ctx = context.active();
  const carrier = {};
  propagation.inject(ctx, carrier);
  // Pass to worker job
  await webhookQueue.add('hostaway.booking.created', {
    payload: req.body,
    traceContext: carrier, // Propagate trace context
  });
  return reply.status(200).send({ received: true });
});
// src/workers/webhookWorker.ts
import { propagation, context } from '@opentelemetry/api';
const worker = new Worker('webhooks', async (job) => {
  // Extract trace context
  const parentCtx = propagation.extract(context.active(), job.data.traceContext);
  // Continue trace
  await context.with(parentCtx, async () => {
    await processWebhook(job.data.payload);
  });
});
Distributed Trace Example
User Request → Fastify API → PostgreSQL → Redis → BullMQ → Worker → External API
    │              │              │         │        │        │          │
    └──────────────┴──────────────┴─────────┴────────┴────────┴──────────┘
                    Single Trace (trace_id: abc-123)
Trace spans:
- GET /api/v1/bookings(HTTP request) - 250ms- SELECT * FROM bookings(DB query) - 150ms
- GET org:123:bookings(Redis cache miss) - 2ms
- webhook.queue.add(BullMQ enqueue) - 5ms
 
- processWebhook(Worker job) - 500ms- POST https://api.hostaway.com/bookings(External API) - 450ms
 
Total request time: 750ms (250ms API + 500ms worker)
Sampling Strategy
Production Sampling (10%)
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
  // Sample 10% of traces (reduces cost)
  sampler: new TraceIdRatioBasedSampler(0.1),
  // ...
});
Always Sample Errors
import { ParentBasedSampler, AlwaysOnSampler } from '@opentelemetry/sdk-trace-base';
class ErrorSampler {
  shouldSample(context, traceId, spanName, spanKind, attributes, links) {
    // Always sample if error occurred
    if (attributes['http.status_code'] >= 400) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    // Otherwise, sample 10%
    return new TraceIdRatioBasedSampler(0.1).shouldSample(context, traceId, spanName, spanKind, attributes, links);
  }
}
Exporting to Grafana Tempo
Environment Variables
# .env.local
OTEL_EXPORTER_OTLP_ENDPOINT=https://tempo.grafana.net
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64-encoded-credentials>
Grafana Cloud Setup
- Sign up: https://grafana.com/products/cloud/
- Go to Tempo → Send Traces
- Copy OTLP endpoint and credentials
- Add to Doppler secrets
Querying Traces
Grafana Tempo UI
1. Search by trace ID:
   trace_id=abc-123-def-456
2. Search by service:
   service.name=tvl-api
3. Search by duration:
   duration > 1s
4. Search by error:
   status.code=ERROR
TraceQL (Tempo Query Language)
# Find slow booking creation requests
{ service.name="tvl-api" && name="createBooking" && duration > 500ms }
# Find failed webhook processing
{ service.name="webhook-worker" && status.code=ERROR }
# Find traces with DB queries >100ms
{ resource.service.name="tvl-api" && db.statement!="" && duration > 100ms }
Dashboard Metrics from Traces
Request Performance:
- p50, p95, p99 latency
- Error rate (%)
- Requests/second
- Slowest endpoints
Database Performance:
- Query latency (p50, p95, p99)
- Slow queries (>100ms)
- Connection pool usage
External API Performance:
- Hostaway API latency
- Airbnb API latency
- Circuit breaker state
Consequences
Positive
- ✅ Full Visibility: Track requests across services
- ✅ Root Cause Analysis: Identify slow queries, API calls
- ✅ Vendor-Neutral: Not locked to Grafana (can switch to Datadog, Honeycomb)
- ✅ Auto-Instrumentation: No code changes for common frameworks
Negative
- ❌ Performance Overhead: 1-5ms latency per request
- ❌ Learning Curve: TraceQL, span attributes
- ❌ Cost: Grafana Cloud ($8-50/month for MVP)
Mitigations
- Use sampling (10% in production)
- Monitor trace volume (alert if >1M spans/day)
- Free tier sufficient for MVP.0-MVP.1 (<50k spans/month)
Validation Checklist
- OpenTelemetry SDK initialized
- Auto-instrumentation enabled (HTTP, DB, Redis)
- Custom spans for business logic
- Trace context propagation (HTTP → Worker)
- Sampling configured (10% production)
- Grafana Tempo integration
- Dashboard for request performance