Skip to main content

ADR-0044: Grafana Cloud for Unified Observability

Status

Accepted - 2025-01-26


Context

TVL Platform needs unified observability stack for metrics, traces, and logs with single pane of glass.


Decision

Grafana Cloud with Mimir (metrics), Tempo (traces), and Loki (logs).

Rationale

  1. Unified Platform: Single dashboard for metrics, traces, logs
  2. Generous Free Tier: 50GB logs, 10k series, 50GB traces/month
  3. Vendor-Neutral: OpenTelemetry-based, easy to migrate
  4. Pre-Built Dashboards: Ready-made dashboards for Node.js, PostgreSQL, Redis
  5. Correlation: Jump from metric → trace → log

Alternatives Considered

Alternative 1: Datadog

Rejected - Expensive ($15/host/month), vendor lock-in

Alternative 2: AWS CloudWatch

Rejected - AWS lock-in, poor UX, expensive for logs

Alternative 3: Self-Hosted Grafana + Prometheus + Loki

Rejected - Operational overhead, no SLA


Free Tier Limits

ComponentFree TierSufficient For
Metrics (Mimir)10,000 seriesMVP.0-MVP.2 (500 properties)
Traces (Tempo)50GB/month100k spans/month (10% sampling)
Logs (Loki)50GB/month5M log lines/month
DashboardsUnlimited
AlertsUnlimited
Retention13 months

Cost: $0/month for MVP.0-MVP.1

Upgrade trigger: >1,000 properties (~V1.0)


Setup

1. Create Grafana Cloud Account

# Sign up (free tier)
https://grafana.com/products/cloud/

# Create stack: tvl-platform
# Region: US East (closest to Supabase/Railway)

2. Get Credentials

# Mimir (Metrics)
MIMIR_URL=https://prometheus-prod-us-east-0.grafana.net/api/prom/push
MIMIR_USERNAME=123456
MIMIR_API_KEY=glc_...

# Tempo (Traces)
TEMPO_URL=https://tempo-prod-us-east-0.grafana.net
TEMPO_USERNAME=123456
TEMPO_API_KEY=glc_...

# Loki (Logs)
LOKI_URL=https://logs-prod-us-east.grafana.net
LOKI_USERNAME=123456
LOKI_API_KEY=glc_...

3. Add to Doppler

# Add Grafana credentials to Doppler (staging + production)
doppler secrets set GRAFANA_MIMIR_URL="https://..." --project=tvl-platform --config=staging
doppler secrets set GRAFANA_MIMIR_USERNAME="123456" --project=tvl-platform --config=staging
doppler secrets set GRAFANA_MIMIR_API_KEY="glc_..." --project=tvl-platform --config=staging

# Repeat for TEMPO and LOKI

Component 1: Metrics (Mimir)

Prometheus Exporter

// src/telemetry/metrics.ts
import client from 'prom-client';

// Create registry
export const register = new client.Registry();

// Default metrics (CPU, memory, event loop)
client.collectDefaultMetrics({ register });

// Custom metrics
export const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency',
labelNames: ['method', 'route', 'status'],
registers: [register],
});

export const bookingsCreated = new client.Counter({
name: 'bookings_created_total',
help: 'Total bookings created',
labelNames: ['org_id', 'channel'],
registers: [register],
});

export const cacheHitRate = new client.Gauge({
name: 'cache_hit_rate',
help: 'Cache hit rate',
registers: [register],
});

Push to Mimir

// src/telemetry/pushMetrics.ts
import { register } from './metrics';

async function pushMetrics() {
const metrics = await register.metrics();

await fetch(process.env.GRAFANA_MIMIR_URL, {
method: 'POST',
headers: {
'Content-Type': 'text/plain',
'Authorization': `Basic ${btoa(`${process.env.GRAFANA_MIMIR_USERNAME}:${process.env.GRAFANA_MIMIR_API_KEY}`)}`,
},
body: metrics,
});
}

// Push every 15 seconds
setInterval(pushMetrics, 15000);

Component 2: Traces (Tempo)

See ADR-0043 for OpenTelemetry setup

// Export traces to Grafana Tempo
const traceExporter = new OTLPTraceExporter({
url: `${process.env.GRAFANA_TEMPO_URL}/otlp/v1/traces`,
headers: {
'Authorization': `Basic ${btoa(`${process.env.GRAFANA_TEMPO_USERNAME}:${process.env.GRAFANA_TEMPO_API_KEY}`)}`,
},
});

Component 3: Logs (Loki)

See ADR-0046 for Loki integration

// Push logs to Grafana Loki
import winston from 'winston';
import LokiTransport from 'winston-loki';

const logger = winston.createLogger({
transports: [
new LokiTransport({
host: process.env.GRAFANA_LOKI_URL,
basicAuth: `${process.env.GRAFANA_LOKI_USERNAME}:${process.env.GRAFANA_LOKI_API_KEY}`,
labels: { service: 'tvl-api', environment: process.env.NODE_ENV },
}),
],
});

Dashboards

1. Application Performance Dashboard

{
"title": "TVL Platform - Application Performance",
"panels": [
{
"title": "Request Latency (p50, p95, p99)",
"targets": [
{
"expr": "histogram_quantile(0.50, http_request_duration_seconds_bucket)",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, http_request_duration_seconds_bucket)",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, http_request_duration_seconds_bucket)",
"legendFormat": "p99"
}
]
},
{
"title": "Error Rate (%)",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100"
}
]
},
{
"title": "Requests/Second",
"targets": [
{
"expr": "rate(http_requests_total[1m])"
}
]
}
]
}

2. Business Metrics Dashboard

{
"title": "TVL Platform - Business Metrics",
"panels": [
{
"title": "Bookings Created/Hour",
"targets": [
{
"expr": "rate(bookings_created_total[1h]) * 3600"
}
]
},
{
"title": "Revenue/Day",
"targets": [
{
"expr": "sum(increase(booking_revenue_cents[1d])) / 100"
}
]
},
{
"title": "Active Properties",
"targets": [
{
"expr": "count(property_active{status=\"active\"})"
}
]
}
]
}

Alerting Rules

# alerts.yaml (Grafana)
groups:
- name: tvl-platform-alerts
interval: 1m
rules:
# Critical: Error rate >5% for 5 minutes
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"

# Critical: API latency p99 >1s
- alert: HighLatency
expr: histogram_quantile(0.99, http_request_duration_seconds_bucket) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "High API latency"

# Warning: Cache hit rate <70%
- alert: LowCacheHitRate
expr: cache_hit_rate < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Cache hit rate low"

Notification channels:

  • Critical → PagerDuty + Slack
  • Warning → Slack only

Log Queries (LogQL)

# Find all errors in last hour
{service="tvl-api", level="error"} | json | __error__=""

# Find slow requests (>1s)
{service="tvl-api"} | json | duration > 1000

# Find booking creation errors
{service="tvl-api"} | json | message="booking.created" | level="error"

# Count errors by org
sum by (org_id) (count_over_time({service="tvl-api", level="error"}[1h]))

Trace-to-Log Correlation

Jump from trace → logs:

// Add trace ID to logs
logger.info({
message: 'Booking created',
traceId: span.spanContext().traceId, // OpenTelemetry trace ID
bookingId: booking.id,
});

Grafana UI:

  1. Click trace in Tempo
  2. Click "View Logs" button
  3. Loki query: {service="tvl-api"} | json | traceId="abc-123"

Cost Projection

PhasePropertiesLog Lines/MoMetrics SeriesTraces/MoGrafana Cost
MVP.02002M50050k$0 (free)
MVP.11,00010M2,000200k$0 (free)
MVP.25,00050M10,000500k$0 (at limit)
V1.010,000100M20,0001M$50/month
V2.050,000500M100,0005M$250/month

Free tier sufficient until V1.0 (10,000 properties)


Consequences

Positive

  • Unified Platform: Single pane of glass
  • Generous Free Tier: $0 for MVP phases
  • Vendor-Neutral: OpenTelemetry-based
  • Pre-Built Dashboards: Node.js, PostgreSQL, Redis
  • Correlation: Metrics → Traces → Logs

Negative

  • Learning Curve: PromQL, LogQL, TraceQL
  • Free Tier Limits: Must upgrade at V1.0

Mitigations

  • Use pre-built dashboards (reduce PromQL learning)
  • Monitor usage monthly (alert at 80% of limits)
  • Plan upgrade budget for V1.0 ($50/month)

Validation Checklist

  • Grafana Cloud account created
  • Credentials added to Doppler (staging + production)
  • Metrics pushed to Mimir
  • Traces exported to Tempo
  • Logs exported to Loki
  • Application performance dashboard created
  • Business metrics dashboard created
  • Alert rules configured
  • Slack/PagerDuty notifications setup

References