ADR-0044: Grafana Cloud for Unified Observability
Status
Accepted - 2025-01-26
Context
TVL Platform needs unified observability stack for metrics, traces, and logs with single pane of glass.
Decision
Grafana Cloud with Mimir (metrics), Tempo (traces), and Loki (logs).
Rationale
- Unified Platform: Single dashboard for metrics, traces, logs
- Generous Free Tier: 50GB logs, 10k series, 50GB traces/month
- Vendor-Neutral: OpenTelemetry-based, easy to migrate
- Pre-Built Dashboards: Ready-made dashboards for Node.js, PostgreSQL, Redis
- Correlation: Jump from metric → trace → log
Alternatives Considered
Alternative 1: Datadog
Rejected - Expensive ($15/host/month), vendor lock-in
Alternative 2: AWS CloudWatch
Rejected - AWS lock-in, poor UX, expensive for logs
Alternative 3: Self-Hosted Grafana + Prometheus + Loki
Rejected - Operational overhead, no SLA
Free Tier Limits
| Component | Free Tier | Sufficient For | 
|---|---|---|
| Metrics (Mimir) | 10,000 series | MVP.0-MVP.2 (500 properties) | 
| Traces (Tempo) | 50GB/month | 100k spans/month (10% sampling) | 
| Logs (Loki) | 50GB/month | 5M log lines/month | 
| Dashboards | Unlimited | ✓ | 
| Alerts | Unlimited | ✓ | 
| Retention | 13 months | ✓ | 
Cost: $0/month for MVP.0-MVP.1
Upgrade trigger: >1,000 properties (~V1.0)
Setup
1. Create Grafana Cloud Account
# Sign up (free tier)
https://grafana.com/products/cloud/
# Create stack: tvl-platform
# Region: US East (closest to Supabase/Railway)
2. Get Credentials
# Mimir (Metrics)
MIMIR_URL=https://prometheus-prod-us-east-0.grafana.net/api/prom/push
MIMIR_USERNAME=123456
MIMIR_API_KEY=glc_...
# Tempo (Traces)
TEMPO_URL=https://tempo-prod-us-east-0.grafana.net
TEMPO_USERNAME=123456
TEMPO_API_KEY=glc_...
# Loki (Logs)
LOKI_URL=https://logs-prod-us-east.grafana.net
LOKI_USERNAME=123456
LOKI_API_KEY=glc_...
3. Add to Doppler
# Add Grafana credentials to Doppler (staging + production)
doppler secrets set GRAFANA_MIMIR_URL="https://..." --project=tvl-platform --config=staging
doppler secrets set GRAFANA_MIMIR_USERNAME="123456" --project=tvl-platform --config=staging
doppler secrets set GRAFANA_MIMIR_API_KEY="glc_..." --project=tvl-platform --config=staging
# Repeat for TEMPO and LOKI
Component 1: Metrics (Mimir)
Prometheus Exporter
// src/telemetry/metrics.ts
import client from 'prom-client';
// Create registry
export const register = new client.Registry();
// Default metrics (CPU, memory, event loop)
client.collectDefaultMetrics({ register });
// Custom metrics
export const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route', 'status'],
  registers: [register],
});
export const bookingsCreated = new client.Counter({
  name: 'bookings_created_total',
  help: 'Total bookings created',
  labelNames: ['org_id', 'channel'],
  registers: [register],
});
export const cacheHitRate = new client.Gauge({
  name: 'cache_hit_rate',
  help: 'Cache hit rate',
  registers: [register],
});
Push to Mimir
// src/telemetry/pushMetrics.ts
import { register } from './metrics';
async function pushMetrics() {
  const metrics = await register.metrics();
  await fetch(process.env.GRAFANA_MIMIR_URL, {
    method: 'POST',
    headers: {
      'Content-Type': 'text/plain',
      'Authorization': `Basic ${btoa(`${process.env.GRAFANA_MIMIR_USERNAME}:${process.env.GRAFANA_MIMIR_API_KEY}`)}`,
    },
    body: metrics,
  });
}
// Push every 15 seconds
setInterval(pushMetrics, 15000);
Component 2: Traces (Tempo)
See ADR-0043 for OpenTelemetry setup
// Export traces to Grafana Tempo
const traceExporter = new OTLPTraceExporter({
  url: `${process.env.GRAFANA_TEMPO_URL}/otlp/v1/traces`,
  headers: {
    'Authorization': `Basic ${btoa(`${process.env.GRAFANA_TEMPO_USERNAME}:${process.env.GRAFANA_TEMPO_API_KEY}`)}`,
  },
});
Component 3: Logs (Loki)
See ADR-0046 for Loki integration
// Push logs to Grafana Loki
import winston from 'winston';
import LokiTransport from 'winston-loki';
const logger = winston.createLogger({
  transports: [
    new LokiTransport({
      host: process.env.GRAFANA_LOKI_URL,
      basicAuth: `${process.env.GRAFANA_LOKI_USERNAME}:${process.env.GRAFANA_LOKI_API_KEY}`,
      labels: { service: 'tvl-api', environment: process.env.NODE_ENV },
    }),
  ],
});
Dashboards
1. Application Performance Dashboard
{
  "title": "TVL Platform - Application Performance",
  "panels": [
    {
      "title": "Request Latency (p50, p95, p99)",
      "targets": [
        {
          "expr": "histogram_quantile(0.50, http_request_duration_seconds_bucket)",
          "legendFormat": "p50"
        },
        {
          "expr": "histogram_quantile(0.95, http_request_duration_seconds_bucket)",
          "legendFormat": "p95"
        },
        {
          "expr": "histogram_quantile(0.99, http_request_duration_seconds_bucket)",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "title": "Error Rate (%)",
      "targets": [
        {
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100"
        }
      ]
    },
    {
      "title": "Requests/Second",
      "targets": [
        {
          "expr": "rate(http_requests_total[1m])"
        }
      ]
    }
  ]
}
2. Business Metrics Dashboard
{
  "title": "TVL Platform - Business Metrics",
  "panels": [
    {
      "title": "Bookings Created/Hour",
      "targets": [
        {
          "expr": "rate(bookings_created_total[1h]) * 3600"
        }
      ]
    },
    {
      "title": "Revenue/Day",
      "targets": [
        {
          "expr": "sum(increase(booking_revenue_cents[1d])) / 100"
        }
      ]
    },
    {
      "title": "Active Properties",
      "targets": [
        {
          "expr": "count(property_active{status=\"active\"})"
        }
      ]
    }
  ]
}
Alerting Rules
# alerts.yaml (Grafana)
groups:
  - name: tvl-platform-alerts
    interval: 1m
    rules:
      # Critical: Error rate >5% for 5 minutes
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
      # Critical: API latency p99 >1s
      - alert: HighLatency
        expr: histogram_quantile(0.99, http_request_duration_seconds_bucket) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High API latency"
      # Warning: Cache hit rate <70%
      - alert: LowCacheHitRate
        expr: cache_hit_rate < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate low"
Notification channels:
- Critical → PagerDuty + Slack
- Warning → Slack only
Log Queries (LogQL)
# Find all errors in last hour
{service="tvl-api", level="error"} | json | __error__=""
# Find slow requests (>1s)
{service="tvl-api"} | json | duration > 1000
# Find booking creation errors
{service="tvl-api"} | json | message="booking.created" | level="error"
# Count errors by org
sum by (org_id) (count_over_time({service="tvl-api", level="error"}[1h]))
Trace-to-Log Correlation
Jump from trace → logs:
// Add trace ID to logs
logger.info({
  message: 'Booking created',
  traceId: span.spanContext().traceId, // OpenTelemetry trace ID
  bookingId: booking.id,
});
Grafana UI:
- Click trace in Tempo
- Click "View Logs" button
- Loki query: {service="tvl-api"} | json | traceId="abc-123"
Cost Projection
| Phase | Properties | Log Lines/Mo | Metrics Series | Traces/Mo | Grafana Cost | 
|---|---|---|---|---|---|
| MVP.0 | 200 | 2M | 500 | 50k | $0 (free) | 
| MVP.1 | 1,000 | 10M | 2,000 | 200k | $0 (free) | 
| MVP.2 | 5,000 | 50M | 10,000 | 500k | $0 (at limit) | 
| V1.0 | 10,000 | 100M | 20,000 | 1M | $50/month | 
| V2.0 | 50,000 | 500M | 100,000 | 5M | $250/month | 
Free tier sufficient until V1.0 (10,000 properties)
Consequences
Positive
- ✅ Unified Platform: Single pane of glass
- ✅ Generous Free Tier: $0 for MVP phases
- ✅ Vendor-Neutral: OpenTelemetry-based
- ✅ Pre-Built Dashboards: Node.js, PostgreSQL, Redis
- ✅ Correlation: Metrics → Traces → Logs
Negative
- ❌ Learning Curve: PromQL, LogQL, TraceQL
- ❌ Free Tier Limits: Must upgrade at V1.0
Mitigations
- Use pre-built dashboards (reduce PromQL learning)
- Monitor usage monthly (alert at 80% of limits)
- Plan upgrade budget for V1.0 ($50/month)
Validation Checklist
- Grafana Cloud account created
- Credentials added to Doppler (staging + production)
- Metrics pushed to Mimir
- Traces exported to Tempo
- Logs exported to Loki
- Application performance dashboard created
- Business metrics dashboard created
- Alert rules configured
- Slack/PagerDuty notifications setup