Skip to main content

TVL Platform - Deployment Runbook

Summary

This runbook provides step-by-step deployment procedures for the TVL platform, including pre-deployment checklists, deployment steps for each component, health verification, rollback procedures, and post-deployment validation.

Table of Contents


Pre-Deployment Checklist

1. Code Review & Approval

  • All PRs merged to main branch
  • Code review completed by at least 1 senior engineer
  • CI/CD pipeline passing (all tests green)
  • Security scan completed (no high/critical vulnerabilities)
  • Performance regression tests passed
  • Database migration reviewed and approved

2. Environment Validation

Staging Environment

  • Staging deployment successful
  • Manual QA testing completed
  • Integration tests passing
  • External API mocks verified (Hostaway, Stripe test mode)
  • No critical errors in Sentry (staging)
  • Load test completed (if applicable)

Production Readiness

3. Change Management

  • Deployment window scheduled (prefer low-traffic periods)
  • Deployment notification sent to #engineering-deploys Slack channel
  • On-call engineer identified and available
  • Rollback plan documented and reviewed
  • Customer-facing changes documented for support team
  • Maintenance window announced (if downtime expected)

4. Backup & Failsafe

  • Database backup triggered manually: supabase db backup create
  • Current production deployment tagged in Git: git tag prod-YYYYMMDD-HHMM
  • Rollback deployment IDs noted:
    • Vercel: <deployment-id>
    • Railway: <deployment-id>
  • Redis snapshot created (Upstash dashboard)

Deployment Overview

Deployment Architecture

Deployment Order

  1. Database Migrations (manual approval required)
  2. Backend API (Railway/Fly with health check gate)
  3. Worker Processes (parallel with backend)
  4. Frontend (Vercel automatic)
  5. Health Check Verification
  6. Traffic Switchover (automatic for zero-downtime)

Deployment Timing

  • Duration: ~15-20 minutes (no downtime)
  • Recommended Window:
    • Weekdays: 10am-2pm EST (low traffic)
    • Avoid: Friday 4pm+, weekends, holidays
    • Emergency hotfix: Anytime with approval

Frontend Deployment (Vercel)

Automatic Deployment

Frontend deployments are automatic on merge to main:

# Triggered by GitHub Actions
# .github/workflows/deploy-production.yml
- name: Deploy Frontend to Vercel
run: vercel deploy --prod --token=${{ secrets.VERCEL_TOKEN }}

Manual Deployment (Emergency)

If automatic deployment fails:

# 1. Install Vercel CLI
npm install -g vercel

# 2. Login to Vercel
vercel login

# 3. Deploy to production
cd frontend
vercel --prod

# 4. Verify deployment
curl -f https://app.tvl.com/health || echo "Frontend health check failed"

Vercel Deployment Configuration

Environment: Production Domain: app.tvl.com Build Command: npm run build Output Directory: dist Node Version: 20.x

Frontend Health Check

# Check deployment status
curl -I https://app.tvl.com

# Expected response:
# HTTP/2 200
# x-vercel-id: <deployment-id>
# cache-control: public, max-age=0, must-revalidate

# Check JavaScript assets
curl -f https://app.tvl.com/assets/index-*.js | wc -c
# Expected: > 50KB

Rollback Frontend

# Option 1: Rollback via Vercel dashboard
# 1. Go to https://vercel.com/tvl/tvl-frontend/deployments
# 2. Select previous working deployment
# 3. Click "Promote to Production"

# Option 2: Rollback via CLI
vercel rollback --token=$VERCEL_TOKEN

Backend Deployment (Railway/Fly)

Automatic Deployment

Backend deployments are automatic on merge to main with health check gates:

# .github/workflows/deploy-production.yml
deploy-backend:
runs-on: ubuntu-latest
steps:
- name: Deploy to Railway
run: railway up --environment production

- name: Wait for deployment
run: railway status --wait

- name: Health check
run: |
for i in {1..30}; do
if curl -f https://api.tvl.com/health; then
echo "Health check passed"
exit 0
fi
echo "Waiting for health check... ($i/30)"
sleep 10
done
echo "Health check failed after 5 minutes"
exit 1

Manual Deployment (Emergency)

# 1. Install Railway CLI
npm install -g @railway/cli

# 2. Login to Railway
railway login

# 3. Link to production project
railway link tvl-production

# 4. Deploy backend
cd backend
railway up --environment production

# 5. Monitor deployment logs
railway logs --follow

# 6. Verify health check
curl -f https://api.tvl.com/health

Backend Health Check Endpoint

Endpoint: GET /health

Expected Response:

{
"status": "healthy",
"timestamp": "2025-10-24T10:00:00Z",
"version": "1.2.3",
"uptime": 3600,
"checks": {
"database": "healthy",
"redis": "healthy",
"stripe": "healthy",
"hostaway": "healthy"
}
}

Health Check Script:

#!/bin/bash
# scripts/health-check.sh

API_URL="https://api.tvl.com"
MAX_ATTEMPTS=30
WAIT_SECONDS=10

for i in $(seq 1 $MAX_ATTEMPTS); do
echo "Health check attempt $i/$MAX_ATTEMPTS..."

RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/health)

if [ "$RESPONSE" -eq 200 ]; then
echo "✓ Health check passed"

# Verify all dependencies
HEALTH_JSON=$(curl -s $API_URL/health)
DB_STATUS=$(echo $HEALTH_JSON | jq -r '.checks.database')
REDIS_STATUS=$(echo $HEALTH_JSON | jq -r '.checks.redis')

if [ "$DB_STATUS" = "healthy" ] && [ "$REDIS_STATUS" = "healthy" ]; then
echo "✓ All dependencies healthy"
exit 0
else
echo "✗ Dependencies unhealthy: DB=$DB_STATUS, Redis=$REDIS_STATUS"
exit 1
fi
fi

echo "Waiting $WAIT_SECONDS seconds..."
sleep $WAIT_SECONDS
done

echo "✗ Health check failed after $((MAX_ATTEMPTS * WAIT_SECONDS)) seconds"
exit 1

Rollback Backend

# Option 1: Rollback via Railway dashboard
# 1. Go to https://railway.app/project/tvl-production/deployments
# 2. Select previous working deployment
# 3. Click "Redeploy"

# Option 2: Rollback via CLI
railway redeploy --deployment-id <previous-deployment-id>

# Option 3: Rollback via Git
git revert HEAD
git push origin main
# (triggers automatic redeployment)

Database Migrations

Migration Strategy

Approach: Forward-only migrations with manual approval

Migration Tool: Supabase CLI

Pre-Migration Checklist

  • Migration SQL reviewed by DBA or senior engineer
  • Migration tested in staging environment
  • Backup completed: supabase db backup create
  • Migration is backwards-compatible (no breaking changes)
  • RLS policies updated if schema changes affect access control
  • Indexes created for new columns (if applicable)

Migration Execution

# 1. Create migration locally (already done in development)
# supabase migration new add_pricing_snapshot_to_bookings

# 2. Review migration file
cat supabase/migrations/20251024_add_pricing_snapshot_to_bookings.sql

# 3. Apply migration to staging (automatic via CI)
# Already applied when merged to staging branch

# 4. Apply migration to production (MANUAL)
supabase db push --project-ref <production-project-id>

# 5. Verify migration applied
supabase db dump --project-ref <production-project-id> --schema public | grep pricing_snapshot

# Expected: Column exists in bookings table

Migration Rollback

Important: Supabase does not support automatic rollback. Create compensating migration.

-- Example: Rollback adding column
-- supabase migration new rollback_pricing_snapshot

BEGIN;

-- Remove column (if safe to do so)
ALTER TABLE bookings DROP COLUMN IF EXISTS pricing_snapshot;

-- Restore previous state
-- (Add any necessary data restoration logic)

COMMIT;

Common Migration Patterns

Adding Column:

-- Safe: Allows NULL or has default
ALTER TABLE bookings
ADD COLUMN pricing_snapshot JSONB DEFAULT '{}';

-- Create index if needed
CREATE INDEX idx_bookings_pricing_snapshot
ON bookings USING gin(pricing_snapshot);

Adding Table:

CREATE TABLE payment_retries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
payment_id UUID NOT NULL REFERENCES payments(id),
attempt_number INT NOT NULL,
error_message TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);

-- Enable RLS
ALTER TABLE payment_retries ENABLE ROW LEVEL SECURITY;

-- Add RLS policy
CREATE POLICY "Users access payment_retries in their org"
ON payment_retries
FOR ALL
USING (
EXISTS (
SELECT 1 FROM payments p
WHERE p.id = payment_retries.payment_id
AND p.org_id = current_setting('app.current_org_id')::uuid
)
);

Modifying Column:

-- Safe: Expanding constraint (varchar(50) -> varchar(100))
ALTER TABLE properties
ALTER COLUMN name TYPE VARCHAR(100);

-- Unsafe: Narrowing constraint (requires validation)
-- DO NOT DO THIS without data validation:
-- ALTER TABLE properties ALTER COLUMN name TYPE VARCHAR(10);

Worker Deployment

Workers are deployed alongside the backend but run as separate processes.

Worker Types

  1. Sync Worker - Hostaway synchronization
  2. Payment Worker - Payment processing
  3. Notification Worker - Email/SMS notifications

Deployment Command

# Railway automatically starts workers based on Procfile
# Procfile:
# web: node dist/server.js
# sync-worker: node dist/workers/sync-worker.js
# payment-worker: node dist/workers/payment-worker.js
# notification-worker: node dist/workers/notification-worker.js

# Verify workers running
railway ps

# Expected output:
# NAME STATUS REPLICAS
# web running 2
# sync-worker running 2
# payment-worker running 1
# notification-worker running 1

Worker Health Check

# Check worker status via Redis
redis-cli -u $REDIS_URL

# Check active workers
> LLEN bull:sync-worker:active
> LLEN bull:payment-worker:active
> LLEN bull:notification-worker:active

# Check failed jobs
> LLEN bull:sync-worker:failed
> LLEN bull:payment-worker:failed

# Expected: active > 0, failed = 0

Restart Workers

# Restart specific worker
railway restart sync-worker

# Restart all workers
railway restart

Health Checks

Comprehensive Health Check Script

#!/bin/bash
# scripts/verify-deployment.sh

set -e

API_URL="https://api.tvl.com"
APP_URL="https://app.tvl.com"

echo "=== TVL Platform Deployment Verification ==="
echo ""

# 1. Frontend Health
echo "1. Checking Frontend..."
FRONTEND_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL)
if [ "$FRONTEND_STATUS" -eq 200 ]; then
echo "✓ Frontend: OK"
else
echo "✗ Frontend: FAILED (HTTP $FRONTEND_STATUS)"
exit 1
fi

# 2. Backend API Health
echo "2. Checking Backend API..."
API_RESPONSE=$(curl -s $API_URL/health)
API_STATUS=$(echo $API_RESPONSE | jq -r '.status')
if [ "$API_STATUS" = "healthy" ]; then
echo "✓ Backend API: OK"
else
echo "✗ Backend API: FAILED"
echo "Response: $API_RESPONSE"
exit 1
fi

# 3. Database Connection
echo "3. Checking Database..."
DB_STATUS=$(echo $API_RESPONSE | jq -r '.checks.database')
if [ "$DB_STATUS" = "healthy" ]; then
echo "✓ Database: OK"
else
echo "✗ Database: FAILED"
exit 1
fi

# 4. Redis Connection
echo "4. Checking Redis..."
REDIS_STATUS=$(echo $API_RESPONSE | jq -r '.checks.redis')
if [ "$REDIS_STATUS" = "healthy" ]; then
echo "✓ Redis: OK"
else
echo "✗ Redis: FAILED"
exit 1
fi

# 5. External API Connectivity
echo "5. Checking External APIs..."
STRIPE_STATUS=$(echo $API_RESPONSE | jq -r '.checks.stripe')
HOSTAWAY_STATUS=$(echo $API_RESPONSE | jq -r '.checks.hostaway')

if [ "$STRIPE_STATUS" = "healthy" ]; then
echo "✓ Stripe: OK"
else
echo "⚠ Stripe: $STRIPE_STATUS"
fi

if [ "$HOSTAWAY_STATUS" = "healthy" ]; then
echo "✓ Hostaway: OK"
else
echo "⚠ Hostaway: $HOSTAWAY_STATUS"
fi

# 6. Critical Endpoints
echo "6. Checking Critical Endpoints..."

# Auth endpoint
AUTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/api/v1/auth/me -H "Authorization: Bearer invalid")
if [ "$AUTH_STATUS" -eq 401 ]; then
echo "✓ Auth endpoint: OK (returns 401 as expected)"
else
echo "✗ Auth endpoint: FAILED (expected 401, got $AUTH_STATUS)"
exit 1
fi

# Properties endpoint (requires auth)
PROPERTIES_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/api/v1/properties)
if [ "$PROPERTIES_STATUS" -eq 401 ]; then
echo "✓ Properties endpoint: OK (auth required)"
else
echo "✗ Properties endpoint: FAILED (expected 401, got $PROPERTIES_STATUS)"
fi

# 7. Metrics Endpoint
echo "7. Checking Metrics..."
METRICS_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/metrics)
if [ "$METRICS_STATUS" -eq 200 ]; then
echo "✓ Metrics: OK"
else
echo "⚠ Metrics: FAILED (HTTP $METRICS_STATUS)"
fi

# 8. Worker Jobs
echo "8. Checking Worker Jobs..."
REDIS_URL="${REDIS_URL:-redis://localhost:6379}"
ACTIVE_JOBS=$(redis-cli -u $REDIS_URL LLEN bull:sync-worker:active 2>/dev/null || echo "0")
echo "✓ Workers: $ACTIVE_JOBS active jobs"

echo ""
echo "=== Deployment Verification Complete ==="
echo "✓ All critical checks passed"
exit 0

Usage

# Run verification script
chmod +x scripts/verify-deployment.sh
./scripts/verify-deployment.sh

# Run as part of CI/CD
- name: Verify Deployment
run: ./scripts/verify-deployment.sh

Rollback Procedures

Rollback Decision Criteria

Initiate rollback if:

  • Health checks failing for > 5 minutes
  • Error rate > 5% in Sentry
  • Database queries timing out (P95 > 1000ms)
  • Critical functionality broken (auth, bookings, payments)
  • External API integrations failing (Stripe, Hostaway)

Full Rollback Procedure

Estimated Time: 5-10 minutes

#!/bin/bash
# scripts/rollback-production.sh

set -e

echo "=== EMERGENCY ROLLBACK INITIATED ==="
echo "Timestamp: $(date)"
echo ""

# 1. Rollback Backend (Railway)
echo "1. Rolling back Backend..."
railway redeploy --deployment-id <PREVIOUS_DEPLOYMENT_ID>
echo "✓ Backend rollback initiated"

# 2. Rollback Frontend (Vercel)
echo "2. Rolling back Frontend..."
vercel rollback --token=$VERCEL_TOKEN
echo "✓ Frontend rollback initiated"

# 3. Wait for deployments
echo "3. Waiting for rollback to complete..."
sleep 30

# 4. Verify health
echo "4. Verifying health..."
./scripts/verify-deployment.sh

if [ $? -eq 0 ]; then
echo "✓ Rollback successful"
echo "Please investigate and fix the issue before redeploying"
else
echo "✗ Rollback health check failed"
echo "ESCALATE TO ON-CALL ENGINEER IMMEDIATELY"
exit 1
fi

echo ""
echo "=== ROLLBACK COMPLETE ==="

Database Rollback

Warning: Database rollback is complex. Only rollback if:

  1. Migration caused immediate production outage
  2. Compensating migration ready
  3. Data corruption detected
# 1. Create compensating migration
supabase migration new rollback_<original_migration_name>

# 2. Write rollback SQL
cat > supabase/migrations/<timestamp>_rollback.sql <<EOF
-- Rollback logic here
EOF

# 3. Apply rollback migration
supabase db push --project-ref <production-project-id>

# 4. Verify
./scripts/verify-deployment.sh

Post-Rollback Actions

  • Update #engineering-deploys with rollback notification
  • Create incident post-mortem ticket
  • Document root cause and fix
  • Schedule fix deployment
  • Review rollback effectiveness

Post-Deployment Verification

Verification Checklist

Within 15 minutes of deployment:

  • Run comprehensive health check script: ./scripts/verify-deployment.sh
  • Check error rates in Sentry (should be < 1%)
  • Check Grafana dashboards for anomalies
  • Verify critical user flows:
    • User login (Google OIDC)
    • Property list load
    • Create quote
    • Create booking
    • Process payment
  • Check worker job processing rates
  • Verify external API integrations (Stripe, Hostaway)

Within 1 hour of deployment:

  • Monitor P95 latency (should be < 500ms)
  • Check for spike in failed jobs (DLQ)
  • Verify scheduled jobs running (sync workers)
  • Check customer support channels for issues
  • Review deployment logs for warnings

Within 24 hours of deployment:

  • Review error trends in Sentry
  • Check Grafana for resource utilization trends
  • Verify data consistency (no orphaned records)
  • Complete incident report (if any issues)
  • Update runbook with lessons learned

Smoke Test Script

#!/bin/bash
# scripts/smoke-test.sh

API_URL="https://api.tvl.com"
TEST_USER_TOKEN="<test-user-jwt>"

# Test 1: Login
echo "Test 1: User authentication..."
curl -f -X GET "$API_URL/api/v1/auth/me" \
-H "Authorization: Bearer $TEST_USER_TOKEN" || exit 1
echo "✓ Login OK"

# Test 2: List properties
echo "Test 2: List properties..."
curl -f -X GET "$API_URL/api/v1/properties" \
-H "Authorization: Bearer $TEST_USER_TOKEN" || exit 1
echo "✓ Properties OK"

# Test 3: Create quote
echo "Test 3: Create quote..."
QUOTE_RESPONSE=$(curl -f -X POST "$API_URL/api/v1/quotes" \
-H "Authorization: Bearer $TEST_USER_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"unit_id": "<test-unit-id>",
"check_in": "2025-11-01",
"check_out": "2025-11-05",
"guests": 2
}')
echo "✓ Quote OK"

# Test 4: Webhook endpoint
echo "Test 4: Webhook endpoint..."
curl -f -X POST "$API_URL/webhooks/hostaway" \
-H "Content-Type: application/json" \
-d '{"test": true}' || exit 1
echo "✓ Webhook OK"

echo ""
echo "✓ All smoke tests passed"

Emergency Procedures

Production Outage Response

Severity Levels:

  • P0 (Critical): Complete outage, revenue impact
  • P1 (High): Major functionality broken, partial outage
  • P2 (Medium): Minor functionality broken, workaround available
  • P3 (Low): Cosmetic issue, no functionality impact

P0/P1 Response:

  1. Acknowledge (within 5 minutes)

  2. Assess (within 15 minutes)

    • Check health endpoints
    • Review Grafana dashboards
    • Check Sentry errors
    • Identify root cause
  3. Mitigate (within 30 minutes)

    • Execute rollback if recent deployment
    • Restart services if transient failure
    • Scale up resources if capacity issue
    • Disable feature flag if specific feature broken
  4. Resolve (within 1 hour)

    • Deploy hotfix if needed
    • Verify health checks
    • Monitor for 30 minutes
  5. Communicate (ongoing)

    • Update status page every 15 minutes
    • Notify customers via email (if extended outage)
    • Post resolution summary

Emergency Contacts

RoleContactEscalation
On-Call EngineerPagerDutyImmediate
Engineering LeadSlack DM15 minutes
CTOPhone30 minutes
External SupportVendor supportAs needed

Hotfix Deployment

Fast-track deployment for critical production fixes:

# 1. Create hotfix branch from main
git checkout main
git pull
git checkout -b hotfix/fix-payment-processing

# 2. Make minimal fix
# (edit code)

# 3. Commit and push
git add .
git commit -m "hotfix: Fix payment processing race condition"
git push origin hotfix/fix-payment-processing

# 4. Create PR with "HOTFIX" label
gh pr create --title "[HOTFIX] Fix payment processing" --label hotfix

# 5. Fast-track review (1 approval required)
# 6. Merge to main
gh pr merge --squash

# 7. Monitor deployment
./scripts/verify-deployment.sh

# 8. Verify fix in production
# (test specific functionality)

Sources

  • /mnt/c/GitHub/claude-test/docs/01-architecture/deployment-views.md
  • /mnt/c/GitHub/claude-test/docs/01-architecture/logical-architecture.md
  • /mnt/c/GitHub/claude-test/docs/00-overview/platform-overview.md