TVL Platform - Deployment Runbook

Summary

This runbook provides step-by-step deployment procedures for the TVL platform, including pre-deployment checklists, deployment steps for each component, health verification, rollback procedures, and post-deployment validation.

Pre-Deployment Checklist
Deployment Overview
Frontend Deployment (Vercel)
Backend Deployment (Railway/Fly)
Database Migrations
Worker Deployment
Health Checks
Rollback Procedures
Post-Deployment Verification
Emergency Procedures

Pre-Deployment Checklist

1. Code Review & Approval

All PRs merged to main branch
Code review completed by at least 1 senior engineer
CI/CD pipeline passing (all tests green)
Security scan completed (no high/critical vulnerabilities)
Performance regression tests passed
Database migration reviewed and approved

2. Environment Validation

Staging Environment

Staging deployment successful
Manual QA testing completed
Integration tests passing
External API mocks verified (Hostaway, Stripe test mode)
No critical errors in Sentry (staging)
Load test completed (if applicable)

Production Readiness

All environment variables configured in secret manager
SSL certificates valid and not expiring soon (> 30 days)
Database backup completed within last 24 hours
Redis persistence verified
Third-party services operational (check status pages)
- Stripe: https://status.stripe.com
- Supabase: https://status.supabase.com
- Vercel: https://www.vercel-status.com
- Railway: https://railway.statuspage.io

3. Change Management

Deployment window scheduled (prefer low-traffic periods)
Deployment notification sent to #engineering-deploys Slack channel
On-call engineer identified and available
Rollback plan documented and reviewed
Customer-facing changes documented for support team
Maintenance window announced (if downtime expected)

4. Backup & Failsafe

Database backup triggered manually: supabase db backup create
Current production deployment tagged in Git: git tag prod-YYYYMMDD-HHMM
Rollback deployment IDs noted:
- Vercel: <deployment-id>
- Railway: <deployment-id>
Redis snapshot created (Upstash dashboard)

Deployment Overview

Deployment Architecture

Deployment Order

Database Migrations (manual approval required)
Backend API (Railway/Fly with health check gate)
Worker Processes (parallel with backend)
Frontend (Vercel automatic)
Health Check Verification
Traffic Switchover (automatic for zero-downtime)

Deployment Timing

Duration: ~15-20 minutes (no downtime)
Recommended Window:
- Weekdays: 10am-2pm EST (low traffic)
- Avoid: Friday 4pm+, weekends, holidays
- Emergency hotfix: Anytime with approval

Frontend Deployment (Vercel)

Automatic Deployment

Frontend deployments are automatic on merge to main:

# Triggered by GitHub Actions
# .github/workflows/deploy-production.yml
- name: Deploy Frontend to Vercel
  run: vercel deploy --prod --token=${{ secrets.VERCEL_TOKEN }}

Manual Deployment (Emergency)

If automatic deployment fails:

# 1. Install Vercel CLI
npm install -g vercel

# 2. Login to Vercel
vercel login

# 3. Deploy to production
cd frontend
vercel --prod

# 4. Verify deployment
curl -f https://app.tvl.com/health || echo "Frontend health check failed"

Vercel Deployment Configuration

Environment: Production Domain: app.tvl.com Build Command: npm run build Output Directory: dist Node Version: 20.x

Frontend Health Check

# Check deployment status
curl -I https://app.tvl.com

# Expected response:
# HTTP/2 200
# x-vercel-id: <deployment-id>
# cache-control: public, max-age=0, must-revalidate

# Check JavaScript assets
curl -f https://app.tvl.com/assets/index-*.js | wc -c
# Expected: > 50KB

Rollback Frontend

# Option 1: Rollback via Vercel dashboard
# 1. Go to https://vercel.com/tvl/tvl-frontend/deployments
# 2. Select previous working deployment
# 3. Click "Promote to Production"

# Option 2: Rollback via CLI
vercel rollback --token=$VERCEL_TOKEN

Backend Deployment (Railway/Fly)

Automatic Deployment

Backend deployments are automatic on merge to main with health check gates:

# .github/workflows/deploy-production.yml
deploy-backend:
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to Railway
      run: railway up --environment production

    - name: Wait for deployment
      run: railway status --wait

    - name: Health check
      run: |
        for i in {1..30}; do
          if curl -f https://api.tvl.com/health; then
            echo "Health check passed"
            exit 0
          fi
          echo "Waiting for health check... ($i/30)"
          sleep 10
        done
        echo "Health check failed after 5 minutes"
        exit 1

Manual Deployment (Emergency)

# 1. Install Railway CLI
npm install -g @railway/cli

# 2. Login to Railway
railway login

# 3. Link to production project
railway link tvl-production

# 4. Deploy backend
cd backend
railway up --environment production

# 5. Monitor deployment logs
railway logs --follow

# 6. Verify health check
curl -f https://api.tvl.com/health

Backend Health Check Endpoint

Endpoint: GET /health

Expected Response:

{
  "status": "healthy",
  "timestamp": "2025-10-24T10:00:00Z",
  "version": "1.2.3",
  "uptime": 3600,
  "checks": {
    "database": "healthy",
    "redis": "healthy",
    "stripe": "healthy",
    "hostaway": "healthy"
  }
}

Health Check Script:

#!/bin/bash
# scripts/health-check.sh

API_URL="https://api.tvl.com"
MAX_ATTEMPTS=30
WAIT_SECONDS=10

for i in $(seq 1 $MAX_ATTEMPTS); do
  echo "Health check attempt $i/$MAX_ATTEMPTS..."

  RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/health)

  if [ "$RESPONSE" -eq 200 ]; then
    echo "✓ Health check passed"

    # Verify all dependencies
    HEALTH_JSON=$(curl -s $API_URL/health)
    DB_STATUS=$(echo $HEALTH_JSON | jq -r '.checks.database')
    REDIS_STATUS=$(echo $HEALTH_JSON | jq -r '.checks.redis')

    if [ "$DB_STATUS" = "healthy" ] && [ "$REDIS_STATUS" = "healthy" ]; then
      echo "✓ All dependencies healthy"
      exit 0
    else
      echo "✗ Dependencies unhealthy: DB=$DB_STATUS, Redis=$REDIS_STATUS"
      exit 1
    fi
  fi

  echo "Waiting $WAIT_SECONDS seconds..."
  sleep $WAIT_SECONDS
done

echo "✗ Health check failed after $((MAX_ATTEMPTS * WAIT_SECONDS)) seconds"
exit 1

Rollback Backend

# Option 1: Rollback via Railway dashboard
# 1. Go to https://railway.app/project/tvl-production/deployments
# 2. Select previous working deployment
# 3. Click "Redeploy"

# Option 2: Rollback via CLI
railway redeploy --deployment-id <previous-deployment-id>

# Option 3: Rollback via Git
git revert HEAD
git push origin main
# (triggers automatic redeployment)

Database Migrations

Migration Strategy

Approach: Forward-only migrations with manual approval

Migration Tool: Supabase CLI

Pre-Migration Checklist

Migration SQL reviewed by DBA or senior engineer
Migration tested in staging environment
Backup completed: supabase db backup create
Migration is backwards-compatible (no breaking changes)
RLS policies updated if schema changes affect access control
Indexes created for new columns (if applicable)

Migration Execution

# 1. Create migration locally (already done in development)
# supabase migration new add_pricing_snapshot_to_bookings

# 2. Review migration file
cat supabase/migrations/20251024_add_pricing_snapshot_to_bookings.sql

# 3. Apply migration to staging (automatic via CI)
# Already applied when merged to staging branch

# 4. Apply migration to production (MANUAL)
supabase db push --project-ref <production-project-id>

# 5. Verify migration applied
supabase db dump --project-ref <production-project-id> --schema public | grep pricing_snapshot

# Expected: Column exists in bookings table

Migration Rollback

Important: Supabase does not support automatic rollback. Create compensating migration.

-- Example: Rollback adding column
-- supabase migration new rollback_pricing_snapshot

BEGIN;

-- Remove column (if safe to do so)
ALTER TABLE bookings DROP COLUMN IF EXISTS pricing_snapshot;

-- Restore previous state
-- (Add any necessary data restoration logic)

COMMIT;

Common Migration Patterns

Adding Column:

-- Safe: Allows NULL or has default
ALTER TABLE bookings
ADD COLUMN pricing_snapshot JSONB DEFAULT '{}';

-- Create index if needed
CREATE INDEX idx_bookings_pricing_snapshot
ON bookings USING gin(pricing_snapshot);

Adding Table:

CREATE TABLE payment_retries (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  payment_id UUID NOT NULL REFERENCES payments(id),
  attempt_number INT NOT NULL,
  error_message TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

-- Enable RLS
ALTER TABLE payment_retries ENABLE ROW LEVEL SECURITY;

-- Add RLS policy
CREATE POLICY "Users access payment_retries in their org"
ON payment_retries
FOR ALL
USING (
  EXISTS (
    SELECT 1 FROM payments p
    WHERE p.id = payment_retries.payment_id
    AND p.org_id = current_setting('app.current_org_id')::uuid
  )
);

Modifying Column:

-- Safe: Expanding constraint (varchar(50) -> varchar(100))
ALTER TABLE properties
ALTER COLUMN name TYPE VARCHAR(100);

-- Unsafe: Narrowing constraint (requires validation)
-- DO NOT DO THIS without data validation:
-- ALTER TABLE properties ALTER COLUMN name TYPE VARCHAR(10);

Worker Deployment

Workers are deployed alongside the backend but run as separate processes.

Worker Types

Sync Worker - Hostaway synchronization
Payment Worker - Payment processing
Notification Worker - Email/SMS notifications

Deployment Command

# Railway automatically starts workers based on Procfile
# Procfile:
# web: node dist/server.js
# sync-worker: node dist/workers/sync-worker.js
# payment-worker: node dist/workers/payment-worker.js
# notification-worker: node dist/workers/notification-worker.js

# Verify workers running
railway ps

# Expected output:
# NAME                  STATUS    REPLICAS
# web                   running   2
# sync-worker           running   2
# payment-worker        running   1
# notification-worker   running   1

Worker Health Check

# Check worker status via Redis
redis-cli -u $REDIS_URL

# Check active workers
> LLEN bull:sync-worker:active
> LLEN bull:payment-worker:active
> LLEN bull:notification-worker:active

# Check failed jobs
> LLEN bull:sync-worker:failed
> LLEN bull:payment-worker:failed

# Expected: active > 0, failed = 0

Restart Workers

# Restart specific worker
railway restart sync-worker

# Restart all workers
railway restart

Health Checks

Comprehensive Health Check Script

#!/bin/bash
# scripts/verify-deployment.sh

set -e

API_URL="https://api.tvl.com"
APP_URL="https://app.tvl.com"

echo "=== TVL Platform Deployment Verification ==="
echo ""

# 1. Frontend Health
echo "1. Checking Frontend..."
FRONTEND_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL)
if [ "$FRONTEND_STATUS" -eq 200 ]; then
  echo "✓ Frontend: OK"
else
  echo "✗ Frontend: FAILED (HTTP $FRONTEND_STATUS)"
  exit 1
fi

# 2. Backend API Health
echo "2. Checking Backend API..."
API_RESPONSE=$(curl -s $API_URL/health)
API_STATUS=$(echo $API_RESPONSE | jq -r '.status')
if [ "$API_STATUS" = "healthy" ]; then
  echo "✓ Backend API: OK"
else
  echo "✗ Backend API: FAILED"
  echo "Response: $API_RESPONSE"
  exit 1
fi

# 3. Database Connection
echo "3. Checking Database..."
DB_STATUS=$(echo $API_RESPONSE | jq -r '.checks.database')
if [ "$DB_STATUS" = "healthy" ]; then
  echo "✓ Database: OK"
else
  echo "✗ Database: FAILED"
  exit 1
fi

# 4. Redis Connection
echo "4. Checking Redis..."
REDIS_STATUS=$(echo $API_RESPONSE | jq -r '.checks.redis')
if [ "$REDIS_STATUS" = "healthy" ]; then
  echo "✓ Redis: OK"
else
  echo "✗ Redis: FAILED"
  exit 1
fi

# 5. External API Connectivity
echo "5. Checking External APIs..."
STRIPE_STATUS=$(echo $API_RESPONSE | jq -r '.checks.stripe')
HOSTAWAY_STATUS=$(echo $API_RESPONSE | jq -r '.checks.hostaway')

if [ "$STRIPE_STATUS" = "healthy" ]; then
  echo "✓ Stripe: OK"
else
  echo "⚠ Stripe: $STRIPE_STATUS"
fi

if [ "$HOSTAWAY_STATUS" = "healthy" ]; then
  echo "✓ Hostaway: OK"
else
  echo "⚠ Hostaway: $HOSTAWAY_STATUS"
fi

# 6. Critical Endpoints
echo "6. Checking Critical Endpoints..."

# Auth endpoint
AUTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/api/v1/auth/me -H "Authorization: Bearer invalid")
if [ "$AUTH_STATUS" -eq 401 ]; then
  echo "✓ Auth endpoint: OK (returns 401 as expected)"
else
  echo "✗ Auth endpoint: FAILED (expected 401, got $AUTH_STATUS)"
  exit 1
fi

# Properties endpoint (requires auth)
PROPERTIES_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/api/v1/properties)
if [ "$PROPERTIES_STATUS" -eq 401 ]; then
  echo "✓ Properties endpoint: OK (auth required)"
else
  echo "✗ Properties endpoint: FAILED (expected 401, got $PROPERTIES_STATUS)"
fi

# 7. Metrics Endpoint
echo "7. Checking Metrics..."
METRICS_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/metrics)
if [ "$METRICS_STATUS" -eq 200 ]; then
  echo "✓ Metrics: OK"
else
  echo "⚠ Metrics: FAILED (HTTP $METRICS_STATUS)"
fi

# 8. Worker Jobs
echo "8. Checking Worker Jobs..."
REDIS_URL="${REDIS_URL:-redis://localhost:6379}"
ACTIVE_JOBS=$(redis-cli -u $REDIS_URL LLEN bull:sync-worker:active 2>/dev/null || echo "0")
echo "✓ Workers: $ACTIVE_JOBS active jobs"

echo ""
echo "=== Deployment Verification Complete ==="
echo "✓ All critical checks passed"
exit 0

Usage

# Run verification script
chmod +x scripts/verify-deployment.sh
./scripts/verify-deployment.sh

# Run as part of CI/CD
- name: Verify Deployment
  run: ./scripts/verify-deployment.sh

Rollback Procedures

Rollback Decision Criteria

Initiate rollback if:

Health checks failing for > 5 minutes
Error rate > 5% in Sentry
Database queries timing out (P95 > 1000ms)
Critical functionality broken (auth, bookings, payments)
External API integrations failing (Stripe, Hostaway)

Full Rollback Procedure

Estimated Time: 5-10 minutes

#!/bin/bash
# scripts/rollback-production.sh

set -e

echo "=== EMERGENCY ROLLBACK INITIATED ==="
echo "Timestamp: $(date)"
echo ""

# 1. Rollback Backend (Railway)
echo "1. Rolling back Backend..."
railway redeploy --deployment-id <PREVIOUS_DEPLOYMENT_ID>
echo "✓ Backend rollback initiated"

# 2. Rollback Frontend (Vercel)
echo "2. Rolling back Frontend..."
vercel rollback --token=$VERCEL_TOKEN
echo "✓ Frontend rollback initiated"

# 3. Wait for deployments
echo "3. Waiting for rollback to complete..."
sleep 30

# 4. Verify health
echo "4. Verifying health..."
./scripts/verify-deployment.sh

if [ $? -eq 0 ]; then
  echo "✓ Rollback successful"
  echo "Please investigate and fix the issue before redeploying"
else
  echo "✗ Rollback health check failed"
  echo "ESCALATE TO ON-CALL ENGINEER IMMEDIATELY"
  exit 1
fi

echo ""
echo "=== ROLLBACK COMPLETE ==="

Database Rollback

Warning: Database rollback is complex. Only rollback if:

Migration caused immediate production outage
Compensating migration ready
Data corruption detected

# 1. Create compensating migration
supabase migration new rollback_<original_migration_name>

# 2. Write rollback SQL
cat > supabase/migrations/<timestamp>_rollback.sql <<EOF
-- Rollback logic here
EOF

# 3. Apply rollback migration
supabase db push --project-ref <production-project-id>

# 4. Verify
./scripts/verify-deployment.sh

Post-Rollback Actions

Update #engineering-deploys with rollback notification
Create incident post-mortem ticket
Document root cause and fix
Schedule fix deployment
Review rollback effectiveness

Post-Deployment Verification

Verification Checklist

Within 15 minutes of deployment:

Within 1 hour of deployment:

Monitor P95 latency (should be < 500ms)
Check for spike in failed jobs (DLQ)
Verify scheduled jobs running (sync workers)
Check customer support channels for issues
Review deployment logs for warnings

Within 24 hours of deployment:

Review error trends in Sentry
Check Grafana for resource utilization trends
Verify data consistency (no orphaned records)
Complete incident report (if any issues)
Update runbook with lessons learned

Smoke Test Script

#!/bin/bash
# scripts/smoke-test.sh

API_URL="https://api.tvl.com"
TEST_USER_TOKEN="<test-user-jwt>"

# Test 1: Login
echo "Test 1: User authentication..."
curl -f -X GET "$API_URL/api/v1/auth/me" \
  -H "Authorization: Bearer $TEST_USER_TOKEN" || exit 1
echo "✓ Login OK"

# Test 2: List properties
echo "Test 2: List properties..."
curl -f -X GET "$API_URL/api/v1/properties" \
  -H "Authorization: Bearer $TEST_USER_TOKEN" || exit 1
echo "✓ Properties OK"

# Test 3: Create quote
echo "Test 3: Create quote..."
QUOTE_RESPONSE=$(curl -f -X POST "$API_URL/api/v1/quotes" \
  -H "Authorization: Bearer $TEST_USER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "unit_id": "<test-unit-id>",
    "check_in": "2025-11-01",
    "check_out": "2025-11-05",
    "guests": 2
  }')
echo "✓ Quote OK"

# Test 4: Webhook endpoint
echo "Test 4: Webhook endpoint..."
curl -f -X POST "$API_URL/webhooks/hostaway" \
  -H "Content-Type: application/json" \
  -d '{"test": true}' || exit 1
echo "✓ Webhook OK"

echo ""
echo "✓ All smoke tests passed"

Emergency Procedures

Production Outage Response

Severity Levels:

P0 (Critical): Complete outage, revenue impact
P1 (High): Major functionality broken, partial outage
P2 (Medium): Minor functionality broken, workaround available
P3 (Low): Cosmetic issue, no functionality impact

P0/P1 Response:

Acknowledge (within 5 minutes)
- Post in #incidents Slack channel
- Update status page: https://status.tvl.com
- Notify on-call engineer
Assess (within 15 minutes)
- Check health endpoints
- Review Grafana dashboards
- Check Sentry errors
- Identify root cause
Mitigate (within 30 minutes)
- Execute rollback if recent deployment
- Restart services if transient failure
- Scale up resources if capacity issue
- Disable feature flag if specific feature broken
Resolve (within 1 hour)
- Deploy hotfix if needed
- Verify health checks
- Monitor for 30 minutes
Communicate (ongoing)
- Update status page every 15 minutes
- Notify customers via email (if extended outage)
- Post resolution summary

Emergency Contacts

Role	Contact	Escalation
On-Call Engineer	PagerDuty	Immediate
Engineering Lead	Slack DM	15 minutes
CTO	Phone	30 minutes
External Support	Vendor support	As needed

Hotfix Deployment

Fast-track deployment for critical production fixes:

# 1. Create hotfix branch from main
git checkout main
git pull
git checkout -b hotfix/fix-payment-processing

# 2. Make minimal fix
# (edit code)

# 3. Commit and push
git add .
git commit -m "hotfix: Fix payment processing race condition"
git push origin hotfix/fix-payment-processing

# 4. Create PR with "HOTFIX" label
gh pr create --title "[HOTFIX] Fix payment processing" --label hotfix

# 5. Fast-track review (1 approval required)
# 6. Merge to main
gh pr merge --squash

# 7. Monitor deployment
./scripts/verify-deployment.sh

# 8. Verify fix in production
# (test specific functionality)

Sources

/mnt/c/GitHub/claude-test/docs/01-architecture/deployment-views.md
/mnt/c/GitHub/claude-test/docs/01-architecture/logical-architecture.md
/mnt/c/GitHub/claude-test/docs/00-overview/platform-overview.md

Summary​

Table of Contents​

Pre-Deployment Checklist​

1. Code Review & Approval​

2. Environment Validation​

3. Change Management​

4. Backup & Failsafe​

Deployment Overview​

Deployment Architecture​

Deployment Order​

Deployment Timing​

Frontend Deployment (Vercel)​

Automatic Deployment​

Manual Deployment (Emergency)​

Vercel Deployment Configuration​

Frontend Health Check​

Rollback Frontend​

Backend Deployment (Railway/Fly)​

Automatic Deployment​

Manual Deployment (Emergency)​

Backend Health Check Endpoint​

Rollback Backend​

Database Migrations​

Migration Strategy​

Pre-Migration Checklist​

Migration Execution​

Migration Rollback​

Common Migration Patterns​

Worker Deployment​

Worker Types​

Deployment Command​

Worker Health Check​

Restart Workers​

Health Checks​

Comprehensive Health Check Script​

Usage​

Rollback Procedures​

Rollback Decision Criteria​

Full Rollback Procedure​

Database Rollback​

Post-Rollback Actions​

Post-Deployment Verification​

Verification Checklist​

Smoke Test Script​

Emergency Procedures​

Production Outage Response​

Emergency Contacts​

Hotfix Deployment​

Sources​

Summary

Table of Contents

Pre-Deployment Checklist

1. Code Review & Approval

2. Environment Validation

3. Change Management

4. Backup & Failsafe

Deployment Overview

Deployment Architecture

Deployment Order

Deployment Timing

Frontend Deployment (Vercel)

Automatic Deployment

Manual Deployment (Emergency)

Vercel Deployment Configuration

Frontend Health Check

Rollback Frontend

Backend Deployment (Railway/Fly)

Automatic Deployment

Manual Deployment (Emergency)

Backend Health Check Endpoint

Rollback Backend

Database Migrations

Migration Strategy

Pre-Migration Checklist

Migration Execution

Migration Rollback

Common Migration Patterns

Worker Deployment

Worker Types

Deployment Command

Worker Health Check

Restart Workers

Health Checks

Comprehensive Health Check Script

Usage

Rollback Procedures

Rollback Decision Criteria

Full Rollback Procedure

Database Rollback

Post-Rollback Actions

Post-Deployment Verification

Verification Checklist

Smoke Test Script

Emergency Procedures

Production Outage Response

Emergency Contacts

Hotfix Deployment

Sources