TVL Platform - Deployment Runbook
Summary
This runbook provides step-by-step deployment procedures for the TVL platform, including pre-deployment checklists, deployment steps for each component, health verification, rollback procedures, and post-deployment validation.
Table of Contents
- Pre-Deployment Checklist
- Deployment Overview
- Frontend Deployment (Vercel)
- Backend Deployment (Railway/Fly)
- Database Migrations
- Worker Deployment
- Health Checks
- Rollback Procedures
- Post-Deployment Verification
- Emergency Procedures
Pre-Deployment Checklist
1. Code Review & Approval
-  All PRs merged to mainbranch
- Code review completed by at least 1 senior engineer
- CI/CD pipeline passing (all tests green)
- Security scan completed (no high/critical vulnerabilities)
- Performance regression tests passed
- Database migration reviewed and approved
2. Environment Validation
Staging Environment
- Staging deployment successful
- Manual QA testing completed
- Integration tests passing
- External API mocks verified (Hostaway, Stripe test mode)
- No critical errors in Sentry (staging)
- Load test completed (if applicable)
Production Readiness
- All environment variables configured in secret manager
- SSL certificates valid and not expiring soon (> 30 days)
- Database backup completed within last 24 hours
- Redis persistence verified
-  Third-party services operational (check status pages)
- Stripe: https://status.stripe.com
- Supabase: https://status.supabase.com
- Vercel: https://www.vercel-status.com
- Railway: https://railway.statuspage.io
 
3. Change Management
- Deployment window scheduled (prefer low-traffic periods)
- Deployment notification sent to #engineering-deploys Slack channel
- On-call engineer identified and available
- Rollback plan documented and reviewed
- Customer-facing changes documented for support team
- Maintenance window announced (if downtime expected)
4. Backup & Failsafe
-  Database backup triggered manually: supabase db backup create
-  Current production deployment tagged in Git: git tag prod-YYYYMMDD-HHMM
-  Rollback deployment IDs noted:
- Vercel: <deployment-id>
- Railway: <deployment-id>
 
- Vercel: 
- Redis snapshot created (Upstash dashboard)
Deployment Overview
Deployment Architecture
Deployment Order
- Database Migrations (manual approval required)
- Backend API (Railway/Fly with health check gate)
- Worker Processes (parallel with backend)
- Frontend (Vercel automatic)
- Health Check Verification
- Traffic Switchover (automatic for zero-downtime)
Deployment Timing
- Duration: ~15-20 minutes (no downtime)
- Recommended Window:
- Weekdays: 10am-2pm EST (low traffic)
- Avoid: Friday 4pm+, weekends, holidays
- Emergency hotfix: Anytime with approval
 
Frontend Deployment (Vercel)
Automatic Deployment
Frontend deployments are automatic on merge to main:
# Triggered by GitHub Actions
# .github/workflows/deploy-production.yml
- name: Deploy Frontend to Vercel
  run: vercel deploy --prod --token=${{ secrets.VERCEL_TOKEN }}
Manual Deployment (Emergency)
If automatic deployment fails:
# 1. Install Vercel CLI
npm install -g vercel
# 2. Login to Vercel
vercel login
# 3. Deploy to production
cd frontend
vercel --prod
# 4. Verify deployment
curl -f https://app.tvl.com/health || echo "Frontend health check failed"
Vercel Deployment Configuration
Environment: Production
Domain: app.tvl.com
Build Command: npm run build
Output Directory: dist
Node Version: 20.x
Frontend Health Check
# Check deployment status
curl -I https://app.tvl.com
# Expected response:
# HTTP/2 200
# x-vercel-id: <deployment-id>
# cache-control: public, max-age=0, must-revalidate
# Check JavaScript assets
curl -f https://app.tvl.com/assets/index-*.js | wc -c
# Expected: > 50KB
Rollback Frontend
# Option 1: Rollback via Vercel dashboard
# 1. Go to https://vercel.com/tvl/tvl-frontend/deployments
# 2. Select previous working deployment
# 3. Click "Promote to Production"
# Option 2: Rollback via CLI
vercel rollback --token=$VERCEL_TOKEN
Backend Deployment (Railway/Fly)
Automatic Deployment
Backend deployments are automatic on merge to main with health check gates:
# .github/workflows/deploy-production.yml
deploy-backend:
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to Railway
      run: railway up --environment production
    - name: Wait for deployment
      run: railway status --wait
    - name: Health check
      run: |
        for i in {1..30}; do
          if curl -f https://api.tvl.com/health; then
            echo "Health check passed"
            exit 0
          fi
          echo "Waiting for health check... ($i/30)"
          sleep 10
        done
        echo "Health check failed after 5 minutes"
        exit 1
Manual Deployment (Emergency)
# 1. Install Railway CLI
npm install -g @railway/cli
# 2. Login to Railway
railway login
# 3. Link to production project
railway link tvl-production
# 4. Deploy backend
cd backend
railway up --environment production
# 5. Monitor deployment logs
railway logs --follow
# 6. Verify health check
curl -f https://api.tvl.com/health
Backend Health Check Endpoint
Endpoint: GET /health
Expected Response:
{
  "status": "healthy",
  "timestamp": "2025-10-24T10:00:00Z",
  "version": "1.2.3",
  "uptime": 3600,
  "checks": {
    "database": "healthy",
    "redis": "healthy",
    "stripe": "healthy",
    "hostaway": "healthy"
  }
}
Health Check Script:
#!/bin/bash
# scripts/health-check.sh
API_URL="https://api.tvl.com"
MAX_ATTEMPTS=30
WAIT_SECONDS=10
for i in $(seq 1 $MAX_ATTEMPTS); do
  echo "Health check attempt $i/$MAX_ATTEMPTS..."
  RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/health)
  if [ "$RESPONSE" -eq 200 ]; then
    echo "✓ Health check passed"
    # Verify all dependencies
    HEALTH_JSON=$(curl -s $API_URL/health)
    DB_STATUS=$(echo $HEALTH_JSON | jq -r '.checks.database')
    REDIS_STATUS=$(echo $HEALTH_JSON | jq -r '.checks.redis')
    if [ "$DB_STATUS" = "healthy" ] && [ "$REDIS_STATUS" = "healthy" ]; then
      echo "✓ All dependencies healthy"
      exit 0
    else
      echo "✗ Dependencies unhealthy: DB=$DB_STATUS, Redis=$REDIS_STATUS"
      exit 1
    fi
  fi
  echo "Waiting $WAIT_SECONDS seconds..."
  sleep $WAIT_SECONDS
done
echo "✗ Health check failed after $((MAX_ATTEMPTS * WAIT_SECONDS)) seconds"
exit 1
Rollback Backend
# Option 1: Rollback via Railway dashboard
# 1. Go to https://railway.app/project/tvl-production/deployments
# 2. Select previous working deployment
# 3. Click "Redeploy"
# Option 2: Rollback via CLI
railway redeploy --deployment-id <previous-deployment-id>
# Option 3: Rollback via Git
git revert HEAD
git push origin main
# (triggers automatic redeployment)
Database Migrations
Migration Strategy
Approach: Forward-only migrations with manual approval
Migration Tool: Supabase CLI
Pre-Migration Checklist
- Migration SQL reviewed by DBA or senior engineer
- Migration tested in staging environment
-  Backup completed: supabase db backup create
- Migration is backwards-compatible (no breaking changes)
- RLS policies updated if schema changes affect access control
- Indexes created for new columns (if applicable)
Migration Execution
# 1. Create migration locally (already done in development)
# supabase migration new add_pricing_snapshot_to_bookings
# 2. Review migration file
cat supabase/migrations/20251024_add_pricing_snapshot_to_bookings.sql
# 3. Apply migration to staging (automatic via CI)
# Already applied when merged to staging branch
# 4. Apply migration to production (MANUAL)
supabase db push --project-ref <production-project-id>
# 5. Verify migration applied
supabase db dump --project-ref <production-project-id> --schema public | grep pricing_snapshot
# Expected: Column exists in bookings table
Migration Rollback
Important: Supabase does not support automatic rollback. Create compensating migration.
-- Example: Rollback adding column
-- supabase migration new rollback_pricing_snapshot
BEGIN;
-- Remove column (if safe to do so)
ALTER TABLE bookings DROP COLUMN IF EXISTS pricing_snapshot;
-- Restore previous state
-- (Add any necessary data restoration logic)
COMMIT;
Common Migration Patterns
Adding Column:
-- Safe: Allows NULL or has default
ALTER TABLE bookings
ADD COLUMN pricing_snapshot JSONB DEFAULT '{}';
-- Create index if needed
CREATE INDEX idx_bookings_pricing_snapshot
ON bookings USING gin(pricing_snapshot);
Adding Table:
CREATE TABLE payment_retries (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  payment_id UUID NOT NULL REFERENCES payments(id),
  attempt_number INT NOT NULL,
  error_message TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);
-- Enable RLS
ALTER TABLE payment_retries ENABLE ROW LEVEL SECURITY;
-- Add RLS policy
CREATE POLICY "Users access payment_retries in their org"
ON payment_retries
FOR ALL
USING (
  EXISTS (
    SELECT 1 FROM payments p
    WHERE p.id = payment_retries.payment_id
    AND p.org_id = current_setting('app.current_org_id')::uuid
  )
);
Modifying Column:
-- Safe: Expanding constraint (varchar(50) -> varchar(100))
ALTER TABLE properties
ALTER COLUMN name TYPE VARCHAR(100);
-- Unsafe: Narrowing constraint (requires validation)
-- DO NOT DO THIS without data validation:
-- ALTER TABLE properties ALTER COLUMN name TYPE VARCHAR(10);
Worker Deployment
Workers are deployed alongside the backend but run as separate processes.
Worker Types
- Sync Worker - Hostaway synchronization
- Payment Worker - Payment processing
- Notification Worker - Email/SMS notifications
Deployment Command
# Railway automatically starts workers based on Procfile
# Procfile:
# web: node dist/server.js
# sync-worker: node dist/workers/sync-worker.js
# payment-worker: node dist/workers/payment-worker.js
# notification-worker: node dist/workers/notification-worker.js
# Verify workers running
railway ps
# Expected output:
# NAME                  STATUS    REPLICAS
# web                   running   2
# sync-worker           running   2
# payment-worker        running   1
# notification-worker   running   1
Worker Health Check
# Check worker status via Redis
redis-cli -u $REDIS_URL
# Check active workers
> LLEN bull:sync-worker:active
> LLEN bull:payment-worker:active
> LLEN bull:notification-worker:active
# Check failed jobs
> LLEN bull:sync-worker:failed
> LLEN bull:payment-worker:failed
# Expected: active > 0, failed = 0
Restart Workers
# Restart specific worker
railway restart sync-worker
# Restart all workers
railway restart
Health Checks
Comprehensive Health Check Script
#!/bin/bash
# scripts/verify-deployment.sh
set -e
API_URL="https://api.tvl.com"
APP_URL="https://app.tvl.com"
echo "=== TVL Platform Deployment Verification ==="
echo ""
# 1. Frontend Health
echo "1. Checking Frontend..."
FRONTEND_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL)
if [ "$FRONTEND_STATUS" -eq 200 ]; then
  echo "✓ Frontend: OK"
else
  echo "✗ Frontend: FAILED (HTTP $FRONTEND_STATUS)"
  exit 1
fi
# 2. Backend API Health
echo "2. Checking Backend API..."
API_RESPONSE=$(curl -s $API_URL/health)
API_STATUS=$(echo $API_RESPONSE | jq -r '.status')
if [ "$API_STATUS" = "healthy" ]; then
  echo "✓ Backend API: OK"
else
  echo "✗ Backend API: FAILED"
  echo "Response: $API_RESPONSE"
  exit 1
fi
# 3. Database Connection
echo "3. Checking Database..."
DB_STATUS=$(echo $API_RESPONSE | jq -r '.checks.database')
if [ "$DB_STATUS" = "healthy" ]; then
  echo "✓ Database: OK"
else
  echo "✗ Database: FAILED"
  exit 1
fi
# 4. Redis Connection
echo "4. Checking Redis..."
REDIS_STATUS=$(echo $API_RESPONSE | jq -r '.checks.redis')
if [ "$REDIS_STATUS" = "healthy" ]; then
  echo "✓ Redis: OK"
else
  echo "✗ Redis: FAILED"
  exit 1
fi
# 5. External API Connectivity
echo "5. Checking External APIs..."
STRIPE_STATUS=$(echo $API_RESPONSE | jq -r '.checks.stripe')
HOSTAWAY_STATUS=$(echo $API_RESPONSE | jq -r '.checks.hostaway')
if [ "$STRIPE_STATUS" = "healthy" ]; then
  echo "✓ Stripe: OK"
else
  echo "⚠ Stripe: $STRIPE_STATUS"
fi
if [ "$HOSTAWAY_STATUS" = "healthy" ]; then
  echo "✓ Hostaway: OK"
else
  echo "⚠ Hostaway: $HOSTAWAY_STATUS"
fi
# 6. Critical Endpoints
echo "6. Checking Critical Endpoints..."
# Auth endpoint
AUTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/api/v1/auth/me -H "Authorization: Bearer invalid")
if [ "$AUTH_STATUS" -eq 401 ]; then
  echo "✓ Auth endpoint: OK (returns 401 as expected)"
else
  echo "✗ Auth endpoint: FAILED (expected 401, got $AUTH_STATUS)"
  exit 1
fi
# Properties endpoint (requires auth)
PROPERTIES_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/api/v1/properties)
if [ "$PROPERTIES_STATUS" -eq 401 ]; then
  echo "✓ Properties endpoint: OK (auth required)"
else
  echo "✗ Properties endpoint: FAILED (expected 401, got $PROPERTIES_STATUS)"
fi
# 7. Metrics Endpoint
echo "7. Checking Metrics..."
METRICS_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/metrics)
if [ "$METRICS_STATUS" -eq 200 ]; then
  echo "✓ Metrics: OK"
else
  echo "⚠ Metrics: FAILED (HTTP $METRICS_STATUS)"
fi
# 8. Worker Jobs
echo "8. Checking Worker Jobs..."
REDIS_URL="${REDIS_URL:-redis://localhost:6379}"
ACTIVE_JOBS=$(redis-cli -u $REDIS_URL LLEN bull:sync-worker:active 2>/dev/null || echo "0")
echo "✓ Workers: $ACTIVE_JOBS active jobs"
echo ""
echo "=== Deployment Verification Complete ==="
echo "✓ All critical checks passed"
exit 0
Usage
# Run verification script
chmod +x scripts/verify-deployment.sh
./scripts/verify-deployment.sh
# Run as part of CI/CD
- name: Verify Deployment
  run: ./scripts/verify-deployment.sh
Rollback Procedures
Rollback Decision Criteria
Initiate rollback if:
- Health checks failing for > 5 minutes
- Error rate > 5% in Sentry
- Database queries timing out (P95 > 1000ms)
- Critical functionality broken (auth, bookings, payments)
- External API integrations failing (Stripe, Hostaway)
Full Rollback Procedure
Estimated Time: 5-10 minutes
#!/bin/bash
# scripts/rollback-production.sh
set -e
echo "=== EMERGENCY ROLLBACK INITIATED ==="
echo "Timestamp: $(date)"
echo ""
# 1. Rollback Backend (Railway)
echo "1. Rolling back Backend..."
railway redeploy --deployment-id <PREVIOUS_DEPLOYMENT_ID>
echo "✓ Backend rollback initiated"
# 2. Rollback Frontend (Vercel)
echo "2. Rolling back Frontend..."
vercel rollback --token=$VERCEL_TOKEN
echo "✓ Frontend rollback initiated"
# 3. Wait for deployments
echo "3. Waiting for rollback to complete..."
sleep 30
# 4. Verify health
echo "4. Verifying health..."
./scripts/verify-deployment.sh
if [ $? -eq 0 ]; then
  echo "✓ Rollback successful"
  echo "Please investigate and fix the issue before redeploying"
else
  echo "✗ Rollback health check failed"
  echo "ESCALATE TO ON-CALL ENGINEER IMMEDIATELY"
  exit 1
fi
echo ""
echo "=== ROLLBACK COMPLETE ==="
Database Rollback
Warning: Database rollback is complex. Only rollback if:
- Migration caused immediate production outage
- Compensating migration ready
- Data corruption detected
# 1. Create compensating migration
supabase migration new rollback_<original_migration_name>
# 2. Write rollback SQL
cat > supabase/migrations/<timestamp>_rollback.sql <<EOF
-- Rollback logic here
EOF
# 3. Apply rollback migration
supabase db push --project-ref <production-project-id>
# 4. Verify
./scripts/verify-deployment.sh
Post-Rollback Actions
- Update #engineering-deploys with rollback notification
- Create incident post-mortem ticket
- Document root cause and fix
- Schedule fix deployment
- Review rollback effectiveness
Post-Deployment Verification
Verification Checklist
Within 15 minutes of deployment:
-  Run comprehensive health check script: ./scripts/verify-deployment.sh
- Check error rates in Sentry (should be < 1%)
- Check Grafana dashboards for anomalies
-  Verify critical user flows:
- User login (Google OIDC)
- Property list load
- Create quote
- Create booking
- Process payment
 
- Check worker job processing rates
- Verify external API integrations (Stripe, Hostaway)
Within 1 hour of deployment:
- Monitor P95 latency (should be < 500ms)
- Check for spike in failed jobs (DLQ)
- Verify scheduled jobs running (sync workers)
- Check customer support channels for issues
- Review deployment logs for warnings
Within 24 hours of deployment:
- Review error trends in Sentry
- Check Grafana for resource utilization trends
- Verify data consistency (no orphaned records)
- Complete incident report (if any issues)
- Update runbook with lessons learned
Smoke Test Script
#!/bin/bash
# scripts/smoke-test.sh
API_URL="https://api.tvl.com"
TEST_USER_TOKEN="<test-user-jwt>"
# Test 1: Login
echo "Test 1: User authentication..."
curl -f -X GET "$API_URL/api/v1/auth/me" \
  -H "Authorization: Bearer $TEST_USER_TOKEN" || exit 1
echo "✓ Login OK"
# Test 2: List properties
echo "Test 2: List properties..."
curl -f -X GET "$API_URL/api/v1/properties" \
  -H "Authorization: Bearer $TEST_USER_TOKEN" || exit 1
echo "✓ Properties OK"
# Test 3: Create quote
echo "Test 3: Create quote..."
QUOTE_RESPONSE=$(curl -f -X POST "$API_URL/api/v1/quotes" \
  -H "Authorization: Bearer $TEST_USER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "unit_id": "<test-unit-id>",
    "check_in": "2025-11-01",
    "check_out": "2025-11-05",
    "guests": 2
  }')
echo "✓ Quote OK"
# Test 4: Webhook endpoint
echo "Test 4: Webhook endpoint..."
curl -f -X POST "$API_URL/webhooks/hostaway" \
  -H "Content-Type: application/json" \
  -d '{"test": true}' || exit 1
echo "✓ Webhook OK"
echo ""
echo "✓ All smoke tests passed"
Emergency Procedures
Production Outage Response
Severity Levels:
- P0 (Critical): Complete outage, revenue impact
- P1 (High): Major functionality broken, partial outage
- P2 (Medium): Minor functionality broken, workaround available
- P3 (Low): Cosmetic issue, no functionality impact
P0/P1 Response:
- 
Acknowledge (within 5 minutes) - Post in #incidents Slack channel
- Update status page: https://status.tvl.com
- Notify on-call engineer
 
- 
Assess (within 15 minutes) - Check health endpoints
- Review Grafana dashboards
- Check Sentry errors
- Identify root cause
 
- 
Mitigate (within 30 minutes) - Execute rollback if recent deployment
- Restart services if transient failure
- Scale up resources if capacity issue
- Disable feature flag if specific feature broken
 
- 
Resolve (within 1 hour) - Deploy hotfix if needed
- Verify health checks
- Monitor for 30 minutes
 
- 
Communicate (ongoing) - Update status page every 15 minutes
- Notify customers via email (if extended outage)
- Post resolution summary
 
Emergency Contacts
| Role | Contact | Escalation | 
|---|---|---|
| On-Call Engineer | PagerDuty | Immediate | 
| Engineering Lead | Slack DM | 15 minutes | 
| CTO | Phone | 30 minutes | 
| External Support | Vendor support | As needed | 
Hotfix Deployment
Fast-track deployment for critical production fixes:
# 1. Create hotfix branch from main
git checkout main
git pull
git checkout -b hotfix/fix-payment-processing
# 2. Make minimal fix
# (edit code)
# 3. Commit and push
git add .
git commit -m "hotfix: Fix payment processing race condition"
git push origin hotfix/fix-payment-processing
# 4. Create PR with "HOTFIX" label
gh pr create --title "[HOTFIX] Fix payment processing" --label hotfix
# 5. Fast-track review (1 approval required)
# 6. Merge to main
gh pr merge --squash
# 7. Monitor deployment
./scripts/verify-deployment.sh
# 8. Verify fix in production
# (test specific functionality)
Sources
- /mnt/c/GitHub/claude-test/docs/01-architecture/deployment-views.md
- /mnt/c/GitHub/claude-test/docs/01-architecture/logical-architecture.md
- /mnt/c/GitHub/claude-test/docs/00-overview/platform-overview.md