ADR-0003: Job Queue, Retry, and Backoff Strategy
Status
Accepted - 2025-10-24
Context
TVL Platform requires reliable background job processing for:
- Channel sync (availability, pricing, listings)
- Payment processing
- Notifications (email, SMS)
- Report generation
Requirements:
- Asynchronous processing (don't block API responses)
- Retry transient failures
- Dead letter queue for permanent failures
- Prioritization (critical bookings > low-priority reports)
- Observability (job status, duration, failures)
Decision
BullMQ (Redis-backed job queue) with exponential backoff + full jitter retry strategy.
Job Queue: BullMQ
import { Queue, Worker } from 'bullmq';
const syncQueue = new Queue('sync', {
  connection: redisConnection,
  defaultJobOptions: {
    attempts: 5,
    backoff: {
      type: 'exponential',
      delay: 1000 // 1s, 2s, 4s, 8s, 16s
    },
    removeOnComplete: 100, // Keep last 100 completed
    removeOnFail: 1000      // Keep last 1000 failed
  }
});
Retry Strategy: Exponential Backoff + Full Jitter
// Custom backoff with jitter
const calculateBackoff = (attemptsMade: number): number => {
  const baseDelay = 1000; // 1 second
  const maxDelay = 60000; // 60 seconds
  // Exponential: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
  const exponentialDelay = Math.min(
    baseDelay * Math.pow(2, attemptsMade),
    maxDelay
  );
  // Full jitter: random between 0 and exponentialDelay
  return Math.floor(Math.random() * exponentialDelay);
};
Rationale:
- Exponential backoff reduces load during outages
- Full jitter prevents thundering herd (many jobs retrying simultaneously)
- Industry standard (AWS, Google Cloud use similar strategies)
Dead Letter Queue (DLQ)
worker.on('failed', async (job, error) => {
  if (job.attemptsMade >= job.opts.attempts) {
    // Move to DLQ
    await dlqQueue.add('failed-job', {
      originalJob: job.asJSON(),
      error: error.message,
      failedAt: new Date()
    }, {
      priority: 1 // Low priority for manual review
    });
    // Alert ops team
    logger.error('Job moved to DLQ', {
      jobId: job.id,
      jobName: job.name,
      attempts: job.attemptsMade,
      error: error.message
    });
  }
});
Job Prioritization
// Priority levels (1 = highest)
enum JobPriority {
  CRITICAL = 1,  // Payment capture, booking confirmation
  HIGH = 3,      // Availability sync, price updates
  MEDIUM = 5,    // Listing sync, notifications
  LOW = 10       // Report generation, analytics
}
await syncQueue.add('sync-availability', data, {
  priority: JobPriority.HIGH
});
Alternatives Considered
Alternative 1: AWS SQS + Lambda
Rejected (for MVP)
Pros:
- Serverless (auto-scaling)
- Managed service (less ops)
- DLQ built-in
Cons:
- Additional AWS cost
- Cold start latency (Lambda)
- Vendor lock-in
- More complex than BullMQ
Decision: BullMQ simpler for MVP; reconsider if scaling beyond 1M jobs/day.
Alternative 2: Fixed Delay Retry
Rejected
// Fixed 5-second retry
backoff: { type: 'fixed', delay: 5000 }
Cons:
- Doesn't adapt to failure type
- Can overload external APIs during outages
- Thundering herd problem
Decision: Exponential backoff is industry best practice.
Alternative 3: No Jitter
Rejected
Cons:
- All retries happen at same time (2s, 4s, 8s)
- Thundering herd on external APIs
- Can cause cascading failures
Decision: Full jitter prevents synchronized retries.
Consequences
Positive
- 
Reliability - Transient failures (network glitches) automatically retried
- Permanent failures (invalid data) moved to DLQ
- No job loss
 
- 
Performance - Async processing doesn't block API responses
- Prioritization ensures critical jobs processed first
- Backoff prevents overwhelming external APIs
 
- 
Observability - Job status tracked (waiting, active, completed, failed)
- Metrics: job duration, retry count, failure rate
- DLQ enables manual investigation
 
- 
Cost - Redis cheaper than AWS SQS ($20/mo vs. $0.40/million requests)
- BullMQ open-source (no licensing fees)
 
Negative
- 
Redis Dependency - Single point of failure (if Redis down, no job processing)
- Mitigation: Upstash Redis has 99.99% SLA; enable AOF persistence
 
- 
Job Visibility - No built-in UI (unlike AWS SQS console)
- Mitigation: Use Bull Board (open-source web UI for BullMQ)
 
- 
Manual DLQ Management - DLQ jobs require manual replay or investigation
- Mitigation: Build admin UI for DLQ management
 
Implementation
// Worker implementation
const worker = new Worker('sync', async (job) => {
  const { unitId, syncType } = job.data;
  try {
    if (syncType === 'availability') {
      await syncAvailability(unitId);
    } else if (syncType === 'pricing') {
      await syncPricing(unitId);
    }
    return { success: true, syncedAt: new Date() };
  } catch (error) {
    // Classify error
    if (isRateLimitError(error)) {
      // Transient: retry
      throw error;
    } else if (isValidationError(error)) {
      // Permanent: don't retry, move to DLQ
      throw new Error(`Permanent failure: ${error.message}`);
    } else {
      // Unknown: retry
      throw error;
    }
  }
}, {
  connection: redisConnection,
  concurrency: 5, // Process 5 jobs in parallel
  limiter: {
    max: 100,     // Max 100 jobs
    duration: 60000 // Per 60 seconds
  }
});
Validation Checklist
- Jobs survive worker restarts (persisted in Redis)
- Retries follow exponential backoff
- Jitter prevents synchronized retries
- DLQ captures jobs after max attempts
- Prioritization works (critical jobs first)
- Metrics tracked (duration, failures)
- Alerts for DLQ threshold (> 100 jobs)
- Bull Board UI accessible to ops team
Sources
- meta/research-log.md
- docs/01-architecture/logical-architecture.md