ADR-0003: Job Queue, Retry, and Backoff Strategy

Status

Accepted - 2025-10-24

Context

TVL Platform requires reliable background job processing for:

Channel sync (availability, pricing, listings)
Payment processing
Notifications (email, SMS)
Report generation

Requirements:

Asynchronous processing (don't block API responses)
Retry transient failures
Dead letter queue for permanent failures
Prioritization (critical bookings > low-priority reports)
Observability (job status, duration, failures)

Decision

BullMQ (Redis-backed job queue) with exponential backoff + full jitter retry strategy.

Job Queue: BullMQ

import { Queue, Worker } from 'bullmq';

const syncQueue = new Queue('sync', {
  connection: redisConnection,
  defaultJobOptions: {
    attempts: 5,
    backoff: {
      type: 'exponential',
      delay: 1000 // 1s, 2s, 4s, 8s, 16s
    },
    removeOnComplete: 100, // Keep last 100 completed
    removeOnFail: 1000      // Keep last 1000 failed
  }
});

Retry Strategy: Exponential Backoff + Full Jitter

// Custom backoff with jitter
const calculateBackoff = (attemptsMade: number): number => {
  const baseDelay = 1000; // 1 second
  const maxDelay = 60000; // 60 seconds

  // Exponential: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
  const exponentialDelay = Math.min(
    baseDelay * Math.pow(2, attemptsMade),
    maxDelay
  );

  // Full jitter: random between 0 and exponentialDelay
  return Math.floor(Math.random() * exponentialDelay);
};

Rationale:

Exponential backoff reduces load during outages
Full jitter prevents thundering herd (many jobs retrying simultaneously)
Industry standard (AWS, Google Cloud use similar strategies)

Dead Letter Queue (DLQ)

worker.on('failed', async (job, error) => {
  if (job.attemptsMade >= job.opts.attempts) {
    // Move to DLQ
    await dlqQueue.add('failed-job', {
      originalJob: job.asJSON(),
      error: error.message,
      failedAt: new Date()
    }, {
      priority: 1 // Low priority for manual review
    });

    // Alert ops team
    logger.error('Job moved to DLQ', {
      jobId: job.id,
      jobName: job.name,
      attempts: job.attemptsMade,
      error: error.message
    });
  }
});

Job Prioritization

// Priority levels (1 = highest)
enum JobPriority {
  CRITICAL = 1,  // Payment capture, booking confirmation
  HIGH = 3,      // Availability sync, price updates
  MEDIUM = 5,    // Listing sync, notifications
  LOW = 10       // Report generation, analytics
}

await syncQueue.add('sync-availability', data, {
  priority: JobPriority.HIGH
});

Alternatives Considered

Alternative 1: AWS SQS + Lambda

Rejected (for MVP)

Pros:

Serverless (auto-scaling)
Managed service (less ops)
DLQ built-in

Cons:

Additional AWS cost
Cold start latency (Lambda)
Vendor lock-in
More complex than BullMQ

Decision: BullMQ simpler for MVP; reconsider if scaling beyond 1M jobs/day.

Alternative 2: Fixed Delay Retry

Rejected

// Fixed 5-second retry
backoff: { type: 'fixed', delay: 5000 }

Cons:

Doesn't adapt to failure type
Can overload external APIs during outages
Thundering herd problem

Decision: Exponential backoff is industry best practice.

Alternative 3: No Jitter

Rejected

Cons:

All retries happen at same time (2s, 4s, 8s)
Thundering herd on external APIs
Can cause cascading failures

Decision: Full jitter prevents synchronized retries.

Consequences

Positive

Reliability
- Transient failures (network glitches) automatically retried
- Permanent failures (invalid data) moved to DLQ
- No job loss
Performance
- Async processing doesn't block API responses
- Prioritization ensures critical jobs processed first
- Backoff prevents overwhelming external APIs
Observability
- Job status tracked (waiting, active, completed, failed)
- Metrics: job duration, retry count, failure rate
- DLQ enables manual investigation
Cost
- Redis cheaper than AWS SQS ($20/mo vs. $0.40/million requests)
- BullMQ open-source (no licensing fees)

Negative

Redis Dependency
- Single point of failure (if Redis down, no job processing)
- Mitigation: Upstash Redis has 99.99% SLA; enable AOF persistence
Job Visibility
- No built-in UI (unlike AWS SQS console)
- Mitigation: Use Bull Board (open-source web UI for BullMQ)
Manual DLQ Management
- DLQ jobs require manual replay or investigation
- Mitigation: Build admin UI for DLQ management

Implementation

// Worker implementation
const worker = new Worker('sync', async (job) => {
  const { unitId, syncType } = job.data;

  try {
    if (syncType === 'availability') {
      await syncAvailability(unitId);
    } else if (syncType === 'pricing') {
      await syncPricing(unitId);
    }

    return { success: true, syncedAt: new Date() };
  } catch (error) {
    // Classify error
    if (isRateLimitError(error)) {
      // Transient: retry
      throw error;
    } else if (isValidationError(error)) {
      // Permanent: don't retry, move to DLQ
      throw new Error(`Permanent failure: ${error.message}`);
    } else {
      // Unknown: retry
      throw error;
    }
  }
}, {
  connection: redisConnection,
  concurrency: 5, // Process 5 jobs in parallel
  limiter: {
    max: 100,     // Max 100 jobs
    duration: 60000 // Per 60 seconds
  }
});

Validation Checklist

Jobs survive worker restarts (persisted in Redis)
Retries follow exponential backoff
Jitter prevents synchronized retries
DLQ captures jobs after max attempts
Prioritization works (critical jobs first)
Metrics tracked (duration, failures)
Alerts for DLQ threshold (> 100 jobs)
Bull Board UI accessible to ops team

Sources

meta/research-log.md
docs/01-architecture/logical-architecture.md

Status​

Context​

Decision​

Job Queue: BullMQ​

Retry Strategy: Exponential Backoff + Full Jitter​

Dead Letter Queue (DLQ)​

Job Prioritization​

Alternatives Considered​

Alternative 1: AWS SQS + Lambda​

Alternative 2: Fixed Delay Retry​

Alternative 3: No Jitter​

Consequences​

Positive​

Negative​

Implementation​

Validation Checklist​

Sources​

Status

Context

Decision

Job Queue: BullMQ

Retry Strategy: Exponential Backoff + Full Jitter

Dead Letter Queue (DLQ)

Job Prioritization

Alternatives Considered

Alternative 1: AWS SQS + Lambda

Alternative 2: Fixed Delay Retry

Alternative 3: No Jitter

Consequences

Positive

Negative

Implementation

Validation Checklist

Sources