Skip to main content

ADR-0003: Job Queue, Retry, and Backoff Strategy

Status

Accepted - 2025-10-24


Context

TVL Platform requires reliable background job processing for:

  • Channel sync (availability, pricing, listings)
  • Payment processing
  • Notifications (email, SMS)
  • Report generation

Requirements:

  • Asynchronous processing (don't block API responses)
  • Retry transient failures
  • Dead letter queue for permanent failures
  • Prioritization (critical bookings > low-priority reports)
  • Observability (job status, duration, failures)

Decision

BullMQ (Redis-backed job queue) with exponential backoff + full jitter retry strategy.

Job Queue: BullMQ

import { Queue, Worker } from 'bullmq';

const syncQueue = new Queue('sync', {
connection: redisConnection,
defaultJobOptions: {
attempts: 5,
backoff: {
type: 'exponential',
delay: 1000 // 1s, 2s, 4s, 8s, 16s
},
removeOnComplete: 100, // Keep last 100 completed
removeOnFail: 1000 // Keep last 1000 failed
}
});

Retry Strategy: Exponential Backoff + Full Jitter

// Custom backoff with jitter
const calculateBackoff = (attemptsMade: number): number => {
const baseDelay = 1000; // 1 second
const maxDelay = 60000; // 60 seconds

// Exponential: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
const exponentialDelay = Math.min(
baseDelay * Math.pow(2, attemptsMade),
maxDelay
);

// Full jitter: random between 0 and exponentialDelay
return Math.floor(Math.random() * exponentialDelay);
};

Rationale:

  • Exponential backoff reduces load during outages
  • Full jitter prevents thundering herd (many jobs retrying simultaneously)
  • Industry standard (AWS, Google Cloud use similar strategies)

Dead Letter Queue (DLQ)

worker.on('failed', async (job, error) => {
if (job.attemptsMade >= job.opts.attempts) {
// Move to DLQ
await dlqQueue.add('failed-job', {
originalJob: job.asJSON(),
error: error.message,
failedAt: new Date()
}, {
priority: 1 // Low priority for manual review
});

// Alert ops team
logger.error('Job moved to DLQ', {
jobId: job.id,
jobName: job.name,
attempts: job.attemptsMade,
error: error.message
});
}
});

Job Prioritization

// Priority levels (1 = highest)
enum JobPriority {
CRITICAL = 1, // Payment capture, booking confirmation
HIGH = 3, // Availability sync, price updates
MEDIUM = 5, // Listing sync, notifications
LOW = 10 // Report generation, analytics
}

await syncQueue.add('sync-availability', data, {
priority: JobPriority.HIGH
});

Alternatives Considered

Alternative 1: AWS SQS + Lambda

Rejected (for MVP)

Pros:

  • Serverless (auto-scaling)
  • Managed service (less ops)
  • DLQ built-in

Cons:

  • Additional AWS cost
  • Cold start latency (Lambda)
  • Vendor lock-in
  • More complex than BullMQ

Decision: BullMQ simpler for MVP; reconsider if scaling beyond 1M jobs/day.


Alternative 2: Fixed Delay Retry

Rejected

// Fixed 5-second retry
backoff: { type: 'fixed', delay: 5000 }

Cons:

  • Doesn't adapt to failure type
  • Can overload external APIs during outages
  • Thundering herd problem

Decision: Exponential backoff is industry best practice.


Alternative 3: No Jitter

Rejected

Cons:

  • All retries happen at same time (2s, 4s, 8s)
  • Thundering herd on external APIs
  • Can cause cascading failures

Decision: Full jitter prevents synchronized retries.


Consequences

Positive

  1. Reliability

    • Transient failures (network glitches) automatically retried
    • Permanent failures (invalid data) moved to DLQ
    • No job loss
  2. Performance

    • Async processing doesn't block API responses
    • Prioritization ensures critical jobs processed first
    • Backoff prevents overwhelming external APIs
  3. Observability

    • Job status tracked (waiting, active, completed, failed)
    • Metrics: job duration, retry count, failure rate
    • DLQ enables manual investigation
  4. Cost

    • Redis cheaper than AWS SQS ($20/mo vs. $0.40/million requests)
    • BullMQ open-source (no licensing fees)

Negative

  1. Redis Dependency

    • Single point of failure (if Redis down, no job processing)
    • Mitigation: Upstash Redis has 99.99% SLA; enable AOF persistence
  2. Job Visibility

    • No built-in UI (unlike AWS SQS console)
    • Mitigation: Use Bull Board (open-source web UI for BullMQ)
  3. Manual DLQ Management

    • DLQ jobs require manual replay or investigation
    • Mitigation: Build admin UI for DLQ management

Implementation

// Worker implementation
const worker = new Worker('sync', async (job) => {
const { unitId, syncType } = job.data;

try {
if (syncType === 'availability') {
await syncAvailability(unitId);
} else if (syncType === 'pricing') {
await syncPricing(unitId);
}

return { success: true, syncedAt: new Date() };
} catch (error) {
// Classify error
if (isRateLimitError(error)) {
// Transient: retry
throw error;
} else if (isValidationError(error)) {
// Permanent: don't retry, move to DLQ
throw new Error(`Permanent failure: ${error.message}`);
} else {
// Unknown: retry
throw error;
}
}
}, {
connection: redisConnection,
concurrency: 5, // Process 5 jobs in parallel
limiter: {
max: 100, // Max 100 jobs
duration: 60000 // Per 60 seconds
}
});

Validation Checklist

  • Jobs survive worker restarts (persisted in Redis)
  • Retries follow exponential backoff
  • Jitter prevents synchronized retries
  • DLQ captures jobs after max attempts
  • Prioritization works (critical jobs first)
  • Metrics tracked (duration, failures)
  • Alerts for DLQ threshold (> 100 jobs)
  • Bull Board UI accessible to ops team

Sources

  • meta/research-log.md
  • docs/01-architecture/logical-architecture.md