ADR-0039: Cache Stampede Prevention

Status

Accepted - 2025-01-26

Context

When cache expires under high load, multiple requests simultaneously fetch from DB, causing cache stampede (thundering herd).

Example:

1000 requests/sec for popular property
Cache expires (TTL 3600s)
All 1000 requests hit DB simultaneously
Database overload, slow responses

Decision

Distributed lock pattern with Redis to ensure only one request fetches from DB during cache miss.

Rationale

Prevent DB Overload: Only one query per key
Fast Recovery: Lock expires quickly (5s)
Simple: Redis atomic operations
Graceful: Stale cache served during refetch

Alternatives Considered

Alternative 1: No Protection

Rejected - DB overload on popular keys

Alternative 2: Probabilistic Early Expiry

Rejected - Complex, doesn't fully prevent stampede

Alternative 3: Request Coalescing

Rejected - Works per-instance, doesn't work across replicas

Implementation

Lock-Based Cache Fetching

// src/cache/stampedePrevention.ts
import Redis from 'ioredis';

export async function getOrFetch<T>(
  key: string,
  ttl: number,
  fetchFn: () => Promise<T>,
  options: {
    lockTimeout?: number; // Max lock duration (ms)
    waitTimeout?: number; // Max wait for lock (ms)
  } = {}
): Promise<T> {
  const lockKey = `lock:${key}`;
  const lockTimeout = options.lockTimeout || 5000; // 5s
  const waitTimeout = options.waitTimeout || 10000; // 10s

  // Try cache first
  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  // Cache miss - acquire lock
  const lockAcquired = await redis.set(
    lockKey,
    '1',
    'PX', lockTimeout, // Lock expires after 5s
    'NX' // Only set if not exists
  );

  if (lockAcquired) {
    // We got the lock - fetch from DB
    try {
      const data = await fetchFn();

      // Store in cache
      await redis.setex(key, ttl, JSON.stringify(data));

      return data;
    } finally {
      // Release lock
      await redis.del(lockKey);
    }
  } else {
    // Lock held by another request - wait for cache
    const startTime = Date.now();

    while (Date.now() - startTime < waitTimeout) {
      // Check if cache populated
      const cached = await redis.get(key);
      if (cached) {
        return JSON.parse(cached);
      }

      // Wait 50ms and retry
      await sleep(50);
    }

    // Timeout - fallback to direct fetch
    logger.warn({ key }, 'Cache stampede wait timeout, fetching directly');
    return await fetchFn();
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

Usage Example

// src/services/properties/getProperty.ts
export async function getProperty(orgId: string, propertyId: string) {
  const key = CacheKeys.property(orgId, propertyId);

  return await getOrFetch(
    key,
    3600, // 1 hour TTL
    async () => {
      // This only runs once across all replicas
      return await db.query.properties.findFirst({
        where: and(eq(properties.orgId, orgId), eq(properties.id, propertyId)),
      });
    }
  );
}

Alternative: Stale-While-Revalidate

Serve stale cache while refreshing in background:

export async function getOrFetchStale<T>(
  key: string,
  ttl: number,
  staleTtl: number, // Extra time to serve stale
  fetchFn: () => Promise<T>
): Promise<T> {
  const lockKey = `lock:${key}`;

  // Try cache (including stale)
  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  // Cache expired - try lock
  const lockAcquired = await redis.set(lockKey, '1', 'PX', 5000, 'NX');

  if (lockAcquired) {
    // Fetch fresh data
    const data = await fetchFn();
    await redis.setex(key, ttl, JSON.stringify(data));
    await redis.del(lockKey);
    return data;
  } else {
    // Lock held - check stale cache
    const staleKey = `${key}:stale`;
    const stale = await redis.get(staleKey);

    if (stale) {
      // Return stale while refresh happens
      return JSON.parse(stale);
    }

    // No stale - wait for lock
    await sleep(100);
    const refreshed = await redis.get(key);
    return refreshed ? JSON.parse(refreshed) : await fetchFn();
  }
}

Monitoring

Track Stampede Events

export async function recordStampede(key: string) {
  await redis.incr('metrics:cache:stampede');
  logger.warn({ key }, 'Cache stampede detected');
}

// In getOrFetch:
if (!lockAcquired) {
  await recordStampede(key);
  // ... wait logic
}

Dashboard Metrics

Cache Stampede Protection:
- Stampede events/min (should be ~0)
- Lock wait time (p50, p99)
- Lock timeout rate (%)
- Stale cache hit rate (%)

Lock Configuration

Lock Timeout Strategy

Resource Type	Lock Timeout	Rationale
Properties	5s	Fast DB query
Availability	10s	Complex calculation
Pricing	15s	External API call
Analytics	30s	Heavy aggregation

export const LockTimeouts = {
  PROPERTIES: 5000,
  AVAILABILITY: 10000,
  PRICING: 15000,
  ANALYTICS: 30000,
};

Testing Stampede Prevention

// tests/cache/stampede.test.ts
describe('Cache Stampede Prevention', () => {
  it('should only fetch once for concurrent requests', async () => {
    const { orgId, propertyId } = await setupTestProperty();
    const key = CacheKeys.property(orgId, propertyId);

    // Clear cache
    await redis.del(key);

    let fetchCount = 0;
    const mockFetch = async () => {
      fetchCount++;
      await sleep(100); // Simulate slow DB query
      return { id: propertyId, name: 'Test Property' };
    };

    // Fire 100 concurrent requests
    const requests = Array.from({ length: 100 }, () =>
      getOrFetch(key, 3600, mockFetch)
    );

    await Promise.all(requests);

    // Should only fetch once
    expect(fetchCount).toBe(1);
  });

  it('should release lock after timeout', async () => {
    const key = 'test:key';
    const lockKey = `lock:${key}`;

    // Acquire lock
    await redis.set(lockKey, '1', 'PX', 1000, 'NX');

    // Wait for expiry
    await sleep(1100);

    // Lock should be released
    const lockExists = await redis.exists(lockKey);
    expect(lockExists).toBe(0);
  });
});

Edge Cases

Lock Never Released (Process Crash)

Auto-expiry prevents deadlock:

// Lock expires after 5s (PX flag)
await redis.set(lockKey, '1', 'PX', 5000, 'NX');

// Even if process crashes, lock auto-expires

Thundering Herd on Cold Start

Warm cache on startup:

// src/cache/warmup.ts
export async function warmCriticalCaches() {
  logger.info('Warming critical caches...');

  // Warm top 100 properties
  const topProperties = await getTopProperties(100);

  for (const property of topProperties) {
    await getOrFetch(
      CacheKeys.property(property.orgId, property.id),
      3600,
      async () => property
    );
  }

  logger.info('Cache warmed');
}

// Call on startup
await warmCriticalCaches();

Consequences

Positive

✅ Prevents DB Overload: Only one fetch per key
✅ Fast Recovery: Lock expires automatically
✅ Distributed: Works across replicas
✅ Graceful: Stale cache served during refresh

Negative

❌ Lock Contention: High concurrency may wait
❌ Redis Dependency: Requires Redis uptime

Mitigations

Short lock timeouts (5-10s)
Stale-while-revalidate for critical keys
Monitor stampede rate (alert if >10/min)

Validation Checklist

Lock-based cache fetching implemented
Lock timeout configured per resource type
Stale-while-revalidate for critical caches
Stampede monitoring enabled
Tests for concurrent requests

Status​

Context​

Decision​

Rationale​

Alternatives Considered​

Alternative 1: No Protection​

Alternative 2: Probabilistic Early Expiry​

Alternative 3: Request Coalescing​

Implementation​

Lock-Based Cache Fetching​

Usage Example​

Alternative: Stale-While-Revalidate​

Monitoring​

Track Stampede Events​

Dashboard Metrics​

Lock Configuration​

Lock Timeout Strategy​

Testing Stampede Prevention​

Edge Cases​

Lock Never Released (Process Crash)​

Thundering Herd on Cold Start​

Consequences​

Positive​

Negative​

Mitigations​

Validation Checklist​

References​