ADR-0039: Cache Stampede Prevention
Status
Accepted - 2025-01-26
Context
When cache expires under high load, multiple requests simultaneously fetch from DB, causing cache stampede (thundering herd).
Example:
- 1000 requests/sec for popular property
- Cache expires (TTL 3600s)
- All 1000 requests hit DB simultaneously
- Database overload, slow responses
Decision
Distributed lock pattern with Redis to ensure only one request fetches from DB during cache miss.
Rationale
- Prevent DB Overload: Only one query per key
- Fast Recovery: Lock expires quickly (5s)
- Simple: Redis atomic operations
- Graceful: Stale cache served during refetch
Alternatives Considered
Alternative 1: No Protection
Rejected - DB overload on popular keys
Alternative 2: Probabilistic Early Expiry
Rejected - Complex, doesn't fully prevent stampede
Alternative 3: Request Coalescing
Rejected - Works per-instance, doesn't work across replicas
Implementation
Lock-Based Cache Fetching
// src/cache/stampedePrevention.ts
import Redis from 'ioredis';
export async function getOrFetch<T>(
  key: string,
  ttl: number,
  fetchFn: () => Promise<T>,
  options: {
    lockTimeout?: number; // Max lock duration (ms)
    waitTimeout?: number; // Max wait for lock (ms)
  } = {}
): Promise<T> {
  const lockKey = `lock:${key}`;
  const lockTimeout = options.lockTimeout || 5000; // 5s
  const waitTimeout = options.waitTimeout || 10000; // 10s
  // Try cache first
  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }
  // Cache miss - acquire lock
  const lockAcquired = await redis.set(
    lockKey,
    '1',
    'PX', lockTimeout, // Lock expires after 5s
    'NX' // Only set if not exists
  );
  if (lockAcquired) {
    // We got the lock - fetch from DB
    try {
      const data = await fetchFn();
      // Store in cache
      await redis.setex(key, ttl, JSON.stringify(data));
      return data;
    } finally {
      // Release lock
      await redis.del(lockKey);
    }
  } else {
    // Lock held by another request - wait for cache
    const startTime = Date.now();
    while (Date.now() - startTime < waitTimeout) {
      // Check if cache populated
      const cached = await redis.get(key);
      if (cached) {
        return JSON.parse(cached);
      }
      // Wait 50ms and retry
      await sleep(50);
    }
    // Timeout - fallback to direct fetch
    logger.warn({ key }, 'Cache stampede wait timeout, fetching directly');
    return await fetchFn();
  }
}
function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}
Usage Example
// src/services/properties/getProperty.ts
export async function getProperty(orgId: string, propertyId: string) {
  const key = CacheKeys.property(orgId, propertyId);
  return await getOrFetch(
    key,
    3600, // 1 hour TTL
    async () => {
      // This only runs once across all replicas
      return await db.query.properties.findFirst({
        where: and(eq(properties.orgId, orgId), eq(properties.id, propertyId)),
      });
    }
  );
}
Alternative: Stale-While-Revalidate
Serve stale cache while refreshing in background:
export async function getOrFetchStale<T>(
  key: string,
  ttl: number,
  staleTtl: number, // Extra time to serve stale
  fetchFn: () => Promise<T>
): Promise<T> {
  const lockKey = `lock:${key}`;
  // Try cache (including stale)
  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }
  // Cache expired - try lock
  const lockAcquired = await redis.set(lockKey, '1', 'PX', 5000, 'NX');
  if (lockAcquired) {
    // Fetch fresh data
    const data = await fetchFn();
    await redis.setex(key, ttl, JSON.stringify(data));
    await redis.del(lockKey);
    return data;
  } else {
    // Lock held - check stale cache
    const staleKey = `${key}:stale`;
    const stale = await redis.get(staleKey);
    if (stale) {
      // Return stale while refresh happens
      return JSON.parse(stale);
    }
    // No stale - wait for lock
    await sleep(100);
    const refreshed = await redis.get(key);
    return refreshed ? JSON.parse(refreshed) : await fetchFn();
  }
}
Monitoring
Track Stampede Events
export async function recordStampede(key: string) {
  await redis.incr('metrics:cache:stampede');
  logger.warn({ key }, 'Cache stampede detected');
}
// In getOrFetch:
if (!lockAcquired) {
  await recordStampede(key);
  // ... wait logic
}
Dashboard Metrics
Cache Stampede Protection:
- Stampede events/min (should be ~0)
- Lock wait time (p50, p99)
- Lock timeout rate (%)
- Stale cache hit rate (%)
Lock Configuration
Lock Timeout Strategy
| Resource Type | Lock Timeout | Rationale | 
|---|---|---|
| Properties | 5s | Fast DB query | 
| Availability | 10s | Complex calculation | 
| Pricing | 15s | External API call | 
| Analytics | 30s | Heavy aggregation | 
export const LockTimeouts = {
  PROPERTIES: 5000,
  AVAILABILITY: 10000,
  PRICING: 15000,
  ANALYTICS: 30000,
};
Testing Stampede Prevention
// tests/cache/stampede.test.ts
describe('Cache Stampede Prevention', () => {
  it('should only fetch once for concurrent requests', async () => {
    const { orgId, propertyId } = await setupTestProperty();
    const key = CacheKeys.property(orgId, propertyId);
    // Clear cache
    await redis.del(key);
    let fetchCount = 0;
    const mockFetch = async () => {
      fetchCount++;
      await sleep(100); // Simulate slow DB query
      return { id: propertyId, name: 'Test Property' };
    };
    // Fire 100 concurrent requests
    const requests = Array.from({ length: 100 }, () =>
      getOrFetch(key, 3600, mockFetch)
    );
    await Promise.all(requests);
    // Should only fetch once
    expect(fetchCount).toBe(1);
  });
  it('should release lock after timeout', async () => {
    const key = 'test:key';
    const lockKey = `lock:${key}`;
    // Acquire lock
    await redis.set(lockKey, '1', 'PX', 1000, 'NX');
    // Wait for expiry
    await sleep(1100);
    // Lock should be released
    const lockExists = await redis.exists(lockKey);
    expect(lockExists).toBe(0);
  });
});
Edge Cases
Lock Never Released (Process Crash)
Auto-expiry prevents deadlock:
// Lock expires after 5s (PX flag)
await redis.set(lockKey, '1', 'PX', 5000, 'NX');
// Even if process crashes, lock auto-expires
Thundering Herd on Cold Start
Warm cache on startup:
// src/cache/warmup.ts
export async function warmCriticalCaches() {
  logger.info('Warming critical caches...');
  // Warm top 100 properties
  const topProperties = await getTopProperties(100);
  for (const property of topProperties) {
    await getOrFetch(
      CacheKeys.property(property.orgId, property.id),
      3600,
      async () => property
    );
  }
  logger.info('Cache warmed');
}
// Call on startup
await warmCriticalCaches();
Consequences
Positive
- ✅ Prevents DB Overload: Only one fetch per key
- ✅ Fast Recovery: Lock expires automatically
- ✅ Distributed: Works across replicas
- ✅ Graceful: Stale cache served during refresh
Negative
- ❌ Lock Contention: High concurrency may wait
- ❌ Redis Dependency: Requires Redis uptime
Mitigations
- Short lock timeouts (5-10s)
- Stale-while-revalidate for critical keys
- Monitor stampede rate (alert if >10/min)
Validation Checklist
- Lock-based cache fetching implemented
- Lock timeout configured per resource type
- Stale-while-revalidate for critical caches
- Stampede monitoring enabled
- Tests for concurrent requests