Skip to main content

ADR-0039: Cache Stampede Prevention

Status

Accepted - 2025-01-26


Context

When cache expires under high load, multiple requests simultaneously fetch from DB, causing cache stampede (thundering herd).

Example:

  • 1000 requests/sec for popular property
  • Cache expires (TTL 3600s)
  • All 1000 requests hit DB simultaneously
  • Database overload, slow responses

Decision

Distributed lock pattern with Redis to ensure only one request fetches from DB during cache miss.

Rationale

  1. Prevent DB Overload: Only one query per key
  2. Fast Recovery: Lock expires quickly (5s)
  3. Simple: Redis atomic operations
  4. Graceful: Stale cache served during refetch

Alternatives Considered

Alternative 1: No Protection

Rejected - DB overload on popular keys

Alternative 2: Probabilistic Early Expiry

Rejected - Complex, doesn't fully prevent stampede

Alternative 3: Request Coalescing

Rejected - Works per-instance, doesn't work across replicas


Implementation

Lock-Based Cache Fetching

// src/cache/stampedePrevention.ts
import Redis from 'ioredis';

export async function getOrFetch<T>(
key: string,
ttl: number,
fetchFn: () => Promise<T>,
options: {
lockTimeout?: number; // Max lock duration (ms)
waitTimeout?: number; // Max wait for lock (ms)
} = {}
): Promise<T> {
const lockKey = `lock:${key}`;
const lockTimeout = options.lockTimeout || 5000; // 5s
const waitTimeout = options.waitTimeout || 10000; // 10s

// Try cache first
const cached = await redis.get(key);
if (cached) {
return JSON.parse(cached);
}

// Cache miss - acquire lock
const lockAcquired = await redis.set(
lockKey,
'1',
'PX', lockTimeout, // Lock expires after 5s
'NX' // Only set if not exists
);

if (lockAcquired) {
// We got the lock - fetch from DB
try {
const data = await fetchFn();

// Store in cache
await redis.setex(key, ttl, JSON.stringify(data));

return data;
} finally {
// Release lock
await redis.del(lockKey);
}
} else {
// Lock held by another request - wait for cache
const startTime = Date.now();

while (Date.now() - startTime < waitTimeout) {
// Check if cache populated
const cached = await redis.get(key);
if (cached) {
return JSON.parse(cached);
}

// Wait 50ms and retry
await sleep(50);
}

// Timeout - fallback to direct fetch
logger.warn({ key }, 'Cache stampede wait timeout, fetching directly');
return await fetchFn();
}
}

function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}

Usage Example

// src/services/properties/getProperty.ts
export async function getProperty(orgId: string, propertyId: string) {
const key = CacheKeys.property(orgId, propertyId);

return await getOrFetch(
key,
3600, // 1 hour TTL
async () => {
// This only runs once across all replicas
return await db.query.properties.findFirst({
where: and(eq(properties.orgId, orgId), eq(properties.id, propertyId)),
});
}
);
}

Alternative: Stale-While-Revalidate

Serve stale cache while refreshing in background:

export async function getOrFetchStale<T>(
key: string,
ttl: number,
staleTtl: number, // Extra time to serve stale
fetchFn: () => Promise<T>
): Promise<T> {
const lockKey = `lock:${key}`;

// Try cache (including stale)
const cached = await redis.get(key);
if (cached) {
return JSON.parse(cached);
}

// Cache expired - try lock
const lockAcquired = await redis.set(lockKey, '1', 'PX', 5000, 'NX');

if (lockAcquired) {
// Fetch fresh data
const data = await fetchFn();
await redis.setex(key, ttl, JSON.stringify(data));
await redis.del(lockKey);
return data;
} else {
// Lock held - check stale cache
const staleKey = `${key}:stale`;
const stale = await redis.get(staleKey);

if (stale) {
// Return stale while refresh happens
return JSON.parse(stale);
}

// No stale - wait for lock
await sleep(100);
const refreshed = await redis.get(key);
return refreshed ? JSON.parse(refreshed) : await fetchFn();
}
}

Monitoring

Track Stampede Events

export async function recordStampede(key: string) {
await redis.incr('metrics:cache:stampede');
logger.warn({ key }, 'Cache stampede detected');
}

// In getOrFetch:
if (!lockAcquired) {
await recordStampede(key);
// ... wait logic
}

Dashboard Metrics

Cache Stampede Protection:
- Stampede events/min (should be ~0)
- Lock wait time (p50, p99)
- Lock timeout rate (%)
- Stale cache hit rate (%)

Lock Configuration

Lock Timeout Strategy

Resource TypeLock TimeoutRationale
Properties5sFast DB query
Availability10sComplex calculation
Pricing15sExternal API call
Analytics30sHeavy aggregation
export const LockTimeouts = {
PROPERTIES: 5000,
AVAILABILITY: 10000,
PRICING: 15000,
ANALYTICS: 30000,
};

Testing Stampede Prevention

// tests/cache/stampede.test.ts
describe('Cache Stampede Prevention', () => {
it('should only fetch once for concurrent requests', async () => {
const { orgId, propertyId } = await setupTestProperty();
const key = CacheKeys.property(orgId, propertyId);

// Clear cache
await redis.del(key);

let fetchCount = 0;
const mockFetch = async () => {
fetchCount++;
await sleep(100); // Simulate slow DB query
return { id: propertyId, name: 'Test Property' };
};

// Fire 100 concurrent requests
const requests = Array.from({ length: 100 }, () =>
getOrFetch(key, 3600, mockFetch)
);

await Promise.all(requests);

// Should only fetch once
expect(fetchCount).toBe(1);
});

it('should release lock after timeout', async () => {
const key = 'test:key';
const lockKey = `lock:${key}`;

// Acquire lock
await redis.set(lockKey, '1', 'PX', 1000, 'NX');

// Wait for expiry
await sleep(1100);

// Lock should be released
const lockExists = await redis.exists(lockKey);
expect(lockExists).toBe(0);
});
});

Edge Cases

Lock Never Released (Process Crash)

Auto-expiry prevents deadlock:

// Lock expires after 5s (PX flag)
await redis.set(lockKey, '1', 'PX', 5000, 'NX');

// Even if process crashes, lock auto-expires

Thundering Herd on Cold Start

Warm cache on startup:

// src/cache/warmup.ts
export async function warmCriticalCaches() {
logger.info('Warming critical caches...');

// Warm top 100 properties
const topProperties = await getTopProperties(100);

for (const property of topProperties) {
await getOrFetch(
CacheKeys.property(property.orgId, property.id),
3600,
async () => property
);
}

logger.info('Cache warmed');
}

// Call on startup
await warmCriticalCaches();

Consequences

Positive

  • Prevents DB Overload: Only one fetch per key
  • Fast Recovery: Lock expires automatically
  • Distributed: Works across replicas
  • Graceful: Stale cache served during refresh

Negative

  • Lock Contention: High concurrency may wait
  • Redis Dependency: Requires Redis uptime

Mitigations

  • Short lock timeouts (5-10s)
  • Stale-while-revalidate for critical keys
  • Monitor stampede rate (alert if >10/min)

Validation Checklist

  • Lock-based cache fetching implemented
  • Lock timeout configured per resource type
  • Stale-while-revalidate for critical caches
  • Stampede monitoring enabled
  • Tests for concurrent requests

References