Skip to main content

ADR-0034: Circuit Breaker Pattern for API Resilience

Status

Accepted - 2025-01-26


Context

TVL Platform integrates with external APIs (Hostaway, Airbnb, VRBO, Stripe) that may experience outages, slow responses, or rate limiting.


Decision

Circuit Breaker Pattern using opossum library to prevent cascade failures and enable graceful degradation.

Rationale

  1. Fail Fast: Stop calling failing API (prevent timeout pile-up)
  2. Auto Recovery: Automatically retry after cooldown period
  3. Observability: Metrics on open/closed/half-open states
  4. Graceful Degradation: Return cached data or default response

Alternatives Considered

Alternative 1: Retry with Exponential Backoff Only

Rejected - No failure detection, continues hitting broken API

Alternative 2: Manual Circuit Breaker

Rejected - Error-prone, no observability, reinventing wheel

Alternative 3: Service Mesh (Istio)

Rejected - Overkill for MVP, complex infrastructure


Implementation

1. Install Opossum

pnpm add opossum
pnpm add -D @types/opossum

2. Create Circuit Breaker Wrapper

// src/integrations/circuitBreaker.ts
import CircuitBreaker from 'opossum';

export interface CircuitBreakerOptions {
timeout: number; // Request timeout (ms)
errorThresholdPercentage: number; // % failures before opening
resetTimeout: number; // Time before half-open attempt (ms)
rollingCountTimeout: number; // Rolling window for error calculation (ms)
rollingCountBuckets: number; // Number of buckets in window
}

export const DEFAULT_CIRCUIT_BREAKER_OPTIONS: CircuitBreakerOptions = {
timeout: 10000, // 10 seconds
errorThresholdPercentage: 50, // Open if 50% failures
resetTimeout: 30000, // Try again after 30s
rollingCountTimeout: 10000, // 10s rolling window
rollingCountBuckets: 10, // 1s buckets
};

export function createCircuitBreaker<T>(
fn: (...args: any[]) => Promise<T>,
options: Partial<CircuitBreakerOptions> = {}
): CircuitBreaker<any[], T> {
const breaker = new CircuitBreaker(fn, {
...DEFAULT_CIRCUIT_BREAKER_OPTIONS,
...options,
});

// Event listeners for observability
breaker.on('open', () => {
logger.warn({ breakerName: fn.name }, 'Circuit breaker opened');
});

breaker.on('halfOpen', () => {
logger.info({ breakerName: fn.name }, 'Circuit breaker half-open (testing)');
});

breaker.on('close', () => {
logger.info({ breakerName: fn.name }, 'Circuit breaker closed (healthy)');
});

breaker.on('timeout', () => {
logger.error({ breakerName: fn.name }, 'Circuit breaker timeout');
});

return breaker;
}

3. Wrap External API Calls

// src/integrations/hostaway/HostawayConnector.ts
import { createCircuitBreaker } from '../circuitBreaker';

export class HostawayConnector implements ChannelConnector {
private listPropertiesBreaker: CircuitBreaker<any[], Property[]>;
private createBookingBreaker: CircuitBreaker<any[], Booking>;

constructor(credentials: HostawayCredentials) {
// Wrap each external API call
this.listPropertiesBreaker = createCircuitBreaker(
this.listPropertiesImpl.bind(this),
{ timeout: 5000 } // 5s timeout for list
);

this.createBookingBreaker = createCircuitBreaker(
this.createBookingImpl.bind(this),
{ timeout: 10000 } // 10s timeout for create
);
}

async listProperties(): Promise<Property[]> {
try {
return await this.listPropertiesBreaker.fire();
} catch (error) {
if (error.code === 'EOPENBREAKER') {
// Circuit is open - return cached data
logger.warn('Circuit open, returning cached properties');
return await this.getCachedProperties();
}
throw error;
}
}

private async listPropertiesImpl(): Promise<Property[]> {
const response = await axios.get(`${this.baseURL}/listings`, {
headers: { Authorization: `Bearer ${this.accessToken}` },
});
return response.data.result.map(this.transformProperty);
}

async createBooking(data: BookingData): Promise<Booking> {
try {
return await this.createBookingBreaker.fire(data);
} catch (error) {
if (error.code === 'EOPENBREAKER') {
// Circuit is open - queue for later
logger.error('Circuit open, queueing booking creation');
await this.queueBookingCreation(data);
throw new Error('Booking queued due to API unavailability');
}
throw error;
}
}

private async createBookingImpl(data: BookingData): Promise<Booking> {
const response = await axios.post(`${this.baseURL}/reservations`, {
listingId: data.propertyId,
guestName: data.guestName,
// ... other fields
});
return this.transformBooking(response.data.result);
}
}

4. Graceful Degradation Strategies

// Strategy 1: Return cached data
async getCachedProperties(): Promise<Property[]> {
const cached = await redis.get(`properties:${this.orgId}`);
return cached ? JSON.parse(cached) : [];
}

// Strategy 2: Queue operation for later
async queueBookingCreation(data: BookingData): Promise<void> {
await webhookQueue.add('retry.booking.create', {
data,
channel: 'hostaway',
retryAt: new Date(Date.now() + 60000).toISOString(), // Retry in 1 min
});
}

// Strategy 3: Return partial data
async getAvailability(propertyId: string): Promise<Availability[]> {
try {
return await this.availabilityBreaker.fire(propertyId);
} catch (error) {
if (error.code === 'EOPENBREAKER') {
// Return conservative "unavailable" to prevent overbooking
logger.warn('Circuit open, returning unavailable status');
return [{ available: false, reason: 'API unavailable' }];
}
throw error;
}
}

Circuit Breaker States

┌─────────────┐
│ CLOSED │ ← Normal operation
│ (healthy) │
└──────┬──────┘
│ 50% errors

┌─────────────┐
│ OPEN │ ← Failing fast (no API calls)
│ (failing) │
└──────┬──────┘
│ 30s cooldown

┌─────────────┐
│ HALF-OPEN │ ← Testing with 1 request
│ (testing) │
└──────┬──────┘

├─ Success → CLOSED
└─ Failure → OPEN

Observability

Metrics Collection

// src/monitoring/circuitBreakerMetrics.ts
export function collectCircuitBreakerMetrics(breaker: CircuitBreaker) {
setInterval(() => {
const stats = breaker.stats;

metrics.gauge('circuit_breaker.requests', stats.fires);
metrics.gauge('circuit_breaker.failures', stats.failures);
metrics.gauge('circuit_breaker.successes', stats.successes);
metrics.gauge('circuit_breaker.timeouts', stats.timeouts);
metrics.gauge('circuit_breaker.cache_hits', stats.cacheHits);

// State (0 = closed, 1 = open, 2 = half-open)
const state = breaker.opened ? 1 : breaker.halfOpen ? 2 : 0;
metrics.gauge('circuit_breaker.state', state);
}, 10000); // Every 10s
}

Dashboard (Grafana)

Circuit Breaker Health:
- State timeline (closed/open/half-open)
- Error rate (%)
- Timeout rate (%)
- Fallback usage (cache hits)

Consequences

Positive

  • Prevents Cascade Failures: Stop hitting broken API
  • Faster Recovery: Auto-retry after cooldown
  • Graceful Degradation: Return cached/default data
  • Observability: Metrics on breaker states

Negative

  • Delayed Error Detection: 10s rolling window delay
  • False Positives: May open on transient errors

Mitigations

  • Tune errorThresholdPercentage per API (50% default)
  • Use cached data or queuing for critical operations
  • Monitor breaker open events (alert if >5 min)

Validation Checklist

  • Circuit breakers wrap all external API calls
  • Fallback strategies defined (cache, queue, default)
  • Metrics collected for all breakers
  • Grafana dashboard for breaker states
  • Alerts for breaker open >5 minutes

References