Skip to main content

System Architecture & Data Infrastructure - Domain Specification

First Introduced: MVP.0 (cross-cutting) Status: Specification Complete Last Updated: 2025-10-25


Overview

System Architecture & Data Infrastructure define how all platform domains interconnect physically and logically — including databases, services, queues, caches, and analytics layers. This is a cross-cutting domain that provides the technical foundation ensuring the platform can handle operational workloads (bookings, content updates, syncs) while supporting near-real-time data pipelines, horizontal scalability, and compliance-grade reliability.

This domain establishes patterns for event-driven architecture, transactional outbox for reliable messaging, background job processing, caching strategies, and deployment infrastructure that enable all other domains to operate efficiently and scale gracefully.


Responsibilities

This domain IS responsible for:

  • Event-driven architecture patterns and event publishing infrastructure
  • Transactional outbox pattern for reliable cross-domain messaging
  • Background job queue management and worker orchestration
  • Distributed caching strategy (Redis, CDN)
  • Database infrastructure patterns (connection pooling, partitioning, replication)
  • Service layer orchestration and inter-service communication
  • Message queue and event bus infrastructure
  • Data pipeline architecture (streaming and batch)
  • Infrastructure monitoring and observability patterns
  • Deployment architecture and scaling strategies

This domain is NOT responsible for:

  • Business logic execution (→ Individual domain services)
  • Domain-specific event definitions (→ Each domain defines its own events)
  • Application-level audit logging (→ Analytics & Audit domain)
  • User-facing API design (→ Individual domain services)
  • Data warehouse analytics queries (→ Analytics domain)
  • Security policies and access control rules (→ Authorization domain)

Relationships

Depends On:

  • Identity & Tenancy - Org/Account context for multi-tenancy
  • All domain services - Provides infrastructure that all domains leverage

Depended On By:

  • ALL DOMAINS - Every domain uses infrastructure patterns defined here:
    • Events published via transactional outbox
    • Background jobs for async processing
    • Caching for performance optimization
    • Database patterns for data persistence

Related Domains:


Core Concepts

Entity: Event

Purpose: Atomic, immutable record of something that happened in the system, forming the backbone of event-driven architecture.

Key Attributes:

  • id (UUID, primary key) - Unique event identifier
  • org_id (UUID, foreign key → organizations.id) - Tenant context
  • account_id (UUID, nullable, foreign key → accounts.id) - Account context
  • user_id (UUID, nullable, foreign key → users.id) - Actor who triggered event
  • event_type (VARCHAR, required) - Qualified event name (e.g., booking.created, space.updated)
  • aggregate_type (VARCHAR, required) - Entity type (space, booking, payment)
  • aggregate_id (UUID, required) - Entity ID
  • payload (JSONB, required) - Event data (state changes, context)
  • metadata (JSONB) - Additional context (IP, user agent, trace ID)
  • source (ENUM) - Origin: api | ui | system | integration | worker
  • occurred_at (TIMESTAMPTZ, required) - When event happened (business time)
  • published_at (TIMESTAMPTZ) - When event was published to event bus
  • version (INTEGER, default 1) - Event schema version
  • correlation_id (UUID) - Links related events across services
  • causation_id (UUID) - References the event that caused this one
  • created_at (TIMESTAMPTZ, default now())

Relationships:

  • Event → Org (*, many-to-one)
  • Event → Account (*, many-to-one, optional)
  • Event → User (*, many-to-one, optional)
  • Event → Outbox (1:1, initially) - Published via outbox pattern

Lifecycle:

  • Created: When any significant system action occurs
  • Published: Moved from outbox to event bus
  • Consumed: Processed by downstream services/workers
  • Archived: Moved to cold storage after retention period (12-24 months)

Business Rules:

  • Events are immutable once created (append-only)
  • Must include complete data payload (no foreign key dependencies)
  • Event types follow domain.action naming convention
  • All events scoped to org_id for multi-tenancy isolation
  • Correlation ID chains related events across transaction boundaries

Event Types (Examples):

Identity Domain:
- org.created, org.updated, org.suspended
- account.created, account.updated
- user.created, user.login, membership.created

Supply Domain:
- space.created, space.updated, space.published
- unit.created, unit.updated

Availability Domain:
- booking.created, booking.confirmed, booking.canceled
- hold.created, hold.expired, hold.converted
- block.created, block.deleted

Pricing Domain:
- rate_plan.created, rate_plan.updated
- quote.generated, quote.expired

Payments Domain:
- payment.initiated, payment.succeeded, payment.failed
- payout.created, payout.completed

Channels Domain:
- channel_sync.started, channel_sync.completed
- channel_listing.synced

Entity: Outbox

Purpose: Implements the transactional outbox pattern ensuring reliable, exactly-once event publishing even during system failures.

Key Attributes:

  • id (UUID, primary key)
  • event_id (UUID, foreign key → events.id, unique) - Reference to published event
  • org_id (UUID, foreign key → organizations.id) - Tenant partition
  • event_type (VARCHAR, required) - Event type for filtering/routing
  • aggregate_type (VARCHAR, required) - Entity type
  • aggregate_id (UUID, required) - Entity ID
  • payload (JSONB, required) - Complete event payload
  • status (ENUM) - pending | published | failed | dead_letter
  • retry_count (INTEGER, default 0) - Number of publish attempts
  • next_retry_at (TIMESTAMPTZ) - Scheduled retry time (exponential backoff)
  • published_at (TIMESTAMPTZ) - When successfully published to event bus
  • error_message (TEXT) - Last error if status=failed
  • created_at (TIMESTAMPTZ, default now())
  • updated_at (TIMESTAMPTZ, default now())

Relationships:

  • Outbox → Event (1:1) - Each outbox entry publishes one event
  • Outbox → Org (*, many-to-one)

Lifecycle:

  1. Created: When event is generated, inserted in same transaction as business data
  2. Pending: Awaiting publication by outbox worker
  3. Published: Successfully sent to event bus
  4. Failed: Temporary failure, scheduled for retry
  5. Dead Letter: Permanent failure after max retries (requires manual intervention)

Business Rules:

  • Events inserted into outbox in same database transaction as source data change
  • Guarantees event will be published at least once
  • Outbox worker polls for pending events (idempotent publish)
  • Exponential backoff retry strategy (1s, 2s, 4s, 8s, 16s, 30s, 60s)
  • Dead letter after 7 retries or 24 hours
  • Published events retained for 7 days, then archived

Pattern Implementation:

-- Example: Creating booking with outbox event
BEGIN;
-- 1. Insert business entity
INSERT INTO bookings (org_id, space_id, checkin, checkout, ...)
VALUES (...) RETURNING id INTO booking_id;

-- 2. Insert event
INSERT INTO events (org_id, event_type, aggregate_type, aggregate_id, payload, ...)
VALUES (...) RETURNING id INTO event_id;

-- 3. Insert into outbox (in same transaction)
INSERT INTO outbox (event_id, org_id, event_type, aggregate_type, aggregate_id, payload, status)
VALUES (event_id, org_id, 'booking.created', 'booking', booking_id, payload_json, 'pending');
COMMIT;

-- Outbox worker publishes events asynchronously
-- If transaction fails, nothing is committed (consistency guaranteed)

Entity: Job

Purpose: Manages background, asynchronous, and scheduled tasks that don't block user requests.

Key Attributes:

  • id (UUID, primary key)
  • org_id (UUID, nullable, foreign key → organizations.id) - Tenant context (null for system jobs)
  • account_id (UUID, nullable, foreign key → accounts.id) - Account context
  • job_type (VARCHAR, required) - Qualified job name (e.g., ical.import, search.reindex)
  • queue (VARCHAR, default 'default') - Queue name for prioritization
  • priority (INTEGER, default 5) - Higher = more urgent (1-10 scale)
  • payload (JSONB, required) - Job parameters and context
  • status (ENUM) - pending | running | completed | failed | canceled | dead_letter
  • worker_id (VARCHAR) - Identifier of worker processing this job
  • scheduled_at (TIMESTAMPTZ, required) - When job should run
  • started_at (TIMESTAMPTZ) - When job began execution
  • completed_at (TIMESTAMPTZ) - When job finished
  • retry_count (INTEGER, default 0) - Execution attempts
  • max_retries (INTEGER, default 3) - Maximum retry attempts
  • next_retry_at (TIMESTAMPTZ) - Next scheduled retry (exponential backoff)
  • timeout_seconds (INTEGER, default 300) - Max execution time
  • result (JSONB) - Job output data
  • error_message (TEXT) - Error details if failed
  • created_at (TIMESTAMPTZ, default now())
  • updated_at (TIMESTAMPTZ, default now())

Relationships:

  • Job → Org (*, many-to-one, optional)
  • Job → Account (*, many-to-one, optional)

Lifecycle:

  1. Created: Job enqueued by service or scheduled by cron
  2. Pending: Waiting in queue
  3. Running: Worker acquired lock and executing
  4. Completed: Successful execution
  5. Failed: Temporary failure, scheduled for retry
  6. Dead Letter: Max retries exceeded, needs manual review
  7. Canceled: Explicitly canceled by admin or timeout

Business Rules:

  • Jobs are idempotent (safe to retry)
  • Jobs include retry logic with exponential backoff
  • Failed jobs moved to dead letter queue after max retries
  • Long-running jobs must support timeouts and graceful shutdown
  • Jobs scoped to org_id for tenant isolation (except system jobs)
  • Queue names enable priority-based execution (critical, default, low)

Job Types (Examples):

iCal Processing:
- ical.import - Import external calendar feeds
- ical.export - Generate iCal feed file
- ical.sync - Sync availability with external calendars

Search & Indexing:
- search.reindex.space - Reindex single space
- search.reindex.full - Full index rebuild
- search.update_facets - Update aggregated filter counts

Channel Sync:
- channel.sync.export - Export listings to OTA
- channel.sync.import - Import bookings from OTA
- channel.feed.generate - Generate XML/JSON feed

Analytics & Reporting:
- analytics.aggregate_daily - Daily metric rollup
- analytics.generate_report - Generate custom report

Maintenance:
- cleanup.expired_holds - Delete expired holds
- cleanup.archive_events - Move old events to cold storage
- cleanup.purge_sessions - Delete expired sessions

Entity: CacheEntry

Purpose: Manages distributed caching strategy for performance optimization using Redis.

Key Attributes:

  • key (VARCHAR(512), primary key) - Cache key (hierarchical, e.g., org:123:space:456:availability)
  • org_id (UUID, nullable) - Tenant context for multi-tenant caching
  • value (JSONB or TEXT) - Cached data
  • value_type (ENUM) - json | string | integer | list | set | hash
  • ttl_seconds (INTEGER) - Time to live in seconds
  • expires_at (TIMESTAMPTZ) - Absolute expiration time
  • hit_count (INTEGER, default 0) - Number of cache hits
  • last_accessed_at (TIMESTAMPTZ) - Last read timestamp
  • tags (TEXT[]) - Cache invalidation tags (e.g., ['space:456', 'pricing'])
  • created_at (TIMESTAMPTZ, default now())
  • updated_at (TIMESTAMPTZ, default now())

Note: This entity documents Redis caching strategy. Actual cache stored in Redis, this table tracks metadata for monitoring/invalidation.

Relationships:

  • CacheEntry → Org (*, many-to-one, optional)

Cache Key Hierarchy:

org:{org_id}:space:{space_id}:details
org:{org_id}:space:{space_id}:availability:{start_date}:{end_date}
org:{org_id}:space:{space_id}:pricing:{rate_plan_id}
org:{org_id}:account:{account_id}:permissions
user:{user_id}:session:{session_id}
global:amenities:catalog
global:feature_flags
channel:{channel_id}:feed:{feed_id}

TTL Strategies:

  • Session data: 24 hours
  • User permissions: 1 hour (invalidate on role change)
  • Space availability: 15 minutes (invalidate on booking/block)
  • Pricing calculations: 1 hour (invalidate on rate change)
  • Content/media URLs: 7 days
  • Static reference data: 24 hours
  • Search results: 5 minutes
  • Channel feeds: 1 hour

Invalidation Patterns:

  • Time-based: Automatic expiration via TTL
  • Event-driven: Invalidate on domain events (booking.created → clear availability cache)
  • Tag-based: Bulk invalidation by resource tags
  • Manual: Admin-triggered cache clear

Business Rules:

  • All cache keys must include org_id for tenant isolation (except global data)
  • Cached data includes version identifier for schema evolution
  • Cache misses trigger database read and cache population
  • Stale data acceptable for non-critical reads (eventual consistency)
  • Critical operations (booking confirmation) bypass cache

Infrastructure Patterns

Event-Driven Architecture

Pattern: Domains publish events via transactional outbox; consumers react asynchronously.

Components:

  • Event Publishers: Domain services emit events on state changes
  • Outbox Worker: Polls pending events and publishes to event bus
  • Event Bus: Message broker (Redis Streams for MVP, Kafka for scale)
  • Event Consumers: Workers subscribed to event types

Benefits:

  • Loose coupling between domains
  • Eventual consistency across services
  • Enables horizontal scaling of consumers
  • Audit trail of all system changes
  • Supports replay and debugging

Example Flow:

  1. Booking service creates booking → inserts event in outbox
  2. Outbox worker publishes booking.created event to bus
  3. Availability service consumes event → updates calendar
  4. Analytics service consumes event → records metric
  5. Channel service consumes event → triggers sync job

Database Architecture

Strategy: Single PostgreSQL cluster with logical multi-tenancy (org_id partitioning).

Components:

  • Primary Database: ACID-compliant transactional data store
  • Read Replicas: Horizontally scaled read-only replicas for analytics/reports
  • Connection Pooling: PgBouncer for connection multiplexing
  • Partitioning: Table partitioning by org_id or time (events, audit logs)
  • Indexes: Strategic indexes on org_id, foreign keys, and query patterns

Scaling Strategy:

  • MVP: Single region, single primary + 2 read replicas
  • V1: Multi-region with regional read replicas
  • V2: Sharding by org_id hash for very large customers

Backup & Recovery:

  • Continuous WAL archiving for point-in-time recovery
  • Daily full backups (encrypted, 30-day retention)
  • Standby replica for automatic failover
  • Quarterly DR simulations

Service Layer Architecture

Pattern: Modular service-oriented architecture with domain-aligned services.

Core Services:

  • identity-service - Org, Account, User, Membership management
  • authorization-service - Permission evaluation and access control
  • supply-service - Space, Unit, Content management
  • availability-service - Calendar, Booking, Hold, Block management
  • pricing-service - RatePlan, Quote calculation
  • booking-service - Reservation lifecycle orchestration
  • payment-service - Payment, Payout, Transaction processing
  • channel-service - OTA integration and syndication
  • analytics-service - Events, Metrics, Audit processing

Service Communication:

  • Synchronous: REST APIs with JWT authentication
  • Asynchronous: Event bus for cross-domain notifications
  • Orchestration: Saga pattern for distributed transactions

Service Standards:

  • Stateless for horizontal scaling
  • Health check endpoints for liveness/readiness probes
  • Structured logging with correlation IDs
  • Circuit breakers for external dependencies
  • Rate limiting per tenant

Caching Architecture

Multi-Layer Caching Strategy:

Layer 1 - Application Cache (Redis):

  • Session data
  • User permissions
  • Computed prices
  • Availability fragments
  • TTL-based invalidation

Layer 2 - CDN (CloudFront):

  • Media assets (images, videos)
  • iCal feed files
  • Static content
  • Edge caching for global distribution

Layer 3 - Database Query Cache:

  • Materialized views for complex queries
  • Aggregated metrics
  • Refreshed on schedule or event trigger

Cache Invalidation Strategy:

  • Event-driven: Clear relevant cache on domain events
  • Time-based: Automatic TTL expiration
  • Manual: Admin-triggered cache clear API
  • Pattern-based: Invalidate by key prefix or tags

Job Queue Architecture

Queue Design:

  • Default Queue: General background tasks
  • Critical Queue: High-priority jobs (payment processing)
  • Bulk Queue: Large batch operations (full reindex)
  • Scheduled Queue: Cron-like recurring jobs

Worker Pools:

  • Horizontal scaling: Add worker instances as queue depth grows
  • Dedicated pools per queue for isolation
  • Auto-scaling based on queue depth metrics

Failure Handling:

  • Exponential backoff retry with jitter
  • Dead letter queue for permanent failures
  • Alerting on dead letter threshold
  • Manual retry/cancel interface

Rationale

Why Event-Driven Architecture?

  • Enables independent deployment and scaling of services
  • Natural fit for distributed hospitality operations
  • Supports complex workflows (booking → payment → calendar → channel sync)
  • Provides complete audit trail for compliance

Why Transactional Outbox Pattern?

  • Guarantees events are published even during failures
  • Avoids dual-write problem (data + event in same transaction)
  • Enables exactly-once semantics critical for financial operations

Why PostgreSQL (not NoSQL)?

  • ACID guarantees critical for bookings and payments
  • Rich relational model fits complex domain relationships
  • JSONB provides flexibility for extensibility
  • Mature ecosystem and operational tooling

Why Redis for Caching?

  • Fast in-memory access for hot data
  • Rich data structures (strings, sets, sorted sets)
  • Pub/sub for cache invalidation
  • TTL built-in

Why Background Jobs?

  • Non-blocking user experience
  • Retry logic for unreliable external integrations
  • Rate limiting for API calls to external systems
  • Scheduled tasks (nightly reports, cleanups)

MVP Implementation (MVP.0)

Infrastructure Stack:

  • Database: PostgreSQL 15+ (single instance + read replica)
  • Cache: Redis 7+ (single instance, persistence enabled)
  • Message Queue: Redis Streams (upgrade to Kafka in V1)
  • Container Orchestration: Kubernetes (GKE, EKS, or AKS)
  • CDN: CloudFront or Cloudflare
  • Object Storage: S3-compatible (AWS S3, Backblaze B2)
  • Monitoring: OpenTelemetry + Prometheus + Grafana

Enabled Patterns:

  • Event publishing via transactional outbox
  • Background job processing (iCal import, search reindex)
  • Redis caching for permissions and availability
  • Basic event bus with at-least-once delivery

Deferred to V1:

  • Kafka/PubSub for high-throughput event streaming
  • Multi-region database replication
  • Advanced caching (multi-level, warming strategies)
  • Auto-scaling worker pools
  • Event replay and time-travel debugging

Future Enhancements

V1.0: Event Streaming Platform

  • Migrate to Kafka or Google Pub/Sub
  • Event schema registry for versioning
  • Stream processing (aggregate metrics in real-time)
  • Event sourcing for critical domains

V1.1: Advanced Caching

  • Cache warming on data updates
  • Predictive caching based on access patterns
  • Multi-region cache replication
  • Cache hit ratio monitoring and optimization

V1.2: Database Scaling

  • Horizontal sharding by org_id
  • Automated shard rebalancing
  • Read replica routing by query type
  • Connection pooling optimization

V2.0: Microservices Evolution

  • Full microservices decomposition
  • Service mesh (Istio, Linkerd)
  • gRPC for inter-service communication
  • API gateway with rate limiting and circuit breaking

V2.1: Data Pipeline Modernization

  • Real-time streaming to data warehouse
  • Change data capture (Debezium) for sync
  • Data lake for historical event archives
  • ML model serving infrastructure

Operational Notes

Monitoring & Observability

Key Metrics:

  • Events: Publish latency, consumer lag, event throughput
  • Jobs: Queue depth, job duration, failure rate
  • Cache: Hit/miss ratio, eviction rate, memory usage
  • Database: Connection pool utilization, query latency, replication lag

Alerting Thresholds:

  • Critical: Database unavailable, event bus down, job queue overflow
  • Warning: High cache miss rate, slow query detected, outbox backlog

Performance Targets

MVP (Villa-only):

  • API response time: p95 < 200ms
  • Event publish latency: < 1s
  • Job execution: 95% within 5 minutes
  • Cache hit ratio: > 80%
  • Database queries: p95 < 50ms

V1 (Multi-property):

  • API response time: p95 < 150ms
  • Event publish latency: < 500ms
  • Job execution: 95% within 2 minutes
  • Cache hit ratio: > 90%

Deployment Strategy

MVP: Blue/green deployment with manual rollout V1: Canary deployment with automated rollback V2: Progressive delivery with feature flags

Disaster Recovery

RTO (Recovery Time Objective): 1 hour RPO (Recovery Point Objective): 5 minutes

DR Procedures:

  • Automatic failover to standby replica
  • Restore from point-in-time backup
  • Replay events from last checkpoint
  • Validate data consistency before cutover

Physical Schema

See 001_initial_schema.sql for complete CREATE TABLE statements.

Summary:

  • 4 primary tables: events, outbox, jobs, cache_metadata
  • 15+ indexes for query performance
  • Partitioning strategies for events and jobs
  • Row-Level Security policies for multi-tenancy


Implementation Priority: HIGH - Cross-cutting infrastructure must be established before domain services can operate reliably.