System Architecture & Data Infrastructure - Domain Specification

First Introduced: MVP.0 (cross-cutting) Status: Specification Complete Last Updated: 2025-10-25

Overview

System Architecture & Data Infrastructure define how all platform domains interconnect physically and logically — including databases, services, queues, caches, and analytics layers. This is a cross-cutting domain that provides the technical foundation ensuring the platform can handle operational workloads (bookings, content updates, syncs) while supporting near-real-time data pipelines, horizontal scalability, and compliance-grade reliability.

This domain establishes patterns for event-driven architecture, transactional outbox for reliable messaging, background job processing, caching strategies, and deployment infrastructure that enable all other domains to operate efficiently and scale gracefully.

Responsibilities

This domain IS responsible for:

Event-driven architecture patterns and event publishing infrastructure
Transactional outbox pattern for reliable cross-domain messaging
Background job queue management and worker orchestration
Distributed caching strategy (Redis, CDN)
Database infrastructure patterns (connection pooling, partitioning, replication)
Service layer orchestration and inter-service communication
Message queue and event bus infrastructure
Data pipeline architecture (streaming and batch)
Infrastructure monitoring and observability patterns
Deployment architecture and scaling strategies

This domain is NOT responsible for:

Business logic execution (→ Individual domain services)
Domain-specific event definitions (→ Each domain defines its own events)
Application-level audit logging (→ Analytics & Audit domain)
User-facing API design (→ Individual domain services)
Data warehouse analytics queries (→ Analytics domain)
Security policies and access control rules (→ Authorization domain)

Relationships

Depends On:

Identity & Tenancy - Org/Account context for multi-tenancy
All domain services - Provides infrastructure that all domains leverage

Depended On By:

ALL DOMAINS - Every domain uses infrastructure patterns defined here:
- Events published via transactional outbox
- Background jobs for async processing
- Caching for performance optimization
- Database patterns for data persistence

Related Domains:

Analytics & Audit - Consumes events from event bus
Search & Indexing - Uses background jobs for reindexing
Channels & Distribution - Uses jobs for sync operations

Core Concepts

Entity: Event

Purpose: Atomic, immutable record of something that happened in the system, forming the backbone of event-driven architecture.

Key Attributes:

id (UUID, primary key) - Unique event identifier
org_id (UUID, foreign key → organizations.id) - Tenant context
account_id (UUID, nullable, foreign key → accounts.id) - Account context
user_id (UUID, nullable, foreign key → users.id) - Actor who triggered event
event_type (VARCHAR, required) - Qualified event name (e.g., booking.created, space.updated)
aggregate_type (VARCHAR, required) - Entity type (space, booking, payment)
aggregate_id (UUID, required) - Entity ID
payload (JSONB, required) - Event data (state changes, context)
metadata (JSONB) - Additional context (IP, user agent, trace ID)
source (ENUM) - Origin: api | ui | system | integration | worker
occurred_at (TIMESTAMPTZ, required) - When event happened (business time)
published_at (TIMESTAMPTZ) - When event was published to event bus
version (INTEGER, default 1) - Event schema version
correlation_id (UUID) - Links related events across services
causation_id (UUID) - References the event that caused this one
created_at (TIMESTAMPTZ, default now())

Relationships:

Event → Org (*, many-to-one)
Event → Account (*, many-to-one, optional)
Event → User (*, many-to-one, optional)
Event → Outbox (1:1, initially) - Published via outbox pattern

Lifecycle:

Created: When any significant system action occurs
Published: Moved from outbox to event bus
Consumed: Processed by downstream services/workers
Archived: Moved to cold storage after retention period (12-24 months)

Business Rules:

Events are immutable once created (append-only)
Must include complete data payload (no foreign key dependencies)
Event types follow domain.action naming convention
All events scoped to org_id for multi-tenancy isolation
Correlation ID chains related events across transaction boundaries

Event Types (Examples):

Identity Domain:
- org.created, org.updated, org.suspended
- account.created, account.updated
- user.created, user.login, membership.created

Supply Domain:
- space.created, space.updated, space.published
- unit.created, unit.updated

Availability Domain:
- booking.created, booking.confirmed, booking.canceled
- hold.created, hold.expired, hold.converted
- block.created, block.deleted

Pricing Domain:
- rate_plan.created, rate_plan.updated
- quote.generated, quote.expired

Payments Domain:
- payment.initiated, payment.succeeded, payment.failed
- payout.created, payout.completed

Channels Domain:
- channel_sync.started, channel_sync.completed
- channel_listing.synced

Entity: Outbox

Purpose: Implements the transactional outbox pattern ensuring reliable, exactly-once event publishing even during system failures.

Key Attributes:

id (UUID, primary key)
event_id (UUID, foreign key → events.id, unique) - Reference to published event
org_id (UUID, foreign key → organizations.id) - Tenant partition
event_type (VARCHAR, required) - Event type for filtering/routing
aggregate_type (VARCHAR, required) - Entity type
aggregate_id (UUID, required) - Entity ID
payload (JSONB, required) - Complete event payload
status (ENUM) - pending | published | failed | dead_letter
retry_count (INTEGER, default 0) - Number of publish attempts
next_retry_at (TIMESTAMPTZ) - Scheduled retry time (exponential backoff)
published_at (TIMESTAMPTZ) - When successfully published to event bus
error_message (TEXT) - Last error if status=failed
created_at (TIMESTAMPTZ, default now())
updated_at (TIMESTAMPTZ, default now())

Relationships:

Outbox → Event (1:1) - Each outbox entry publishes one event
Outbox → Org (*, many-to-one)

Lifecycle:

Created: When event is generated, inserted in same transaction as business data
Pending: Awaiting publication by outbox worker
Published: Successfully sent to event bus
Failed: Temporary failure, scheduled for retry
Dead Letter: Permanent failure after max retries (requires manual intervention)

Business Rules:

Events inserted into outbox in same database transaction as source data change
Guarantees event will be published at least once
Outbox worker polls for pending events (idempotent publish)
Exponential backoff retry strategy (1s, 2s, 4s, 8s, 16s, 30s, 60s)
Dead letter after 7 retries or 24 hours
Published events retained for 7 days, then archived

Pattern Implementation:

-- Example: Creating booking with outbox event
BEGIN;
  -- 1. Insert business entity
  INSERT INTO bookings (org_id, space_id, checkin, checkout, ...)
  VALUES (...) RETURNING id INTO booking_id;

  -- 2. Insert event
  INSERT INTO events (org_id, event_type, aggregate_type, aggregate_id, payload, ...)
  VALUES (...) RETURNING id INTO event_id;

  -- 3. Insert into outbox (in same transaction)
  INSERT INTO outbox (event_id, org_id, event_type, aggregate_type, aggregate_id, payload, status)
  VALUES (event_id, org_id, 'booking.created', 'booking', booking_id, payload_json, 'pending');
COMMIT;

-- Outbox worker publishes events asynchronously
-- If transaction fails, nothing is committed (consistency guaranteed)

Entity: Job

Purpose: Manages background, asynchronous, and scheduled tasks that don't block user requests.

Key Attributes:

id (UUID, primary key)
org_id (UUID, nullable, foreign key → organizations.id) - Tenant context (null for system jobs)
account_id (UUID, nullable, foreign key → accounts.id) - Account context
job_type (VARCHAR, required) - Qualified job name (e.g., ical.import, search.reindex)
queue (VARCHAR, default 'default') - Queue name for prioritization
priority (INTEGER, default 5) - Higher = more urgent (1-10 scale)
payload (JSONB, required) - Job parameters and context
status (ENUM) - pending | running | completed | failed | canceled | dead_letter
worker_id (VARCHAR) - Identifier of worker processing this job
scheduled_at (TIMESTAMPTZ, required) - When job should run
started_at (TIMESTAMPTZ) - When job began execution
completed_at (TIMESTAMPTZ) - When job finished
retry_count (INTEGER, default 0) - Execution attempts
max_retries (INTEGER, default 3) - Maximum retry attempts
next_retry_at (TIMESTAMPTZ) - Next scheduled retry (exponential backoff)
timeout_seconds (INTEGER, default 300) - Max execution time
result (JSONB) - Job output data
error_message (TEXT) - Error details if failed
created_at (TIMESTAMPTZ, default now())
updated_at (TIMESTAMPTZ, default now())

Relationships:

Job → Org (*, many-to-one, optional)
Job → Account (*, many-to-one, optional)

Lifecycle:

Created: Job enqueued by service or scheduled by cron
Pending: Waiting in queue
Running: Worker acquired lock and executing
Completed: Successful execution
Failed: Temporary failure, scheduled for retry
Dead Letter: Max retries exceeded, needs manual review
Canceled: Explicitly canceled by admin or timeout

Business Rules:

Jobs are idempotent (safe to retry)
Jobs include retry logic with exponential backoff
Failed jobs moved to dead letter queue after max retries
Long-running jobs must support timeouts and graceful shutdown
Jobs scoped to org_id for tenant isolation (except system jobs)
Queue names enable priority-based execution (critical, default, low)

Job Types (Examples):

iCal Processing:
- ical.import - Import external calendar feeds
- ical.export - Generate iCal feed file
- ical.sync - Sync availability with external calendars

Search & Indexing:
- search.reindex.space - Reindex single space
- search.reindex.full - Full index rebuild
- search.update_facets - Update aggregated filter counts

Channel Sync:
- channel.sync.export - Export listings to OTA
- channel.sync.import - Import bookings from OTA
- channel.feed.generate - Generate XML/JSON feed

Analytics & Reporting:
- analytics.aggregate_daily - Daily metric rollup
- analytics.generate_report - Generate custom report

Maintenance:
- cleanup.expired_holds - Delete expired holds
- cleanup.archive_events - Move old events to cold storage
- cleanup.purge_sessions - Delete expired sessions

Entity: CacheEntry

Purpose: Manages distributed caching strategy for performance optimization using Redis.

Key Attributes:

key (VARCHAR(512), primary key) - Cache key (hierarchical, e.g., org:123:space:456:availability)
org_id (UUID, nullable) - Tenant context for multi-tenant caching
value (JSONB or TEXT) - Cached data
value_type (ENUM) - json | string | integer | list | set | hash
ttl_seconds (INTEGER) - Time to live in seconds
expires_at (TIMESTAMPTZ) - Absolute expiration time
hit_count (INTEGER, default 0) - Number of cache hits
last_accessed_at (TIMESTAMPTZ) - Last read timestamp
tags (TEXT[]) - Cache invalidation tags (e.g., ['space:456', 'pricing'])
created_at (TIMESTAMPTZ, default now())
updated_at (TIMESTAMPTZ, default now())

Note: This entity documents Redis caching strategy. Actual cache stored in Redis, this table tracks metadata for monitoring/invalidation.

Relationships:

CacheEntry → Org (*, many-to-one, optional)

Cache Key Hierarchy:

org:{org_id}:space:{space_id}:details
org:{org_id}:space:{space_id}:availability:{start_date}:{end_date}
org:{org_id}:space:{space_id}:pricing:{rate_plan_id}
org:{org_id}:account:{account_id}:permissions
user:{user_id}:session:{session_id}
global:amenities:catalog
global:feature_flags
channel:{channel_id}:feed:{feed_id}

TTL Strategies:

Session data: 24 hours
User permissions: 1 hour (invalidate on role change)
Space availability: 15 minutes (invalidate on booking/block)
Pricing calculations: 1 hour (invalidate on rate change)
Content/media URLs: 7 days
Static reference data: 24 hours
Search results: 5 minutes
Channel feeds: 1 hour

Invalidation Patterns:

Time-based: Automatic expiration via TTL
Event-driven: Invalidate on domain events (booking.created → clear availability cache)
Tag-based: Bulk invalidation by resource tags
Manual: Admin-triggered cache clear

Business Rules:

All cache keys must include org_id for tenant isolation (except global data)
Cached data includes version identifier for schema evolution
Cache misses trigger database read and cache population
Stale data acceptable for non-critical reads (eventual consistency)
Critical operations (booking confirmation) bypass cache

Infrastructure Patterns

Event-Driven Architecture

Pattern: Domains publish events via transactional outbox; consumers react asynchronously.

Components:

Event Publishers: Domain services emit events on state changes
Outbox Worker: Polls pending events and publishes to event bus
Event Bus: Message broker (Redis Streams for MVP, Kafka for scale)
Event Consumers: Workers subscribed to event types

Benefits:

Loose coupling between domains
Eventual consistency across services
Enables horizontal scaling of consumers
Audit trail of all system changes
Supports replay and debugging

Example Flow:

Booking service creates booking → inserts event in outbox
Outbox worker publishes booking.created event to bus
Availability service consumes event → updates calendar
Analytics service consumes event → records metric
Channel service consumes event → triggers sync job

Database Architecture

Strategy: Single PostgreSQL cluster with logical multi-tenancy (org_id partitioning).

Components:

Primary Database: ACID-compliant transactional data store
Read Replicas: Horizontally scaled read-only replicas for analytics/reports
Connection Pooling: PgBouncer for connection multiplexing
Partitioning: Table partitioning by org_id or time (events, audit logs)
Indexes: Strategic indexes on org_id, foreign keys, and query patterns

Scaling Strategy:

MVP: Single region, single primary + 2 read replicas
V1: Multi-region with regional read replicas
V2: Sharding by org_id hash for very large customers

Backup & Recovery:

Continuous WAL archiving for point-in-time recovery
Daily full backups (encrypted, 30-day retention)
Standby replica for automatic failover
Quarterly DR simulations

Service Layer Architecture

Pattern: Modular service-oriented architecture with domain-aligned services.

Core Services:

identity-service - Org, Account, User, Membership management
authorization-service - Permission evaluation and access control
supply-service - Space, Unit, Content management
availability-service - Calendar, Booking, Hold, Block management
pricing-service - RatePlan, Quote calculation
booking-service - Reservation lifecycle orchestration
payment-service - Payment, Payout, Transaction processing
channel-service - OTA integration and syndication
analytics-service - Events, Metrics, Audit processing

Service Communication:

Synchronous: REST APIs with JWT authentication
Asynchronous: Event bus for cross-domain notifications
Orchestration: Saga pattern for distributed transactions

Service Standards:

Stateless for horizontal scaling
Health check endpoints for liveness/readiness probes
Structured logging with correlation IDs
Circuit breakers for external dependencies
Rate limiting per tenant

Caching Architecture

Multi-Layer Caching Strategy:

Layer 1 - Application Cache (Redis):

Session data
User permissions
Computed prices
Availability fragments
TTL-based invalidation

Layer 2 - CDN (CloudFront):

Media assets (images, videos)
iCal feed files
Static content
Edge caching for global distribution

Layer 3 - Database Query Cache:

Materialized views for complex queries
Aggregated metrics
Refreshed on schedule or event trigger

Cache Invalidation Strategy:

Event-driven: Clear relevant cache on domain events
Time-based: Automatic TTL expiration
Manual: Admin-triggered cache clear API
Pattern-based: Invalidate by key prefix or tags

Job Queue Architecture

Queue Design:

Default Queue: General background tasks
Critical Queue: High-priority jobs (payment processing)
Bulk Queue: Large batch operations (full reindex)
Scheduled Queue: Cron-like recurring jobs

Worker Pools:

Horizontal scaling: Add worker instances as queue depth grows
Dedicated pools per queue for isolation
Auto-scaling based on queue depth metrics

Failure Handling:

Exponential backoff retry with jitter
Dead letter queue for permanent failures
Alerting on dead letter threshold
Manual retry/cancel interface

Rationale

Why Event-Driven Architecture?

Enables independent deployment and scaling of services
Natural fit for distributed hospitality operations
Supports complex workflows (booking → payment → calendar → channel sync)
Provides complete audit trail for compliance

Why Transactional Outbox Pattern?

Guarantees events are published even during failures
Avoids dual-write problem (data + event in same transaction)
Enables exactly-once semantics critical for financial operations

Why PostgreSQL (not NoSQL)?

ACID guarantees critical for bookings and payments
Rich relational model fits complex domain relationships
JSONB provides flexibility for extensibility
Mature ecosystem and operational tooling

Why Redis for Caching?

Fast in-memory access for hot data
Rich data structures (strings, sets, sorted sets)
Pub/sub for cache invalidation
TTL built-in

Why Background Jobs?

Non-blocking user experience
Retry logic for unreliable external integrations
Rate limiting for API calls to external systems
Scheduled tasks (nightly reports, cleanups)

MVP Implementation (MVP.0)

Infrastructure Stack:

Database: PostgreSQL 15+ (single instance + read replica)
Cache: Redis 7+ (single instance, persistence enabled)
Message Queue: Redis Streams (upgrade to Kafka in V1)
Container Orchestration: Kubernetes (GKE, EKS, or AKS)
CDN: CloudFront or Cloudflare
Object Storage: S3-compatible (AWS S3, Backblaze B2)
Monitoring: OpenTelemetry + Prometheus + Grafana

Enabled Patterns:

Event publishing via transactional outbox
Background job processing (iCal import, search reindex)
Redis caching for permissions and availability
Basic event bus with at-least-once delivery

Deferred to V1:

Kafka/PubSub for high-throughput event streaming
Multi-region database replication
Advanced caching (multi-level, warming strategies)
Auto-scaling worker pools
Event replay and time-travel debugging

Future Enhancements

V1.0: Event Streaming Platform

Migrate to Kafka or Google Pub/Sub
Event schema registry for versioning
Stream processing (aggregate metrics in real-time)
Event sourcing for critical domains

V1.1: Advanced Caching

Cache warming on data updates
Predictive caching based on access patterns
Multi-region cache replication
Cache hit ratio monitoring and optimization

V1.2: Database Scaling

Horizontal sharding by org_id
Automated shard rebalancing
Read replica routing by query type
Connection pooling optimization

V2.0: Microservices Evolution

Full microservices decomposition
Service mesh (Istio, Linkerd)
gRPC for inter-service communication
API gateway with rate limiting and circuit breaking

V2.1: Data Pipeline Modernization

Real-time streaming to data warehouse
Change data capture (Debezium) for sync
Data lake for historical event archives
ML model serving infrastructure

Operational Notes

Monitoring & Observability

Key Metrics:

Events: Publish latency, consumer lag, event throughput
Jobs: Queue depth, job duration, failure rate
Cache: Hit/miss ratio, eviction rate, memory usage
Database: Connection pool utilization, query latency, replication lag

Alerting Thresholds:

Critical: Database unavailable, event bus down, job queue overflow
Warning: High cache miss rate, slow query detected, outbox backlog

Performance Targets

MVP (Villa-only):

API response time: p95 < 200ms
Event publish latency: < 1s
Job execution: 95% within 5 minutes
Cache hit ratio: > 80%
Database queries: p95 < 50ms

V1 (Multi-property):

API response time: p95 < 150ms
Event publish latency: < 500ms
Job execution: 95% within 2 minutes
Cache hit ratio: > 90%

Deployment Strategy

MVP: Blue/green deployment with manual rollout V1: Canary deployment with automated rollback V2: Progressive delivery with feature flags

Disaster Recovery

RTO (Recovery Time Objective): 1 hour RPO (Recovery Point Objective): 5 minutes

DR Procedures:

Automatic failover to standby replica
Restore from point-in-time backup
Replay events from last checkpoint
Validate data consistency before cutover

Physical Schema

See 001_initial_schema.sql for complete CREATE TABLE statements.

Summary:

4 primary tables: events, outbox, jobs, cache_metadata
15+ indexes for query performance
Partitioning strategies for events and jobs
Row-Level Security policies for multi-tenancy

Platform Specification - Complete platform overview
[Data Model](../../TVL Data and Domain Model Specification 2025-10-21 (1).md) - Detailed entity definitions
Analytics Domain - Event consumption patterns
Search Domain - Background job usage
MVP.0 Overview
Architecture Decision Records - Infrastructure decisions

Implementation Priority: HIGH - Cross-cutting infrastructure must be established before domain services can operate reliably.

Overview​

Responsibilities​

Relationships​

Core Concepts​

Entity: Event​

Entity: Outbox​

Entity: Job​

Entity: CacheEntry​

Infrastructure Patterns​

Event-Driven Architecture​

Database Architecture​

Service Layer Architecture​

Caching Architecture​

Job Queue Architecture​

Rationale​

MVP Implementation (MVP.0)​

Future Enhancements​

V1.0: Event Streaming Platform​

V1.1: Advanced Caching​

V1.2: Database Scaling​

V2.0: Microservices Evolution​

V2.1: Data Pipeline Modernization​

Operational Notes​

Monitoring & Observability​

Performance Targets​

Deployment Strategy​

Disaster Recovery​

Physical Schema​

Related Documents​

Overview

Responsibilities

Relationships

Core Concepts

Entity: Event

Entity: Outbox

Entity: Job

Entity: CacheEntry

Infrastructure Patterns

Event-Driven Architecture

Database Architecture

Service Layer Architecture

Caching Architecture

Job Queue Architecture

Rationale

MVP Implementation (MVP.0)

Future Enhancements

V1.0: Event Streaming Platform

V1.1: Advanced Caching

V1.2: Database Scaling

V2.0: Microservices Evolution

V2.1: Data Pipeline Modernization

Operational Notes

Monitoring & Observability

Performance Targets

Deployment Strategy

Disaster Recovery

Physical Schema

Related Documents