System Architecture & Data Infrastructure - Domain Specification
First Introduced: MVP.0 (cross-cutting) Status: Specification Complete Last Updated: 2025-10-25
Overview
System Architecture & Data Infrastructure define how all platform domains interconnect physically and logically — including databases, services, queues, caches, and analytics layers. This is a cross-cutting domain that provides the technical foundation ensuring the platform can handle operational workloads (bookings, content updates, syncs) while supporting near-real-time data pipelines, horizontal scalability, and compliance-grade reliability.
This domain establishes patterns for event-driven architecture, transactional outbox for reliable messaging, background job processing, caching strategies, and deployment infrastructure that enable all other domains to operate efficiently and scale gracefully.
Responsibilities
This domain IS responsible for:
- Event-driven architecture patterns and event publishing infrastructure
- Transactional outbox pattern for reliable cross-domain messaging
- Background job queue management and worker orchestration
- Distributed caching strategy (Redis, CDN)
- Database infrastructure patterns (connection pooling, partitioning, replication)
- Service layer orchestration and inter-service communication
- Message queue and event bus infrastructure
- Data pipeline architecture (streaming and batch)
- Infrastructure monitoring and observability patterns
- Deployment architecture and scaling strategies
This domain is NOT responsible for:
- Business logic execution (→ Individual domain services)
- Domain-specific event definitions (→ Each domain defines its own events)
- Application-level audit logging (→ Analytics & Audit domain)
- User-facing API design (→ Individual domain services)
- Data warehouse analytics queries (→ Analytics domain)
- Security policies and access control rules (→ Authorization domain)
Relationships
Depends On:
- Identity & Tenancy - Org/Account context for multi-tenancy
- All domain services - Provides infrastructure that all domains leverage
Depended On By:
- ALL DOMAINS - Every domain uses infrastructure patterns defined here:
- Events published via transactional outbox
- Background jobs for async processing
- Caching for performance optimization
- Database patterns for data persistence
 
Related Domains:
- Analytics & Audit - Consumes events from event bus
- Search & Indexing - Uses background jobs for reindexing
- Channels & Distribution - Uses jobs for sync operations
Core Concepts
Entity: Event
Purpose: Atomic, immutable record of something that happened in the system, forming the backbone of event-driven architecture.
Key Attributes:
- id(UUID, primary key) - Unique event identifier
- org_id(UUID, foreign key → organizations.id) - Tenant context
- account_id(UUID, nullable, foreign key → accounts.id) - Account context
- user_id(UUID, nullable, foreign key → users.id) - Actor who triggered event
- event_type(VARCHAR, required) - Qualified event name (e.g.,- booking.created,- space.updated)
- aggregate_type(VARCHAR, required) - Entity type (space, booking, payment)
- aggregate_id(UUID, required) - Entity ID
- payload(JSONB, required) - Event data (state changes, context)
- metadata(JSONB) - Additional context (IP, user agent, trace ID)
- source(ENUM) - Origin: api | ui | system | integration | worker
- occurred_at(TIMESTAMPTZ, required) - When event happened (business time)
- published_at(TIMESTAMPTZ) - When event was published to event bus
- version(INTEGER, default 1) - Event schema version
- correlation_id(UUID) - Links related events across services
- causation_id(UUID) - References the event that caused this one
- created_at(TIMESTAMPTZ, default now())
Relationships:
- Event → Org (*, many-to-one)
- Event → Account (*, many-to-one, optional)
- Event → User (*, many-to-one, optional)
- Event → Outbox (1:1, initially) - Published via outbox pattern
Lifecycle:
- Created: When any significant system action occurs
- Published: Moved from outbox to event bus
- Consumed: Processed by downstream services/workers
- Archived: Moved to cold storage after retention period (12-24 months)
Business Rules:
- Events are immutable once created (append-only)
- Must include complete data payload (no foreign key dependencies)
- Event types follow domain.action naming convention
- All events scoped to org_id for multi-tenancy isolation
- Correlation ID chains related events across transaction boundaries
Event Types (Examples):
Identity Domain:
- org.created, org.updated, org.suspended
- account.created, account.updated
- user.created, user.login, membership.created
Supply Domain:
- space.created, space.updated, space.published
- unit.created, unit.updated
Availability Domain:
- booking.created, booking.confirmed, booking.canceled
- hold.created, hold.expired, hold.converted
- block.created, block.deleted
Pricing Domain:
- rate_plan.created, rate_plan.updated
- quote.generated, quote.expired
Payments Domain:
- payment.initiated, payment.succeeded, payment.failed
- payout.created, payout.completed
Channels Domain:
- channel_sync.started, channel_sync.completed
- channel_listing.synced
Entity: Outbox
Purpose: Implements the transactional outbox pattern ensuring reliable, exactly-once event publishing even during system failures.
Key Attributes:
- id(UUID, primary key)
- event_id(UUID, foreign key → events.id, unique) - Reference to published event
- org_id(UUID, foreign key → organizations.id) - Tenant partition
- event_type(VARCHAR, required) - Event type for filtering/routing
- aggregate_type(VARCHAR, required) - Entity type
- aggregate_id(UUID, required) - Entity ID
- payload(JSONB, required) - Complete event payload
- status(ENUM) - pending | published | failed | dead_letter
- retry_count(INTEGER, default 0) - Number of publish attempts
- next_retry_at(TIMESTAMPTZ) - Scheduled retry time (exponential backoff)
- published_at(TIMESTAMPTZ) - When successfully published to event bus
- error_message(TEXT) - Last error if status=failed
- created_at(TIMESTAMPTZ, default now())
- updated_at(TIMESTAMPTZ, default now())
Relationships:
- Outbox → Event (1:1) - Each outbox entry publishes one event
- Outbox → Org (*, many-to-one)
Lifecycle:
- Created: When event is generated, inserted in same transaction as business data
- Pending: Awaiting publication by outbox worker
- Published: Successfully sent to event bus
- Failed: Temporary failure, scheduled for retry
- Dead Letter: Permanent failure after max retries (requires manual intervention)
Business Rules:
- Events inserted into outbox in same database transaction as source data change
- Guarantees event will be published at least once
- Outbox worker polls for pending events (idempotent publish)
- Exponential backoff retry strategy (1s, 2s, 4s, 8s, 16s, 30s, 60s)
- Dead letter after 7 retries or 24 hours
- Published events retained for 7 days, then archived
Pattern Implementation:
-- Example: Creating booking with outbox event
BEGIN;
  -- 1. Insert business entity
  INSERT INTO bookings (org_id, space_id, checkin, checkout, ...)
  VALUES (...) RETURNING id INTO booking_id;
  -- 2. Insert event
  INSERT INTO events (org_id, event_type, aggregate_type, aggregate_id, payload, ...)
  VALUES (...) RETURNING id INTO event_id;
  -- 3. Insert into outbox (in same transaction)
  INSERT INTO outbox (event_id, org_id, event_type, aggregate_type, aggregate_id, payload, status)
  VALUES (event_id, org_id, 'booking.created', 'booking', booking_id, payload_json, 'pending');
COMMIT;
-- Outbox worker publishes events asynchronously
-- If transaction fails, nothing is committed (consistency guaranteed)
Entity: Job
Purpose: Manages background, asynchronous, and scheduled tasks that don't block user requests.
Key Attributes:
- id(UUID, primary key)
- org_id(UUID, nullable, foreign key → organizations.id) - Tenant context (null for system jobs)
- account_id(UUID, nullable, foreign key → accounts.id) - Account context
- job_type(VARCHAR, required) - Qualified job name (e.g.,- ical.import,- search.reindex)
- queue(VARCHAR, default 'default') - Queue name for prioritization
- priority(INTEGER, default 5) - Higher = more urgent (1-10 scale)
- payload(JSONB, required) - Job parameters and context
- status(ENUM) - pending | running | completed | failed | canceled | dead_letter
- worker_id(VARCHAR) - Identifier of worker processing this job
- scheduled_at(TIMESTAMPTZ, required) - When job should run
- started_at(TIMESTAMPTZ) - When job began execution
- completed_at(TIMESTAMPTZ) - When job finished
- retry_count(INTEGER, default 0) - Execution attempts
- max_retries(INTEGER, default 3) - Maximum retry attempts
- next_retry_at(TIMESTAMPTZ) - Next scheduled retry (exponential backoff)
- timeout_seconds(INTEGER, default 300) - Max execution time
- result(JSONB) - Job output data
- error_message(TEXT) - Error details if failed
- created_at(TIMESTAMPTZ, default now())
- updated_at(TIMESTAMPTZ, default now())
Relationships:
- Job → Org (*, many-to-one, optional)
- Job → Account (*, many-to-one, optional)
Lifecycle:
- Created: Job enqueued by service or scheduled by cron
- Pending: Waiting in queue
- Running: Worker acquired lock and executing
- Completed: Successful execution
- Failed: Temporary failure, scheduled for retry
- Dead Letter: Max retries exceeded, needs manual review
- Canceled: Explicitly canceled by admin or timeout
Business Rules:
- Jobs are idempotent (safe to retry)
- Jobs include retry logic with exponential backoff
- Failed jobs moved to dead letter queue after max retries
- Long-running jobs must support timeouts and graceful shutdown
- Jobs scoped to org_id for tenant isolation (except system jobs)
- Queue names enable priority-based execution (critical, default, low)
Job Types (Examples):
iCal Processing:
- ical.import - Import external calendar feeds
- ical.export - Generate iCal feed file
- ical.sync - Sync availability with external calendars
Search & Indexing:
- search.reindex.space - Reindex single space
- search.reindex.full - Full index rebuild
- search.update_facets - Update aggregated filter counts
Channel Sync:
- channel.sync.export - Export listings to OTA
- channel.sync.import - Import bookings from OTA
- channel.feed.generate - Generate XML/JSON feed
Analytics & Reporting:
- analytics.aggregate_daily - Daily metric rollup
- analytics.generate_report - Generate custom report
Maintenance:
- cleanup.expired_holds - Delete expired holds
- cleanup.archive_events - Move old events to cold storage
- cleanup.purge_sessions - Delete expired sessions
Entity: CacheEntry
Purpose: Manages distributed caching strategy for performance optimization using Redis.
Key Attributes:
- key(VARCHAR(512), primary key) - Cache key (hierarchical, e.g.,- org:123:space:456:availability)
- org_id(UUID, nullable) - Tenant context for multi-tenant caching
- value(JSONB or TEXT) - Cached data
- value_type(ENUM) - json | string | integer | list | set | hash
- ttl_seconds(INTEGER) - Time to live in seconds
- expires_at(TIMESTAMPTZ) - Absolute expiration time
- hit_count(INTEGER, default 0) - Number of cache hits
- last_accessed_at(TIMESTAMPTZ) - Last read timestamp
- tags(TEXT[]) - Cache invalidation tags (e.g., ['space:456', 'pricing'])
- created_at(TIMESTAMPTZ, default now())
- updated_at(TIMESTAMPTZ, default now())
Note: This entity documents Redis caching strategy. Actual cache stored in Redis, this table tracks metadata for monitoring/invalidation.
Relationships:
- CacheEntry → Org (*, many-to-one, optional)
Cache Key Hierarchy:
org:{org_id}:space:{space_id}:details
org:{org_id}:space:{space_id}:availability:{start_date}:{end_date}
org:{org_id}:space:{space_id}:pricing:{rate_plan_id}
org:{org_id}:account:{account_id}:permissions
user:{user_id}:session:{session_id}
global:amenities:catalog
global:feature_flags
channel:{channel_id}:feed:{feed_id}
TTL Strategies:
- Session data: 24 hours
- User permissions: 1 hour (invalidate on role change)
- Space availability: 15 minutes (invalidate on booking/block)
- Pricing calculations: 1 hour (invalidate on rate change)
- Content/media URLs: 7 days
- Static reference data: 24 hours
- Search results: 5 minutes
- Channel feeds: 1 hour
Invalidation Patterns:
- Time-based: Automatic expiration via TTL
- Event-driven: Invalidate on domain events (booking.created → clear availability cache)
- Tag-based: Bulk invalidation by resource tags
- Manual: Admin-triggered cache clear
Business Rules:
- All cache keys must include org_id for tenant isolation (except global data)
- Cached data includes version identifier for schema evolution
- Cache misses trigger database read and cache population
- Stale data acceptable for non-critical reads (eventual consistency)
- Critical operations (booking confirmation) bypass cache
Infrastructure Patterns
Event-Driven Architecture
Pattern: Domains publish events via transactional outbox; consumers react asynchronously.
Components:
- Event Publishers: Domain services emit events on state changes
- Outbox Worker: Polls pending events and publishes to event bus
- Event Bus: Message broker (Redis Streams for MVP, Kafka for scale)
- Event Consumers: Workers subscribed to event types
Benefits:
- Loose coupling between domains
- Eventual consistency across services
- Enables horizontal scaling of consumers
- Audit trail of all system changes
- Supports replay and debugging
Example Flow:
- Booking service creates booking → inserts event in outbox
- Outbox worker publishes booking.createdevent to bus
- Availability service consumes event → updates calendar
- Analytics service consumes event → records metric
- Channel service consumes event → triggers sync job
Database Architecture
Strategy: Single PostgreSQL cluster with logical multi-tenancy (org_id partitioning).
Components:
- Primary Database: ACID-compliant transactional data store
- Read Replicas: Horizontally scaled read-only replicas for analytics/reports
- Connection Pooling: PgBouncer for connection multiplexing
- Partitioning: Table partitioning by org_id or time (events, audit logs)
- Indexes: Strategic indexes on org_id, foreign keys, and query patterns
Scaling Strategy:
- MVP: Single region, single primary + 2 read replicas
- V1: Multi-region with regional read replicas
- V2: Sharding by org_id hash for very large customers
Backup & Recovery:
- Continuous WAL archiving for point-in-time recovery
- Daily full backups (encrypted, 30-day retention)
- Standby replica for automatic failover
- Quarterly DR simulations
Service Layer Architecture
Pattern: Modular service-oriented architecture with domain-aligned services.
Core Services:
- identity-service- Org, Account, User, Membership management
- authorization-service- Permission evaluation and access control
- supply-service- Space, Unit, Content management
- availability-service- Calendar, Booking, Hold, Block management
- pricing-service- RatePlan, Quote calculation
- booking-service- Reservation lifecycle orchestration
- payment-service- Payment, Payout, Transaction processing
- channel-service- OTA integration and syndication
- analytics-service- Events, Metrics, Audit processing
Service Communication:
- Synchronous: REST APIs with JWT authentication
- Asynchronous: Event bus for cross-domain notifications
- Orchestration: Saga pattern for distributed transactions
Service Standards:
- Stateless for horizontal scaling
- Health check endpoints for liveness/readiness probes
- Structured logging with correlation IDs
- Circuit breakers for external dependencies
- Rate limiting per tenant
Caching Architecture
Multi-Layer Caching Strategy:
Layer 1 - Application Cache (Redis):
- Session data
- User permissions
- Computed prices
- Availability fragments
- TTL-based invalidation
Layer 2 - CDN (CloudFront):
- Media assets (images, videos)
- iCal feed files
- Static content
- Edge caching for global distribution
Layer 3 - Database Query Cache:
- Materialized views for complex queries
- Aggregated metrics
- Refreshed on schedule or event trigger
Cache Invalidation Strategy:
- Event-driven: Clear relevant cache on domain events
- Time-based: Automatic TTL expiration
- Manual: Admin-triggered cache clear API
- Pattern-based: Invalidate by key prefix or tags
Job Queue Architecture
Queue Design:
- Default Queue: General background tasks
- Critical Queue: High-priority jobs (payment processing)
- Bulk Queue: Large batch operations (full reindex)
- Scheduled Queue: Cron-like recurring jobs
Worker Pools:
- Horizontal scaling: Add worker instances as queue depth grows
- Dedicated pools per queue for isolation
- Auto-scaling based on queue depth metrics
Failure Handling:
- Exponential backoff retry with jitter
- Dead letter queue for permanent failures
- Alerting on dead letter threshold
- Manual retry/cancel interface
Rationale
Why Event-Driven Architecture?
- Enables independent deployment and scaling of services
- Natural fit for distributed hospitality operations
- Supports complex workflows (booking → payment → calendar → channel sync)
- Provides complete audit trail for compliance
Why Transactional Outbox Pattern?
- Guarantees events are published even during failures
- Avoids dual-write problem (data + event in same transaction)
- Enables exactly-once semantics critical for financial operations
Why PostgreSQL (not NoSQL)?
- ACID guarantees critical for bookings and payments
- Rich relational model fits complex domain relationships
- JSONB provides flexibility for extensibility
- Mature ecosystem and operational tooling
Why Redis for Caching?
- Fast in-memory access for hot data
- Rich data structures (strings, sets, sorted sets)
- Pub/sub for cache invalidation
- TTL built-in
Why Background Jobs?
- Non-blocking user experience
- Retry logic for unreliable external integrations
- Rate limiting for API calls to external systems
- Scheduled tasks (nightly reports, cleanups)
MVP Implementation (MVP.0)
Infrastructure Stack:
- Database: PostgreSQL 15+ (single instance + read replica)
- Cache: Redis 7+ (single instance, persistence enabled)
- Message Queue: Redis Streams (upgrade to Kafka in V1)
- Container Orchestration: Kubernetes (GKE, EKS, or AKS)
- CDN: CloudFront or Cloudflare
- Object Storage: S3-compatible (AWS S3, Backblaze B2)
- Monitoring: OpenTelemetry + Prometheus + Grafana
Enabled Patterns:
- Event publishing via transactional outbox
- Background job processing (iCal import, search reindex)
- Redis caching for permissions and availability
- Basic event bus with at-least-once delivery
Deferred to V1:
- Kafka/PubSub for high-throughput event streaming
- Multi-region database replication
- Advanced caching (multi-level, warming strategies)
- Auto-scaling worker pools
- Event replay and time-travel debugging
Future Enhancements
V1.0: Event Streaming Platform
- Migrate to Kafka or Google Pub/Sub
- Event schema registry for versioning
- Stream processing (aggregate metrics in real-time)
- Event sourcing for critical domains
V1.1: Advanced Caching
- Cache warming on data updates
- Predictive caching based on access patterns
- Multi-region cache replication
- Cache hit ratio monitoring and optimization
V1.2: Database Scaling
- Horizontal sharding by org_id
- Automated shard rebalancing
- Read replica routing by query type
- Connection pooling optimization
V2.0: Microservices Evolution
- Full microservices decomposition
- Service mesh (Istio, Linkerd)
- gRPC for inter-service communication
- API gateway with rate limiting and circuit breaking
V2.1: Data Pipeline Modernization
- Real-time streaming to data warehouse
- Change data capture (Debezium) for sync
- Data lake for historical event archives
- ML model serving infrastructure
Operational Notes
Monitoring & Observability
Key Metrics:
- Events: Publish latency, consumer lag, event throughput
- Jobs: Queue depth, job duration, failure rate
- Cache: Hit/miss ratio, eviction rate, memory usage
- Database: Connection pool utilization, query latency, replication lag
Alerting Thresholds:
- Critical: Database unavailable, event bus down, job queue overflow
- Warning: High cache miss rate, slow query detected, outbox backlog
Performance Targets
MVP (Villa-only):
- API response time: p95 < 200ms
- Event publish latency: < 1s
- Job execution: 95% within 5 minutes
- Cache hit ratio: > 80%
- Database queries: p95 < 50ms
V1 (Multi-property):
- API response time: p95 < 150ms
- Event publish latency: < 500ms
- Job execution: 95% within 2 minutes
- Cache hit ratio: > 90%
Deployment Strategy
MVP: Blue/green deployment with manual rollout V1: Canary deployment with automated rollback V2: Progressive delivery with feature flags
Disaster Recovery
RTO (Recovery Time Objective): 1 hour RPO (Recovery Point Objective): 5 minutes
DR Procedures:
- Automatic failover to standby replica
- Restore from point-in-time backup
- Replay events from last checkpoint
- Validate data consistency before cutover
Physical Schema
See 001_initial_schema.sql for complete CREATE TABLE statements.
Summary:
- 4 primary tables: events, outbox, jobs, cache_metadata
- 15+ indexes for query performance
- Partitioning strategies for events and jobs
- Row-Level Security policies for multi-tenancy
Related Documents
- Platform Specification - Complete platform overview
- [Data Model](../../TVL Data and Domain Model Specification 2025-10-21 (1).md) - Detailed entity definitions
- Analytics Domain - Event consumption patterns
- Search Domain - Background job usage
- MVP.0 Overview
- Architecture Decision Records - Infrastructure decisions
Implementation Priority: HIGH - Cross-cutting infrastructure must be established before domain services can operate reliably.