Reliability technical appendix

This page is for technical decision makers — IT, security, procurement — who want the architecture behind the reliability overview. It documents where RentalTide runs, what we replicate, how fast we recover, and what dependencies sit outside our direct control.

Hosting and regions

RentalTide is hosted entirely on Amazon Web Services. We do not operate any on-premise servers and do not store production data on employee devices.

Component	Primary region	Notes
Application servers (API + web)	`us-east-1`	Containerised, autoscaled across three Availability Zones
Primary database (PostgreSQL Aurora)	`us-east-1`	Multi-AZ, synchronous replication
Object storage (photos, attachments, PDFs)	`us-east-1`	S3 with versioning and cross-region replication
Background jobs and scheduled tasks	`us-east-1`	Lambda, run independently of the web tier
Static site delivery (booking widget, docs)	Global edge	CloudFront, served from 400+ points of presence worldwide
Backups (snapshots and continuous)	`us-east-1` + `us-west-2`	Cross-region replication for disaster recovery

The booking widget and customer-facing checkout pages are served from CloudFront's global edge, so a us-east-1 regional event does not take the booking widget offline immediately — pages still render from cached assets, and any submission queues up against the API layer.

Database resilience

The primary database is Aurora Serverless v2 PostgreSQL.

Synchronous replication across three Availability Zones. A write is acknowledged only when it is durable on storage in two zones. Loss of a single zone is invisible to the application.
Automatic failover within a region. If the writer node becomes unhealthy, Aurora promotes a reader to writer in under 60 seconds.
Continuous backup. Every write is shipped to S3 in near-real-time. Point-in-time restore is available for any second within the last seven days.
Daily automated snapshots. Retained for 35 days.
Cross-region snapshot replication. Daily snapshots are copied to us-west-2 for full-region disaster recovery.

For full-region recovery, the recovery point objective is bounded by replication lag to us-west-2, which is typically under five minutes.

Application tier

The application tier runs as containers on AWS ECS Fargate behind an Application Load Balancer.

Autoscaled across three Availability Zones. A single AZ loss removes roughly one-third of capacity; autoscaling refills within minutes.
Health checks every 15 seconds. Failing instances are removed from the load balancer automatically.
Rolling deploys. New container versions are rolled out a fraction at a time. Health checks gate progression. A bad deploy stops itself before reaching full fleet.
Automatic rollback. Deploys that fail health checks revert to the previous task definition without manual intervention.
Zero-downtime deploys. New containers come up healthy before old containers are drained.

Recovery time and recovery point objectives

Recovery time objective (RTO) is the maximum acceptable time from incident start to service restoration. Recovery point objective (RPO) is the maximum acceptable data loss measured in time.

Failure scenario	RTO target	RPO target	Mechanism
Single application instance crash	60 seconds	Zero	Container restart, load balancer reroute
Single Availability Zone failure	2 minutes	Zero	Multi-AZ Aurora and ECS distribute traffic to healthy AZs
Failed deploy	5 minutes	Zero	Automated rollback to previous task definition
Aurora writer node failure	90 seconds	Zero	Aurora failover promotes a reader
Application-level bug (data corruption)	30 minutes	Bounded by detection time	Targeted SQL repair using journal_entries audit log
Catastrophic data corruption	6 hours	5 minutes	Point-in-time restore from continuous backup
Full region outage (`us-east-1`)	4 hours	5 minutes	Restore from cross-region snapshot in `us-west-2`

RTO and RPO targets are reviewed quarterly and tested via game days at least twice per year.

Audit and reconstruction

Beyond raw database backups, RentalTide maintains an append-only journal (journal_entries) for every financial mutation: payments, refunds, fees, taxes, AR adjustments. This is double-entry by design.

This means:

The cache columns on rental_bookings (amount paid, outstanding balance, amount refunded) can be reconstructed from the journal at any time.
A bug that corrupts cache fields without corrupting the journal is recoverable by re-running the cache sync from journal entries.
Forensic queries (who paid what, when, on which booking) survive any cache corruption.

We treat journal_entries as the source of truth. The reliability of that table is the reliability of your books.

External dependencies

RentalTide depends on a handful of external SaaS providers. We hold ourselves to the union of their availability, so we choose them carefully.

Provider	What we use it for	Their stated SLA	Failure behaviour
Stripe	Payments, terminals, Connect	99.99%	Card capture and refunds fail with clear errors; no data loss
Auth0	Staff authentication	99.9%	Existing sessions stay valid; new sign-ins blocked
SendGrid	Transactional email	99.95%	Emails queue; retried for 72 hours
Twilio	SMS and voice	99.95%	SMS retries; voice routing degrades to next provider
AWS	Compute, storage, database	99.99% per AZ	Multi-AZ design tolerates single-AZ loss invisibly
OpenAI / Anthropic	AI features (docs bot, phone, summaries)	99.9%	AI features degrade gracefully; core ops unaffected

A failure of any single external provider does not take RentalTide down. The most common visible effect is one feature (typically payment capture or outbound email) showing transient errors while the provider recovers.

Backups and retention

Asset	Frequency	Retention	Storage location
Aurora continuous backup	Real-time	7 days	S3, `us-east-1`
Aurora daily snapshots	Daily, 04:00 UTC	35 days	S3, `us-east-1`
Cross-region snapshot copy	Daily	35 days	S3, `us-west-2`
S3 object versioning	Every write	Indefinite	S3, `us-east-1`
S3 cross-region replication	Every write	Indefinite	S3, `us-west-2`
Application logs	Real-time	90 days	CloudWatch, `us-east-1`
Audit / journal entries	Real-time	Indefinite	Aurora `journal_entries` table

Backups are encrypted at rest with AWS-managed keys. Cross-region replication uses separate keys per region to limit the blast radius of a key compromise.

Security controls relevant to availability

Availability is also a security concern. We protect against:

DDoS — AWS Shield Standard is always on. CloudFront absorbs volumetric attacks at the edge.
Credential theft — Auth0 enforces MFA for staff accounts, and bearer tokens are short-lived (1 hour) with rotating refresh tokens.
Insider risk — production database access requires SSO and is audited. No engineer has standing access to customer data; access is just-in-time and logged.
Ransomware — backups are immutable for their retention period (S3 Object Lock on snapshots).

Status, communication, and runbooks

When something goes wrong, you will hear from us in these places:

status.rentaltide.com — public status page, updated within minutes of detection. This is the source of truth for the platform overall.
Your support channel — Slack (for partner-tier customers) or email, with location-specific impact and ETA.
Postmortems — every incident with more than 15 minutes of customer impact gets a written postmortem within 5 business days, in the format described in our internal COE template.

Internally, we maintain detailed runbooks for each failure mode under /docs/contingency/ in our codebase. These cover region failover, internet outage procedures, server crash response, and physical-disaster business continuity. They are not public, but the executive summaries are available on request for procurement and security reviews.

Game days and chaos testing

Twice a year (typically March and September, ahead of peak season), we run game-day exercises that test:

Aurora writer failover under load
ECS task termination across an entire AZ
Point-in-time database restore into a separate environment
Cross-region snapshot restore
Incident response paging, status page updates, and customer communication

Results from each game day are recorded internally and fed back into our recovery time objectives.

Reporting and questions

For SOC 2 reports, vendor security assessments, or anything that requires a signed document, reach out to your account contact or security@rentaltide.com. We will route the request to the right place.

For technical questions about a specific failure mode, reach out in your support channel — we are happy to walk through any of the above in more detail.

Reliability technical appendix