Reliability technical appendix
This page is for technical decision makers — IT, security, procurement — who want the architecture behind the reliability overview. It documents where RentalTide runs, what we replicate, how fast we recover, and what dependencies sit outside our direct control.
Hosting and regions
RentalTide is hosted entirely on Amazon Web Services. We do not operate any on-premise servers and do not store production data on employee devices.
| Component | Primary region | Notes |
|---|---|---|
| Application servers (API + web) | us-east-1 | Containerised, autoscaled across three Availability Zones |
| Primary database (PostgreSQL Aurora) | us-east-1 | Multi-AZ, synchronous replication |
| Object storage (photos, attachments, PDFs) | us-east-1 | S3 with versioning and cross-region replication |
| Background jobs and scheduled tasks | us-east-1 | Lambda, run independently of the web tier |
| Static site delivery (booking widget, docs) | Global edge | CloudFront, served from 400+ points of presence worldwide |
| Backups (snapshots and continuous) | us-east-1 + us-west-2 | Cross-region replication for disaster recovery |
The booking widget and customer-facing checkout pages are served from CloudFront's global edge, so a us-east-1 regional event does not take the booking widget offline immediately — pages still render from cached assets, and any submission queues up against the API layer.
Database resilience
The primary database is Aurora Serverless v2 PostgreSQL.
- Synchronous replication across three Availability Zones. A write is acknowledged only when it is durable on storage in two zones. Loss of a single zone is invisible to the application.
- Automatic failover within a region. If the writer node becomes unhealthy, Aurora promotes a reader to writer in under 60 seconds.
- Continuous backup. Every write is shipped to S3 in near-real-time. Point-in-time restore is available for any second within the last seven days.
- Daily automated snapshots. Retained for 35 days.
- Cross-region snapshot replication. Daily snapshots are copied to
us-west-2for full-region disaster recovery.
For full-region recovery, the recovery point objective is bounded by replication lag to us-west-2, which is typically under five minutes.
Application tier
The application tier runs as containers on AWS ECS Fargate behind an Application Load Balancer.
- Autoscaled across three Availability Zones. A single AZ loss removes roughly one-third of capacity; autoscaling refills within minutes.
- Health checks every 15 seconds. Failing instances are removed from the load balancer automatically.
- Rolling deploys. New container versions are rolled out a fraction at a time. Health checks gate progression. A bad deploy stops itself before reaching full fleet.
- Automatic rollback. Deploys that fail health checks revert to the previous task definition without manual intervention.
- Zero-downtime deploys. New containers come up healthy before old containers are drained.
Recovery time and recovery point objectives
Recovery time objective (RTO) is the maximum acceptable time from incident start to service restoration. Recovery point objective (RPO) is the maximum acceptable data loss measured in time.
| Failure scenario | RTO target | RPO target | Mechanism |
|---|---|---|---|
| Single application instance crash | 60 seconds | Zero | Container restart, load balancer reroute |
| Single Availability Zone failure | 2 minutes | Zero | Multi-AZ Aurora and ECS distribute traffic to healthy AZs |
| Failed deploy | 5 minutes | Zero | Automated rollback to previous task definition |
| Aurora writer node failure | 90 seconds | Zero | Aurora failover promotes a reader |
| Application-level bug (data corruption) | 30 minutes | Bounded by detection time | Targeted SQL repair using journal_entries audit log |
| Catastrophic data corruption | 6 hours | 5 minutes | Point-in-time restore from continuous backup |
Full region outage (us-east-1) | 4 hours | 5 minutes | Restore from cross-region snapshot in us-west-2 |
RTO and RPO targets are reviewed quarterly and tested via game days at least twice per year.
Audit and reconstruction
Beyond raw database backups, RentalTide maintains an append-only journal (journal_entries) for every financial mutation: payments, refunds, fees, taxes, AR adjustments. This is double-entry by design.
This means:
- The cache columns on
rental_bookings(amount paid, outstanding balance, amount refunded) can be reconstructed from the journal at any time. - A bug that corrupts cache fields without corrupting the journal is recoverable by re-running the cache sync from journal entries.
- Forensic queries (who paid what, when, on which booking) survive any cache corruption.
We treat journal_entries as the source of truth. The reliability of that table is the reliability of your books.
External dependencies
RentalTide depends on a handful of external SaaS providers. We hold ourselves to the union of their availability, so we choose them carefully.
| Provider | What we use it for | Their stated SLA | Failure behaviour |
|---|---|---|---|
| Stripe | Payments, terminals, Connect | 99.99% | Card capture and refunds fail with clear errors; no data loss |
| Auth0 | Staff authentication | 99.9% | Existing sessions stay valid; new sign-ins blocked |
| SendGrid | Transactional email | 99.95% | Emails queue; retried for 72 hours |
| Twilio | SMS and voice | 99.95% | SMS retries; voice routing degrades to next provider |
| AWS | Compute, storage, database | 99.99% per AZ | Multi-AZ design tolerates single-AZ loss invisibly |
| OpenAI / Anthropic | AI features (docs bot, phone, summaries) | 99.9% | AI features degrade gracefully; core ops unaffected |
A failure of any single external provider does not take RentalTide down. The most common visible effect is one feature (typically payment capture or outbound email) showing transient errors while the provider recovers.
Backups and retention
| Asset | Frequency | Retention | Storage location |
|---|---|---|---|
| Aurora continuous backup | Real-time | 7 days | S3, us-east-1 |
| Aurora daily snapshots | Daily, 04:00 UTC | 35 days | S3, us-east-1 |
| Cross-region snapshot copy | Daily | 35 days | S3, us-west-2 |
| S3 object versioning | Every write | Indefinite | S3, us-east-1 |
| S3 cross-region replication | Every write | Indefinite | S3, us-west-2 |
| Application logs | Real-time | 90 days | CloudWatch, us-east-1 |
| Audit / journal entries | Real-time | Indefinite | Aurora journal_entries table |
Backups are encrypted at rest with AWS-managed keys. Cross-region replication uses separate keys per region to limit the blast radius of a key compromise.
Security controls relevant to availability
Availability is also a security concern. We protect against:
- DDoS — AWS Shield Standard is always on. CloudFront absorbs volumetric attacks at the edge.
- Credential theft — Auth0 enforces MFA for staff accounts, and bearer tokens are short-lived (1 hour) with rotating refresh tokens.
- Insider risk — production database access requires SSO and is audited. No engineer has standing access to customer data; access is just-in-time and logged.
- Ransomware — backups are immutable for their retention period (S3 Object Lock on snapshots).
Status, communication, and runbooks
When something goes wrong, you will hear from us in these places:
status.rentaltide.com— public status page, updated within minutes of detection. This is the source of truth for the platform overall.- Your support channel — Slack (for partner-tier customers) or email, with location-specific impact and ETA.
- Postmortems — every incident with more than 15 minutes of customer impact gets a written postmortem within 5 business days, in the format described in our internal COE template.
Internally, we maintain detailed runbooks for each failure mode under /docs/contingency/ in our codebase. These cover region failover, internet outage procedures, server crash response, and physical-disaster business continuity. They are not public, but the executive summaries are available on request for procurement and security reviews.
Game days and chaos testing
Twice a year (typically March and September, ahead of peak season), we run game-day exercises that test:
- Aurora writer failover under load
- ECS task termination across an entire AZ
- Point-in-time database restore into a separate environment
- Cross-region snapshot restore
- Incident response paging, status page updates, and customer communication
Results from each game day are recorded internally and fed back into our recovery time objectives.
Reporting and questions
For SOC 2 reports, vendor security assessments, or anything that requires a signed document, reach out to your account contact or security@rentaltide.com. We will route the request to the right place.
For technical questions about a specific failure mode, reach out in your support channel — we are happy to walk through any of the above in more detail.

