-
Create RDS Multi-AZ Database in Primary Region
The Multi-AZ RDS database provides automatic failover within a region. It synchronously replicates data to a standby instance in a different Availability Zone.
In AWS Console (right panel):
1. Click the "RDS Database" tab at the top
2. In "Database name" field Type: globalshop-prod-primary
3. In "Instance class" dropdown Select "db.r5.xlarge"
4. In "Storage type" Select "Provisioned IOPS SSD (io1)"
5. In "Allocated storage" Type: 500 GB
6. Under "Multi-AZ deployment":
Select radio button: "Yes - Create a standby instance"
7. In "Backup retention" Select "7 days"
8. Check box: "Enable automatic minor version upgrades"
9. Click orange "Create Database" button at bottom
Why Multi-AZ? Provides 99.95% SLA with automatic failover in ~60 seconds if primary AZ fails. The standby replica stays in sync via synchronous replication.
-
Configure Cross-Region Read Replica
Read replicas in a secondary region provide disaster recovery capability. During a regional outage, you can promote the replica to become a standalone database.
Step-by-step:
1. Still in "RDS Database" tab, scroll to "Read Replicas" section
2. In "Replica name" field Type: globalshop-prod-replica
3. In "Replica region" dropdown Select "us-west-2"
4. Check boxes:
✓ "Publicly accessible" (for testing)
✓ "Auto-promote to primary on failure"
5. In "Replication lag alert threshold" Type: 60 seconds
6. Click "Create Read Replica" button
Replication Lag: Typically <1 second under normal load. Monitor with CloudWatch metric "ReplicaLag" - alerts trigger if it exceeds 60 seconds.
-
Set Up Route 53 Failover Routing
Route 53 health checks monitor your primary region and automatically route traffic to the secondary region if primary becomes unhealthy.
Route 53 Configuration:
1. Click "Route 53" tab
2. In "Record name" field Type: www.globalshop.com
3. In "Routing policy" dropdown Select "Failover"
4. Configure PRIMARY record:
Record type: "Primary"
Value: ALB DNS in us-east-1 (e.g., primary-alb-123.us-east-1.elb.amazonaws.com)
Health check ID: "primary-health-check"
5. Configure SECONDARY record:
Record type: "Secondary"
Value: ALB DNS in us-west-2
Failover: "Evaluate target health"
6. Click "Create records"
Important: Health checks run every 30 seconds from multiple global locations. Failover occurs within 60-120 seconds after primary becomes unhealthy.
-
Configure S3 Cross-Region Replication
Replicate static assets, user uploads, and backups to secondary region. CRR provides near real-time replication with 99.99999999999% durability.
S3 Replication Setup:
1. Click "S3 Replication" tab
2. In "Source bucket" dropdown Select "globalshop-prod-assets-us-east-1"
3. In "Destination region" Select "us-west-2"
4. In "Destination bucket" Type: globalshop-prod-assets-us-west-2
5. Replication options:
✓ "Replicate objects encrypted with AWS KMS"
✓ "Replicate delete markers"
✓ "Replication metrics & notifications"
6. In "Replication time control (RTC)" Check "Enable"
This guarantees 99.99% of objects replicated within 15 minutes
7. Click "Enable Replication"
Cost Optimization: Use S3 Intelligent-Tiering for infrequently accessed objects. CRR costs ~$0.02/GB transferred.
-
Set Up CloudWatch Cross-Region Monitoring
Centralized monitoring dashboard shows health of both regions. Alarms trigger automatic failover and notify ops team.
CloudWatch Configuration:
1. Click "CloudWatch" tab
2. Click "Create Dashboard" Name: GlobalShop-DR-Dashboard
3. Add widgets:
RDS: CPUUtilization, DatabaseConnections, ReplicaLag
ALB: TargetResponseTime, HealthyHostCount, HTTPCode_Target_5XX_Count
Route 53: HealthCheckStatus
S3: ReplicationLatency, BytesPendingReplication
4. Create alarms:
Critical Alarms:
RDS ReplicaLag > 60 seconds SNS topic: "DR-Ops-Team"
Route 53 Health Check fails Auto-trigger failover Lambda
ALB HealthyHostCount < 2 Page on-call engineer
5. Click "Save Dashboard"
Best Practice: Use CloudWatch Anomaly Detection to automatically detect unusual patterns like sudden traffic spikes or replication lag increases.
-
Create Lambda Failover Automation
Automated failover Lambda function executes when primary region fails. It promotes RDS replica, updates DNS, scales secondary region, and notifies team.
Automation Steps:
1. In AWS Console, navigate to Lambda (use search bar if needed)
2. Click "Create function"
3. Function name GlobalShop-DR-Failover
4. Runtime Python 3.11
5. Add trigger: CloudWatch Event (alarm state change)
6. Function logic (pseudo-code you'll implement):
def lambda_handler(event, context):
# 1. Promote RDS read replica to primary
# 2. Update Route 53 to point to us-west-2
# 3. Scale Auto Scaling group in us-west-2 to 100%
# 4. Send notification to Slack/PagerDuty
# 5. Log failover event to DynamoDB
7. Set timeout to 5 minutes
8. Attach IAM role with RDS, Route53, EC2, SNS permissions
9. Click "Deploy"
Critical: Test failover monthly during maintenance windows. Document RTO (target: <15 min) and RPO (target: <1 min) for each drill.
-
Configure DR Drill and Validate
Simulate complete regional failure, measure actual RTO/RPO, verify data consistency, and document lessons learned.
DR Drill Procedure:
1. Schedule drill during low-traffic period (notify stakeholders 48hrs ahead)
2. Simulate primary region failure:
Manually mark Route 53 health check as "unhealthy"
Or: Shut down primary ALB target group
3. Monitor automatic failover:
Watch CloudWatch logs for Lambda execution
Verify RDS replica promotion (5-10 minutes)
Confirm DNS propagation (1-2 minutes)
Check Auto Scaling scaling activity
4. Validate application functionality:
Test user login, checkout, database writes
Verify static assets loading from S3
Check recent transactions (data loss?)
5. Measure metrics:
RTO: Time from failure to full restoration
RPO: Amount of data lost (in seconds)
Document any issues or gaps
6. Failback to primary region:
Reverse replication direction
Restore Route 53 to primary
Scale down secondary region
Post-Drill: Hold retrospective meeting within 24 hours. Update runbooks based on findings. Typical first-drill issues: forgotten IAM permissions, DNS TTL too high, monitoring gaps.
-
Set Up Secondary Region (us-west-2)
Mirror the primary region architecture in us-west-2 as a standby environment. This region will be in "warm standby" mode with reduced capacity that can scale up during failover.
Your Tasks:
Create identical VPC in us-west-2
Deploy RDS Read Replica from primary
Configure standby ALB and Auto Scaling
Set minimum capacity to 25% of primary
Enable cross-region VPC peering
Cost Optimization: Use t3.medium instances in standby region to reduce costs. Scale to r5.large during failover.
-
Configure S3 Cross-Region Replication
Enable automatic replication of all S3 objects (user uploads, static assets, logs) from primary to secondary region. This ensures data consistency across regions.
Your Tasks:
Enable versioning on source and destination buckets
Create replication rule for all objects
Configure replication time control (RTC) for predictable timing
Set up replication metrics and notifications
Enable delete marker replication
Best Practice: Use S3 Intelligent-Tiering to automatically optimize storage costs while maintaining replication performance.
-
Implement Route 53 Health Checks & Failover
Configure Route 53 to automatically detect primary region failures and route traffic to the secondary region. Health checks monitor both endpoint availability and application-level health.
Your Tasks:
Create health checks for primary ALB endpoint
Set up application-level health check (HTTP /health endpoint)
Configure failover routing policy
Set evaluation interval to 10 seconds
Configure SNS notifications for health check failures
Critical: Test your health check endpoint thoroughly! A misconfigured health check can cause unnecessary failovers.
-
Build Automated Failover Lambda Function
Create a Lambda function that orchestrates the failover process: promoting RDS read replica to master, scaling up secondary region capacity, updating security groups, and sending notifications.
Your Tasks:
Write Lambda function in Python to handle failover
Promote RDS read replica to standalone instance
Scale Auto Scaling group to production capacity
Update Route 53 DNS records
Send notifications to operations team (SNS, Slack, PagerDuty)
Testing: Create a failback function too! You'll need to restore primary region as primary after issues are resolved.
-
Set Up CloudWatch Cross-Region Monitoring
Implement comprehensive monitoring across both regions with unified dashboards, custom metrics, and automated alerting for DR-specific metrics (replication lag, health check status, etc.).
Your Tasks:
Create CloudWatch dashboard showing both regions
Configure alarms for RDS replication lag (> 60 seconds)
Set up custom metrics for application health
Enable cross-region CloudWatch Logs aggregation
Configure automated runbooks in Systems Manager
-
Configure DR Drill & Validate RPO/RTO
Perform a full disaster recovery test to validate your architecture. Measure actual RPO and RTO, identify bottlenecks, and refine your runbook. Document everything!
Your Tasks:
Simulate primary region failure
Trigger automated failover process
Measure time to restore service (RTO)
Validate data consistency (RPO)
Test failback to primary region
Generate DR drill report with metrics and lessons learned
Compliance: Most regulations require quarterly DR drills. Schedule these proactively and keep detailed records.