Skip to main content

Runbook: DR Failover

Overview

This runbook covers disaster recovery failover procedures for Cloud Aegis, including:

  • Pre-failover readiness checks
  • Automated and manual failover triggers
  • Per-CSP failover specifics (AWS, Azure, GCP)
  • Post-failover verification
  • Failback to primary
  • Stakeholder communication templates

Prerequisites

  • Access to all three CSP consoles (AWS, Azure, GCP)
  • kubectl access to primary and DR clusters
  • DNS management access (Route 53, Traffic Manager, Cloud DNS)
  • PagerDuty incident created for tracking
  • Runbook 02-incident-response.md reviewed if triggered by incident

Incident Classification

ConditionFailover TypeRTO TargetRPO Target
Region outageFull failover30 min5 min
Database corruptionDB-only failover15 min1 min
Network partitionRouting failover10 min0
Compute failureCluster failover20 min2 min

Pre-Failover Checklist

Run these checks before initiating any failover.

# 1. Verify DR environment is healthy
kubectl get nodes --context=dr-cluster
kubectl get pods -n aegis --context=dr-cluster

# 2. Check database replication lag
# AWS RDS
aws rds describe-db-instances \
--db-instance-identifier aegis-db-replica \
--query 'DBInstances[0].StatusInfos' \
--region us-west-2

# 3. Verify last successful backup
aws rds describe-db-snapshots \
--db-instance-identifier aegis-db \
--query 'reverse(sort_by(DBSnapshots, &SnapshotCreateTime))[0]' \
--region us-east-1

# 4. Check DNS TTL (must be <=60s before failover)
dig +short SOA aegis.io
dig aegis.io | grep TTL

# 5. Verify DR application config is current
kubectl get configmap aegis-config -n aegis --context=dr-cluster -o yaml | \
diff - <(kubectl get configmap aegis-config -n aegis --context=primary-cluster -o yaml)

# 6. Check DR health endpoint
curl -sf https://dr.aegis.io/health | jq .
  • DR cluster nodes all Ready
  • Database replication lag <30s
  • Last backup <6h old
  • DNS TTL reduced to 60s
  • DR config matches primary
  • DR health endpoint returns 200

Automated Failover Triggers

Cloud Aegis monitors the following conditions and triggers automated failover via the DR controller:

# Primary region health score (triggers failover at <0.3)
avg(aegis_health_status) by (region) < 0.3

# Consecutive health check failures (triggers at >5)
increase(aegis_health_check_failures_total[5m]) > 5

# Database replication lag (triggers at >120s)
aegis_db_replication_lag_seconds > 120

To check current DR controller status:

kubectl get deployment aegis-dr-controller -n aegis-system
kubectl logs -n aegis-system -l app=dr-controller --tail=50

Manual Failover Procedure

Step 1: Declare Failover

# Announce in incident channel
# #incident-YYYYMMDD-XX: "Initiating manual DR failover to [region]. ETA: 30 min."

# Disable new writes to primary (prevents split-brain)
kubectl annotate deployment aegis-api \
aegis.io/drain="true" \
--context=primary-cluster \
-n aegis

Step 2: Verify In-Flight Requests Drained

# Watch connection count drop to zero
watch -n 5 'kubectl exec -n aegis deployment/aegis-api \
--context=primary-cluster -- \
curl -s localhost:8080/debug/connections | jq .active'

Step 3: Initiate Database Failover

See per-CSP sections below for database-specific commands.

Step 4: Update DNS

# Route 53 (AWS)
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.aegis.io",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "dr-nlb.us-west-2.elb.amazonaws.com"}]
}
}]
}'

# Verify propagation
watch -n 10 'dig +short api.aegis.io'

Step 5: Scale DR Cluster

# Scale to production capacity
kubectl scale deployment aegis-api \
--replicas=6 \
--context=dr-cluster \
-n aegis

kubectl rollout status deployment/aegis-api \
--context=dr-cluster \
-n aegis

Step 6: Verify Failover Complete

curl -sf https://api.aegis.io/health | jq '{status, region}'
# Expected: {"status": "healthy", "region": "dr"}

Per-CSP Failover Specifics

AWS

RDS Multi-AZ Failover

# Force failover for Multi-AZ (60-120s downtime)
aws rds reboot-db-instance \
--db-instance-identifier aegis-db \
--force-failover \
--region us-east-1

# Monitor failover event
aws rds describe-events \
--source-identifier aegis-db \
--source-type db-instance \
--duration 60

Route 53 Health Check Management

# Disable primary health check (force failover routing)
aws route53 update-health-check \
--health-check-id hc-1234567890abcdef0 \
--disabled

# Verify traffic routing
aws route53 get-health-check-status \
--health-check-id hc-1234567890abcdef0

EKS Node Scaling

# Scale node group in DR region
aws eks update-nodegroup-config \
--cluster-name aegis-dr \
--nodegroup-name primary \
--scaling-config minSize=3,maxSize=10,desiredSize=6 \
--region us-west-2

# Wait for nodes ready
kubectl wait nodes \
--for=condition=Ready \
--all \
--timeout=300s \
--context=dr-cluster

Azure

SQL Failover Groups

# Initiate planned failover
az sql failover-group set-primary \
--name aegis-fog \
--resource-group aegis-dr-rg \
--server aegis-dr-sql

# Check failover group status
az sql failover-group show \
--name aegis-fog \
--resource-group aegis-dr-rg \
--server aegis-dr-sql \
--query 'replicationState'

Traffic Manager

# Disable primary endpoint
az network traffic-manager endpoint update \
--resource-group aegis-rg \
--profile-name aegis-tm \
--name primary-endpoint \
--type azureEndpoints \
--endpoint-status Disabled

# Verify routing
az network traffic-manager profile show \
--resource-group aegis-rg \
--name aegis-tm \
--query 'endpoints[].{name:name,status:endpointStatus}'

AKS Scaling

# Scale AKS node pool in DR region
az aks nodepool scale \
--resource-group aegis-dr-rg \
--cluster-name aegis-dr-aks \
--name agentpool \
--node-count 6

# Verify nodes
kubectl get nodes --context=aks-dr-cluster

GCP

Cloud SQL HA Failover

# Initiate failover to standby
gcloud sql instances failover aegis-db \
--failover-replica-name=aegis-db-failover \
--project=aegis-prod

# Check instance status
gcloud sql instances describe aegis-db \
--project=aegis-prod \
--format='value(state,failoverReplica.available)'

Cloud DNS

# Update DNS record to DR endpoint
gcloud dns record-sets update api.aegis.io. \
--type=CNAME \
--ttl=60 \
--rrdatas=dr-lb.aegis.io. \
--zone=aegis-zone \
--project=aegis-prod

# Verify propagation
gcloud dns record-sets list \
--zone=aegis-zone \
--filter="name=api.aegis.io." \
--project=aegis-prod

GKE Node Pools

# Resize GKE node pool in DR region
gcloud container clusters resize aegis-dr \
--node-pool primary-pool \
--num-nodes 6 \
--region us-central1 \
--project=aegis-prod

# Watch nodes come up
kubectl get nodes -w --context=gke-dr-cluster

Post-Failover Verification

Run all checks before closing the incident.

# 1. API health
curl -sf https://api.aegis.io/health | jq .
# Expected: status=healthy, region=dr

# 2. Verify all deployments running
kubectl get deployment -n aegis --context=dr-cluster

# 3. Check error rate (allow 5-min window to settle)
curl -s 'http://prometheus-dr:9090/api/v1/query?query=rate(aegis_http_requests_total{status=~"5.."}[5m])' | jq '.data.result'

# 4. Data integrity check
kubectl exec -n aegis deployment/aegis-api \
--context=dr-cluster -- \
./aegis db check-integrity

# 5. Verify OPA policy engine loaded
curl -sf https://api.aegis.io/api/v1/policies/health | jq .
# Expected: {"opa_external": "ok", "opa_embedded": "ok"}

# 6. Test end-to-end scan
curl -sf -X POST https://api.aegis.io/api/v1/scans \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"scope": "test", "providers": ["aws"]}' | jq '{id, status}'
  • API health returns healthy
  • Error rate <1% for 5 consecutive minutes
  • Database writes succeeding
  • Policy engine loaded and evaluating
  • End-to-end scan completes

Failback Procedure

Return to primary after primary region is confirmed stable (minimum 2h post-recovery).

# 1. Verify primary region recovered
kubectl get nodes --context=primary-cluster

# 2. Sync any data written during failover (if not using synchronous replication)
./aegis dr sync-from-dr --dry-run
./aegis dr sync-from-dr --execute

# 3. Reduce DNS TTL to 60s (if not already)
# Run appropriate DNS command from Step 4 above pointing back to primary NLB

# 4. Scale primary cluster back up
kubectl scale deployment aegis-api \
--replicas=6 \
--context=primary-cluster \
-n aegis

# 5. Redirect DNS back to primary
# Reverse the DNS change from Step 4 of the failover procedure

# 6. Verify traffic on primary
watch -n 10 'curl -sf https://api.aegis.io/health | jq .region'

# 7. Scale down DR cluster to standby capacity
kubectl scale deployment aegis-api \
--replicas=2 \
--context=dr-cluster \
-n aegis
  • Primary cluster healthy for >2h
  • Data sync completed with zero conflicts
  • DNS propagated to primary
  • DR cluster scaled back to standby

Communication Templates

Initial Notification (T+0)

Subject: [INCIDENT] Cloud Aegis DR Failover Initiated

Severity: SEV1
Started: YYYY-MM-DD HH:MM UTC
Incident Channel: #incident-YYYYMMDD-XX

We are initiating a DR failover for Cloud Aegis due to [reason].
Estimated completion: 30 minutes.
Expected impact: API unavailable for up to 5 minutes during DNS propagation.

Next update: T+15 min

Progress Update (T+15)

Update [T+15]:
- Database failover: COMPLETE
- DNS update: IN PROGRESS (propagating)
- DR cluster: SCALED (6/6 pods running)
- ETA to full recovery: 15 minutes

Resolution Notice

Subject: [RESOLVED] Cloud Aegis DR Failover Complete

Resolved: YYYY-MM-DD HH:MM UTC
Duration: XX minutes
Region: [primary -> DR]

All systems operational. Post-mortem scheduled for YYYY-MM-DD.

Escalation

ConditionAction
Failover takes >30 minEscalate to VP Engineering
Data integrity failureStop failover, engage DBA + Security
DR environment unhealthyContact Platform Team immediately
Two-region simultaneous failureEngage executive team, activate BCP

Contact Information

  • On-Call: PagerDuty
  • Platform Team: #platform-support (Slack)
  • Security Team: #security-ops (Slack)
  • DBA On-Call: PagerDuty escalation policy aegis-dba