Runbook: Remediation Operations
Overview
This runbook covers operating Cloud Aegis's remediation dispatcher, including:
- Dispatcher architecture and executor model
- Starting and monitoring remediation batches
- Per-handler operations (network, storage, compute, identity, security services)
- Rollback procedures
- Emergency stop
- Troubleshooting common failures
- Reporting and SLA compliance
Prerequisites
- Write-access cloud credentials for the target account/subscription
- kubectl access to the Cloud Aegis cluster
- Cloud Aegis API token with
remediation:executescope - Runbook 02-incident-response.md reviewed if remediating an active incident
- Change management approval for production remediations
Dispatcher Architecture
The remediation dispatcher runs as a Kubernetes Deployment (aegis-remediation) and exposes a tiered execution model:
| Tier | Handler Types | Concurrency | Timeout |
|---|---|---|---|
| T1 (Fast) | Network, Storage ACLs | 10 parallel | 30s |
| T2 (Standard) | Compute config, IAM | 5 parallel | 120s |
| T3 (Slow) | Patching, Key rotation | 2 parallel | 600s |
State snapshots are written to S3/GCS before every remediation. The rollback window is 48h.
Starting a Remediation Batch
Step 1: Dry Run (Required)
Always run dry-run first to preview changes.
# Dry run against a specific finding set
curl -s -X POST https://api.aegis.io/api/v1/remediations \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"finding_ids": ["f-abc123", "f-def456"],
"dry_run": true,
"executor_tier": "T1"
}' | jq '{batch_id, actions_preview, estimated_duration_s}'
# Or via CLI
./aegis remediate --finding-ids f-abc123,f-def456 --dry-run
Review the actions_preview list. Confirm each action is expected before proceeding.
Step 2: Execute Batch
# Execute with the batch_id from dry run
curl -s -X POST https://api.aegis.io/api/v1/remediations \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"batch_id": "batch-20260226-001",
"finding_ids": ["f-abc123", "f-def456"],
"dry_run": false,
"executor_tier": "T1"
}' | jq '{batch_id, status, started_at}'
Step 3: Monitor Progress
# Poll batch status
watch -n 10 'curl -sf https://api.aegis.io/api/v1/remediations/batch-20260226-001 | \
jq "{status, completed, failed, pending, elapsed_s}"'
# Stream dispatcher logs
kubectl logs -n aegis -l app=aegis-remediation -f | \
grep "batch-20260226-001"
Monitoring Active Remediations
Prometheus Metrics
# Active remediation count by tier
aegis_remediations_active by (tier)
# Success rate over 1h
rate(aegis_remediations_total{status="success"}[1h])
/ rate(aegis_remediations_total[1h])
# Avg duration by handler type
histogram_quantile(0.95, rate(aegis_remediation_duration_seconds_bucket[1h])) by (handler)
# Error rate
rate(aegis_remediations_total{status="failed"}[5m])
Log-Based Monitoring
# All remediation errors in the last hour
kubectl logs -n aegis -l app=aegis-remediation \
--since=1h | grep '"level":"error"' | jq '{handler, finding_id, error}'
# Successful remediations
kubectl logs -n aegis -l app=aegis-remediation \
--since=1h | grep '"status":"success"' | \
jq -r '[.handler, .finding_id, .resource_id] | @tsv'
Per-Handler Operations
Network: BlockPublicSSH
Blocks inbound SSH (22) and RDP (3389) from 0.0.0.0/0 and ::/0.
# Trigger manually for a specific security group
curl -s -X POST https://api.aegis.io/api/v1/remediations/handlers/block-public-ssh \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"resource_id": "sg-0abc123def456789", "provider": "aws", "dry_run": true}'
# Direct AWS CLI equivalent (for verification)
aws ec2 revoke-security-group-ingress \
--group-id sg-0abc123def456789 \
--protocol tcp \
--port 22 \
--cidr 0.0.0.0/0
# Verify rule removed
aws ec2 describe-security-groups \
--group-ids sg-0abc123def456789 \
--query 'SecurityGroups[0].IpPermissions[?FromPort==`22`]'
Storage: S3 Public Access Blocking
# Trigger for a specific S3 bucket
curl -s -X POST https://api.aegis.io/api/v1/remediations/handlers/s3-block-public-access \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"resource_id": "my-bucket", "provider": "aws", "dry_run": true}'
# Direct AWS CLI equivalent
aws s3api put-public-access-block \
--bucket my-bucket \
--public-access-block-configuration \
"BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
# Verify
aws s3api get-public-access-block --bucket my-bucket
Compute: IMDSv2 Enforcement
# Trigger for a specific EC2 instance
curl -s -X POST https://api.aegis.io/api/v1/remediations/handlers/enforce-imdsv2 \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"resource_id": "i-0abc123def456789", "provider": "aws", "dry_run": true}'
# Direct AWS CLI equivalent
aws ec2 modify-instance-metadata-options \
--instance-id i-0abc123def456789 \
--http-tokens required \
--http-endpoint enabled
# Verify
aws ec2 describe-instances \
--instance-ids i-0abc123def456789 \
--query 'Reservations[0].Instances[0].MetadataOptions'
Identity: IAM Key Rotation
Deactivates keys older than the configured threshold (default: 90 days).
# Trigger key rotation check for a user
curl -s -X POST https://api.aegis.io/api/v1/remediations/handlers/rotate-iam-keys \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"resource_id": "arn:aws:iam::123456789012:user/svc-aegis", "provider": "aws", "dry_run": true}'
# Direct AWS CLI: deactivate stale key
aws iam update-access-key \
--access-key-id AKIAIOSFODNN7EXAMPLE \
--status Inactive \
--user-name svc-aegis
# Verify
aws iam list-access-keys \
--user-name svc-aegis \
--query 'AccessKeyMetadata[].{KeyId:AccessKeyId,Status:Status,Created:CreateDate}'
Security Services: GuardDuty Enablement
# Enable GuardDuty in a region
curl -s -X POST https://api.aegis.io/api/v1/remediations/handlers/enable-guardduty \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"resource_id": "123456789012", "region": "us-east-1", "provider": "aws", "dry_run": true}'
# Direct AWS CLI equivalent
aws guardduty create-detector \
--enable \
--finding-publishing-frequency FIFTEEN_MINUTES \
--region us-east-1
# Verify
aws guardduty list-detectors --region us-east-1
Patching: SSM Compliance Queries
# Check SSM patch compliance for an instance
aws ssm describe-instance-patch-states \
--instance-ids i-0abc123def456789 \
--query 'InstancePatchStates[0].{Missing:MissingCount,Failed:FailedCount,Installed:InstalledCount}'
# Trigger SSM patch run
aws ssm send-command \
--instance-ids i-0abc123def456789 \
--document-name "AWS-RunPatchBaseline" \
--parameters "Operation=Install" \
--timeout-seconds 600
# Poll command status
aws ssm get-command-invocation \
--command-id <command-id> \
--instance-id i-0abc123def456789 \
--query '{Status:Status,StatusDetails:StatusDetails}'
Rollback Procedures
The dispatcher snapshots resource state before every change. Rollbacks are available for 48h.
# List available snapshots for a batch
curl -sf https://api.aegis.io/api/v1/remediations/batch-20260226-001/snapshots | \
jq '.[] | {snapshot_id, resource_id, handler, taken_at}'
# Dry-run rollback
curl -s -X POST https://api.aegis.io/api/v1/remediations/batch-20260226-001/rollback \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"snapshot_id": "snap-001", "dry_run": true}'
# Execute rollback
curl -s -X POST https://api.aegis.io/api/v1/remediations/batch-20260226-001/rollback \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"snapshot_id": "snap-001", "dry_run": false}'
# Verify rollback
curl -sf https://api.aegis.io/api/v1/remediations/batch-20260226-001 | jq .rollback_status
Manual Rollback (if dispatcher unavailable)
Snapshots are stored in S3 at s3://aegis-snapshots/remediations/<batch-id>/.
# Retrieve snapshot
aws s3 cp s3://aegis-snapshots/remediations/batch-20260226-001/snap-001.json .
# Inspect and apply manually using the appropriate CSP CLI
cat snap-001.json | jq .previous_state
Emergency Stop
To halt all running remediations immediately:
# Kill dispatcher pods (restarts automatically but drains queue)
kubectl delete pod -l app=aegis-remediation -n aegis
# Suspend the remediation queue (prevents new work from dequeuing)
kubectl patch deployment aegis-remediation -n aegis \
--type='json' \
-p='[{"op": "replace", "path": "/spec/replicas", "value": 0}]'
# Verify stopped
kubectl get pods -n aegis -l app=aegis-remediation
# Resume after investigation
kubectl patch deployment aegis-remediation -n aegis \
--type='json' \
-p='[{"op": "replace", "path": "/spec/replicas", "value": 2}]'
Troubleshooting Common Failures
Permission Denied
Symptoms: remediation failed: AccessDenied or 403 Forbidden
Diagnosis:
kubectl logs -n aegis -l app=aegis-remediation | \
grep '"error":"AccessDenied"' | jq '{finding_id, resource_id, action}'
Resolution:
- Verify IAM role/service principal has required policy attached
- Check for Service Control Policies (SCPs) blocking the action
- Confirm resource is in scope (correct account, region)
- Check if resource has a resource-based policy denying the action
Resource Locked
Symptoms: ConflictException, ResourceInUse, or resource is being modified
Diagnosis:
# Check if resource has pending operations
aws ec2 describe-instance-status \
--instance-ids i-0abc123def456789 \
--query 'InstanceStatuses[0].InstanceState'
Resolution:
- Wait 60s and retry — most locks are transient
- If persistent, check for concurrent automation (other tools, AWS automation)
- Manually verify the resource state and complete or cancel the competing operation
Dependency Conflict
Symptoms: Remediation succeeds but finding remains open; re-scan immediately re-flags the resource
Diagnosis:
# Get finding details to identify root resource vs dependent resource
curl -sf https://api.aegis.io/api/v1/findings/f-abc123 | jq .dependencies
Resolution:
- Check if a parent resource (e.g., Launch Template) is propagating the misconfiguration
- Remediate the root resource, not the derived one
- If an IaC pipeline is overwriting changes, add a policy exception and fix the IaC source
Remediation Reporting
# Success/failure rates for last 7 days
curl -sf "https://api.aegis.io/api/v1/remediations/report?period=7d" | \
jq '{total, succeeded, failed, success_rate, sla_compliance_pct}'
# Breakdown by handler
curl -sf "https://api.aegis.io/api/v1/remediations/report?period=7d&group_by=handler" | \
jq '.[] | {handler, total, success_rate, avg_duration_s}'
# SLA compliance (target: 95% of T1 remediations complete within 5 min)
curl -s 'http://prometheus:9090/api/v1/query?query=
sum(rate(aegis_remediations_total{tier="T1",status="success",duration_bucket=~"[0-9]+"}[7d]))
/ sum(rate(aegis_remediations_total{tier="T1"}[7d]))' | jq .
Post-Batch Checklist
- Batch status shows no failed items
- Re-scan triggered to verify finding closure
- Rollback snapshots confirmed in S3 (for audit)
- Change management ticket updated with results
- Metrics reviewed in Grafana
Escalation
| Condition | Action |
|---|---|
| Batch failure rate >20% | Stop batch, investigate permissions |
| Rollback fails | Manual remediation, engage Platform Team |
| Emergency stop needed | Scale replicas to 0, page on-call |
| SLA compliance <90% | Page on-call, review dispatcher capacity |
Contact Information
- On-Call: PagerDuty
- Platform Team: #platform-support (Slack)
- Security Team: #security-ops (Slack)