Runbook: Incident Response
Overview
This runbook covers incident detection, triage, mitigation, and resolution for Cloud Aegis production issues.
Incident Classification
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| SEV1 | Service down, data loss risk | 15 min | API unreachable, DB corruption |
| SEV2 | Major degradation | 30 min | 50% error rate, major feature broken |
| SEV3 | Minor degradation | 2 hours | Single endpoint slow, non-critical bug |
| SEV4 | Low impact | 1 business day | UI cosmetic issue, minor inconvenience |
Detection
Alert Sources
- PagerDuty - Critical alerts
- Grafana - Metric-based alerts
- Datadog/CloudWatch - Log-based alerts
- Customer Reports - Support tickets
Key Metrics to Monitor
# Error rate (should be <0.1%)
sum(rate(aegis_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(aegis_http_requests_total[5m]))
# Latency P99 (should be <500ms)
histogram_quantile(0.99, rate(aegis_http_request_duration_seconds_bucket[5m]))
# Active findings (trend)
aegis_findings_active
# AI provider availability
aegis_health_status{component="ai_provider"}
Triage Procedure
Step 1: Initial Assessment (5 min)
# Check overall status
kubectl get pods -n aegis
kubectl top pods -n aegis
# Check recent deployments
kubectl rollout history deployment/aegis-api -n aegis
# Check logs for errors
kubectl logs -n aegis -l app=aegis-api --tail=100 | grep -i error
Step 2: Impact Assessment
- How many users affected?
- Which features impacted?
- Is data at risk?
- When did it start?
Step 3: Classification
Based on impact, classify severity and engage appropriate responders.
Common Issues and Remediation
Issue: High API Error Rate
Symptoms: 5xx errors, timeouts
Diagnosis:
# Check API logs
kubectl logs -n aegis -l app=aegis-api --tail=500 | grep "ERROR\|FATAL"
# Check resource usage
kubectl top pods -n aegis
# Check database connectivity
kubectl exec -n aegis deployment/aegis-api -- ./aegis health
Remediation:
- If OOM: Increase memory limits, then investigate memory leak
- If CPU: Scale horizontally, then optimize hot paths
- If DB connection: Check connection pool, DB health
- If external dependency: Check provider status, enable fallback
Issue: Database Connection Failures
Symptoms: "connection refused", "too many connections"
Diagnosis:
# Check connection count
kubectl exec -n aegis deployment/aegis-api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
# Check max connections
kubectl exec -n aegis deployment/aegis-api -- \
psql $DATABASE_URL -c "SHOW max_connections;"
Remediation:
- Kill idle connections:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'; - Increase connection pool size in config
- Add PgBouncer if not already present
- Scale up DB instance if connection limit reached
Issue: AI Provider Timeouts
Symptoms: Slow analysis, AI-powered features fail
Diagnosis:
# Check AI provider status
curl -s https://status.anthropic.com/api/v2/status.json | jq .
curl -s https://status.openai.com/api/v2/status.json | jq .
# Check rate limit status
kubectl logs -n aegis -l app=aegis-api | grep "rate_limit"
Remediation:
- Enable fallback provider in config
- Increase timeout if provider slow but working
- Enable cached responses for repeat queries
- Gracefully degrade to static analysis
Issue: High Memory Usage
Symptoms: OOMKilled pods, increasing memory trend
Diagnosis:
# Check memory usage
kubectl top pods -n aegis
# Enable profiling
curl -s http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof
Remediation:
- Restart affected pods (temporary)
- Reduce batch sizes for processing
- Add memory limits enforcement
- Investigate and fix memory leak
Communication
Internal Communication
- Create incident channel:
#incident-YYYYMMDD-XX - Post initial update with:
- What's happening
- Who is investigating
- Current impact
- Update every 15 minutes for SEV1-2
External Communication (if customer-facing)
- Update status page
- Prepare customer communication
- Coordinate with support team
Post-Incident
Immediate (within 24h)
- Document timeline
- Confirm service restored
- Remove any temporary mitigations
- Update monitoring if gap identified
Post-Mortem (within 5 days)
- Schedule blameless post-mortem
- Document root cause
- Create action items
- Share learnings
Escalation Matrix
| Severity | Primary | Escalation (30 min) | Escalation (1h) |
|---|---|---|---|
| SEV1 | On-Call | Engineering Manager | VP Engineering |
| SEV2 | On-Call | Tech Lead | Engineering Manager |
| SEV3 | On-Call | Tech Lead | - |
| SEV4 | Assigned Engineer | - | - |
Contact Information
- On-Call: PagerDuty
- Engineering Manager: @eng-manager
- Security: #security-ops
- Customer Success: #customer-success