Runbook: Performance Troubleshooting
Overview
This runbook covers diagnosing and resolving performance issues in Cloud Aegis, including slow API responses, high latency, and resource exhaustion.
Prerequisites
- Access to Grafana dashboards
- kubectl access to production
- pprof endpoint access (internal network only)
Performance Baselines
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| API P50 latency | <50ms | <200ms | >500ms |
| API P99 latency | <200ms | <500ms | >2s |
| Error rate | <0.1% | <1% | >5% |
| CPU usage | <60% | <80% | >90% |
| Memory usage | <70% | <85% | >95% |
| DB query time | <50ms | <200ms | >1s |
Diagnosis Workflow
Step 1: Identify the Bottleneck
# Check overall latency
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,rate(aegis_http_request_duration_seconds_bucket[5m]))'
# Check by endpoint
curl -s 'http://prometheus:9090/api/v1/query?query=topk(10,histogram_quantile(0.99,rate(aegis_http_request_duration_seconds_bucket[5m]))by(path))'
# Check resource usage
kubectl top pods -n aegis
Step 2: CPU Profiling
# Enable CPU profile (30 seconds)
curl -s http://aegis-api:6060/debug/pprof/profile?seconds=30 > cpu.prof
# Analyze locally
go tool pprof -http=:8080 cpu.prof
Common CPU issues:
- JSON serialization (use sonic or jsoniter)
- Regex compilation (compile once, reuse)
- Excessive logging
- Inefficient algorithms
Step 3: Memory Profiling
# Capture heap profile
curl -s http://aegis-api:6060/debug/pprof/heap > heap.prof
# Analyze
go tool pprof heap.prof
> top10
> list <function>
Common memory issues:
- Unbounded slice growth
- String concatenation in loops
- Holding references preventing GC
- Large object pools
Step 4: Database Analysis
# Check slow queries
kubectl exec -n aegis deployment/aegis-api -- \
psql $DATABASE_URL -c "SELECT query, calls, mean_time, total_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;"
# Check active connections
kubectl exec -n aegis deployment/aegis-api -- \
psql $DATABASE_URL -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Check table sizes
kubectl exec -n aegis deployment/aegis-api -- \
psql $DATABASE_URL -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"
Step 5: Goroutine Analysis
# Check goroutine count
curl -s http://aegis-api:6060/debug/pprof/goroutine?debug=1 | head -50
# Full goroutine dump
curl -s http://aegis-api:6060/debug/pprof/goroutine?debug=2 > goroutines.txt
Common goroutine issues:
- Goroutine leaks (missing context cancellation)
- Blocking on channels
- Mutex contention
Common Issues and Fixes
Slow Database Queries
Symptoms: High P99 latency, increasing query times
Quick fixes:
-- Add missing index
CREATE INDEX CONCURRENTLY idx_findings_created
ON findings(created_at DESC)
WHERE status = 'open';
-- Vacuum and analyze
VACUUM ANALYZE findings;
-- Check for lock contention
SELECT * FROM pg_locks WHERE NOT granted;
Long-term fixes:
- Add query timeouts
- Implement query result caching
- Partition large tables by date
- Add read replicas for reporting queries
Memory Exhaustion
Symptoms: OOMKilled, gradual memory increase
Quick fixes:
# Increase memory limit (temporary)
kubectl patch deployment aegis-api -n aegis \
--type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "4Gi"}]'
Long-term fixes:
- Implement streaming for large responses
- Use generators for batch processing
- Add memory limits to goroutines
- Profile and fix memory leaks
High CPU Usage
Symptoms: Throttled pods, slow responses
Quick fixes:
# Scale horizontally
kubectl scale deployment aegis-api -n aegis --replicas=5
Long-term fixes:
- Cache computed results
- Optimize hot code paths
- Use worker pools for CPU-intensive work
- Add batch processing for bulk operations
Connection Pool Exhaustion
Symptoms: "connection pool exhausted", intermittent failures
Quick fixes:
# Check and increase pool size
kubectl edit configmap aegis-config -n aegis
# Update: database.max_connections: 100
Long-term fixes:
- Add PgBouncer for connection pooling
- Reduce connection hold times
- Implement connection health checking
- Add circuit breaker for downstream services
Optimization Checklist
API Layer
- Response compression enabled (gzip)
- Connection keep-alive configured
- Request timeout limits set
- Rate limiting prevents overload
Database Layer
- Connection pooling configured
- Query timeouts set
- Slow query logging enabled
- Indexes optimized for common queries
Caching Layer
- Redis caching for hot data
- Cache hit rate >80%
- TTL configured appropriately
- Cache invalidation working
Application Layer
- Goroutine limits configured
- Memory limits enforced
- Profiling endpoints enabled
- Structured logging (not excessive)
Monitoring Queries
Grafana Dashboard Queries
# Request rate by endpoint
sum(rate(aegis_http_requests_total[5m])) by (path)
# Latency heatmap
histogram_quantile(0.5, rate(aegis_http_request_duration_seconds_bucket[5m])) by (path)
histogram_quantile(0.95, rate(aegis_http_request_duration_seconds_bucket[5m])) by (path)
histogram_quantile(0.99, rate(aegis_http_request_duration_seconds_bucket[5m])) by (path)
# Error rate
sum(rate(aegis_http_requests_total{status=~"5.."}[5m])) / sum(rate(aegis_http_requests_total[5m]))
# Memory usage
container_memory_usage_bytes{container="aegis-api"} / container_spec_memory_limit_bytes{container="aegis-api"}
# CPU usage
rate(container_cpu_usage_seconds_total{container="aegis-api"}[5m]) / container_spec_cpu_quota{container="aegis-api"} * 100000
Escalation
| Condition | Action |
|---|---|
| P99 > 2s for >5 min | Page on-call |
| CPU/Memory >90% | Scale and page on-call |
| Error rate >5% | Page on-call immediately |
| DB query >30s | Kill query, investigate |