Skip to main content

Runbook: FinOps Budget Alerts

Overview

This runbook covers operating Cloud Aegis's FinOps budget alerting system, including:

  • Budget configuration and threshold management
  • Slack and PagerDuty alert channel setup
  • Responding to budget threshold breaches
  • Cost anomaly investigation
  • Chargeback report generation

Prerequisites

  • Cloud Aegis API token with finops:manage scope
  • Slack workspace with incoming webhook configured
  • PagerDuty service with Events API v2 integration key
  • Access to cloud billing consoles (AWS Cost Explorer, Azure Cost Management, GCP Billing)

Architecture

The FinOps alerting system consists of:

ComponentPackagePurpose
BudgetMonitorinternal/finops/alerting/budget.goMonitors budgets against thresholds
SlackNotifierinternal/finops/alerting/slack.goSends Block Kit alerts to Slack
PagerDutyNotifierinternal/finops/alerting/pagerduty.goSends Events API v2 alerts
MultiCloudAggregatorinternal/finops/aggregator/Fetches cost data from AWS/Azure/GCP
AnomalyDetectorinternal/finops/anomaly/ML-based spend anomaly detection

Configuring Budget Alerts

Create a Budget

curl -s -X POST https://api.aegis.io/api/v1/budgets \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Engineering Team Q1",
"amount": 50000,
"currency": "USD",
"period": "monthly",
"owner": "[email protected]",
"tags": {"team": "engineering", "env": "production"},
"thresholds": [
{"percent": 50, "channel": "slack"},
{"percent": 80, "channel": "slack"},
{"percent": 90, "channel": "pagerduty"},
{"percent": 100, "channel": "pagerduty"}
]
}' | jq '{budget_id, name, amount}'

Configure Slack Channel

# Set Slack webhook URL
kubectl edit configmap aegis-config -n aegis
# Update:
# finops:
# slack:
# webhook_url: "https://hooks.slack.com/services/T00/B00/xxx"
# channel: "#finops-alerts"

Configure PagerDuty

# Set PagerDuty integration key
kubectl edit configmap aegis-config -n aegis
# Update:
# finops:
# pagerduty:
# routing_key: "abc123def456"
# severity: "warning" # or "critical" for >100% threshold

Responding to Budget Alerts

50% Threshold (Informational)

No action required. This is an awareness notification.

80% Threshold (Warning)

  1. Review current spend by service:
curl -sf "https://api.aegis.io/api/v1/costs/summary?period=mtd&group_by=service" \
-H "Authorization: Bearer $API_TOKEN" | jq '.services[] | {service, cost, percent_of_budget}'
  1. Identify top cost drivers:
curl -sf "https://api.aegis.io/api/v1/costs/summary?period=mtd&group_by=resource" \
-H "Authorization: Bearer $API_TOKEN" | jq '.resources | sort_by(-.cost) | .[0:10]'
  1. Check for cost anomalies:
curl -sf "https://api.aegis.io/api/v1/costs/anomalies?period=7d" \
-H "Authorization: Bearer $API_TOKEN" | jq '.anomalies[] | {service, expected, actual, deviation_pct}'

90% Threshold (Action Required)

  1. Identify and stop non-essential resources:
# Check for idle resources
# AWS
aws ce get-cost-and-usage \
--time-period Start=$(date -v-7d +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE \
--filter '{"Dimensions": {"Key": "RECORD_TYPE", "Values": ["Usage"]}}'
  1. Review and action rightsizing recommendations:
curl -sf "https://api.aegis.io/api/v1/costs/recommendations" \
-H "Authorization: Bearer $API_TOKEN" | jq '.recommendations[] | {resource, current_cost, recommended_cost, savings}'
  1. Notify budget owner via the original Slack thread.

100% Threshold (Critical)

  1. Page on-call immediately (PagerDuty auto-triggered).
  2. Freeze non-critical provisioning if possible.
  3. Escalate to budget owner and their management chain.
  4. Schedule a cost review meeting within 24 hours.

Cost Anomaly Investigation

When an anomaly is detected:

# 1. Get anomaly details
curl -sf "https://api.aegis.io/api/v1/costs/anomalies/anom-001" \
-H "Authorization: Bearer $API_TOKEN" | jq .

# 2. Check if it correlates with a deployment
kubectl rollout history deployment/aegis-api -n aegis

# 3. Check if it correlates with a traffic spike
curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(aegis_http_requests_total[1h]))[7d:1h]'

# 4. Check cloud-native anomaly detection
# AWS
aws ce get-anomalies \
--date-interval Start=$(date -v-7d +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--monitor-arn arn:aws:ce::123456789012:anomalymonitor/abc-123

Generating Chargeback Reports

# Monthly chargeback report (CSV)
curl -sf "https://api.aegis.io/api/v1/costs/chargeback?period=2026-02&format=csv" \
-H "Authorization: Bearer $API_TOKEN" \
-o chargeback-2026-02.csv

# JSON format for programmatic consumption
curl -sf "https://api.aegis.io/api/v1/costs/chargeback?period=2026-02&format=json" \
-H "Authorization: Bearer $API_TOKEN" | jq '.teams[] | {team, total_cost, services}'

Monitoring

Prometheus Metrics

# Budget utilization
aegis_finops_budget_utilization_percent by (budget_name)

# Alert send success/failure rate
rate(aegis_finops_alerts_total{status="sent"}[1h])
rate(aegis_finops_alerts_total{status="failed"}[1h])

# Cost data freshness (should be <24h)
time() - aegis_finops_last_sync_timestamp

Escalation

ConditionAction
Budget alert not deliveredCheck Slack webhook / PagerDuty key, verify network
Cost data stale (>24h)Check CSP API credentials, verify aggregator pod
Anomaly false positiveTune detection thresholds in config
Budget exceeded with no alertCheck BudgetMonitor logs, verify threshold config

Contact Information

  • On-Call: PagerDuty
  • FinOps Team: #finops (Slack)
  • Platform Team: #platform-support (Slack)