CSPM Aggregator - High-Level Design (HLD)
Document Control
| Version | Date | Author | Role | Contact |
|---|---|---|---|---|
| 1.0 | 8 January 2026 | Liem Vo-Nguyen | Security Architect | [email protected] |
Table of Contents
Executive Summary
CSPM Aggregator is a multi-cloud security posture management platform that aggregates findings from AWS Security Hub, Azure Defender for Cloud, and GCP Security Command Center. It applies AI-powered contextual risk scoring and remediation complexity analysis to transform raw security findings into actionable, prioritized work items.
Key Differentiators:
- Contextual AI Scoring: Uses Claude API to analyze 30+ context signals (asset tier, network exposure, compensating controls, vulnerability details) and adjust finding severity beyond raw CSPM output
- Remediation Complexity Tiers: Classifies findings into Tier 1 (full automation), Tier 2 (partial automation), Tier 3 (manual) based on 25+ rules plus AI fallback
- Priority Matrix: Combines risk severity + complexity tier -> P1-P5 prioritization with SLA tracking
- Zero Credential Storage: Uses OIDC (AWS), Managed Identity (Azure), Workload Identity Federation (GCP)
Target Users: Security Operations, Cloud Engineering, Compliance Teams Deployment: Kubernetes CronJob on Azure AKS Data Store: Azure Blob Storage (state persistence) AI: Anthropic Claude API (claude-opus-4-6)
Architecture Overview
High-Level System Diagram
Architecture Principles
| Principle | Implementation |
|---|---|
| Multi-Cloud Parity | Normalized finding schema works across AWS, Azure, GCP |
| Zero Trust | OIDC/Managed Identity/WIF - no static credentials |
| Read-Only Access | SecurityAudit/Reader roles only |
| AI-Augmented | LLM scoring with guardrails and fallbacks |
| Stateless Processing | State externalized to Azure Blob Storage |
Component Descriptions
Cloud Provider Clients
| Provider | Service | Authentication | Data Source |
|---|---|---|---|
| AWS | Security Hub | OIDC Federation | Active findings (NEW, NOTIFIED) |
| Azure | Defender for Cloud | Managed Identity | Unhealthy assessments |
| GCP | Security Command Center | Workload Identity Federation | Active findings |
Core Processing Pipeline
[Cloud Providers] -> [Normalizer] -> [Delta Detection] -> [AI Scoring] -> [Priority Matrix] -> [Output]
| | | | | |
Raw findings Common schema NEW/EXISTING/ Risk + Complexity P1-P5 + Asana/Email/
CLOSED/REOPENED assessment SLA Reports
AI Scoring Layer
| Component | Purpose | Lines of Code |
|---|---|---|
| Risk Scorer | LLM-based contextual severity adjustment | 589 |
| Complexity Normalizer | Remediation tier classification (25+ rules + AI) | 905 |
| Priority Calculator | Risk + Complexity -> P1-P5 with escalations | 638 |
Integration Services
| Service | Purpose | Authentication |
|---|---|---|
| Asana | Task creation and sync | Personal Access Token |
| Microsoft Graph | Email distribution | OAuth 2.0 Client Credentials |
| Azure Blob Storage | State persistence | Managed Identity |
Data Flow
Finding Ingestion Flow
Data Transformation Stages
| Stage | Input | Output | Latency |
|---|---|---|---|
| 1. Query | Cloud API credentials | Raw findings | 30-60s per provider |
| 2. Normalize | Raw findings | Common schema | <1s |
| 3. Enrich | Common schema | + Org metadata (CBU, Tier, Owner) | <1s |
| 4. Delta | Current + Previous state | + DeltaStatus field | <5s |
| 5. Risk Score | Enriched finding | + AI risk assessment | 2-5s per finding |
| 6. Complexity | Enriched finding | + Complexity tier | 0.5-2s per finding |
| 7. Prioritize | Risk + Complexity | P1-P5 + SLA + Queue | <1s |
Priority Matrix
| Tier 1 (Auto) | Tier 2 (Partial) | Tier 3 (Manual) | |
|---|---|---|---|
| CRITICAL | P1 | P1 | P2 |
| HIGH | P1 | P2 | P3 |
| MEDIUM | P3 | P4 | P4 |
| LOW | P4 | P5 | P5 |
| INFO | P5 | P5 | P5 |
Security Architecture
Authentication Model
| Cloud | Method | Credential Storage | Rotation |
|---|---|---|---|
| AWS | OIDC Federation | None (STS) | Automatic |
| Azure | Managed Identity | None (MSI) | Automatic |
| GCP | Workload Identity Federation | None (WIF) | Automatic |
| Anthropic | API Key | Azure Key Vault | Manual (annual) |
Data Protection
| Data Type | Classification | Protection |
|---|---|---|
| Finding metadata | Internal | Encrypted at rest (AES-256) |
| Resource IDs | Internal | No PII extracted |
| LLM prompts | Internal | Sanitized (no PII/secrets) |
| State files | Internal | Blob encryption + RBAC |
Container Security
| Control | Implementation |
|---|---|
| Non-root user | UID 1000 |
| Read-only filesystem | readOnlyRootFilesystem: true |
| No privilege escalation | allowPrivilegeEscalation: false |
| Dropped capabilities | drop: [ALL] |
| Resource limits | CPU: 1000m, Memory: 1Gi |
Network Security
# NetworkPolicy - Egress only
egress:
- to: [DNS (53), HTTPS (443)]
- except: [10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16]
Integration Points
Upstream (Data Sources)
| Source | Protocol | Rate Limit | Pagination |
|---|---|---|---|
| AWS Security Hub | HTTPS/REST | 10 TPS | Yes (paginator) |
| Azure Resource Graph | HTTPS/REST | 15 req/5s | Yes (skipToken) |
| GCP Security Command Center | HTTPS/gRPC | 1000 req/min | Yes (iterator) |
Downstream (Outputs)
| Destination | Protocol | Purpose |
|---|---|---|
| Asana | HTTPS/REST | Task creation and tracking |
| Microsoft Graph | HTTPS/REST | Email notifications |
| Azure Blob Storage | HTTPS/REST | State persistence |
| Local filesystem | File I/O | HTML/CSV reports |
AI Integration
| Provider | Model | Temperature | Max Tokens | Use Case |
|---|---|---|---|---|
| Anthropic | claude-opus-4-6 | 0.1 | 1024 | Risk scoring |
| Anthropic | claude-opus-4-6 | 0.1 | 512 | Complexity fallback |
Build vs Buy Analysis
| Component | Decision | Rationale |
|---|---|---|
| Cloud Provider Clients | Build | Custom filtering, normalized output |
| Finding Normalizer | Build | Cross-cloud schema alignment |
| Risk Scoring | Build + Buy (LLM) | Contextual analysis requires AI |
| Complexity Rules | Build | Domain-specific remediation knowledge |
| Priority Matrix | Build | Custom escalation logic |
| Asana Integration | Build | Simple REST API |
| Email (Graph) | Build | Simple REST API |
| State Storage | Buy (Azure Blob) | Managed, encrypted, cheap |
| Secrets | Buy (Key Vault) | Managed rotation, audit |
| Container Orchestration | Buy (AKS) | Managed Kubernetes |
Alternative Solutions Considered
| Alternative | Pros | Cons | Decision |
|---|---|---|---|
| Wiz/Orca | Comprehensive CSPM | $100K+/year, no AI scoring customization | Pass |
| Prisma Cloud | Multi-cloud native | Complex licensing, limited priority logic | Pass |
| Custom LLM (self-hosted) | No API costs | Ops overhead, model quality | Pass |
| ServiceNow integration | Enterprise ITSM | Requires license, complex setup | Future |
Cost Analysis
Infrastructure Costs (Monthly)
| Component | Usage | Cost | Notes |
|---|---|---|---|
| AKS Node | 1 node (B2s) | $30 | Shared cluster |
| Azure Blob Storage | <1 GB | $0.02 | State files only |
| Key Vault | 10 secrets | $0.03 | Per-operation pricing |
| Container Registry | 1 image | $5 | Basic tier |
| Egress | <1 GB | $0.09 | API responses |
| Subtotal (Infra) | ~$35 |
AI Costs (Monthly)
| Operation | Volume | Cost/Unit | Cost |
|---|---|---|---|
| Risk scoring | 500 findings | $0.015/call | $7.50 |
| Complexity fallback | 50 findings | $0.008/call | $0.40 |
| Subtotal (AI) | ~$8 |
Total Monthly Cost
| Scenario | Findings | Infrastructure | AI | Total |
|---|---|---|---|---|
| Low | 200 | $35 | $4 | $39 |
| Medium | 500 | $35 | $8 | $43 |
| High | 2,000 | $35 | $32 | $67 |
Deployment Model
Kubernetes Architecture
AKS Cluster
├── Namespace: cspm-aggregator
│ ├── CronJob (monthly, 1st @ 8:00 AM PT)
│ │ ├── Pod (1 replica)
│ │ │ ├── Container: aggregator
│ │ │ ├── ServiceAccount + Workload Identity
│ │ │ └── Volume mounts (config, reports, tmp)
│ │ └── Secrets (from Key Vault via ESO)
│ ├── ConfigMap (config.yaml)
│ └── NetworkPolicy (egress-only)
└── Shared Infrastructure
├── Azure Key Vault
└── Azure Blob Storage
Deployment Lifecycle
| Event | Trigger | Action |
|---|---|---|
| Scheduled run | CronJob (1st of month, 08:00 PT) | Full pipeline execution |
| Manual run | kubectl create job --from=cronjob/cspm-aggregator | Ad-hoc execution |
| Image update | Git push to main | CI/CD -> Container Registry -> kubectl rollout |
Resource Configuration
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
DR/BC Architecture
Recovery Objectives
| Metric | Target | Rationale |
|---|---|---|
| RTO | 4 hours | Monthly batch job, not real-time |
| RPO | 30 days | State snapshot per run |
| MTTR | 1 hour | Redeploy from CI/CD |
Failure Scenarios
| Scenario | Impact | Recovery |
|---|---|---|
| AKS cluster failure | Job doesn't run | Manual trigger after cluster recovery |
| Cloud provider API outage | Partial data | Retry logic, skip unavailable provider |
| AI API outage | No scoring | Fallback to rule-based (conservative) |
| State corruption | Incorrect deltas | Re-run with --reset-state flag |
Backup Strategy
| Data | Backup Location | Retention | Frequency |
|---|---|---|---|
| State files | Azure Blob (GRS) | 90 days | Per run |
| Reports | Azure Blob (GRS) | 1 year | Per run |
| Config | Git repository | Indefinite | Per change |
| Secrets | Key Vault (soft delete) | 90 days | Manual |
Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Language | Go 1.23+ | Performance, single binary, strong typing |
| AI Provider | Anthropic Claude | Best-in-class reasoning, JSON mode |
| Cloud SDKs | AWS SDK v2, Azure SDK, GCP SDK | Official, well-maintained |
| Logging | go.uber.org/zap | Structured JSON logging |
| Configuration | gopkg.in/yaml.v3 | Human-readable, env var support |
| Container | Docker (Alpine 3.19) | Minimal attack surface |
| Orchestration | Kubernetes (AKS) | Managed, Workload Identity native |
| Secrets | Azure Key Vault | Managed rotation, audit logging |
| State | Azure Blob Storage | Cheap, durable, encrypted |
Architecture Decision Records
| ADR | Title | Decision | Rationale |
|---|---|---|---|
| ADR-001 | Language Selection | Go | Performance, single binary deployment |
| ADR-002 | AI Provider | Anthropic Claude | Superior reasoning, JSON output |
| ADR-003 | Authentication | OIDC/MSI/WIF | Zero credential storage |
| ADR-004 | Deployment Model | Kubernetes CronJob | Scheduled batch, resource efficient |
| ADR-005 | State Storage | Azure Blob | Durable, encrypted, cheap |
| ADR-006 | Complexity Strategy | Rules + AI fallback | Deterministic where possible |
| ADR-007 | Priority Matrix | Risk x Complexity | Actionable prioritization |
| ADR-008 | Container Security | Non-root, read-only | Defense in depth |
| ADR-009 | Network Policy | Egress-only HTTPS | Minimize attack surface |
Full ADR details in docs/core/architecture/adr/ directory.
Future Roadmap
| Phase | Feature | Priority |
|---|---|---|
| v1.1 | HTTP API for on-demand queries | High |
| v1.2 | Auto-remediation for Tier1 findings | High |
| v1.3 | ServiceNow integration | Medium |
| v2.0 | Real-time streaming (event-driven) | Low |
| v2.1 | Custom rule engine (user-defined) | Low |