Skip to main content

CSPM Aggregator - High-Level Design (HLD)


Document Control

VersionDateAuthorRoleContact
1.08 January 2026Liem Vo-NguyenSecurity Architect[email protected]

Table of Contents

SectionPage
Executive Summary1
Architecture Overview2
Component Descriptions3
Data Flow4
Security Architecture5
Integration Points6
Build vs Buy Analysis7
Cost Analysis8
Deployment Model9
DR/BC Architecture10
Technology Stack11
Architecture Decision Records12

Executive Summary

CSPM Aggregator is a multi-cloud security posture management platform that aggregates findings from AWS Security Hub, Azure Defender for Cloud, and GCP Security Command Center. It applies AI-powered contextual risk scoring and remediation complexity analysis to transform raw security findings into actionable, prioritized work items.

Key Differentiators:

  • Contextual AI Scoring: Uses Claude API to analyze 30+ context signals (asset tier, network exposure, compensating controls, vulnerability details) and adjust finding severity beyond raw CSPM output
  • Remediation Complexity Tiers: Classifies findings into Tier 1 (full automation), Tier 2 (partial automation), Tier 3 (manual) based on 25+ rules plus AI fallback
  • Priority Matrix: Combines risk severity + complexity tier -> P1-P5 prioritization with SLA tracking
  • Zero Credential Storage: Uses OIDC (AWS), Managed Identity (Azure), Workload Identity Federation (GCP)

Target Users: Security Operations, Cloud Engineering, Compliance Teams Deployment: Kubernetes CronJob on Azure AKS Data Store: Azure Blob Storage (state persistence) AI: Anthropic Claude API (claude-opus-4-6)


Architecture Overview

High-Level System Diagram

System Architecture

Architecture Principles

PrincipleImplementation
Multi-Cloud ParityNormalized finding schema works across AWS, Azure, GCP
Zero TrustOIDC/Managed Identity/WIF - no static credentials
Read-Only AccessSecurityAudit/Reader roles only
AI-AugmentedLLM scoring with guardrails and fallbacks
Stateless ProcessingState externalized to Azure Blob Storage

Component Descriptions

Cloud Provider Clients

ProviderServiceAuthenticationData Source
AWSSecurity HubOIDC FederationActive findings (NEW, NOTIFIED)
AzureDefender for CloudManaged IdentityUnhealthy assessments
GCPSecurity Command CenterWorkload Identity FederationActive findings

Core Processing Pipeline

[Cloud Providers] -> [Normalizer] -> [Delta Detection] -> [AI Scoring] -> [Priority Matrix] -> [Output]
| | | | | |
Raw findings Common schema NEW/EXISTING/ Risk + Complexity P1-P5 + Asana/Email/
CLOSED/REOPENED assessment SLA Reports

AI Scoring Layer

ComponentPurposeLines of Code
Risk ScorerLLM-based contextual severity adjustment589
Complexity NormalizerRemediation tier classification (25+ rules + AI)905
Priority CalculatorRisk + Complexity -> P1-P5 with escalations638

Integration Services

ServicePurposeAuthentication
AsanaTask creation and syncPersonal Access Token
Microsoft GraphEmail distributionOAuth 2.0 Client Credentials
Azure Blob StorageState persistenceManaged Identity

Data Flow

Finding Ingestion Flow

Scoring Pipeline

Data Transformation Stages

StageInputOutputLatency
1. QueryCloud API credentialsRaw findings30-60s per provider
2. NormalizeRaw findingsCommon schema<1s
3. EnrichCommon schema+ Org metadata (CBU, Tier, Owner)<1s
4. DeltaCurrent + Previous state+ DeltaStatus field<5s
5. Risk ScoreEnriched finding+ AI risk assessment2-5s per finding
6. ComplexityEnriched finding+ Complexity tier0.5-2s per finding
7. PrioritizeRisk + ComplexityP1-P5 + SLA + Queue<1s

Priority Matrix

Priority Matrix

Tier 1 (Auto)Tier 2 (Partial)Tier 3 (Manual)
CRITICALP1P1P2
HIGHP1P2P3
MEDIUMP3P4P4
LOWP4P5P5
INFOP5P5P5

Security Architecture

Authentication Model

CloudMethodCredential StorageRotation
AWSOIDC FederationNone (STS)Automatic
AzureManaged IdentityNone (MSI)Automatic
GCPWorkload Identity FederationNone (WIF)Automatic
AnthropicAPI KeyAzure Key VaultManual (annual)

Data Protection

Data TypeClassificationProtection
Finding metadataInternalEncrypted at rest (AES-256)
Resource IDsInternalNo PII extracted
LLM promptsInternalSanitized (no PII/secrets)
State filesInternalBlob encryption + RBAC

Container Security

ControlImplementation
Non-root userUID 1000
Read-only filesystemreadOnlyRootFilesystem: true
No privilege escalationallowPrivilegeEscalation: false
Dropped capabilitiesdrop: [ALL]
Resource limitsCPU: 1000m, Memory: 1Gi

Network Security

# NetworkPolicy - Egress only
egress:
- to: [DNS (53), HTTPS (443)]
- except: [10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16]

Integration Points

Upstream (Data Sources)

SourceProtocolRate LimitPagination
AWS Security HubHTTPS/REST10 TPSYes (paginator)
Azure Resource GraphHTTPS/REST15 req/5sYes (skipToken)
GCP Security Command CenterHTTPS/gRPC1000 req/minYes (iterator)

Downstream (Outputs)

DestinationProtocolPurpose
AsanaHTTPS/RESTTask creation and tracking
Microsoft GraphHTTPS/RESTEmail notifications
Azure Blob StorageHTTPS/RESTState persistence
Local filesystemFile I/OHTML/CSV reports

AI Integration

ProviderModelTemperatureMax TokensUse Case
Anthropicclaude-opus-4-60.11024Risk scoring
Anthropicclaude-opus-4-60.1512Complexity fallback

Build vs Buy Analysis

ComponentDecisionRationale
Cloud Provider ClientsBuildCustom filtering, normalized output
Finding NormalizerBuildCross-cloud schema alignment
Risk ScoringBuild + Buy (LLM)Contextual analysis requires AI
Complexity RulesBuildDomain-specific remediation knowledge
Priority MatrixBuildCustom escalation logic
Asana IntegrationBuildSimple REST API
Email (Graph)BuildSimple REST API
State StorageBuy (Azure Blob)Managed, encrypted, cheap
SecretsBuy (Key Vault)Managed rotation, audit
Container OrchestrationBuy (AKS)Managed Kubernetes

Alternative Solutions Considered

AlternativeProsConsDecision
Wiz/OrcaComprehensive CSPM$100K+/year, no AI scoring customizationPass
Prisma CloudMulti-cloud nativeComplex licensing, limited priority logicPass
Custom LLM (self-hosted)No API costsOps overhead, model qualityPass
ServiceNow integrationEnterprise ITSMRequires license, complex setupFuture

Cost Analysis

Infrastructure Costs (Monthly)

ComponentUsageCostNotes
AKS Node1 node (B2s)$30Shared cluster
Azure Blob Storage<1 GB$0.02State files only
Key Vault10 secrets$0.03Per-operation pricing
Container Registry1 image$5Basic tier
Egress<1 GB$0.09API responses
Subtotal (Infra)~$35

AI Costs (Monthly)

OperationVolumeCost/UnitCost
Risk scoring500 findings$0.015/call$7.50
Complexity fallback50 findings$0.008/call$0.40
Subtotal (AI)~$8

Total Monthly Cost

ScenarioFindingsInfrastructureAITotal
Low200$35$4$39
Medium500$35$8$43
High2,000$35$32$67

Deployment Model

Kubernetes Architecture

AKS Cluster
├── Namespace: cspm-aggregator
│ ├── CronJob (monthly, 1st @ 8:00 AM PT)
│ │ ├── Pod (1 replica)
│ │ │ ├── Container: aggregator
│ │ │ ├── ServiceAccount + Workload Identity
│ │ │ └── Volume mounts (config, reports, tmp)
│ │ └── Secrets (from Key Vault via ESO)
│ ├── ConfigMap (config.yaml)
│ └── NetworkPolicy (egress-only)
└── Shared Infrastructure
├── Azure Key Vault
└── Azure Blob Storage

Deployment Lifecycle

EventTriggerAction
Scheduled runCronJob (1st of month, 08:00 PT)Full pipeline execution
Manual runkubectl create job --from=cronjob/cspm-aggregatorAd-hoc execution
Image updateGit push to mainCI/CD -> Container Registry -> kubectl rollout

Resource Configuration

resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi

DR/BC Architecture

Recovery Objectives

MetricTargetRationale
RTO4 hoursMonthly batch job, not real-time
RPO30 daysState snapshot per run
MTTR1 hourRedeploy from CI/CD

Failure Scenarios

ScenarioImpactRecovery
AKS cluster failureJob doesn't runManual trigger after cluster recovery
Cloud provider API outagePartial dataRetry logic, skip unavailable provider
AI API outageNo scoringFallback to rule-based (conservative)
State corruptionIncorrect deltasRe-run with --reset-state flag

Backup Strategy

DataBackup LocationRetentionFrequency
State filesAzure Blob (GRS)90 daysPer run
ReportsAzure Blob (GRS)1 yearPer run
ConfigGit repositoryIndefinitePer change
SecretsKey Vault (soft delete)90 daysManual

Technology Stack

LayerTechnologyRationale
LanguageGo 1.23+Performance, single binary, strong typing
AI ProviderAnthropic ClaudeBest-in-class reasoning, JSON mode
Cloud SDKsAWS SDK v2, Azure SDK, GCP SDKOfficial, well-maintained
Logginggo.uber.org/zapStructured JSON logging
Configurationgopkg.in/yaml.v3Human-readable, env var support
ContainerDocker (Alpine 3.19)Minimal attack surface
OrchestrationKubernetes (AKS)Managed, Workload Identity native
SecretsAzure Key VaultManaged rotation, audit logging
StateAzure Blob StorageCheap, durable, encrypted

Architecture Decision Records

ADRTitleDecisionRationale
ADR-001Language SelectionGoPerformance, single binary deployment
ADR-002AI ProviderAnthropic ClaudeSuperior reasoning, JSON output
ADR-003AuthenticationOIDC/MSI/WIFZero credential storage
ADR-004Deployment ModelKubernetes CronJobScheduled batch, resource efficient
ADR-005State StorageAzure BlobDurable, encrypted, cheap
ADR-006Complexity StrategyRules + AI fallbackDeterministic where possible
ADR-007Priority MatrixRisk x ComplexityActionable prioritization
ADR-008Container SecurityNon-root, read-onlyDefense in depth
ADR-009Network PolicyEgress-only HTTPSMinimize attack surface

Full ADR details in docs/core/architecture/adr/ directory.


Future Roadmap

PhaseFeaturePriority
v1.1HTTP API for on-demand queriesHigh
v1.2Auto-remediation for Tier1 findingsHigh
v1.3ServiceNow integrationMedium
v2.0Real-time streaming (event-driven)Low
v2.1Custom rule engine (user-defined)Low