Skip to main content

CSPM Aggregator - Detailed Design Document (DDD)


Document Control

VersionDateAuthorRoleContact
1.08 January 2026Liem Vo-NguyenSecurity Architect[email protected]

Table of Contents

SectionPage
Architecture Decision Records1
Data Flow Diagrams6
Component Design7
API Specifications9
Data Models11
AI Scoring Design13
Security Design16
Appendix: Technology References18

Architecture Decision Records

ADR-001: Language Selection

Status: Accepted Date: 8 January 2026

Context: Need to choose a language for the aggregator service that handles multi-cloud API calls, JSON processing, and LLM integration.

Options:

OptionProsCons
GoSingle binary, fast startup, strong typing, excellent concurrencyVerbose error handling
PythonRich AI/ML ecosystem, rapid prototypingSlower startup, dependency management
RustMemory safety, performanceSteep learning curve, slower development

Decision: Go

Rationale:

  • Single binary deployment simplifies container images
  • Fast cold start ideal for CronJob workloads
  • Strong typing catches errors at compile time
  • Excellent cloud SDK support (AWS, Azure, GCP)
  • Go 1.23+ generics improve code reuse

ADR-002: AI Provider Selection

Status: Accepted Date: 8 January 2026

Context: Need an LLM provider for contextual risk scoring and complexity assessment fallback.

Options:

OptionProsCons
Anthropic ClaudeSuperior reasoning, JSON mode, safetyAPI-only, no self-host
OpenAI GPT-4Wide adoption, function callingRate limits, cost
AWS BedrockManaged, multiple modelsAWS lock-in, complexity
Self-hosted (Llama)No API costs, privacyOps overhead, quality gap

Decision: Anthropic Claude (claude-opus-4-6)

Rationale:

  • Best-in-class reasoning for complex security context
  • Native JSON output mode reduces parsing errors
  • Consistent, deterministic output at low temperature (0.1)
  • Strong safety alignment for security-sensitive prompts

ADR-003: Authentication Strategy

Status: Accepted Date: 8 January 2026

Context: Need to authenticate to three cloud providers without storing credentials.

Decision: Federated Identity (OIDC/MSI/WIF)

CloudMethodToken Lifetime
AWSOIDC Federation via STS1 hour
AzureManaged IdentityAutomatic
GCPWorkload Identity Federation1 hour

Rationale:

  • Zero credential storage eliminates secret rotation burden
  • Audit trail via cloud-native identity logs
  • Automatic token refresh handled by SDKs
  • Aligned with zero-trust architecture

ADR-004: Deployment Model

Status: Accepted Date: 8 January 2026

Context: Need to run the aggregator on a schedule (monthly) with minimal operational overhead.

Options:

OptionProsCons
Kubernetes CronJobNative scheduling, resource limitsRequires cluster
Azure Functions (Timer)Serverless, pay-per-useCold start, 10-min limit
AWS Lambda + EventBridgeServerless15-min limit, cross-cloud auth complex
VM + cronSimpleAlways-on cost, manual patching

Decision: Kubernetes CronJob on AKS

Rationale:

  • Native Workload Identity integration for Azure
  • No execution time limits (job can run 1+ hours)
  • Resource limits and pod security policies
  • Reuse existing AKS infrastructure

ADR-005: State Storage

Status: Accepted Date: 8 January 2026

Context: Need to persist state between runs for delta detection (NEW/EXISTING/CLOSED/REOPENED).

Options:

OptionCostDurabilityComplexity
Azure Blob Storage$0.02/GB99.999999999% (11 9s)Low
Azure Table Storage$0.04/GB99.999999999%Low
PostgreSQL$25+/moHighMedium
Redis$15+/moMediumLow

Decision: Azure Blob Storage

Rationale:

  • Cheapest option for small state files (<1 MB)
  • 11 9s durability with GRS replication
  • Simple JSON file read/write (no ORM)
  • Managed Identity authentication

ADR-006: Complexity Assessment Strategy

Status: Accepted Date: 8 January 2026

Context: Need to classify findings into remediation complexity tiers (Tier1/2/3).

Decision: Rule-based with AI fallback

Strategy:

1. Try rule-based matching (25+ predefined rules)
-> Match found -> Return tier

2. AI fallback for unknown finding types
-> LLM assesses complexity
-> Return tier with lower confidence

3. Conservative default
-> Unknown + AI unavailable -> Tier3

Rationale:

  • Rules provide deterministic, fast classification for known patterns
  • AI handles novel/edge cases without manual rule updates
  • Conservative default ensures no underestimation of effort

ADR-007: Priority Matrix Design

Status: Accepted Date: 8 January 2026

Context: Need to combine risk severity and remediation complexity into actionable priorities.

Decision: 2D Matrix with Escalation Rules

Base Matrix:

Tier 1Tier 2Tier 3
CRITICALP1P1P2
HIGHP1P2P3
MEDIUMP3P4P4
LOWP4P5P5

Escalation Rules (can bump priority up):

  1. Production environment: +1 priority (max P1)
  2. PCI/PII data: +1 priority (max P2)
  3. Internet-facing resource: +1 priority (max P2)
  4. SLA overdue: +1 priority

Rationale:

  • Simple mental model for security teams
  • Escalations capture business context
  • Maps cleanly to SLA timelines (P1=24h, P2=7d, P3=14d, P4=30d, P5=90d)

ADR-008: Container Security

Status: Accepted Date: 8 January 2026

Context: Need to secure the container workload running in AKS.

Decision: Defense in Depth

ControlSetting
Non-root userrunAsUser: 1000
Read-only filesystemreadOnlyRootFilesystem: true
No privilege escalationallowPrivilegeEscalation: false
Drop all capabilitiescapabilities: { drop: [ALL] }
Resource limitscpu: 1000m, memory: 1Gi

Rationale:

  • Non-root prevents privilege escalation exploits
  • Read-only filesystem limits persistence mechanisms
  • Dropped capabilities reduce kernel attack surface
  • Resource limits prevent DoS from runaway processes

ADR-009: Network Policy

Status: Accepted Date: 8 January 2026

Context: Need to restrict network access for the aggregator pod.

Decision: Egress-only to DNS and HTTPS

egress:
- ports: [53/UDP, 53/TCP] # DNS
- ports: [443/TCP] # HTTPS
to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16

Rationale:

  • DNS required for service discovery
  • HTTPS required for cloud APIs and LLM
  • Exclude private ranges to prevent lateral movement
  • No ingress required (batch job, not server)

Data Flow Diagrams

Scoring Pipeline

Scoring Pipeline

Shows: Cloud providers -> Normalizer -> Delta Detection -> AI Scoring -> Priority Matrix -> Outputs


Priority Matrix Flow

Priority Matrix

Shows: Risk severity + Complexity tier -> Base priority -> Escalation rules -> Final priority + SLA


Component Design

Package Structure

cspm-aggregator/
├── cmd/aggregator/ # Application entrypoint
│ └── main.go # CLI flags, pipeline orchestration
├── internal/
│ ├── config/ # Configuration management
│ │ └── config.go # YAML + env var loading
│ ├── providers/ # Cloud provider clients
│ │ ├── aws/securityhub.go
│ │ ├── azure/defender.go
│ │ └── gcp/scc.go
│ ├── normalizer/ # Common finding schema
│ │ └── schema.go # Finding struct, delta detection
│ ├── scoring/ # AI scoring layer
│ │ ├── risk_scorer.go # Contextual risk assessment
│ │ ├── complexity.go # Remediation tier classification
│ │ └── priority.go # Priority matrix calculation
│ ├── asana/ # Asana task sync
│ ├── email/ # Email notifications
│ └── reporter/ # Report generation
├── configs/config.yaml # Configuration template
└── k8s/ # Kubernetes manifests

Interface Definitions

Cloud Provider Interface:

type Provider interface {
Name() string
GetFindings(ctx context.Context) ([]normalizer.Finding, error)
}

LLM Provider Interface:

type LLMProvider interface {
Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error)
Stream(ctx context.Context, req CompletionRequest) (<-chan StreamChunk, error)
CountTokens(ctx context.Context, text string) (int, error)
ModelName() string
MaxContextLength() int
IsAvailable(ctx context.Context) bool
}

Metadata Provider Interface:

type ResourceMetadataProvider interface {
GetDependencyCount(ctx context.Context, resourceID string) (int, error)
IsStateful(ctx context.Context, resourceID, resourceType string) (bool, error)
IsSharedResource(ctx context.Context, resourceID string) (bool, error)
GetSLATier(ctx context.Context, resourceID string) (string, error)
}

API Specifications

CLI Interface

# Full pipeline execution
./aggregator --config /app/configs/config.yaml

# Dry run (no external writes)
./aggregator --dry-run

# Single cloud provider
./aggregator --cloud aws

# Version info
./aggregator --version

Planned HTTP API (v1.1)

Get Prioritized Findings:

GET /api/v1/findings
?priority=P1,P2
&severity=CRITICAL,HIGH
&tier=Tier1
&automation_candidate=true
&queue=auto_remediation
&sla_status=overdue
&csp=aws,azure
&cbu=BRAND1,BRAND2
&limit=100
&cursor=xxx

Response:
{
"findings": [PrioritizedFinding],
"summary": {
"total": 150,
"by_priority": {"P1": 5, "P2": 20, "P3": 45, "P4": 60, "P5": 20},
"auto_remediation_ready": 12,
"quick_win_risk_reduction_pct": 35.5
},
"next_cursor": "yyy"
}

Dashboard Summary:

GET /api/v1/dashboard/summary

Response:
{
"total_findings": 150,
"p1_count": 5,
"p2_count": 20,
"p3_count": 45,
"p4_count": 60,
"p5_count": 20,
"auto_remediation_ready": 12,
"quick_wins": 8,
"quick_win_risk_reduction_pct": 35.5,
"on_track_sla": 120,
"at_risk_sla": 20,
"overdue_sla": 10,
"trend": {
"new_findings": 25,
"closed_findings": 40,
"net_change": -15,
"closure_rate": 0.62
}
}

Quick Wins Report:

GET /api/v1/reports/quick-wins

Response:
{
"quick_wins": [
{
"finding_id": "xxx",
"title": "S3 bucket public access",
"priority": "P1",
"tier": "Tier1",
"automation_candidate": true,
"estimated_effort_hours": 0.5,
"risk_reduction_pct": 5.2
}
],
"total_risk_reduction_pct": 35.5
}

Data Models

Common Finding Schema

type Finding struct {
// Core identification
FindingID string `json:"finding_id"`
FindingIDShort string `json:"finding_id_short"` // SHA256 dedup key
CSP string `json:"csp"` // aws|azure|gcp
AccountID string `json:"account_id"`
ResourceID string `json:"resource_id"`

// Details
Title string `json:"title"`
Description string `json:"description"`
Severity string `json:"severity"` // CRITICAL|HIGH|MEDIUM|LOW
Status string `json:"status"` // ACTIVE|RESOLVED|SUPPRESSED
ControlID string `json:"control_id"`
Standard string `json:"standard"` // CIS|FSBP|MCSB

// Classification (from org metadata)
CBU string `json:"cbu"` // Cost Business Unit
Tier string `json:"tier"` // Tier1-Prod|Tier2-NonProd|Tier3-Dev
EnvType string `json:"env_type"` // DEV|STG|PROD
Owner string `json:"owner"`

// Timestamps
FirstSeen time.Time `json:"first_seen"`
LastSeen time.Time `json:"last_seen"`
RemediationSLA time.Time `json:"remediation_sla"`

// Tracking
AsanaTaskID string `json:"asana_task_id,omitempty"`
DeltaStatus string `json:"delta_status"` // NEW|EXISTING|CLOSED|REOPENED
DaysOpen int `json:"days_open"`
}

Risk Assessment

type RiskAssessment struct {
OriginalSeverity string `json:"original_severity"`
AdjustedSeverity string `json:"adjusted_severity"`
SeverityDirection string `json:"severity_direction"` // upgraded|downgraded|unchanged
RiskScore int `json:"risk_score"` // 1-100
Confidence float64 `json:"confidence"` // 0.0-1.0
Rationale string `json:"rationale"`
MitigatingFactors []string `json:"mitigating_factors"`
AggravatingFactors []string `json:"aggravating_factors"`
RecommendedAction string `json:"recommended_action"` // remediate|accept_risk|investigate|suppress
AutoAcceptEligible bool `json:"auto_accept_eligible"`
AutoAcceptReason string `json:"auto_accept_reason,omitempty"`
}

Complexity Assessment

type ComplexityAssessment struct {
Tier int `json:"tier"` // 1|2|3
ComplexityScore int `json:"complexity_score"` // 1-100
AutomationCandidate bool `json:"automation_candidate"`
AutomationBlockers []string `json:"automation_blockers,omitempty"`

// Coordination requirements
RequiresAppTeam bool `json:"requires_app_team"`
RequiresNetworkTeam bool `json:"requires_network_team"`
RequiresDBTeam bool `json:"requires_db_team"`
RequiresChangeWindow bool `json:"requires_change_window"`
RequiresDowntime bool `json:"requires_downtime"`

// Impact estimates
ServiceImpact string `json:"service_impact"` // none|minimal|moderate|significant
EstimatedDowntimeMin int `json:"estimated_downtime_min"`
EstimatedEffortHours float64 `json:"estimated_effort_hours"`
RecommendedApproach string `json:"recommended_approach"`
Rationale string `json:"rationale"`
}

Prioritized Finding

type PrioritizedFinding struct {
Finding Finding `json:"finding"`
RiskAssessment RiskAssessment `json:"risk_assessment"`
ComplexityAssessment ComplexityAssessment `json:"complexity_assessment"`

// Priority output
Priority string `json:"priority"` // P1|P2|P3|P4|P5
EscalationReasons []string `json:"escalation_reasons,omitempty"`
Queue string `json:"queue"` // auto_remediation|security_review|app_team|change_board|remediation_queue
AutoRemediationReady bool `json:"auto_remediation_ready"`

// SLA tracking
SLADeadline time.Time `json:"sla_deadline"`
SLAStatus string `json:"sla_status"` // on_track|at_risk|overdue
DaysUntilSLA int `json:"days_until_sla"`
}

AI Scoring Design

Risk Scorer Architecture

Context Signals (30+):

CategorySignals
Asset ClassificationAssetTier, EnvType, DataClassification, BusinessCriticality
Network ExposureInternetFacing, VPCType, IngressPorts, EgressRestricted
Compensating ControlsWAFEnabled, EDREnabled, DLPEnabled, EncryptionAtRest, EncryptionInTransit, MFARequired, PrivateEndpoint
Vulnerability ContextCVSSScore, ExploitAvailable, ExploitInWild, PackageInUse, PatchAvailable
Historical PatternsFalsePositiveHistory, FPRateForType
Business ContextComplianceScopes, DataResidency, CostCenter, ApplicationOwner, SupportTier

Risk Scoring Flow:

1. Enrich context from metadata providers
2. Load FP history for finding type
3. Check auto-accept scenarios:
- LOW severity in sandbox -> auto-accept
- High FP rate (>30%) + 3+ historical FPs -> auto-accept
4. Build LLM prompt with all context signals
5. Call Claude API (temperature=0.1)
6. Parse JSON response
7. Apply guardrails:
- Never downgrade CRITICAL on Tier1-Prod + internet-facing
- Minimum MEDIUM for PCI/PII data
- Cap confidence at 70% when package usage unknown
- Ensure risk score aligns with severity

Complexity Normalizer Rules

Tier 1 (Full Automation):

CloudFinding Types
AWSS3 public access, logging disabled, IMDSv2, tagging, CloudTrail
AzureHTTPS only, diagnostic logging
GCPPublic bucket access, audit logging

Tier 2 (Partial Automation):

CloudFinding Types
AWSSecurity groups, IAM policies, TLS config, patching
AzureNSG rules, Key Vault access
GCPFirewall rules, IAM bindings

Tier 3 (Manual Execution):

CloudFinding Types
AWSRDS config, network architecture, critical patches
AzureSQL configuration
GCPCloud SQL, GKE cluster configuration

Environment Bumps:

  • Production environment: +1 tier
  • Stateful resource (database, queue): +1 tier
  • High dependencies (>5): +1 tier
  • Shared resource: Minimum Tier2, no automation
  • PCI/PII/PHI data: Minimum Tier2, requires change window

Security Design

Authentication Flow

System Architecture

Shows: OIDC/MSI/WIF authentication to cloud providers, Key Vault for secrets

Data Access Control

DataReadWriteDelete
Cloud findingsAggregatorN/AN/A
State filesAggregatorAggregatorAdmin
ReportsAggregator, UsersAggregatorAdmin
ConfigAggregatorAdminAdmin
SecretsAggregatorAdminAdmin

Secrets Management

SecretStorageRotationAccess
Anthropic API KeyKey VaultAnnualAggregator (MSI)
Asana PATKey VaultAnnualAggregator (MSI)
Graph Client SecretKey VaultAnnualAggregator (MSI)

Audit Logging

EventLog LevelFields
Pipeline startINFOtimestamp, version, config_hash
Provider queryINFOprovider, finding_count, duration
AI scoringDEBUGfinding_id, original_severity, adjusted_severity
External writeINFOdestination, record_count
ErrorERRORoperation, error_message, stack_trace

Appendix: Technology References

TechnologyPurposeDocumentation
Go 1.23Language runtimeOfficial docs
AWS SDK for Go v2AWS API clientAWS docs
Azure SDK for GoAzure API clientMicrosoft docs
GCP Go Client LibrariesGCP API clientGoogle docs
Anthropic ClaudeLLM APIAnthropic docs
Zap LoggerStructured loggingUber docs
Azure Blob StorageState persistenceMicrosoft docs
Azure Key VaultSecrets managementMicrosoft docs
Kubernetes CronJobSchedulingK8s docs
Asana APITask managementAsana docs
Microsoft GraphEmail APIMicrosoft docs