Skip to main content

Detailed Design Document: Cloud Aegis Enterprise Cloud Governance Platform


Document Control

PropertyValue
Document IDAE-DDD-001
Version2.0
StatusApproved
ClassificationInternal
CreatedJanuary 5, 2026
Last UpdatedFebruary 27, 2026

Author

NameRoleEmail
Liem Vo-NguyenSecurity Architect[email protected]

Approvers

NameRoleSignatureDate
Admin OneEngineering Lead[email protected]Mar 4, 2026
Admin OneSecurity Director[email protected]Mar 4, 2026
Admin OnePrincipal Architect[email protected]Mar 4, 2026

Document History

VersionDateAuthorChanges
0.1Jan 2, 2026L. Vo-NguyenInitial draft
0.2Jan 3, 2026L. Vo-NguyenAdded compliance module design
1.0Jan 5, 2026L. Vo-NguyenFirst release
1.1Feb 14, 2026L. Vo-NguyenAdded Section 3.4 Remediation Dispatcher
1.2Feb 20, 2026L. Vo-NguyenAdded Section 3.5 AI Governance Module (merged from AgentGuard)
1.3Feb 25, 2026L. Vo-NguyenAdded Section 3.6 IaC Deploy Layer
2.1Mar 20, 2026L. Vo-NguyenRename sweep: CloudForge to Cloud Aegis; OPA namespace cloudforge.ai to aegis.ai
2.0Feb 27, 2026L. Vo-NguyenAdded Section 3.7 Risk Intelligence (Planned); SLA updates; version bump
DocumentLink
High-Level DesignHLD.md
Component Rationalecomponent-rationale.md
DR/BC PlanDR-BC.md
API SpecificationPlanned — not yet created

1. Introduction

1.1 Purpose

This Detailed Design Document (DDD) provides comprehensive technical specifications for implementing the Cloud Aegis Enterprise Cloud Governance Platform. It supplements the High-Level Design (HLD) with implementation-level details.

1.2 Scope

This document covers:

  • Detailed component specifications
  • Data models and schemas
  • API contracts
  • Integration patterns
  • Security implementation details
  • Performance requirements

1.3 Audience

  • Development Engineers
  • DevOps/SRE Engineers
  • Security Engineers
  • QA Engineers

2. System Context

2.1 External Integrations

┌─────────────────────────────────────────────────────────────────────────────┐
│ Cloud Aegis │
└─────────────────────────────────────────────────────────────────────────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ VCS │ │ SAST │ │ IdP │ │ GRC │ │ Cloud │
│ GitHub │ │ Sonar │ │ Entra │ │ SNOW │ │ AWS │
│ GitLab │ │ Veracode│ │ Okta │ │ Archer │ │ Azure │
│ ADO │ │ Checkov │ │ │ │ │ │ GCP │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘

2.2 Integration Authentication

SystemAuth MethodCredential Storage
GitHubOAuth App / PATAWS Secrets Manager
GitLabPersonal Access TokenAWS Secrets Manager
Azure DevOpsPAT / Service PrincipalAzure Key Vault
SonarQubeAPI TokenAWS Secrets Manager
VeracodeHMAC API CredentialsAWS Secrets Manager
Entra IDOIDC / Client CredentialsAzure Key Vault
OktaAPI Token / OAuthAWS Secrets Manager
ServiceNowBasic Auth / OAuthAWS Secrets Manager
ArcherSession TokenAWS Secrets Manager
AWSOIDC Federation (WIF)None (IAM Roles)
AzureWorkload IdentityNone (Managed Identity)
GCPWorkload IdentityNone (Service Account)

3. Component Detailed Design

3.1 Compliance Framework Engine

3.1.1 Package Structure

internal/compliance/
├── framework.go # Framework manager and core types
├── finding.go # Finding schema and methods
├── deduplication.go # Deduplication logic
├── ai_analyzer.go # AI-powered analysis
├── frameworks_builtin.go # CIS, NIST, ISO, PCI-DSS
├── frameworks_sector.go # HIPAA, SOX, GLBA, FFIEC
├── frameworks_gov_extended.go # CMMC, ITAR, DFARS
├── frameworks_automotive.go # ISO 21434, UN ECE R155, TISAX
└── mapper.go # Finding-to-control mapping

3.1.2 Finding Data Model

type Finding struct {
// Core Identification
ID string `json:"id"`
Source string `json:"source"`
SourceFindingID string `json:"source_finding_id"`
Type FindingType `json:"type"`
Category FindingCategory `json:"category"`

// Resource Information
ResourceType ResourceType `json:"resource_type"`
ResourceID string `json:"resource_id"`
ResourceName string `json:"resource_name"`

// Platform & Environment
Platform Platform `json:"platform"`
CloudProvider CloudProvider `json:"cloud_provider"`
EnvironmentType EnvironmentType `json:"environment_type"`

// Severity & Risk
StaticSeverity string `json:"static_severity"`
AIRiskScore float64 `json:"ai_risk_score"`
AIRiskLevel string `json:"ai_risk_level"`

// Workflow
WorkflowStatus WorkflowStatus `json:"workflow_status"`
Assignee *AssigneeInfo `json:"assignee,omitempty"`

// Compliance
ComplianceMappings []ComplianceMapping `json:"compliance_mappings"`
}

3.1.3 Deduplication Algorithm

Input: New Finding F, Existing Findings []E

1. Generate DeduplicationKey for F:
Key = SHA256(ResourceType + ResourceID + CanonicalRuleID + Title + CVEs)

2. Check for exact duplicates:
FOR each E in existing:
IF E.DeduplicationKey == F.DeduplicationKey:
RETURN (F, isDuplicate=true)

3. Check for equivalent rules:
FOR each E in existing:
IF E.ResourceID == F.ResourceID:
IF areRulesEquivalent(E.SourceFindingID, F.SourceFindingID):
IF shouldReplaceExisting(F, E):
MARK E for removal
RETURN (F, isDuplicate=false)
ELSE:
F.DuplicateOf = E.ID
RETURN (F, isDuplicate=true)

4. RETURN (F, isDuplicate=false)

3.1.4 Rule Equivalence Mappings

Canonical RuleEquivalent Rules
s3-bucket-public-accessS3.1, S3.2, S3.3, CKV_AWS_19, CKV_AWS_20, CKV_AWS_21
ec2-security-group-openEC2.19, EC2.2, CKV_AWS_23, CKV_AWS_24, CKV_AWS_25
iam-root-access-keyIAM.4, CKV_AWS_41
encryption-at-restS3.4, RDS.3, EBS.1, CKV_AWS_3, CKV_AWS_16

3.2 CI/CD Security Module

3.2.1 Package Structure

internal/cicd/
├── scanner.go # Pipeline scanner
├── dependency_scanner.go # Dependency analysis
├── vcs/
│ ├── provider.go # VCS interface
│ ├── github.go # GitHub/GH Enterprise
│ ├── gitlab.go # GitLab
│ └── azure_devops.go # Azure DevOps
└── sast/
├── provider.go # SAST interface
├── sonarqube.go # SonarQube/SonarCloud
├── checkov.go # Checkov IaC scanning
└── veracode.go # Veracode SAST/DAST

3.2.2 VCS Provider Interface

type Provider interface {
Name() string
GetRepositories(ctx context.Context) ([]*Repository, error)
GetPullRequests(ctx context.Context, owner, repo, state string) ([]*PullRequest, error)
GetPipelines(ctx context.Context, owner, repo string) ([]*Pipeline, error)
GetSecurityAlerts(ctx context.Context, owner, repo string) ([]*SecurityAlert, error)
CreateCheckRun(ctx context.Context, owner, repo, sha string, check *CheckRun) error
}

3.2.3 SAST Provider Interface

type Provider interface {
Name() string
Type() string // sast, dast, sca, iac
Scan(ctx context.Context, req *ScanRequest) (*ScanResult, error)
GetScanStatus(ctx context.Context, scanID string) (*ScanStatus, error)
GetFindings(ctx context.Context, scanID string) ([]*Finding, error)
}

3.3 Identity & Zero Trust Module

3.3.1 Package Structure

internal/identity/
├── provider.go # Identity provider interface
├── entra_id.go # Microsoft Entra ID
├── okta.go # Okta
└── zero_trust.go # Zero Trust policy engine

3.3.2 Zero Trust Policy Evaluation

type PolicyDecision struct {
Allow bool
RequireMFA bool
RequireDevice bool
SessionDuration time.Duration
RiskScore float64
Reason string
}

func (z *ZeroTrustEnforcer) EnforcePolicy(ctx context.Context, req AccessRequest) (*PolicyDecision, error) {
// 1. Evaluate user risk
userRisk := z.evaluateUserRisk(req.User)

// 2. Evaluate device compliance
deviceCompliance := z.evaluateDeviceCompliance(req.Device)

// 3. Evaluate resource sensitivity
resourceSensitivity := z.evaluateResourceSensitivity(req.Resource)

// 4. Apply policies
for _, policy := range z.policies {
if policy.Matches(req) {
return policy.Evaluate(userRisk, deviceCompliance, resourceSensitivity)
}
}

// 5. Default deny
return &PolicyDecision{Allow: false, Reason: "No matching policy"}
}

3.4 Remediation Dispatcher

3.4.1 Package Structure

pkg/remediation/
├── executor.go # Batch executor, semaphore, dry-run routing
├── types.go # Remediator interface, result types, RollbackState

internal/remediation/
├── network/
│ └── block_ssh.go # BlockPublicSSHRemediator (Tier 1, AWS/GCP/Azure)
├── compute/ # (planned) EBS encryption, IMDSv2, public AMI handlers
├── identity/ # (planned) IAM root key, stale access key handlers
└── storage/ # (planned) S3 public access, versioning handlers

internal/findings/
└── finding.go # PrioritizedFinding bridge type consumed by all handlers

3.4.2 Remediator Interface

All remediation handlers implement the Remediator interface defined in pkg/remediation/types.go. The interface enforces a four-method contract covering execution, validation, tier classification, and simulation:

// Remediator is the interface that all remediation handlers must implement.
type Remediator interface {
// Remediate executes the remediation action for the given finding.
Remediate(ctx context.Context, finding *findings.PrioritizedFinding) (*RemediationResult, error)

// Validate verifies that the remediation was successful.
Validate(ctx context.Context, finding *findings.PrioritizedFinding) (*ValidationResult, error)

// Tier returns the complexity tier (1-3) for this remediation.
// Tier 1: Auto-safe, no approval needed (DEV/STG)
// Tier 2: Requires verification before PROD
// Tier 3: Requires change window
Tier() int

// DryRun simulates the remediation without making changes.
DryRun(ctx context.Context, finding *findings.PrioritizedFinding) (*DryRunResult, error)
}

3.4.3 Finding Bridge Type

The internal/findings package defines the primary type consumed by all handlers. It mirrors the cspm-aggregator's scoring package schema for JSON compatibility and will be removed once the aggregator is merged:

// PrioritizedFinding contains the full assessment for a finding.
// This is the primary type consumed by remediation handlers.
type PrioritizedFinding struct {
Finding *Finding `json:"finding"`
RiskAssessment *RiskAssessment `json:"risk_assessment,omitempty"`
ComplexityAssessment *ComplexityAssessment `json:"complexity_assessment,omitempty"`
Priority string `json:"priority"`
PriorityScore int `json:"priority_score"`
AutoRemediationReady bool `json:"auto_remediation_ready"`
RecommendedAction string `json:"recommended_action"`
AssignedQueue string `json:"assigned_queue"`
RequiresApproval bool `json:"requires_approval"`
AssessedAt time.Time `json:"assessed_at"`
}

The Finding.Context struct carries business context — asset tier, environment type, data classification, internet-facing flag, and compliance scopes — that is used by handlers for bastion-host heuristics and tier gate decisions.

3.4.4 Tiered Execution Model

Remediation actions are classified into three tiers that gate execution authority:

TierNameAuto-ExecuteApproval RequiredScope
T1Auto-safeYesNoneAlways runs; changes are always safe (e.g., block public SSH on non-bastion)
T2Requires verificationOnly if AutoRemediationReady=truePre-PROD reviewModerate blast radius; validated before applying to production
T3Change windowNoFull change approvalHigh blast radius; requires scheduled change window

The executor enforces this gate at the Execute method level:

// Tier 1 = auto-safe (always runs). Tier 2+ require AutoRemediationReady [SEC-006]
if !finding.AutoRemediationReady && handler.Tier() > 1 {
return &RemediationResult{
FindingID: finding.Finding.ID,
Success: false,
Message: fmt.Sprintf("Auto-remediation not approved for tier %d finding", handler.Tier()),
}, nil
}

3.4.5 Executor Flow

The Executor dispatches findings to registered handlers by FindingType key. The execution sequence is:

PrioritizedFinding
|
v
[Nil guard + field validation] -- SEC-001: prevents nil pointer panics
|
v
[Handler lookup by FindingType]
|
v
[Tier gate check] -- T2/T3 require AutoRemediationReady=true
|
v
[Dry-run branch?]
Yes --> handler.DryRun() --> RemediationResult{Message: "DRY-RUN: ..."}
No --> handler.Remediate()
|
v
handler.Validate()
|
IsCompliant?
Yes --> result.Success = true
No --> result.Success = false (remediation applied but validation failed)

The full Execute method applies the sequence including post-remediation validation:

// Execute processes a finding and routes it to the appropriate handler.
func (e *Executor) Execute(ctx context.Context, finding *findings.PrioritizedFinding) (*RemediationResult, error) {
if finding == nil || finding.Finding == nil {
return nil, fmt.Errorf("finding or finding.Finding is nil")
}
if finding.Finding.ID == "" || finding.Finding.FindingType == "" {
return nil, fmt.Errorf("finding missing required fields: ID=%q, FindingType=%q",
finding.Finding.ID, finding.Finding.FindingType)
}

handler, ok := e.handlers[finding.Finding.FindingType]
if !ok {
return nil, fmt.Errorf("no handler registered for finding type: %s", finding.Finding.FindingType)
}

if !finding.AutoRemediationReady && handler.Tier() > 1 {
return &RemediationResult{
FindingID: finding.Finding.ID,
Success: false,
Message: fmt.Sprintf("Auto-remediation not approved for tier %d finding", handler.Tier()),
}, nil
}

if e.dryRun {
dryRunResult, err := handler.DryRun(ctx, finding)
// ...
}

result, err := handler.Remediate(ctx, finding)
// ...
validation, err := handler.Validate(ctx, finding)
// ...
}

3.4.6 Batch Processing with Semaphore

ExecuteBatch processes multiple findings concurrently up to a configurable maxConcurrency limit. Results are guaranteed to be returned in input order regardless of goroutine completion order. Context cancellation is handled gracefully — remaining items are marked as cancelled rather than deadlocked:

// ExecuteBatch processes multiple findings concurrently (up to maxConcurrency).
// Results are returned in the same order as the input batch [SEC-002].
func (e *Executor) ExecuteBatch(ctx context.Context, batch []*findings.PrioritizedFinding, maxConcurrency int) ([]*RemediationResult, error) {
if maxConcurrency <= 0 {
maxConcurrency = 5
}

results := make([]*RemediationResult, len(batch)) // pre-allocated at fixed indices
sem := make(chan struct{}, maxConcurrency)

type resultPair struct {
result *RemediationResult
err error
index int
}

resultChan := make(chan resultPair, len(batch))

for i := range batch {
select {
case sem <- struct{}{}:
case <-ctx.Done():
// Mark remaining as cancelled; do not block
}
go func(idx int, f *findings.PrioritizedFinding) {
defer func() { <-sem }()
result, err := e.Execute(ctx, f)
resultChan <- resultPair{result: result, err: err, index: idx}
}(i, batch[i])
}
// Drain resultChan and place each result at its original index
// ...
}

Key properties of the batch executor:

  • Default concurrency: 5 (configurable via maxConcurrency)
  • Context-aware semaphore acquisition prevents goroutine leaks on cancellation
  • Index-preserving results enable deterministic audit log ordering

3.4.7 Handler Registration

Handlers are registered against the FindingType string key used in the finding's FindingType field. Multiple CSP variants can map to the same handler:

executor := remediation.NewExecutor(false) // false = live mode

sshHandler := network.NewBlockPublicSSHRemediator()
executor.Register("OPEN_SSH_PORT", sshHandler)
executor.Register("AWS.EC2.SecurityGroup.SSH", sshHandler)
executor.Register("GCP.OPEN_SSH_PORT", sshHandler)

The BlockPublicSSHRemediator dispatches internally by CSP based on finding.Finding.Source:

func (b *BlockPublicSSHRemediator) Remediate(ctx context.Context, finding *findings.PrioritizedFinding) (*remediation.RemediationResult, error) {
switch {
case strings.Contains(finding.Finding.Source, "aws"):
return b.remediateAWS(ctx, finding, result)
case strings.Contains(finding.Finding.Source, "gcp"):
return b.remediateGCP(ctx, finding, result)
case strings.Contains(finding.Finding.Source, "azure"):
return b.remediateAzure(ctx, finding, result)
}
}

The AWS path calls ec2.RevokeSecurityGroupIngress to remove the 0.0.0.0/0:22 ingress rule, then validates via ec2.DescribeSecurityGroups that no public SSH rule remains. The GCP and Azure paths are stubs pending implementation.

3.4.8 Dry-Run Mode

When NewExecutor(dryRun: true) is configured, every handler receives a DryRun() call instead of Remediate(). The DryRunResult carries:

  • WouldSucceed bool — whether the handler believes the action would succeed
  • PlannedActions []string — human-readable list of changes that would be made
  • PrerequisitesMet bool — whether all preconditions are satisfied
  • EstimatedImpact string — operator-readable impact statement
  • Warnings []string — conditions that block or caution against auto-execution

Example: the SSH handler heuristically detects bastion hosts and suppresses auto-execution:

// Check if this is a bastion security group (heuristic)
if strings.Contains(strings.ToLower(finding.Finding.ResourceID), "bastion") {
dryRun.Warnings = append(dryRun.Warnings,
"WARNING: This appears to be a bastion host security group. Public SSH may be intentional.")
dryRun.WouldSucceed = false
}

3.4.9 Rollback Engine

The RollbackState type captures pre-remediation resource state with a 48-hour rollback window enforced at the workflow layer:

// RollbackState captures pre-remediation state needed to reverse an action.
type RollbackState struct {
FindingID string `json:"finding_id"`
ResourceID string `json:"resource_id"`
Region string `json:"region"`
AccountID string `json:"account_id"`
PreState map[string]interface{} `json:"pre_state"` // Handler-specific state
CapturedAt time.Time `json:"captured_at"`
}

PreState is handler-specific — for the SSH handler it stores the original ingress rules; for future key-rotation handlers it stores the prior key ID. The rollback window expiry check (CapturedAt.Add(48 * time.Hour).Before(now)) is enforced before allowing rollback execution.

RemediationRecord provides the full audit trail linking findings, handlers, results, and Asana task URLs:

type RemediationRecord struct {
ID string `json:"id"`
FindingID string `json:"finding_id"`
Domain string `json:"domain"` // compute, identity, network, etc.
Handler string `json:"handler"` // Specific remediator name
Tier int `json:"tier"`
Status RemediationStatus `json:"status"`
Result *RemediationResult `json:"result,omitempty"`
Validation *ValidationResult `json:"validation,omitempty"`
AsanaTaskURL string `json:"asana_task_url,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}

3.4.10 Remediation State Machine

                  +----------+
| PENDING |
+----+------+
|
[Execute called by dispatcher]
|
v
+-------------+
| IN_PROGRESS |
+------+------+
|
+-----------+-----------+
| |
[handler error] [handler success]
| |
v v
+--------+ +----------+
| FAILED | | validate |
+--------+ +----+-----+
|
+--------------+-------------+
| |
[not compliant] [compliant]
| |
v v
+--------+ +-----------+
| FAILED | | COMPLETED |
+--------+ +-----------+
|
[within 48h window]
|
v
+----------+
| (rollback |
| eligible)|
+----------+

Valid status values: pending, in_progress, completed, failed, skipped.


3.5 AI Governance Module

3.5.1 Package Structure

internal/ai-governance/
├── opa/
│ └── engine.go # In-process OPA engine, policy loading, evaluation
└── models.go # Agent registry, observability traces, STRIDE+ATLAS
# threat models, maturity assessment

Migrated selectively from AgentGuard. Compliance framework models (Framework, Control, Crosswalk, GapAnalysis) are not included here as they already exist in internal/compliance/.

3.5.2 Dual-OPA Architecture

Cloud Aegis runs two distinct OPA evaluation paths that are architecturally complementary:

                 Cloud Aegis Platform
|
+--------------+--------------+
| |
v v
[Cloud Provisioning Path] [AI Governance Path]
internal/policy/evaluator.go internal/ai-governance/opa/engine.go
| |
v v
HTTP REST to external In-process (embedded)
OPA instance OPA via Go library
| |
Package namespace: Package namespace:
terraform.* aegis.ai.*
| |
Governs: Governs:
- IaC plan evaluation - Agent tool access
- Resource compliance - Data flow controls
- Infrastructure drift - Rate limiting
- Environment isolation - Prompt injection detection

The two paths are independent and non-conflicting: internal/policy/evaluator.go sends plan JSON to an external OPA HTTP endpoint; internal/ai-governance/opa/engine.go embeds the OPA Go library directly and evaluates in-process with sub-millisecond latency requirements for synchronous agent request gating.

3.5.3 Embedded OPA Engine

The Engine type wraps the OPA Go library with a sync.RWMutex-protected query cache, an in-memory data store, and lazy query preparation:

// Engine is the in-process policy evaluation engine powered by OPA.
type Engine struct {
mu sync.RWMutex
queries map[string]*rego.PreparedEvalQuery
store storage.Store
}

// Decision represents the result of a policy evaluation.
type Decision struct {
Allow bool `json:"allow"`
Reasons []string `json:"reasons,omitempty"`
Violations []Violation `json:"violations,omitempty"`
Metadata map[string]any `json:"metadata,omitempty"`
EvalTimeUs int64 `json:"eval_time_us"`
}

Policies are loaded either as individual .rego files or as pre-bundled tar.gz archives. The Rego namespace is data.aegis.ai:

func (e *Engine) LoadPolicies(ctx context.Context, paths []string) error {
r = rego.New(
rego.Query("data.aegis.ai"),
rego.Store(e.store),
rego.Load([]string{path}, nil),
)
pq, err := r.PrepareForEval(ctx)
e.queries["default"] = &pq
return nil
}

3.5.4 Evaluation Input Schema

All policy evaluation uses the EvaluationInput struct, which carries typed contexts for the agent, the tool being invoked, the data being accessed, and the originating request:

// EvaluationInput is the input to policy evaluation.
type EvaluationInput struct {
Agent AgentContext `json:"agent"`
Tool *ToolContext `json:"tool,omitempty"`
Data *DataContext `json:"data,omitempty"`
Request *RequestContext `json:"request,omitempty"`
Environment map[string]string `json:"environment,omitempty"`
}

// AgentContext provides agent information for policy evaluation.
type AgentContext struct {
ID string `json:"id"`
Name string `json:"name"`
Team string `json:"team"`
Environment string `json:"environment"`
Capabilities []string `json:"capabilities"`
}

// ToolContext provides tool invocation information.
type ToolContext struct {
Name string `json:"name"`
Category string `json:"category"`
Parameters map[string]any `json:"parameters"`
External bool `json:"external"`
}

// DataContext provides data flow information.
type DataContext struct {
Classification string `json:"classification"`
Source string `json:"source"`
Destination string `json:"destination"`
PIIFields []string `json:"pii_fields,omitempty"`
}

Two convenience methods are exposed for the two primary policy domains:

// EvaluateToolAccess evaluates tool access policy for an AI agent.
func (e *Engine) EvaluateToolAccess(ctx context.Context, agent *AgentContext, tool *ToolContext) (*Decision, error) {
input := &EvaluationInput{Agent: *agent, Tool: tool}
return e.Evaluate(ctx, "aegis.ai.tool_access.allow", input)
}

// EvaluateDataFlow evaluates data flow policy for an AI agent.
func (e *Engine) EvaluateDataFlow(ctx context.Context, agent *AgentContext, data *DataContext) (*Decision, error) {
input := &EvaluationInput{Agent: *agent, Data: data}
return e.Evaluate(ctx, "aegis.ai.data_flow.allow", input)
}

3.5.5 Built-in Rego Policies

Two base policies are embedded as Go constants. Both are loaded at engine initialization and can be overridden by environment-specific bundles.

Tool Access Policy (package aegis.ai.tool_access):

  • Default deny; allow requires tool in allowed list, parameters passing forbidden-pattern regex check, and no rate-limit breach
  • Generates typed denial_reasons for audit logs
package aegis.ai.tool_access

default allow = false

allow {
tool_allowed
parameters_valid
not rate_limit_exceeded
}

tool_allowed {
input.tool.name in data.policies.allowed_tools[input.agent.id]
}

contains_forbidden_pattern {
pattern := data.policies.forbidden_patterns[_]
regex.match(pattern, json.marshal(input.tool.parameters))
}

Data Flow Policy (package aegis.ai.data_flow):

  • Controls which data classifications may flow to which destinations
  • PII data to redact_destinations triggers field-level redaction
  • source_restricted check blocks flows from restricted sources to untrusted destinations
package aegis.ai.data_flow

default allow_flow = false

allow_flow {
destination_allowed
not source_restricted
}

requires_redaction {
input.data.classification == "PII"
input.data.destination in data.policies.redact_destinations
}

3.5.6 Agent Registry

The Agent struct is the central registry record, linking an agent's identity to its capabilities, tool bindings, bound policy IDs, and operational status:

// Agent represents a registered AI agent in the system.
type Agent struct {
ID uuid.UUID `json:"id" db:"id"`
Name string `json:"name" db:"name"`
Framework string `json:"framework" db:"framework"` // langchain, crewai, autogen
Version string `json:"version" db:"version"`
Owner string `json:"owner" db:"owner"`
Team string `json:"team" db:"team"`
Environment string `json:"environment" db:"environment"` // dev, staging, prod
Capabilities []Capability `json:"capabilities" db:"capabilities"`
Tools []ToolBinding `json:"tools" db:"tools"`
Policies []string `json:"policies" db:"policies"` // Policy IDs bound to agent
RiskLevel string `json:"risk_level" db:"risk_level"`
Status AgentStatus `json:"status" db:"status"`
LastActiveAt *time.Time `json:"last_active_at,omitempty" db:"last_active_at"`
}

Status lifecycle: active -> suspended (policy violation) -> inactive (decommissioned) or deprecated (replaced by newer version).

3.5.7 Observability: Agent Traces

AgentTrace captures the full execution tree of an agent invocation. Each invocation produces a root trace with N child Span records, each typed as llm, retrieval, tool, chain, agent, or policy:

// AgentTrace represents a complete execution trace for an agent invocation.
type AgentTrace struct {
TraceID string `json:"trace_id" db:"trace_id"`
AgentID uuid.UUID `json:"agent_id" db:"agent_id"`
SessionID string `json:"session_id" db:"session_id"`
UserID string `json:"user_id" db:"user_id"`
Status TraceStatus `json:"status" db:"status"`
Spans []Span `json:"spans" db:"spans"`
SecuritySignals []SecuritySignal `json:"security_signals" db:"security_signals"`
Metrics TraceMetrics `json:"metrics" db:"metrics"`
}

TraceStatus includes the value blocked for traces terminated by a policy denial. SecuritySignal records detected anomalies within a trace (signal types: injection_attempt, data_exfiltration, tool_abuse, privilege_escalation, anomalous_behavior, policy_violation, rate_limit_exceeded).

TraceMetrics provides aggregate token accounting and cost estimation:

type TraceMetrics struct {
TotalSpans int `json:"total_spans"`
LLMCalls int `json:"llm_calls"`
ToolInvocations int `json:"tool_invocations"`
TotalTokens int `json:"total_tokens"`
EstimatedCostUSD float64 `json:"estimated_cost_usd"`
PolicyEvaluations int `json:"policy_evaluations"`
SecuritySignals int `json:"security_signals"`
}

3.5.8 STRIDE + MITRE ATLAS Threat Models

The ThreatModel struct links threats to STRIDE categories and ATLAS technique identifiers, enabling structured threat modeling for AI systems:

// Threat represents an identified threat.
type Threat struct {
ID string `json:"id"`
Title string `json:"title"`
Category STRIDECategory `json:"category"`
Likelihood string `json:"likelihood"` // low, medium, high, very_high
Impact string `json:"impact"` // low, medium, high, critical
RiskLevel string `json:"risk_level"` // likelihood x impact
ATLASTechniques []string `json:"atlas_techniques"`
MitigationIDs []string `json:"mitigation_ids"`
}

STRIDE categories mapped: spoofing, tampering, repudiation, information_disclosure, denial_of_service, elevation_of_privilege.

ATLAS techniques are referenced by identifier (e.g., AML.T0051 for LLM prompt injection). Mitigation.MappedControls links mitigations back to compliance framework control IDs in internal/compliance/.

3.5.9 Maturity Assessment

MaturityAssessment provides a 1-5 level maturity scoring across governance domains. Each DomainAssessment is weighted and composed of CapabilityAssessment records that track current vs. target levels with supporting evidence:

type MaturityAssessment struct {
ID string `json:"id" db:"id"`
OrganizationID string `json:"organization_id" db:"organization_id"`
AssessmentDate time.Time `json:"assessment_date" db:"assessment_date"`
Domains []DomainAssessment `json:"domains"`
OverallScore float64 `json:"overall_score"`
OverallLevel int `json:"overall_level"` // 1-5
Recommendations []Recommendation `json:"recommendations"`
}

Recommendation records include current/target level delta, effort estimate (small, medium, large), and impact classification.


3.6 IaC Deploy Layer

3.6.1 Directory Structure

deploy/
├── terraform/
│ ├── modules/
│ │ ├── compute/
│ │ │ ├── main.tf # GCP Cloud Run / AWS ECS Fargate / Azure Container Apps
│ │ │ └── variables.tf
│ │ ├── database/
│ │ │ ├── main.tf # GCP Cloud SQL / AWS RDS / Azure PostgreSQL Flexible
│ │ │ └── variables.tf
│ │ └── redis/
│ │ ├── main.tf # GCP Memorystore / AWS ElastiCache / Azure Cache for Redis
│ │ └── variables.tf
│ ├── environments/
│ │ ├── dev/ # Dev environment composition
│ │ └── (staging, prod planned)
│ └── policies/
│ ├── security-baseline.rego # Encryption, public IPs, TLS, IAM wildcards
│ ├── cost-controls.rego # (planned) Instance sizing, retention caps
│ ├── naming-conventions.rego # (planned) Resource naming enforcement
│ ├── network-security.rego # (planned) CIDR restrictions, subnet placement
│ └── ai-governance.rego # (planned) AI_MODEL env var, observability
├── scripts/
│ ├── plan-with-policy.sh # Terraform plan -> JSON -> conftest pipeline
│ └── deploy.sh # Apply script with pre-flight checks
├── Dockerfile.api # Multi-stage Go build
└── Dockerfile.worker # Worker container image

3.6.2 Multi-Cloud Module Design (count-based provider switching)

All three Terraform modules use a count = var.cloud_provider == "X" ? 1 : 0 pattern to select exactly one cloud provider's resources at plan time. This produces a single module interface usable across GCP, AWS, and Azure with no conditional logic at the environment composition layer:

Compute module (modules/compute/main.tf):

  • GCP: google_cloud_run_v2_service — VPC egress restricted to PRIVATE_RANGES_ONLY, service account injection, Secret Manager refs
  • AWS: aws_ecs_task_definition + aws_ecs_service — Fargate launch type, awsvpc networking, CloudWatch log group, Secrets Manager value references. assign_public_ip = false is hardcoded (not configurable).
  • Azure: azurerm_container_app — delegated Container Apps environment, secret references via environment variable secret bindings

Database module (modules/database/main.tf):

  • GCP: google_sql_database_instanceipv4_enabled = false, private network only, require_ssl = true, PITR enabled in prod, deletion_protection in prod
  • AWS: aws_db_instancestorage_encrypted = true (hardcoded), manage_master_user_password = true (Secrets Manager rotation), publicly_accessible = false, Multi-AZ in prod
  • Azure: azurerm_postgresql_flexible_server — delegated subnet, geo-redundant backup in prod

Instance tier mappings are defined as local maps per provider:

local {
tier_map_gcp = { SMALL = "db-f1-micro", STANDARD = "db-custom-2-7680" }
tier_map_aws = { SMALL = "db.t3.micro", STANDARD = "db.t3.medium" }
tier_map_azure = { SMALL = "B_Standard_B1ms", STANDARD = "GP_Standard_D2s_v3" }
}

Redis module (modules/redis/main.tf):

  • GCP: google_redis_instancetransit_encryption_mode = "SERVER_AUTHENTICATION", auth_enabled = true, HA via STANDARD_HA tier
  • AWS: aws_elasticache_replication_groupat_rest_encryption_enabled = true, transit_encryption_enabled = true, automatic failover in HA mode
  • Azure: azurerm_redis_cacheenable_non_ssl_port = false, minimum_tls_version = "1.2", subnet binding

3.6.3 Security Policy Gate: conftest Pipeline

Terraform plans are validated against Rego policies using conftest before any apply is permitted. The plan-with-policy.sh script implements a four-step pipeline:

Step 1: terraform init -backend=false
Step 2: terraform plan -var="cloud_provider=${PROVIDER}" -out=plan.tfplan
Step 3: terraform show -json plan.tfplan > plan.json
Step 4: conftest test plan.json --policy policies/ --namespace terraform

Exit code semantics:

  • 0 — all checks passed; safe to apply
  • 1 — policy violations detected; apply blocked
  • 2 — warnings only; human review required before applying
CONFTEST_EXIT=0
conftest test "${PLAN_JSON}" \
--policy "${POLICY_DIR}" \
--namespace "terraform" \
--output table \
2>&1 || CONFTEST_EXIT=$?

if [[ ${CONFTEST_EXIT} -eq 0 ]]; then
echo "[+] All policy checks PASSED."
elif [[ ${CONFTEST_EXIT} -eq 2 ]]; then
echo "[!] Policy checks passed with WARNINGS."
exit 2
else
echo "[-] Policy VIOLATIONS detected. Resolve before applying."
exit 1
fi

3.6.4 Security Baseline Rego Policies

The security-baseline.rego policy file (package terraform.security_baseline) enforces eight mandatory security controls against the terraform plan JSON:

Policy IDControlResources Checked
SECURITY-001Encryption at restaws_db_instance, aws_rds_cluster, google_sql_database_instance, azurerm_postgresql_flexible_server
SECURITY-002S3 server-side encryptionaws_s3_bucket
SECURITY-003No public Cloud Run ingressgoogle_cloud_run_v2_service
SECURITY-004No public EC2 IPaws_instance
SECURITY-005TLS 1.2+ on ALB listenersaws_lb_listener
SECURITY-006TLS 1.2+ on GCP SSL policiesgoogle_compute_ssl_policy
SECURITY-007No default VPC usageaws_instance, aws_ecs_service, aws_rds_instance
SECURITY-008No wildcard IAM actionsaws_iam_policy, google_project_iam_binding

Example denial rule:

package terraform.security_baseline

deny contains msg if {
resource := input.resource_changes[_]
resource.type in ["aws_db_instance", "google_sql_database_instance",
"azurerm_postgresql_flexible_server"]
config := resource.change.after
not config.storage_encrypted
msg := sprintf(
"SECURITY-001: %s '%s' must have storage_encrypted = true",
[resource.type, resource.name]
)
}

3.6.5 Three-Layer OPA Governance Architecture

When the IaC deploy layer is combined with the existing policy modules, the full platform implements a three-layer OPA governance model:

Layer 1: IaC Plan Gate (pre-deploy)
conftest + security-baseline.rego
Triggered: before every terraform apply
Scope: infrastructure resource properties
Blocking: hard block on policy violations

Layer 2: Cloud Provisioning Runtime (post-deploy)
internal/policy/evaluator.go -> HTTP OPA server
Triggered: per API request for resource provisioning
Scope: runtime resource state + compliance context
Blocking: request denied if policy evaluation fails

Layer 3: AI Agent Governance (runtime)
internal/ai-governance/opa/engine.go (in-process)
Triggered: per tool invocation and data access by AI agents
Scope: agent capabilities, tool parameters, data classifications
Blocking: synchronous deny before tool execution

3.6.6 Container Images

Two Dockerfiles are provided, both using multi-stage Go builds:

  • Dockerfile.api — API server image, gcr.io/distroless/base-debian12 final stage, runs as non-root UID 65532
  • Dockerfile.worker — Worker/consumer image, same base, separate binary entrypoint for the remediation worker process

Both images bake the VERSION build arg as a Go linker variable (-ldflags "-X main.version=${VERSION}") for version reporting in health endpoints.


3.7 Risk Intelligence Module (Planned)

This section describes the planned Risk Intelligence module, which will introduce graph-based attack path analysis and threat intelligence enrichment. Current state: design and schema complete; implementation planned as the next major feature release.

3.7.1 Design Rationale

Traditional CSPM tools evaluate findings in isolation. A misconfigured S3 bucket is "medium." An overly permissive IAM role is "medium." A known CVE on an EC2 instance is "medium." But chain them together — internet-exposed EC2 with CVE -> lateral movement via overprivileged role -> exfiltrate from misconfigured S3 containing PII — and the aggregate is a critical attack path.

The Risk Intelligence module implements toxic combination detection: the insight that multiple low/medium-severity issues chained together create critical risk. Findings are evaluated across 7 dimensions simultaneously: network exposures, vulnerabilities, misconfigurations, identities, data stores, secrets, and malware/behavioral signals.

3.7.2 AttackPathContext Schema

The AttackPathContext struct (defined in cspm-aggregator/internal/normalizer/schema.go) carries attack path metadata enriched from cloud-native engines (Azure attack paths, GCP attack exposure score, AWS GuardDuty attack sequences) and open-source tooling (Cartography, PMapper):

// AttackPathContext contains attack path analysis context.
type AttackPathContext struct {
Score float64 `json:"score,omitempty"` // 0-100 composite attack path score
PathNodeCount int `json:"path_node_count,omitempty"` // Nodes in longest attack path
EntryPointType string `json:"entry_point_type,omitempty"` // internet, lateral, insider
TargetType string `json:"target_type,omitempty"` // data, compute, identity, network
BlastRadiusCount int `json:"blast_radius_count,omitempty"` // Resources reachable from finding
IsToxicCombination bool `json:"is_toxic_combination,omitempty"`
IsChokepoint bool `json:"is_chokepoint,omitempty"`
IAMEscalationPath []string `json:"iam_escalation_path,omitempty"` // Privilege escalation chain
}

IsToxicCombination is set when multiple otherwise-low-severity findings chain into a critical path. IsChokepoint identifies nodes through which many attack paths pass — remediation of a chokepoint breaks the largest number of paths simultaneously.

3.7.3 ToxicComboDetails Schema

The FindingClass enum in the normalizer schema includes classes specific to graph-based analysis:

const (
ClassThreat FindingClass = "THREAT"
ClassVulnerability FindingClass = "VULNERABILITY"
ClassMisconfiguration FindingClass = "MISCONFIGURATION"
ClassObservation FindingClass = "OBSERVATION"
ClassPostureViolation FindingClass = "POSTURE_VIOLATION"
ClassToxicCombination FindingClass = "TOXIC_COMBINATION" // graph-derived
ClassChokepoint FindingClass = "CHOKEPOINT" // graph-derived
ClassSensitiveDataRisk FindingClass = "SENSITIVE_DATA_RISK"
)

These classes are set by the cloud-native engines: GCP SCC reports TOXIC_COMBINATION and CHOKEPOINT natively via the findingClass field; AWS GuardDuty Detection objects are mapped to THREAT; Azure Defender assessments default to MISCONFIGURATION.

3.7.4 Threat Intelligence Context

ThreatIntelContext carries enrichment from five public feeds: CISA KEV, EPSS v4, NVD CVSS, GreyNoise, and AlienVault OTX:

// ThreatIntelContext contains threat intelligence enrichment.
type ThreatIntelContext struct {
CVEIDs []string `json:"cve_ids,omitempty"`
InKEV bool `json:"in_kev,omitempty"` // CISA Known Exploited Vulnerabilities
KEVDateAdded string `json:"kev_date_added,omitempty"`
EPSSScore float64 `json:"epss_score,omitempty"` // 0.0-1.0 exploitation probability
EPSSPercentile float64 `json:"epss_percentile,omitempty"`
CVSSBaseScore float64 `json:"cvss_base_score,omitempty"`
CVSSVector string `json:"cvss_vector,omitempty"`
GreyNoiseClass string `json:"greynoise_class,omitempty"` // benign, malicious, unknown
OTXPulseCount int `json:"otx_pulse_count,omitempty"`
EnrichedAt time.Time `json:"enriched_at,omitempty"`
}

InKEV = true is treated as an aggravating factor in the LLM scoring prompt regardless of CVSS base score. EPSSScore provides a probabilistic exploitation likelihood that the AI scorer uses to adjust severity (high EPSS + internet-facing asset = upgrade to next severity tier).

3.7.5 LLM Scoring with Attack Path Context

The RiskScorer.BuildPrompt() method in cspm-aggregator/internal/scoring/risk_scorer.go conditionally appends an attack path and threat intelligence section to the Claude prompt when the relevant fields are populated:

// Add threat intelligence context if present
if ctx.InKEV || ctx.EPSSScore > 0 || ctx.AttackPathScore > 0 {
prompt += fmt.Sprintf(`
## Threat Intelligence
- In CISA KEV: %t
- EPSS Score: %.4f (percentile: %.2f)
- Attack Path Score: %.1f
- Toxic Combination: %t
- Blast Radius (reachable resources): %d
`,
ctx.InKEV,
ctx.EPSSScore,
ctx.EPSSPercentile,
ctx.AttackPathScore,
ctx.IsToxicCombination,
ctx.BlastRadiusCount,
)
}

The prompt instructs the model to treat InKEV, attack path score, and IsToxicCombination as aggravating factors. Business guardrails applied post-response:

  • CRITICAL severity is never downgraded for Tier1-Prod internet-facing assets
  • PCI/PII findings have a minimum severity floor of MEDIUM
  • Confidence is capped at 0.7 when package usage is unknown

3.7.6 Planned Graph Database Layer

The planned graph layer will use Neo4j (development) and Amazon Neptune (production AWS) as a relationship store alongside existing finding storage. Graph schema:

Nodes:
(:Asset {id, type, account, cloud, region})
(:Finding {id, severity, source, cve})
(:Identity {arn, type, permissions[]})
(:DataStore {id, classification, encrypted})

Edges:
(Asset)-[:HAS_FINDING]->(Finding)
(Asset)-[:CAN_REACH]->(Asset) // network reachability
(Identity)-[:CAN_ASSUME]->(Identity) // role chaining
(Identity)-[:CAN_ACCESS]->(DataStore)
(Asset)-[:RUNS_AS]->(Identity)

Attack path templates will be defined as Cypher query patterns:

// Internet-exposed -> vulnerable -> overprivileged -> sensitive data
MATCH path = (entry:Asset)-[:CAN_REACH*1..4]->(target:DataStore)
WHERE entry.internetExposed = true
AND ANY(f IN [(entry)-[:HAS_FINDING]->(f) | f]
WHERE f.severity >= 'HIGH')
AND ANY(hop IN nodes(path)
WHERE hop:Identity AND hop.overprivileged = true)
AND target.classification IN ['PII', 'PHI', 'FINANCIAL']
RETURN path, length(path) as hops
ORDER BY hops ASC

Paths are scored by: hop count (shorter = more exploitable), severity of findings along each hop, target asset criticality, and entry point exposure type.

3.7.7 Cross-Account Path Analysis

With environments spanning multiple cloud accounts and projects, cross-account trust relationships represent the primary blind spot of cloud-native CSPM tools. The graph layer will ingest:

  • AWS: IAM trust policies, resource policies, cross-account roles, VPC peering, Transit Gateway attachments, PrivateLink connections
  • GCP: Organization policies, shared VPC configurations, service account impersonation chains across projects
  • Azure: Management group hierarchy, cross-subscription RBAC assignments, service principal relationships

Graph edges span account boundaries, enabling detection of lateral movement paths invisible to any single-account tool.

3.7.8 Contextual Severity Validation Design

The LLM severity adjustment output is validated against the following guardrail rules (implemented in applyGuardrails in risk_scorer.go):

RuleConditionEffect
GR-001CRITICAL + Tier1-Prod + internet-facingNever downgrade; hard floor
GR-002PCI or PII data classificationMinimum severity = MEDIUM
GR-003Package usage unknownCap confidence at 0.70
GR-004Severity/score alignmentClamp risk score to severity band

Severity-to-score bands:

SeverityMin ScoreMax Score
CRITICAL85100
HIGH6584
MEDIUM4064
LOW1539

Auto-accept shortcuts (rule-based, skip LLM call):

  • LOW severity in sandbox environment -> auto-accept
  • FP rate for type > 30% and >= 3 historical FPs and non-CRITICAL -> downgrade to LOW

4. Data Architecture

4.1 Database Schema

4.1.1 Core Tables

-- Findings table with partitioning
CREATE TABLE findings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source VARCHAR(100) NOT NULL,
source_finding_id VARCHAR(255),
type VARCHAR(50) NOT NULL,
category VARCHAR(50),
title TEXT NOT NULL,
description TEXT,

-- Resource
resource_type VARCHAR(50),
resource_id VARCHAR(500),
resource_name VARCHAR(255),

-- Platform
platform VARCHAR(50),
cloud_provider VARCHAR(50),
region VARCHAR(100),
account_id VARCHAR(100),
environment_type VARCHAR(50),

-- Severity
static_severity VARCHAR(20),
ai_risk_score DECIMAL(4,2),
ai_risk_level VARCHAR(20),
cvss DECIMAL(3,1),

-- Workflow
workflow_status VARCHAR(50) DEFAULT 'new',
assignee_id VARCHAR(255),
assignee_email VARCHAR(255),

-- Ownership
service_name VARCHAR(255),
line_of_business VARCHAR(255),
technical_contact_email VARCHAR(255),

-- Timestamps
first_found_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
due_date TIMESTAMPTZ,

-- Deduplication
deduplication_key VARCHAR(64) NOT NULL,
canonical_rule_id VARCHAR(255),

-- JSONB for flexible data
cves JSONB,
compliance_mappings JSONB,
raw_data JSONB,
tags JSONB,

CONSTRAINT unique_dedup_key UNIQUE (deduplication_key)
) PARTITION BY RANGE (first_found_at);

-- Monthly partitions
CREATE TABLE findings_2026_01 PARTITION OF findings
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

-- Indexes
CREATE INDEX idx_findings_status ON findings (workflow_status);
CREATE INDEX idx_findings_severity ON findings (static_severity);
CREATE INDEX idx_findings_resource ON findings (resource_id);
CREATE INDEX idx_findings_assignee ON findings (assignee_email);
CREATE INDEX idx_findings_gin_cves ON findings USING GIN (cves);
CREATE INDEX idx_findings_gin_compliance ON findings USING GIN (compliance_mappings);

4.1.2 Compliance Framework Tables

CREATE TABLE compliance_frameworks (
id VARCHAR(100) PRIMARY KEY,
name VARCHAR(255) NOT NULL,
version VARCHAR(50),
description TEXT,
sector VARCHAR(50),
url TEXT,
controls JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE sector_profiles (
sector VARCHAR(50) PRIMARY KEY,
name VARCHAR(255) NOT NULL,
description TEXT,
required_frameworks JSONB,
optional_frameworks JSONB
);

4.2 Cache Strategy

Cache Key PatternTTLPurpose
framework:{id}24hCompliance framework data
finding:{id}1hIndividual finding cache
dedup:{key}7dDeduplication key lookup
user:{id}:session8hUser session data
rate:{provider}:{key}1minRate limiting counters

5. API Specifications

5.1 Finding Endpoints

Create Finding

POST /api/v1/findings
Content-Type: application/json

{
"source": "aws-security-hub",
"source_finding_id": "arn:aws:securityhub:...",
"type": "misconfiguration",
"title": "S3 bucket allows public access",
"resource_id": "arn:aws:s3:::my-bucket",
"static_severity": "high",
"environment_type": "production"
}

Response

{
"id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"deduplication_key": "abc123...",
"workflow_status": "new",
"compliance_mappings": [
{
"framework_id": "cis-benchmarks",
"control_id": "3.1",
"control_title": "Data Protection"
}
],
"ai_risk_score": 8.5,
"ai_risk_level": "critical"
}

5.2 Error Responses

CodeErrorDescription
400INVALID_REQUESTRequest validation failed
401UNAUTHORIZEDAuthentication required
403FORBIDDENInsufficient permissions
404NOT_FOUNDResource not found
409DUPLICATEFinding already exists
429RATE_LIMITEDToo many requests
500INTERNAL_ERRORServer error

6. Security Design

6.1 Authentication Flow

User → Cloud Aegis UI → OIDC Provider (Entra/Okta)

ID Token

Cloud Aegis API Gateway

Token Validation + RBAC

Authorized Request

6.2 Authorization Matrix

RoleFindings ReadFindings WriteConfigAdmin
ViewerOwn LoB---
AnalystAllAssign/Comment--
EngineerAllRemediate--
AdminAllAllYes-
Super AdminAllAllYesYes

6.3 Encryption

Data StateMethodKey Management
At Rest (DB)AES-256AWS KMS
At Rest (S3)AES-256AWS KMS
In TransitTLS 1.3AWS ACM
API KeysEnvelopeAWS Secrets Manager

7. Performance Requirements

7.1 SLAs

MetricTargetMeasurement
API Latency (p50)< 100msPrometheus histogram
API Latency (p99)< 500msPrometheus histogram
Finding Ingestion1000/secKafka consumer lag
Compliance Mapping< 200msPer finding
AI Analysis< 3sPer finding
Availability99.9%Uptime monitoring
Remediation Dispatch (T1, single finding)< 2s end-to-endRemediationResult.Duration field
Remediation Dispatch (T1, batch of 50)< 30sExecuteBatch total wall time
Remediation Validation round-trip< 5sPost-remediation Validate() call
AI Governance Policy Evaluation (in-process OPA)< 5ms (p99)Decision.EvalTimeUs field
IaC Policy Gate (conftest full plan)< 60splan-with-policy.sh exit time

7.2 Scaling Triggers

ComponentMetricScale UpScale Down
API PodsCPU> 70%< 30%
WorkersQueue Depth> 1000< 100
DatabaseConnections> 80%Manual

8. Observability

8.1 Metrics

var (
findingsProcessed = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "aegis_findings_processed_total",
Help: "Total findings processed",
},
[]string{"source", "type", "severity"},
)

aiAnalysisLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "aegis_ai_analysis_duration_seconds",
Help: "AI analysis latency",
Buckets: prometheus.ExponentialBuckets(0.1, 2, 10),
},
[]string{"provider"},
)
)

8.2 Logging

logger.Info("Finding processed",
zap.String("finding_id", finding.ID),
zap.String("source", finding.Source),
zap.String("type", string(finding.Type)),
zap.Float64("ai_risk_score", finding.AIRiskScore),
zap.Duration("processing_time", elapsed),
)

8.3 Tracing

ctx, span := tracer.Start(ctx, "ProcessFinding",
trace.WithAttributes(
attribute.String("finding.id", finding.ID),
attribute.String("finding.source", finding.Source),
),
)
defer span.End()

Appendix A: Configuration Reference

A.1 Environment Variables

VariableDescriptionDefault
CF_DATABASE_URLPostgreSQL connection string-
CF_REDIS_URLRedis connection string-
CF_AI_PROVIDERAI provider (anthropic/openai)anthropic
CF_AI_MODELAI model nameclaude-opus-4-6
CF_LOG_LEVELLog levelinfo
CF_METRICS_PORTPrometheus metrics port9090

A.2 Configuration File

server:
port: 8080
read_timeout: 30s
write_timeout: 30s

database:
host: localhost
port: 5432
name: aegis
max_connections: 100

redis:
host: localhost
port: 6379
db: 0

ai:
provider: anthropic
model: claude-opus-4-6
max_tokens: 4096
contextual_risk_weight: 0.4

compliance:
enabled_sectors:
- general
- healthcare
- finance
- government
- automotive

Appendix B: Glossary

TermDefinition
CSPMCloud Security Posture Management
DDDDetailed Design Document
HLDHigh-Level Design
OPAOpen Policy Agent
OCSFOpen Cybersecurity Schema Framework
SCASoftware Composition Analysis
SASTStatic Application Security Testing
WIFWorkload Identity Federation

Appendix C: References

  1. NIST Cybersecurity Framework 2.0
  2. CIS Benchmarks
  3. ISO/SAE 21434:2021
  4. OWASP ASVS
  5. OpenTelemetry Specification