Skip to main content

High-Level Design: Cloud Aegis Enterprise Cloud Governance Platform

PropertyValue
Version4.0
AuthorLiem Vo-Nguyen
DateMarch 2026
StatusActive
LinkedInlinkedin.com/in/liemvonguyen
DocumentDescription
Detailed Design Document (DDD)Implementation-level technical specifications
Component RationaleTechnology selection with cost analysis
DR/BC PlanDisaster Recovery and Business Continuity
Pitch DeckExecutive presentation
ADRsArchitecture Decision Records (ADR-001 through ADR-019)
RunbooksOperational procedures (9 runbooks)

1. Executive Summary

Cloud Aegis is an Enterprise Cloud Governance Platform that provides:

  • Self-service cloud resource provisioning with built-in governance guardrails
  • Cloud Security Posture Management (CSPM) with multi-cloud aggregation
  • Multi-framework compliance mapping (CIS, NIST, ISO, PCI-DSS, HIPAA, etc.)
  • AI-powered risk analysis and toxic combination detection
  • Attack path computation and visualization
  • Automated remediation with rollback capabilities
  • CI/CD security scanning integration (SonarQube, Checkov, Veracode)
  • VCS integration (GitHub, GitLab, Azure DevOps)
  • Identity and Zero Trust policy enforcement (Entra ID, Okta)
  • FinOps cost management with budget alerting
  • AI governance with embedded OPA policy engine

1.1 Business Drivers

  • Enable self-service infrastructure provisioning without bypassing security controls
  • Enforce policy-as-code guardrails across multi-cloud environments (AWS, Azure, GCP)
  • Integrate with enterprise GRC tools (RSA Archer, ServiceNow) for exception management
  • Provide comprehensive compliance mapping across 20+ frameworks
  • AI-powered contextual risk scoring beyond static severity
  • CI/CD pipeline security with SAST/DAST/IaC scanning
  • Reduce mean time to remediation through automated security fixes
  • Control cloud costs with multi-cloud FinOps aggregation and budget alerting

2. Architecture Overview

2.1 Component Summary

ComponentPurposeTechnology
Portal LayerSelf-service UI for requests and dashboardsReact 19 / Vite 7 / Tailwind CSS v4 / shadcn/ui
REST APIHTTP API server with RBAC and rate limitingGo 1.25 / gorilla/mux
Orchestration EngineWorkflow management for approvals and provisioningTemporal
Policy EngineEvaluate requests against governance rules (dual-OPA)OPA / Rego (external server + embedded Go library)
AI Risk AnalyzerContextual risk scoring, toxic combo detectionClaude Opus 4.6 / GPT-4 / AWS Bedrock
Compliance EngineMulti-framework compliance mapping and assessmentGo
CSPM AggregatorMulti-cloud finding normalization and enrichmentGo (AWS/Azure/GCP SDK clients)
Graph Query EngineMulti-hop graph traversal (zero-ETL over PostgreSQL)PuppyGraph Enterprise (Gremlin / openCypher)
Attack Path EngineIn-memory BFS graph computationGo + ReactFlow (frontend)
Toxic Combo Detector4-pattern toxic combination detectionGo
Threat IntelligenceEPSS, CISA KEV, GreyNoise enrichmentGo (HTTP clients with caching)
Remediation DispatcherAutomated security fix execution with rollbackGo (10 handlers, 8 domains, 3 tiers)
FinOps AggregatorMulti-cloud cost aggregation and budget alertingGo (AWS/Azure/GCP cost APIs)
WAF ModuleGolden templates and compliance scanningGo
Container SecurityImage scanning, admission controlGo
Secrets ManagementMulti-cloud secrets with rotation lifecycleGo
CI/CD SecurityPipeline and dependency scanningGo
Identity ModuleZero Trust policy enforcement, RBACGo (Okta/Entra ID)
AI GovernanceEmbedded OPA for AI agent tool/data-flow controlGo + OPA library
VCS IntegrationGitHub/GitLab/Azure DevOps APIsGo
SAST IntegrationSonarQube, Veracode, CheckovGo
GRC IntegrationArcher, ServiceNow ticketingGo (provider pattern)

3. Compliance Framework Engine

3.1 Supported Frameworks

SectorFrameworks
GeneralCIS Benchmarks v8, NIST CSF 2.0, ISO 27001:2022, ISO 27017
CloudAWS Security Best Practices, GCP CIS v2, Azure MCSB
HealthcareHIPAA Security Rule, HITRUST CSF v11
FinancePCI-DSS 4.0, SOX ITGC, GLBA Safeguards Rule, FFIEC
GovernmentNIST 800-53 Rev 5, FedRAMP, DISA STIGs, CMMC
AI/MLNIST AI RMF 1.0, ISO 42001:2023
AutomotiveISO 21434, UN ECE R155, TISAX

3.2 Finding Schema

Comprehensive finding schema including:

Field CategoryKey Fields
IdentificationID, Source, Type, Title, Description
ResourceResourceType, ResourceID, Platform, CloudProvider, Region
On-PremHostname, SerialNumber, IPAddress, AssetTag
EnvironmentEnvironmentType (prod/non-prod), AccountID, VPC
SeverityStaticSeverity, AIRiskScore, AIRiskLevel, CVSS, EPSS
VulnerabilityCVEs (with hyperlinks), CWEs, ExploitAvailable
ComplianceComplianceMappings (framework, control, section, URL)
OwnershipTechnicalContact, ServiceName, LineOfBusiness, Team
WorkflowStatus, FalsePositive, TicketID, DueDate, SLABreachDate
DeduplicationDeduplicationKey, CanonicalRuleID, RelatedRules
Attack PathAttackPathContext, BlastRadius, ToxicComboFlag, MITRETactic

3.3 AI-Powered Analysis

  • Contextual Risk Scoring: Environment, exploitability, blast radius
  • Toxic Combination Detection: 4 patterns (public storage, IAM+noMFA, internet+CVE, SG+DB)
  • Misconfiguration Analysis: Root cause, impact, remediation steps
  • Vulnerability Analysis: Exploit likelihood, attack surface, priority
  • Blast Radius Computation: Account/VPC/transit reachability analysis
  • False Positive/Negative Detection: 3 FP suppression + 3 FN escalation rules

3.4 Deduplication Logic

When a finding is captured by multiple rules:

  1. Generate deduplication key from resource + rule + finding details
  2. Map rule to canonical rule using equivalence mappings
  3. Keep most specific/relevant rule based on priority hierarchy
  4. Link related rules as references

4. CI/CD Security Module

4.1 VCS Providers

ProviderFeatures
GitHub/GitHub EnterpriseRepos, PRs, Actions, Dependabot alerts, Check runs
GitLabProjects, MRs, Pipelines, Vulnerability findings
Azure DevOpsRepos, PRs, Pipelines, Advanced Security alerts

4.2 SAST/DAST Tools

ToolTypeIntegration
SonarQube/SonarCloudSASTAPI-based project/issue retrieval
CheckovIaCCLI execution with JSON parsing
VeracodeSAST/DASTHMAC-authenticated API

5. Identity and Zero Trust Module

5.1 Identity Providers

ProviderCapabilities
Microsoft Entra IDUser/Group management, Risk scoring, PIM integration
OktaUser/Group management, Role assignment

5.2 RBAC Model

Four backend roles enforce API access control:

RoleDescriptionScope
AdminTenant administratorFull access: all endpoints, user management, audit log
OperatorSecOps teamRead/update: findings, remediations, compliance, exceptions
RequesterEnd userRead + submit: own exceptions, catalog browsing
ViewerRead-only observer (rank 0)GET only: /findings, /compliance/frameworks, /agents + traces

See ADR-006 for the full RBAC design. See ADR-013 for resource-scoped RBAC (ABAC) with ResourceScope in JWT claims.

5.3 Zero Trust Policies

  • Block high-risk sign-ins
  • Require MFA for sensitive operations
  • Device compliance verification
  • Contextual access decisions

6. Remediation Dispatcher

6.1 Architecture

The remediation dispatcher provides automated security fix execution with a tiered execution model:

TierHandler TypesConcurrencyTimeout
T1 (Auto-Safe)Network ACLs, Storage ACLs10 parallel30s
T2 (Verify)Compute config, IAM key rotation5 parallel120s
T3 (Change Window)OS patching, key rotation2 parallel600s

6.2 Handlers (10 across 8 domains)

DomainHandlerCloud Provider
NetworkBlockPublicSSH (SSH/RDP)AWS
StorageS3PublicAccessBlockAWS
ComputeIMDSv2EnforcementAWS
IdentityIAMKeyRotationAWS
Security ServicesGuardDutyEnablementAWS
Security ServicesAzureDefender (stub)Azure
SecretsRotationGuidance (manual)Multi-cloud
PatchingSSMPatchCompliance (query-only)AWS

6.3 Rollback

State snapshots are stored in S3/GCS before every remediation. Rollback window: 48 hours.

See ADR-009 for the full architecture decision.


7. Attack Path Analysis

7.1 Computation Engine

In-memory BFS graph engine that builds an adjacency graph from loaded findings at startup:

  • Nodes: Resources extracted from findings (keyed by resource_id)
  • Edges: Inferred relationships (same account + compatible resource types)
  • Traversal: BFS from entry points (internet-exposed) to targets (data stores)
  • Max depth: 4 hops

7.2 Graph Query Engine (PuppyGraph)

For multi-hop traversal queries beyond the BFS engine (e.g., "find all findings reachable from identity X within 3 hops"), Cloud Aegis integrates PuppyGraph Enterprise as a zero-ETL graph query layer over the existing PostgreSQL data store. PuppyGraph supports both Gremlin and openCypher query languages and is accessed via POST /api/v1/graph/query. The existing Go BFS engine is retained as a fallback when the PuppyGraph service is unavailable (feature flag: PUPPYGRAPH_URL). See ADR-015 for the full architecture decision.

7.3 API

EndpointMethodDescription
/api/v1/attack-pathsGETPaginated attack paths (default 20/page, max 100)
/api/v1/attack-paths/{id}GETSingle path with full finding details
/api/v1/attack-paths/statsGETCoverage stats (findings in paths vs isolated)

See ADR-008 for the architecture decision.


8. FinOps Cost Management

8.1 Components

ComponentPackageDescription
Cost Aggregatorinternal/finops/aggregator/AWS/Azure/GCP cost API clients
Anomaly Detectioninternal/finops/anomaly/ML-based spend anomaly alerting
Chargeback Engineinternal/finops/chargeback/Tag-based cost allocation + CSV export
Budget Monitorinternal/finops/alerting/Slack + PagerDuty budget alerts
Cost Estimationinternal/finops/estimation.go21-resource lookup table
Reporterinternal/finops/reporter/Showback/chargeback reports

8.2 Budget Alerting

Budget alerts are sent via two channels:

  • Slack: Block Kit formatted messages
  • PagerDuty: Events API v2 integration

See ADR-010 for the architecture decision.


9. Deployment Architecture

9.1 Multi-Cloud Support

9.2 Terraform Modules

ModulePathProviders
Computedeploy/terraform/modules/compute/Cloud Run, ECS Fargate, Azure Container Apps
Databasedeploy/terraform/modules/database/Cloud SQL, RDS, Azure PostgreSQL
Redisdeploy/terraform/modules/redis/Memorystore, ElastiCache, Azure Cache
Networkdeploy/terraform/modules/network/AWS VPC, Azure VNet, GCP VPC
IAMdeploy/terraform/modules/iam/GCP SA, AWS IAM Roles, Azure Managed Identity
Monitoringdeploy/terraform/modules/monitoring/Cloud Monitoring, CloudWatch, Azure Monitor
Secretsdeploy/terraform/modules/secrets/GCP Secret Manager, AWS Secrets Manager, Azure Key Vault
PuppyGraphdeploy/terraform/modules/puppygraph/AWS EC2 (POC)

Environments: dev, staging, prod in deploy/terraform/environments/.

9.3 High Availability

  • Active-Active across 2+ regions
  • Database replication with automatic failover
  • State synchronization via distributed consensus
  • < 1 minute RTO for compute failures

10. Security Considerations

10.1 Authentication & Authorization

  • JWT authentication (HS256/RS256, JWKS caching)
  • OIDC federation (Okta, Entra ID) with mock fallback for development
  • RBAC middleware (Admin, Operator, Requester roles)
  • API rate limiting (Redis-backed, tier-based: anonymous/free/basic/professional/enterprise)
  • OIDC/WIF for cloud provider access

10.2 Data Protection

  • Encryption at rest (AES-256)
  • Encryption in transit (TLS 1.3)
  • Secrets in cloud-native vaults (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager)

11. Monitoring & Observability

11.1 Telemetry Stack

ComponentToolPurpose
MetricsPrometheus + GrafanaSystem and application metrics
LoggingStructured JSON (zap) to ELK/SplunkCentralized log aggregation
TracingOpenTelemetryDistributed tracing across services
AlertingPagerDuty/OpsgenieIncident notification

11.2 Key Metrics

MetricDescriptionAlert Threshold
aegis_http_requests_totalTotal HTTP requests by method/path/status-
aegis_http_request_duration_secondsRequest latency histogramP99 > 500ms
aegis_findings_processed_totalFindings processed by source/type/severity-
aegis_ai_analysis_duration_secondsAI analysis durationP99 > 30s
aegis_health_statusComponent health (1=healthy, 0=unhealthy)Any 0
aegis_rate_limit_hits_totalRate limit violations>100/min

11.3 Health Endpoints

EndpointPurposeResponse
/healthDetailed health checkAll components with latency
/healthzKubernetes liveness probe{"status": "alive"}
/readyKubernetes readiness probeFull component health status
/metricsPrometheus metricsPrometheus format

11.4 Troubleshooting

Built-in troubleshooting capabilities provide remediation suggestions for common issues:

  • Database connection failures: Connection pooling, credential verification
  • Redis connection issues: Endpoint verification, memory analysis
  • AI provider timeouts: Fallback provider activation, rate limit handling
  • High memory/CPU usage: Profiling endpoints at /debug/pprof/

See Technical Runbooks for detailed operational procedures.


12. API Reference

12.1 Core Endpoints

EndpointMethodRBACDescription
/api/v1/findingsGEToperator, adminList findings
/api/v1/findings/{id}GEToperator, adminGet finding detail
/api/v1/findings/{id}/enrichPOSToperator, adminEnrich finding with AI
/api/v1/compliance/frameworksGEToperator, adminList available frameworks
/api/v1/attack-pathsGEToperator, adminList attack paths (paginated)
/api/v1/attack-paths/statsGEToperator, adminAttack path coverage stats
/api/v1/attack-paths/{id}GEToperator, adminGet attack path detail
/api/v1/remediationsGEToperator, adminList remediations
/api/v1/remediations/{id}GEToperator, adminGet remediation detail
/api/v1/remediations/{id}/executePOSTadminExecute remediation
/api/v1/costs/summaryGEToperator, adminGet cost summary
/api/v1/exceptionsPOSTadminCreate exception
/api/v1/exceptions/{id}GEToperator, adminGet exception
/api/v1/exceptions/mineGETrequester+Get my exceptions
/api/v1/exceptions/pendingGEToperator, adminPending approvals
/api/v1/exceptions/expiringGEToperator, adminExpiring exceptions
/api/v1/exceptions/{id}/approvePOSTadminSubmit approval
/api/v1/validate/exceptionPOSToperator, adminValidate exception against policy
/api/v1/agentsGEToperator, adminList AI agents
/api/v1/agents/{id}GEToperator, adminGet agent detail
/api/v1/agents/{id}/tracesGEToperator, adminGet agent traces
/api/v1/audit-logGETadminList audit log
/api/v1/usersGETadminList users
/api/v1/catalog/modulesGEToperator, adminList catalog modules
/api/v1/policiesGEToperator, adminList policies
/api/v1/workflowsGEToperator, adminList workflows
/api/v1/workflows/{id}GEToperator, adminGet workflow
/api/v1/workflows/{id}/approvePOSTadminApprove workflow
/api/v1/container/scanGEToperator, adminScan container
/api/v1/container/admissionGEToperator, adminCheck admission
/api/v1/secretsGEToperator, adminList secrets
/api/v1/secrets/scanPOSToperator, adminScan for secrets (content in request body)
/api/v1/secrets/{path}GEToperator, adminGet secret
/api/v1/waf/templatesGEToperator, adminList WAF templates
/api/v1/waf/compliance/{templateId}GEToperator, adminValidate WAF compliance
/api/v1/identity/usersGEToperator, adminList identity users
/api/v1/identity/users/{id}/riskGEToperator, adminGet user risk score
/api/v1/ai/nlqPOSToperator, adminNatural language query
/api/v1/containersGEToperator, adminContainer security topology
/api/v1/ai/usageGETadminAI budget status (monthly spend vs cap)

13. Data Ingestion Pipeline

13.1 Architecture

The ingestion subsystem normalizes findings from multiple cloud security scanners into a canonical format and deduplicates them before persistence.

13.2 Scanner Adapters

Each scanner implements the ScannerAdapter interface (Parse(ctx, data) → []NormalizedFinding):

AdapterSourceSeverity Resolution
ProwlerAdapterProwler JSONVendor severity via normalizeSeverity()
TrivyAdapterTrivy JSONVendor severity via normalizeSeverity()
AWSConfigAdapterAWS Config rulesHeuristic (rule name keywords: "root-account"/"mfa" → CRITICAL)

Severity is canonicalized to CRITICAL/HIGH/MEDIUM/LOW. INFORMATIONAL findings are intentionally dropped.

13.3 Deduplication

In-memory SHA-256 keyed cache with TTL-based eviction:

  • Key: SHA-256 of source \x00 sourceFindingID \x00 resourceID \x00 accountID (null-byte delimiters prevent field-split collisions)
  • Atomic check-and-insert: CheckOrInsert() acquires a write lock and returns both duplicate status and existing entry
  • Background eviction: goroutine on configurable interval, cancelled via context

14. Ticket Integration System

14.1 Provider Architecture

Remediation workflows route findings to external ticket/project management systems via the TicketProvider interface:

ProviderAuthDescription FormatID Validation
JiraBasic (API token)Atlassian Document Format (ADF)^[A-Z][A-Z0-9_]+-\d+$
AsanaBearer (PAT)Plain text^\d+$
Azure DevOpsPATHTML (escaped)^\d+$
MockNoneIn-memoryAny

All REST clients implement exponential backoff with max 3 retries on 429/5xx responses. Configuration loaded from environment variables via ConfigFromEnv().

14.2 Risk-Aware Routing

The RoutingEngine maps finding severity and attack graph signals to ticket priority and SLA:

Input ConditionPriorityTeamSLA
CRITICAL + choke-pointUrgentincident-response4 hours
CRITICALUrgentsecurity-ops24 hours
HIGHHighsecurity-ops72 hours
MEDIUMNormalplatform-eng7 days
LOW (fallback)Lowbacklog30 days

Routing rules are first-match-wins with a configurable rule set (RoutingRule with match function + decision).


15. Webhook Delivery System

15.1 Architecture

Outbound webhook engine delivers Cloud Aegis events to registered HTTP endpoints with HMAC-SHA256 signing.

15.2 Event Types

Event TypeTrigger
finding.createdNew finding ingested
finding.resolvedFinding marked resolved
compliance.driftCompliance posture change
attack_path.newNew attack path discovered
exception.approvedException request approved
deploy.previewDeploy preview ready

Endpoints subscribe to specific event types or receive all events (empty filter = all).

15.3 Security

  • HMAC signing: X-Aegis-Signature header with SHA-256 HMAC when endpoint has a secret
  • SSRF protection (2-layer): (1) URL validation at registration rejects non-HTTPS, private IPs, localhost, metadata endpoints; (2) safeDialContext() rejects private/link-local IPs after DNS resolution (DNS rebinding defense)
  • SA-106: HTTPS-only enforcement for webhook URLs

15.4 Delivery

Asynchronous fan-out: DeliverAsync() spawns goroutines per matching endpoint. Each delivery attempt is tracked with HTTP status code and duration. HTTP client timeout: 10 seconds.


16. Integrated Operations Terminal

16.1 Architecture

WebSocket-based interactive terminal for running read-only cloud CLI commands from the browser UI.

16.2 Security Controls

ControlImplementation
AuthenticationTwo-phase ticket system (SA-002): JWT → 60s nonce → WS upgrade
AuthorizationRBAC: operator or admin only
Command whitelistRead-only cloud CLI subcommands only (aws, gcloud, az, kubectl, terraform, trivy)
Shell injectionMetacharacter rejection (|;&$\><(){}!#\n\r`) before parsing
Dangerous flagsBlocks --endpoint-url, --profile, --impersonate-service-account
EnvironmentsafeEnv() strips all env vars except PATH, HOME=/tmp, TERM
Limits30s timeout, 512KB output, 4KB message, 2 sessions/user, 5min idle
AuditAll connect/execute/denied events logged via audit.AuditLogger
Mock fallbackReturns realistic demo output when binary not on PATH

17. Resource Query Language (RQL)

17.1 Grammar

Hand-written lexer and recursive-descent parser for filtering findings and resources:

query      = condition { ("AND" | "OR") condition }
condition = field operator value
field = identifier { "." identifier }
operator = "=" | "!=" | ">" | ">=" | "<" | "<="
value = quoted_string | unquoted_word

17.2 Evaluation

  • Field access: Decoupled via FieldAccessor function (dependency injection)
  • Precedence: Left-to-right, AND binds tighter than OR. No parenthesized grouping.
  • String comparison: Case-insensitive for = and !=
  • Numeric comparison: Via strconv.ParseFloat for >, >=, <, <=
  • Ordered fields: Inverted comparison for severity-like fields (CRITICAL=1 < HIGH=2), so severity >= HIGH matches CRITICAL and HIGH

18. Attack Surface Management

18.1 Architecture

External-facing asset discovery that scans domains for hosts, services, ports, and TLS certificates via the ASMScanner interface.

18.2 Asset Model

ComponentFields
AssetHostname, IP, Services, Certificates, FirstSeen, LastSeen
ExposedServicePort, Protocol (HTTP/HTTPS/SSH/DNS/SMTP/FTP), Banner, TLS flag
CertificateSubject, Issuer, NotBefore, NotAfter, SANs

Current implementation provides a deterministic mock scanner (SHA-256 domain seed for reproducible demo data). Real scanner implementations plug in behind the same ASMScanner interface.


19. Multi-Tenancy

19.1 Tenant Resolution

Request-scoped tenant resolution via middleware with a 3-level cascade:

PrioritySourceRestriction
1 (highest)JWT tenant_id claimAny authenticated user
2X-Tenant-ID headerAdmin role only
3Subdomain extraction from Host headerAny request

When no tenant is resolved, defaults to ("default", "") for single-tenant backward compatibility. nil store disables multi-tenancy (middleware becomes a no-op).

19.2 Tenant Configuration

Per-tenant configuration includes:

AreaConfig
BrandingCompanyName, ProductName, LogoPath, PrimaryColor, AccentColor
AuthOIDC provider (okta/entra_id/auth0/mock), Issuer, ClientID, Audience
ModulesEnabled feature modules
Rate LimitsRequestsPerMinute, BurstSize

In-memory store (Phase 3 prototype). Postgres-backed store planned for Phase 4.


20. AI Governance

20.1 Architecture

Agent governance framework with in-process embedded OPA policy engine (microsecond-level evaluation, not HTTP sidecar). Provides agent registry, observability tracing, threat modeling (STRIDE + MITRE ATLAS), and maturity assessment.

20.2 Policy Engine

Two base policies embedded as Go constants:

PolicyControls
BaseToolAccessPolicyTool allowlist/blocklist, rate limiting, forbidden parameter patterns
BaseDataFlowPolicyClassification-based destination control, source restrictions, PII redaction

Policies are compiled at load time via rego.PreparedEvalQuery for sub-millisecond evaluation. Returns structured Decision with Allow, Reasons, Violations, and EvalTimeUs.

20.3 Observability Model

ComponentPurpose
AgentTraceFull execution trace per agent invocation
SpanIndividual operation (types: llm, retrieval, tool, chain, agent, policy)
SecuritySignalInjection attempts, data exfiltration, tool abuse, privilege escalation
TraceMetricsAggregated performance and cost metrics

LLM spans track token counts and cost. Retrieval spans track vector similarity scores. Tool spans include inline policy decisions.


21. Audit System

21.1 Architecture

Tamper-evident, append-only audit logging with SHA-256 integrity hashes and multiple backend support.

21.2 Event Taxonomy

Domain.Verb format across 12 domains:

DomainExample Actions
exceptioncreate, approve, reject, expire, revoke
findingcreate, update, remediate, suppress
remediationexecute, rollback
terminalconnect, execute, denied
agentinvoke, complete, fail
deploy_previewcreate, promote
userlogin, logout, role_change
secretrotate, access, scan

21.3 Integrity

computeHash() produces SHA-256 of all content fields with null-byte delimiters. Stored as IntegrityHash on every AuditEntry. Postgres backend includes automatic tenant_id scoping via tenant.IDFromContext().


22. GRC Integration

22.1 Provider Architecture

Policy exception lifecycle management via the GRCProvider interface (8 methods). Factory pattern (NewProvider(Config)) creates the appropriate backend:

ProviderBackendStatus
MemoryIn-memory mapDemo/test
PostgresPostgreSQLSelf-hosted production
ServiceNowServiceNow GRC REST APIEnterprise
ArcherRSA Archer REST APIStub (documented)

22.2 Exception Lifecycle

Approval chain: multi-level (SECURITY_LEAD → GRC_ANALYST → CISO). Empty approval chain does NOT auto-approve. ValidateException() is the integration point with the policy engine — called before provisioning.

22.3 Security

  • Credentials loaded from environment variables at init (never stored in config structs)
  • ServiceNow query injection prevention: snowSafeInput regex ^[a-zA-Z0-9._@\-]+$ + URL encoding
  • ServiceNow OAuth token caching with double-check locking pattern
  • All HTTP response bodies limited to 1MB via io.LimitReader
  • Postgres queries use parameterized placeholders ($N) and pq.Array() for batch operations
  • All Postgres queries include tenant_id scoping

See ADR-007 for the architecture decision.


Appendix A: Technology Stack

CategoryTechnology
LanguageGo 1.25
API Frameworkgorilla/mux
FrontendReact 19 / Vite 7 / Tailwind CSS v4 / shadcn/ui
DatabasePostgreSQL 16
CacheRedis
OrchestrationTemporal
Policy EngineOPA / Rego
AIAnthropic Claude Opus 4.6, OpenAI GPT-4, AWS Bedrock (production enrichment)
IaCTerraform
Container RuntimeKubernetes (EKS/AKS/GKE)
ObservabilityOpenTelemetry, Prometheus, zap
IdentityOkta, Microsoft Entra ID (OIDC)
DeploymentCloudflare Pages (frontend), Docker (backend)

Appendix B: Diagram Formats

Note on LucidChart Import: Mermaid diagrams are rendered as static images when imported to LucidChart. For editable diagrams:

  1. Recommended: Create directly in LucidChart or use draw.io
  2. Export: Use draw.io XML format for cross-platform compatibility
  3. Alternative: Use PlantUML with LucidChart import extension

Architecture diagrams in this document use Mermaid for GitHub rendering and can be recreated in LucidChart for presentation purposes.


Document History

VersionDateAuthorChanges
4.0March 2026L. Vo-NguyenExpanded from 12 to 22 sections: added Data Ingestion (13), Ticket Integration (14), Webhooks (15), Terminal (16), RQL (17), ASM (18), Multi-Tenancy (19), AI Governance (20), Audit (21), GRC (22)
3.1March 2026L. Vo-NguyenAdded Viewer role to RBAC table, updated ADR count (009-014), added POST /api/v1/ai/nlq + GET /api/v1/containers + GET /api/v1/ai/usage to API reference, changed /secrets/scan from GET to POST
3.0March 2026L. Vo-NguyenUpdated tech stack (Go 1.25, gorilla/mux, React 19), added remediation/attack path/FinOps/CSPM sections, full API reference from routes.go, corrected RBAC model, added ADR cross-references
2.0January 2026L. Vo-NguyenArchitecture overview, compliance engine, CI/CD, identity, deployment
1.0January 2026L. Vo-NguyenInitial HLD