Skip to main content

Component Selection Rationale and Cost Analysis

DocumentValue
Version1.0
AuthorLiem Vo-Nguyen
DateJanuary 2026
ProjectCloud Aegis

1. Cache Layer: Redis vs Memcached

Decision: Redis

CriteriaRedisMemcached
Data StructuresRich (strings, hashes, lists, sets, sorted sets)Simple key-value only
PersistenceOptional (RDB/AOF)None
ReplicationBuilt-in master-replicaManual
Pub/SubYesNo
Lua ScriptingYesNo
Cluster ModeYesNo (client-side sharding)
Memory EfficiencyModerateHigh

Why Redis

  1. Session Management: Need TTL with complex session objects
  2. Rate Limiting: Atomic increment with expire (INCR + EXPIRE)
  3. Pub/Sub: Real-time finding notifications
  4. Caching Compliance Data: Hash structures for framework controls
  5. Distributed Locks: Redlock for workflow coordination

Cost Analysis (AWS ElastiCache)

WorkloadRedisMemcached
Instancecache.r6g.largecache.r6g.large
Monthly (On-Demand)$219.60$175.20
Monthly (Reserved 1yr)$142.35$113.88
Estimated Usage50GB cache, 10K ops/sec50GB cache, 10K ops/sec

Verdict: Redis costs ~25% more but provides essential features (persistence, pub/sub, Lua) that would require custom implementation with Memcached.


2. Compute: Lambda vs EC2 vs EKS

Decision: EKS (Primary), Lambda (Event Processing)

CriteriaLambdaEC2EKS
Cold Start100ms-10sNoneNone
Max Duration15 minUnlimitedUnlimited
ScalingAutomaticManual/ASGHPA/KEDA
Cost ModelPer invocationPer hourPer hour + control plane
Ops OverheadLowHighMedium
StateStatelessStatefulStateful pods

Why EKS (Primary)

  1. Long-Running Workflows: Compliance scans can take hours
  2. Consistent Latency: No cold starts for API requests
  3. Multi-Tenancy: Namespace isolation for customers
  4. Sidecar Patterns: Envoy for mTLS, Fluentd for logging
  5. Stateful Workloads: Temporal workers need persistent connections

Why Lambda (Event Processing)

  1. Webhook Receivers: Sporadic GRC callbacks
  2. Scheduled Tasks: Daily compliance reports
  3. Event-Driven: S3 trigger for file processing

Cost Analysis (100 concurrent users, 1M API calls/month)

ComponentLambdaEC2 (m6i.large)EKS (m6i.large)
Compute/month$200 (invocations)$138 (2 instances)$138 + $73 (control plane)
Load BalancerN/A$22$22 (ingress)
Data Transfer$10$50$50
Total$210$210$283

At Scale (10x)

ComponentLambdaEC2EKS
10M calls/month$2,000$690 (5 inst)$690 + $73
ScalingAutomaticASG lagKEDA fast

Verdict: EKS for API and long-running workloads ($73/mo premium for orchestration). Lambda for event-driven processing where cold starts acceptable.


3. Database: PostgreSQL vs MySQL vs DynamoDB

Decision: PostgreSQL (Primary), DynamoDB (High-Scale Findings)

CriteriaPostgreSQLMySQLDynamoDB
JSONB SupportExcellentJSON (less performant)Native
PartitioningBuilt-inManualAutomatic
ExtensionsRich (pg_trgm, pgvector)LimitedNone
TransactionsFull ACIDFull ACIDLimited
GeospatialPostGISLimitedLimited
Full-Text SearchBuilt-inBuilt-inOpenSearch needed
Cost (RDS)SimilarSimilarHigher at scale

Why PostgreSQL

  1. JSONB for Findings: Flexible schema for multi-source findings
  2. Array Types: Compliance framework mappings as arrays
  3. pg_trgm: Fuzzy search on finding descriptions
  4. CTEs: Complex compliance reporting queries
  5. Partitioning: By date for finding history

Why DynamoDB (Adjunct)

  1. High-Volume Findings: 100K+ findings/day ingestion
  2. Single-Table Design: Finding lookup by ID
  3. TTL: Automatic expiration of transient data
  4. Global Tables: Multi-region replication

Cost Analysis (500GB data, 10K reads/sec, 1K writes/sec)

ComponentPostgreSQL (RDS)DynamoDB
Instancedb.r6g.largeN/A
Storage (500GB)$57.50$125 (standard)
Compute$219.60N/A
Read CapacityIncluded$500 (on-demand)
Write CapacityIncluded$625 (on-demand)
Monthly Total$277$1,250

Verdict: PostgreSQL for primary data store. DynamoDB only if needing multi-region active-active or >100K TPS.


4. Message Queue: SQS vs Kafka vs RabbitMQ

Decision: SQS (Primary), Kafka (High-Volume Streams)

CriteriaSQSKafkaRabbitMQ
OrderingFIFO queuesPer-partitionPer-queue
Throughput3K msg/sec FIFO, unlimited std100K+ msg/sec50K msg/sec
Retention14 daysConfigurableUntil consumed
Ops OverheadNone (managed)HighMedium
Consumer GroupsNo (fan-out via SNS)YesYes
ReplayNoYesNo

Why SQS

  1. Managed: Zero ops for queue infrastructure
  2. Integration: Native Lambda, ECS triggers
  3. Dead Letter Queues: Built-in DLQ support
  4. Sufficient Throughput: 3K/sec FIFO covers most use cases

Why Kafka (When Needed)

  1. Event Sourcing: Full history replay capability
  2. High Volume: >10K events/sec sustained
  3. Stream Processing: Kafka Streams/ksqlDB
  4. Multi-Consumer: Same events to multiple consumers

Cost Analysis (1M messages/day)

ComponentSQSMSK (Kafka)AmazonMQ (RabbitMQ)
Messages$12.60N/AN/A
Broker CostsN/A$438 (2x kafka.m5.large)$213 (mq.m5.large)
StorageN/A$30 (100GB)Included
Monthly Total$13$468$213

Verdict: SQS for 99% of use cases. Kafka only for event sourcing requirements.


5. Orchestration: Temporal vs Step Functions vs Airflow

Decision: Temporal

CriteriaTemporalStep FunctionsAirflow
Long-RunningUnlimited1 year maxDAG-based
Code-Based WorkflowsYes (Go, Java, etc.)JSON/YAMLPython
RetriesSophisticatedBasicBasic
VisibilityExcellent UICloudWatchGood UI
TestingUnit testableDifficultModerate
CostSelf-hostedPer transitionSelf-hosted

Why Temporal

  1. Code-Based: Workflows as Go code, testable
  2. Long-Running: Approval workflows can take weeks
  3. Saga Pattern: Complex multi-step provisioning
  4. Visibility: Built-in workflow history
  5. Self-Healing: Automatic retry with backoff

Cost Analysis

ComponentTemporal (Self-Hosted)Step Functions
Compute2x m6i.large = $138/moN/A
Transitions (1M/mo)N/A$25
Storage50GB EBS = $5/moIncluded
Monthly Total$143$25

But: Step Functions has state transition limits, no local testing, and JSON-based workflows. Temporal's code-first approach worth the cost for complex workflows.


6. AI Provider: Anthropic Claude vs OpenAI GPT-4

Decision: Anthropic Claude Opus 4.6 (Primary), OpenAI GPT-4 (Fallback)

CriteriaClaude Opus 4.6GPT-4 Turbo
Context Window200K tokens128K tokens
SpeedModerateFast
ReasoningExcellentExcellent
CodingExcellentExcellent
Cost (Input)$15/1M tokens$10/1M tokens
Cost (Output)$75/1M tokens$30/1M tokens
Rate LimitsLowerHigher

Why Claude Opus 4.6 (Primary)

  1. Context Window: 200K tokens for large finding batches
  2. Reasoning: Better at nuanced security analysis
  3. Structured Output: More consistent JSON responses
  4. Less Hallucination: More conservative in risk assessment

Why GPT-4 (Fallback)

  1. Rate Limits: Higher throughput when needed
  2. Cost: 60% cheaper for output-heavy workloads
  3. Availability: Different failure domains

Cost Analysis (100K findings/month, ~500 tokens/finding)

MetricClaude Opus 4.6GPT-4 Turbo
Input Tokens50M @ $15/M = $75050M @ $10/M = $500
Output Tokens25M @ $75/M = $1,87525M @ $30/M = $750
Monthly Total$2,625$1,250

Verdict: Use Claude for complex analysis (toxic combos, contextual risk), GPT-4 for high-volume simple enrichment. Hybrid approach ~$1,800/mo.


7. Secret Management: AWS Secrets Manager vs HashiCorp Vault

Decision: AWS Secrets Manager (Cloud), Vault (Hybrid)

CriteriaAWS Secrets ManagerHashiCorp Vault
Multi-CloudAWS onlyYes
Dynamic SecretsLimited (RDS)Extensive
PKINoYes
Ops OverheadNoneHigh
AuditingCloudTrailBuilt-in
CostPer secret + API callsSelf-hosted

Why AWS Secrets Manager (Cloud-Only)

  1. Zero Ops: Fully managed
  2. IAM Integration: Native AWS IAM policies
  3. Rotation: Built-in for RDS, Redshift
  4. Lambda Integration: Seamless for serverless

Why Vault (Hybrid/Multi-Cloud)

  1. Multi-Cloud: Single pane for AWS/Azure/GCP
  2. Dynamic DB Creds: Ephemeral credentials
  3. PKI: Certificate management
  4. SSH: Dynamic SSH credentials

Cost Analysis (100 secrets, 100K API calls/month)

ComponentAWS Secrets ManagerHashiCorp Vault (Self-Hosted)
Secrets (100)$40/moN/A
API Calls (100K)$5/moN/A
ComputeN/A2x t3.medium = $60/mo
StorageN/A50GB EBS = $5/mo
Monthly Total$45$65

Verdict: AWS Secrets Manager for AWS-only. Vault if multi-cloud or need dynamic secrets/PKI.


8. Summary: Monthly Cost Estimate

Small Deployment (10 users, 10K findings/month)

ComponentChoiceMonthly Cost
Compute (EKS)2x m6i.large + control plane$211
Database (RDS)db.t3.medium$50
Cache (ElastiCache)cache.t3.micro$12
Queue (SQS)Standard$5
AI (Claude)10K findings$260
SecretsSecrets Manager$10
NetworkingNAT, LB$50
Total~$600/mo

Medium Deployment (100 users, 100K findings/month)

ComponentChoiceMonthly Cost
Compute (EKS)4x m6i.large + control plane$365
Database (RDS)db.r6g.large Multi-AZ$440
Cache (ElastiCache)cache.r6g.large$220
Queue (SQS)Standard$15
AI (Hybrid)100K findings$1,800
SecretsSecrets Manager$50
NetworkingNAT, LB, Transit Gateway$200
MonitoringCloudWatch, X-Ray$100
Total~$3,200/mo

Large Deployment (1000 users, 1M findings/month)

ComponentChoiceMonthly Cost
Compute (EKS)10x m6i.xlarge + control plane$1,500
Database (RDS)db.r6g.2xlarge Multi-AZ$1,760
Cache (ElastiCache)cache.r6g.xlarge cluster$880
Queue (Kafka MSK)3x kafka.m5.large$660
AI (Hybrid)1M findings$15,000
SecretsVault cluster$300
NetworkingFull mesh, WAF$800
MonitoringDatadog/New Relic$500
Total~$21,400/mo

9. Cost Optimization Recommendations

  1. Reserved Instances: 35-50% savings on compute/database
  2. Savings Plans: Commit to 1-year for additional discounts
  3. Spot Instances: Use for batch processing (Checkov scans)
  4. AI Caching: Cache AI responses for similar findings (30% reduction)
  5. S3 Intelligent-Tiering: For finding archives
  6. Right-Sizing: Monthly review of instance utilization