Fossil/knowledge-base

Fork 0

Files

Stanislav Hubacek ef3c2f75b1 18.6.2026

2026-06-18 16:25:33 +02:00

25 KiB

Raw Permalink Blame History

☁️ Cloud Architecture

Providers

AWS — largest market share, broadest portfolio
Azure — strong integration with Microsoft ecosystem
GCP — Kubernetes (GKE), data & ML, network connectivity

Deployment Models

Model	Description
Public cloud	Shared provider infrastructure
Private cloud	Dedicated infrastructure (on-prem or hosted)
Hybrid cloud	Public + private interconnection
Multi-cloud	Multiple public providers

Multi-cloud Strategy

Reasons for Multi-cloud

Vendor lock-in prevention — risk diversification
Regulatory requirements — data residency in specific regions
Best-of-breed — each provider has strengths (AWS networking, Azure enterprise, GCP data/ML)
Acquisition scenarios — merge & acquisition unification

Multi-cloud Connectivity

Method	Latency	Throughput	Cost
Site-to-Site VPN	Medium	Limited	Low
Private interconnect (Direct Connect / ExpressRoute / Dedicated Interconnect)	Low	High	High
Cloud-to-cloud VPN	Medium	Medium	Medium
SD-WAN	Low	High	Medium

Challenges

Network complexity — different VPC/VNet concepts, security models
IAM federation — unified identities across clouds (SSO, SAML, OIDC)
Data gravity — moving data between clouds is expensive and slow
Monitoring — single pane of glass across clouds (Grafana, Datadog)

Cloud Adoption Frameworks (CAF)

Each major provider has its own Cloud Adoption Framework for a structured approach to cloud adoption:

Provider	Framework	Focus
AWS	AWS CAF	6 perspectives: Business, People, Governance, Platform, Security, Operations
Azure	Microsoft CAF	8 methodologies: Strategy, Plan, Ready, Migrate, Innovate, Govern, Manage, Secure
GCP	Google CAF	4 pillars: Learn, Scale, Modernize, Operate

Multi-Cloud Administration Guide (Mulder, 2024) recommends combining CAF frameworks across providers for unified governance models, especially in:

Interoperability — standardization of APIs and IaC across clouds (Terraform, Pulumi)
Data governance — unified policy for data residency and lifecycle
Compliance automation — automated audits across clouds (AWS Config, Azure Policy, GCP Org Policies)
Access management — identity federation and centralized RBAC

Migration Strategies — 6 Rs

Strategy	Description	Difficulty	Typical Scenario
Rehost (Lift & Shift)	Move VM/as-is without changes	Low	Quick migration, datacenter exit, minimal risk
Replatform (Lift & Reshape)	Migration with minor adjustments (e.g., RDS instead of self-managed DB)	Medium	Optimization without rewriting the application
Refactor (Re-architect)	Rewrite application as cloud-native (microservices, serverless)	High	Maximize cloud benefit, long-term strategy
Repurchase	Move to SaaS (e.g., Salesforce, Workday)	Low	Application is outdated, SaaS alternative exists
Retire	Decommission unused applications	Low	Application no longer in use
Retain	Keep on-prem	None	Regulatory reasons, too high migration risk

Decision Framework for 6 Rs

Start: Is the application needed?
  ├── No → Retire
  └── Yes → Does a SaaS alternative exist?
       ├── Yes → Repurchase
       └── No → Is refactoring worthwhile?
            ├── Yes → Refactor
            └── No → Is platform change sufficient?
                 ├── Yes → Replatform
                 └── No → Rehost

Well-Architected Framework (AWS)

Operational Excellence — automation, monitoring, documentation
Security — IAM, encryption, compliance
Reliability — recovery, scaling, backup plans
Performance Efficiency — right-sizing, choosing the right services
Cost Optimization — FinOps, reserved instances, spot instances
Sustainability (since 2022) — carbon footprint, energy efficiency

Analogues: Azure Well-Architected Framework, GCP Architecture Framework

Key Questions from Well-Architected Review (~60 questions)

Operational Excellence (12 questions)

How are changes managed and automated?
How are operations documented and shared within the team?
How are expected and unexpected events reflected in operations?
What runbooks exist for common operational scenarios?
How is incident management and postmortem process conducted?

Security (12 questions)

How is identity & access management implemented?
How is data protected at rest and in transit?
How is security incident detection ensured?
What are the procedures for patch management and vulnerability remediation?
How are infrastructure credentials and secrets managed?

Reliability (12 questions)

How is service availability ensured during a component failure?
How is backup and disaster recovery implemented?
How do service limits (quotas, throttling) affect reliability?
How does automatic scaling work under changing load?
What are the SLI/SLO metrics and how are they monitored?

Performance Efficiency (12 questions)

How is the correct type and size of compute/storage selected?
How is the database layer optimized (indexes, queries, caching)?
How is monitoring used to identify bottlenecks?
How is scaling implemented (vertical vs horizontal)?

Cost Optimization (12 questions)

How are costs allocated to teams/projects (chargeback/showback)?
What tools are used for cost analysis?
How are unused resources identified and eliminated?
How is licensing optimized (BYOL, hybrid benefit)?

Key Components

Compute Layer

VM / instances — EC2, Azure VMs, GCE
Container orchestration — EKS, AKS, GKE
Serverless — Lambda, Azure Functions, Cloud Functions
PaaS — App Engine, Elastic Beanstalk, Azure App Service

Compute Comparison Matrix (AWS EC2)

Family	Type	vCPU:Memory	Use Case	Example Pricing (on-demand, us-east-1)
General purpose	M7g, m7i	1:4	Web servers, microservices, dev/test	m7i.large ~$0.088/h
Compute optimized	C7g, c7i	1:2	HPC, batch processing, CI/CD, gaming	c7i.large ~$0.078/h
Memory optimized	R7g, r7i, x2idn	1:8 to 1:32	In-memory DB (Redis), SAP HANA, real-time analytics	r7i.large ~$0.118/h
Storage optimized	I4i, im4gn	1:4 + NVMe	Transactional DB, data warehousing, Kafka	i4i.large ~$0.138/h
GPU / ML	P5, g5, trn1	GPU attach	AI training (P5), inference (g5), ML (trn1)	g5.xlarge ~$1.006/h

See GPU.en.md for GPU model and configuration details.

Storage

Object storage — S3, Blob Storage, Cloud Storage
Block storage — EBS, managed disks, persistent disks
File storage — EFS, Azure Files, Filestore
CDN — CloudFront, Azure CDN, Cloud CDN

S3 Storage Classes

Class	Availability	Retrieval Time	Price / GB / Month	Use Case
S3 Standard	99.99 %	milliseconds	~$0.023	Active data, frequent access
S3 Intelligent-Tiering	99.9 %	milliseconds	~$0.023 + monitoring fee	Unknown/variable access patterns
S3 Standard-IA	99.9 %	milliseconds	~$0.0125	Less frequent but fast access
S3 One Zone-IA	99.5 %	milliseconds	~$0.01	Reproducible data
S3 Glacier Instant	99.9 %	milliseconds	~$0.004	Archive with occasional access
S3 Glacier Flexible	99.99 %	1-5 min (expedite) / 3-5 h (standard)	~$0.0036	Long-term archive
S3 Glacier Deep Archive	99.99 %	12 h (standard) / 48 h (bulk)	~$0.00099	Cheapest, compliance archives

Multi-AZ and Multi-Region Architecture

Region ┌──────────────────────────────┐
       │  AZ-1       AZ-2       AZ-3  │
       │  ┌───┐      ┌───┐      ┌───┐ │
       │  │APP│──────│APP│──────│APP│ │
       │  └─┬─┘      └─┬─┘      └─┬─┘ │
       │    │          │          │    │
       │  ┌─▼──────────▼──────────▼─┐ │
       │  │      Load Balancer      │ │
       │  └────────────┬────────────┘ │
       │               │              │
       │  ┌────────────▼────────────┐ │
       │  │  Database (Primary)     │ │
       │  │  + Read Replica         │ │
       │  └─────────────────────────┘ │
       └──────────────────────────────┘

Disaster Recovery Strategies

DR Strategies on AWS (from least to most prepared)

Strategy	RTO	RPO	Cost	Description
Backup & Restore	hours	24 h	Low	Regular data backups to S3/Glacier, restore in DR region
Pilot Light	tens of minutes	minutes	Medium	Minimal running copy (DB, core services), scale on failover
Warm Standby	minutes	seconds	High	Reduced production copy running, scale on failover
Active-Active (Multi-Region)	seconds	< 1 s	Very high	Fully active in multiple regions, traffic routing (Route53, Global Accelerator)

Key books on the topic:

Engineering Resilient Systems on AWS (Schwarz, Moran, Bachmeier, 2024) — practical labs for resilience patterns: back off and retry, multi-Region failover, circuit breaker, chaos engineering using AWS Fault Injection Simulator
Building Resilient Architectures on AWS (2025) — data security, backup strategies, recovery plan automation

Chaos Engineering

Deliberate fault injection to verify system resilience:

AWS Fault Injection Simulator (FIS) — managed fault injection for EC2, ECS, EKS, RDS
Tools: Chaos Mesh (Kubernetes), Gremlin, Litmus
Process: define hypothesis → run experiment → measure impact → improve system
Safety: experiments in isolated environment, safety controls, automatic rollback

Cloud Design Patterns

Strangler Fig

Gradually replacing parts of a monolithic application with microservices.

Legacy functionality is progressively redirected to new services
Strangler Fig proxy (route headers, feature flags) controls traffic migration
Advantage: incremental value delivery without big-bang rewrite

Circuit Breaker

Prevents cascading failures when a dependent service fails.

Three states: Closed (normal operation), Open (requests immediately fail), Half-Open (test request after timeout)
Parameters: failure threshold, timeout (reset timeout), half-open max requests
Implementations: resilience4j, Hystrix (legacy), Istio (envoy), AWS App Mesh

Saga

Distributed transaction across microservices — a series of local transactions with compensating actions.

Choreography — each service publishes an event, the next service reacts (Kafka, EventBridge)
Orchestration — central orchestrator manages steps (Step Functions, Temporal, Camunda)

CQRS (Command Query Responsibility Segregation)

Separation of write (Command) and read (Query) models.

Command model: optimized for writes (normalized, transactional)
Query model: optimized for reads (denormalized, read-optimized views)
Eventual consistency between models (event bus propagates changes)
Use case: reporting, audit logs, high-throughput systems

Event Sourcing

Storing state as a sequence of events, not the current state.

Each change is an append-only event in an event store
Current state = fold of all events
Advantages: audit trail, time travel, CQRS compatibility
Implementations: EventStoreDB, Kafka (log), DynamoDB + CDC

Additional Cloud Patterns (Wilder — Cloud Architecture Patterns)

Pattern	Category	Description
Horizontally Scaling Compute	Scalability	Adding/removing instances based on load, elasticity
Queue-Centric Workflow	Scalability	Decoupling components via queues (SQS, RabbitMQ), async processing
Auto-Scaling	Scalability	Automatic scaling based on metrics (CPU, memory, request count)
MapReduce	Big Data	Distributed data processing (Hadoop, EMR, BigQuery)
Database Sharding	Big Data	Horizontal data partitioning across databases
Busy Signal	Failure Handling	Graceful degradation under overload (HTTP 503, throttling, backpressure)
Node Failure	Failure Handling	Detection and automatic recovery from compute node failure
Colocation	Distributed Users	Placing compute close to data to reduce latency
Valet Key	Distributed Users	Delegated storage access (SAS tokens, S3 presigned URLs)
Multi-Site Deployment	Distributed Users	Active deployment in multiple geographic locations

Evolutionary Architecture

Definition (Ford, Parsons, Kua, 2022): An evolutionary architecture supports guided, incremental change across multiple dimensions.

Fitness Functions

Automated checks of architectural characteristics — analogous to tests for architecture:

Type	Description	Example
Atomic	Checks a single metric	Cyclomatic complexity < 10
Holistic	Checks the overall system	End-to-end latency < 200 ms
Triggered	Triggered by event (CI/CD commit, deployment)	API contract verification
Continuous	Runs continuously in production	Monitoring dependency freshness
Static	Code analysis without execution	SonarQube, ESLint
Dynamic	Runtime analysis	Load tests, chaos experiments

Principles of Evolutionary Architecture

Incremental change — small, safe changes thanks to CI/CD, deployment pipelines, mature DevOps
Fitness functions — automated protection of architectural characteristics (scalability, performance, security)
Coupling management — conscious work with component connections (affinity, volatility, cycles)
Evolutionary data — database migrations as first-class citizens (evolutionary schemas, expand-contract pattern)

Antipatterns

Big Design Up Front (BDUF) — trying to design everything upfront, ignoring change
No Design at All — absence of architectural thinking, purely emergent design
Premature Standardization — introducing standards before the domain is understood

Hybrid Cloud Connectivity

See also: NETWORKING.en.md — network architecture (VPN, BGP, VPC design).

Site-to-Site VPN — IPSec tunnel over the internet
Direct Connect / ExpressRoute / Dedicated Interconnect — private physical connection
Cloud VPN / Transit Gateway — hub-and-spoke topology

Cost Optimization Detail

Savings Plans vs Reserved Instances

Property	Compute Savings Plan	EC2 Instance Savings Plan	Reserved Instances
Flexibility	Instance family, region, OS	Instance family + region	Specific instance
Term	1 or 3 years	1 or 3 years	1 or 3 years
Discount (typical)	~30-50 %	~40-60 %	~40-60 %
Change instance	Yes (any)	Yes (within family)	No
Change region	Yes	No	No
Payment options	All Upfront / Partial / No Upfront	All Upfront / Partial / No Upfront	All Upfront / Partial / No Upfront

Spot Instance Best Practices

Diversification — use a mix of instance types (spot fleet) for higher availability
Graceful handling — application must handle termination notice (2 minute warning)
Checkpointing — regular state saving for restart after spot interruption
Spot block (AWS) — protection for 1-6 h (limited availability)
Use cases: batch processing, CI/CD runners, stateless microservices, ML training
Avoid: stateful workloads, databases (without special design)

Organization and Governance

AWS Organizations

Root OU
├── Security OU
│   ├── Audit Account (CloudTrail, Config)
│   └── Security Tooling Account (GuardDuty, Security Hub)
├── Infrastructure OU
│   ├── Network Account (Transit Gateway, VPN)
│   ├── Shared Services Account (AD, SSO)
│   └── Log Archive Account
├── Workloads OU
│   ├── Dev OU → individual dev accounts
│   ├── Staging OU → staging accounts
│   └── Prod OU → production accounts
└── Sandbox OU → isolated experimental accounts

SCP (Service Control Policies) — whitelist/blacklist services at OU level
Tag policies — enforce tagging across accounts
AI services opt-out — control data usage in AWS AI services

Azure Management Groups

Tenant Root Group
├── Platform MG
│   ├── Connectivity (hub VNet, ExpressRoute)
│   ├── Management (Log Analytics, Automation)
│   └── Identity (AD DS, PIM)
├── Application MG
│   ├── DEV (dev subscriptions)
│   ├── TEST (test subscriptions)
│   └── PROD (production subscriptions)
└── Sandbox MG

Azure Policy — built-in and custom policies (similar to SCP)
Management Group hierarchy — up to 6 levels deep
Subscription limits — max 10,000 subscriptions per tenant

GCP Projects

Organization Node
├── Folder: Platform
│   ├── Project: Shared Networking (VPC, Cloud NAT, VPN)
│   ├── Project: Security (Cloud KMS, Secret Manager, Chronicle)
│   └── Project: Monitoring (Cloud Monitoring, Logging)
├── Folder: Workloads
│   ├── Folder: Dev
│   │   └── Project: [app]-dev
│   ├── Folder: Staging
│   │   └── Project: [app]-staging
│   └── Folder: Prod
│       └── Project: [app]-prod
└── Folder: Sandbox
    └── Project: [user]-sandbox

Organization policies — constraints at organization/folder level
Resource Manager — hierarchy: Organization → Folder → Project → Resources
Project limits — max 30 projects (can be increased), 10k resources per project

12-Factor App Methodology

Methodology for building cloud-native applications (Heroku, 2011), expanded by the book Multi-Cloud Handbook for Developers (Natarajan, Jacob, 2024).

#	Factor	Description	Cloud Implementation
1	Codebase	One repo, many deployments	Git + CI/CD pipeline
2	Dependencies	Explicit dependency declaration	package.json, requirements.txt, Docker image
3	Config	Configuration in environment variables	Secrets Manager, Parameter Store, env vars
4	Backing services	Dependent services as attached resources	RDS, S3, Redis — connection via connection string
5	Build, release, run	Strict separation of build stages	CI/CD pipeline (GitHub Actions, GitLab CI)
6	Processes	Application as stateless processes	Horizontal scaling, session in Redis
7	Port binding	Service exports port, not embedded in server	Express, FastAPI, Spring Boot on own port
8	Concurrency	Scaling via process model	Horizontal Pod Autoscaler (K8s), EC2 Auto Scaling
9	Disposability	Fast startup and graceful shutdown	Health checks, SIGTERM handling, preStop hooks
10	Dev/Prod parity	Minimal difference between environments	Docker, IaC (Terraform), same backing services
11	Logs	Logs as event streams	stdout/stderr → CloudWatch, ELK, Datadog
12	Admin processes	Admin tasks as one-off processes	DB migrations, data backfill — run in isolation

Multi-cloud Extensions (Multi-Cloud Handbook for Developers)

API-first design — consistent API interfaces across clouds (REST, gRPC)
Domain-Driven Design (DDD) — bounded contexts mapped to cloud services
Service Mesh — Istio, Linkerd for observability, traffic management and security across clouds
GitOps — declarative deployment with ArgoCD/Flux across Kubernetes clusters in different clouds

Azure Cloud Native Architecture (Map Book)

Based on The Azure Cloud Native Architecture Mapbook (2nd ed.) (Eyskens, 2025) — 40+ architectural maps across domains:

Domains of Architectural Maps

Domain	Key Azure Services	Architectural Patterns
Infrastructure	VNet, Azure Firewall, ExpressRoute, VPN Gateway	Hub-and-spoke, Virtual WAN, Private Link
Applications	App Service, API Management, Service Bus, Functions	Event-driven, Strangler Fig, Backend for Frontend
Data	Cosmos DB, SQL Database, Synapse, Data Lake	CQRS, Event Sourcing, Polyglot Persistence
Container Orchestrators	AKS, Azure Container Apps, ACA	Sidecar, Ambassador, Adapter (service mesh)
AI	Azure OpenAI, Cognitive Services, ML Studio	RAG, model fine-tuning, MLOps
Security	Entra ID, Defender for Cloud, Key Vault, Sentinel	Zero Trust, Defense in depth, JIT Access

Cloud Adoption Framework on Azure

Strategy — business case, application catalog, portfolio rationalization
Plan — landing zone design, governance baseline, subscription taxonomy
Ready — landing zone implementation (ALZ), Azure Policy, Networking, Identity
Migrate — assessment (Azure Migrate), rehost/replatform, test and cutover
Govern — cost management, policy enforcement, compliance monitoring

Cloud Provider Comparison

Based on Cloud Computing: AWS, Azure, Google Cloud (Sario, 2025):

Area	AWS	Azure	GCP
Compute	EC2, Lambda, ECS/EKS	VMs, Functions, AKS	GCE, Cloud Functions, GKE
Storage	S3, EBS, EFS	Blob, Disk, Files	Cloud Storage, Persistent Disk, Filestore
Relational DB	RDS (MySQL, PG, SQL Server, Oracle, MariaDB)	SQL Database, MySQL/PostgreSQL	Cloud SQL (MySQL, PG, SQL Server)
NoSQL DB	DynamoDB, ElastiCache	Cosmos DB, Redis Cache	Firestore, Bigtable, Memorystore
Message queue	SQS, SNS	Service Bus, Queue Storage	Pub/Sub, Tasks
Observability	CloudWatch, X-Ray	Monitor, Application Insights	Cloud Monitoring, Cloud Trace
AI/ML	SageMaker, Bedrock	Azure ML, OpenAI	Vertex AI, AutoML
Pricing (compute)	On-demand, Reserved, Spot, Savings Plan	Pay-as-you-go, Reserved, Spot	On-demand, Committed Use, Spot

OpenStack as Private Cloud

OpenStack is the dominant open-source platform for building private clouds (IaaS). It provides compute (Nova), networking (Neutron), and storage services (Cinder/Swift/Manila) with a unified API.

Advantages over Commercial Solutions

Vendor-neutral API — avoids lock-in (VMware, Hyper-V)
Multi-tenancy — Keystone identity, RBAC, projects, quotas
Hybrid cloud ready — federation with AWS/Azure/GCP, Terraform provisioning
Ecosystem — hundreds of services (Heat orchestration, Magnum containers, Designate DNS)

Suitable Scenarios

Scenario	Key Services
Data center with multi-tenancy and self-service	Nova, Neutron, Cinder, Horizon
Telco / NFVI / MEC	Neutron (DPDK, SR-IOV), Nova (NUMA pinning)
Science and HPC	Cyborg (GPU), Manila (NAS), Ironic (bare metal)
Academic clouds	Keystone federation, Trove (DBaaS)

Challenges

Significant deployment and operations complexity
Frequent API breaking changes between releases (cycle per year)
Limited enterprise support outside commercial distributions (Red Hat, Canonical, Mirantis)

Best Practices

Use infrastructure as code (Terraform, Pulumi, CDK)
Design for failure — every component can fail
Implement defense in depth — security at every layer
Monitor costs — tagging, budget alerts, anomaly detection
Use managed services where it makes sense (less operations)
Least privilege for all IAM roles and policies

Resources

Links, books and standards: sources/cloud/sources.en.md

Cost tagging — assign tags for chargeback/showback (Environment, Team, Cost Center, Application)
Automated compliance — AWS Config, Azure Policy, GCP Org Policies for guardrails
Multi-account strategy — AWS Control Tower, Azure Landing Zones, GCP Resource Hierarchy

Book	Authors	ISBN	Description
The AI Cloud Infrastructure Blueprint	Thummarakoti, Vududala, Madupati, Kaushik	978-1-041-16642-9	End-to-end guide to designing, deploying, and managing AI systems on cloud platforms. Covers public/private/hybrid/multi-cloud models for AI, infrastructure for ML training and inference, MLOps. Target audience: architects, data scientists, DevOps.
AWS for Solutions Architects (3rd ed.)	Shrivastava, Srivastav, Thakur	978-1-83664-193-3	Practical guide to AWS architecture — compute (EC2, Lambda), storage (S3, EBS), databases (RDS, DynamoDB), networking, security, Well-Architected Framework, migration, cost optimization. Suitable for AWS Solutions Architect certification preparation.

25 KiB Raw Permalink Blame History

☁️ Cloud Architecture

Providers

Deployment Models

Multi-cloud Strategy

Reasons for Multi-cloud

Multi-cloud Connectivity

Challenges

Cloud Adoption Frameworks (CAF)

Migration Strategies — 6 Rs

Decision Framework for 6 Rs

Well-Architected Framework (AWS)

Key Questions from Well-Architected Review (~60 questions)

Key Components

Compute Layer

Compute Comparison Matrix (AWS EC2)

Storage

S3 Storage Classes

Multi-AZ and Multi-Region Architecture

Disaster Recovery Strategies

DR Strategies on AWS (from least to most prepared)

Chaos Engineering

Cloud Design Patterns

Strangler Fig

Circuit Breaker

Saga

CQRS (Command Query Responsibility Segregation)

Event Sourcing

Additional Cloud Patterns (Wilder — Cloud Architecture Patterns)

Evolutionary Architecture

Fitness Functions

Principles of Evolutionary Architecture

Antipatterns

Hybrid Cloud Connectivity

Cost Optimization Detail

Savings Plans vs Reserved Instances

Spot Instance Best Practices

Organization and Governance

AWS Organizations

Azure Management Groups

GCP Projects

12-Factor App Methodology

Multi-cloud Extensions (Multi-Cloud Handbook for Developers)

Azure Cloud Native Architecture (Map Book)

Domains of Architectural Maps

Cloud Adoption Framework on Azure

Cloud Provider Comparison

OpenStack as Private Cloud

Advantages over Commercial Solutions

Suitable Scenarios

Challenges

Best Practices

Resources

Recommended Reading

25 KiB

Raw Permalink Blame History