Files
knowledge-base/CLOUD.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

25 KiB

☁️ Cloud Architecture

Providers

  • AWS — largest market share, broadest portfolio
  • Azure — strong integration with Microsoft ecosystem
  • GCP — Kubernetes (GKE), data & ML, network connectivity

Deployment Models

Model Description
Public cloud Shared provider infrastructure
Private cloud Dedicated infrastructure (on-prem or hosted)
Hybrid cloud Public + private interconnection
Multi-cloud Multiple public providers

Multi-cloud Strategy

Reasons for Multi-cloud

  • Vendor lock-in prevention — risk diversification
  • Regulatory requirements — data residency in specific regions
  • Best-of-breed — each provider has strengths (AWS networking, Azure enterprise, GCP data/ML)
  • Acquisition scenarios — merge & acquisition unification

Multi-cloud Connectivity

Method Latency Throughput Cost
Site-to-Site VPN Medium Limited Low
Private interconnect (Direct Connect / ExpressRoute / Dedicated Interconnect) Low High High
Cloud-to-cloud VPN Medium Medium Medium
SD-WAN Low High Medium

Challenges

  • Network complexity — different VPC/VNet concepts, security models
  • IAM federation — unified identities across clouds (SSO, SAML, OIDC)
  • Data gravity — moving data between clouds is expensive and slow
  • Monitoring — single pane of glass across clouds (Grafana, Datadog)

Cloud Adoption Frameworks (CAF)

Each major provider has its own Cloud Adoption Framework for a structured approach to cloud adoption:

Provider Framework Focus
AWS AWS CAF 6 perspectives: Business, People, Governance, Platform, Security, Operations
Azure Microsoft CAF 8 methodologies: Strategy, Plan, Ready, Migrate, Innovate, Govern, Manage, Secure
GCP Google CAF 4 pillars: Learn, Scale, Modernize, Operate

Multi-Cloud Administration Guide (Mulder, 2024) recommends combining CAF frameworks across providers for unified governance models, especially in:

  • Interoperability — standardization of APIs and IaC across clouds (Terraform, Pulumi)
  • Data governance — unified policy for data residency and lifecycle
  • Compliance automation — automated audits across clouds (AWS Config, Azure Policy, GCP Org Policies)
  • Access management — identity federation and centralized RBAC

Migration Strategies — 6 Rs

Strategy Description Difficulty Typical Scenario
Rehost (Lift & Shift) Move VM/as-is without changes Low Quick migration, datacenter exit, minimal risk
Replatform (Lift & Reshape) Migration with minor adjustments (e.g., RDS instead of self-managed DB) Medium Optimization without rewriting the application
Refactor (Re-architect) Rewrite application as cloud-native (microservices, serverless) High Maximize cloud benefit, long-term strategy
Repurchase Move to SaaS (e.g., Salesforce, Workday) Low Application is outdated, SaaS alternative exists
Retire Decommission unused applications Low Application no longer in use
Retain Keep on-prem None Regulatory reasons, too high migration risk

Decision Framework for 6 Rs

Start: Is the application needed?
  ├── No → Retire
  └── Yes → Does a SaaS alternative exist?
       ├── Yes → Repurchase
       └── No → Is refactoring worthwhile?
            ├── Yes → Refactor
            └── No → Is platform change sufficient?
                 ├── Yes → Replatform
                 └── No → Rehost

Well-Architected Framework (AWS)

  1. Operational Excellence — automation, monitoring, documentation
  2. Security — IAM, encryption, compliance
  3. Reliability — recovery, scaling, backup plans
  4. Performance Efficiency — right-sizing, choosing the right services
  5. Cost Optimization — FinOps, reserved instances, spot instances
  6. Sustainability (since 2022) — carbon footprint, energy efficiency

Analogues: Azure Well-Architected Framework, GCP Architecture Framework

Key Questions from Well-Architected Review (~60 questions)

Operational Excellence (12 questions)

  • How are changes managed and automated?
  • How are operations documented and shared within the team?
  • How are expected and unexpected events reflected in operations?
  • What runbooks exist for common operational scenarios?
  • How is incident management and postmortem process conducted?

Security (12 questions)

  • How is identity & access management implemented?
  • How is data protected at rest and in transit?
  • How is security incident detection ensured?
  • What are the procedures for patch management and vulnerability remediation?
  • How are infrastructure credentials and secrets managed?

Reliability (12 questions)

  • How is service availability ensured during a component failure?
  • How is backup and disaster recovery implemented?
  • How do service limits (quotas, throttling) affect reliability?
  • How does automatic scaling work under changing load?
  • What are the SLI/SLO metrics and how are they monitored?

Performance Efficiency (12 questions)

  • How is the correct type and size of compute/storage selected?
  • How is the database layer optimized (indexes, queries, caching)?
  • How is monitoring used to identify bottlenecks?
  • How is scaling implemented (vertical vs horizontal)?

Cost Optimization (12 questions)

  • How are costs allocated to teams/projects (chargeback/showback)?
  • What tools are used for cost analysis?
  • How are unused resources identified and eliminated?
  • How is licensing optimized (BYOL, hybrid benefit)?

Key Components

Compute Layer

  • VM / instances — EC2, Azure VMs, GCE
  • Container orchestration — EKS, AKS, GKE
  • Serverless — Lambda, Azure Functions, Cloud Functions
  • PaaS — App Engine, Elastic Beanstalk, Azure App Service

Compute Comparison Matrix (AWS EC2)

Family Type vCPU:Memory Use Case Example Pricing (on-demand, us-east-1)
General purpose M7g, m7i 1:4 Web servers, microservices, dev/test m7i.large ~$0.088/h
Compute optimized C7g, c7i 1:2 HPC, batch processing, CI/CD, gaming c7i.large ~$0.078/h
Memory optimized R7g, r7i, x2idn 1:8 to 1:32 In-memory DB (Redis), SAP HANA, real-time analytics r7i.large ~$0.118/h
Storage optimized I4i, im4gn 1:4 + NVMe Transactional DB, data warehousing, Kafka i4i.large ~$0.138/h
GPU / ML P5, g5, trn1 GPU attach AI training (P5), inference (g5), ML (trn1) g5.xlarge ~$1.006/h

See GPU.en.md for GPU model and configuration details.

Storage

  • Object storage — S3, Blob Storage, Cloud Storage
  • Block storage — EBS, managed disks, persistent disks
  • File storage — EFS, Azure Files, Filestore
  • CDN — CloudFront, Azure CDN, Cloud CDN

S3 Storage Classes

Class Availability Retrieval Time Price / GB / Month Use Case
S3 Standard 99.99 % milliseconds ~$0.023 Active data, frequent access
S3 Intelligent-Tiering 99.9 % milliseconds ~$0.023 + monitoring fee Unknown/variable access patterns
S3 Standard-IA 99.9 % milliseconds ~$0.0125 Less frequent but fast access
S3 One Zone-IA 99.5 % milliseconds ~$0.01 Reproducible data
S3 Glacier Instant 99.9 % milliseconds ~$0.004 Archive with occasional access
S3 Glacier Flexible 99.99 % 1-5 min (expedite) / 3-5 h (standard) ~$0.0036 Long-term archive
S3 Glacier Deep Archive 99.99 % 12 h (standard) / 48 h (bulk) ~$0.00099 Cheapest, compliance archives

Multi-AZ and Multi-Region Architecture

Region ┌──────────────────────────────┐
       │  AZ-1       AZ-2       AZ-3  │
       │  ┌───┐      ┌───┐      ┌───┐ │
       │  │APP│──────│APP│──────│APP│ │
       │  └─┬─┘      └─┬─┘      └─┬─┘ │
       │    │          │          │    │
       │  ┌─▼──────────▼──────────▼─┐ │
       │  │      Load Balancer      │ │
       │  └────────────┬────────────┘ │
       │               │              │
       │  ┌────────────▼────────────┐ │
       │  │  Database (Primary)     │ │
       │  │  + Read Replica         │ │
       │  └─────────────────────────┘ │
       └──────────────────────────────┘

Disaster Recovery Strategies

DR Strategies on AWS (from least to most prepared)

Strategy RTO RPO Cost Description
Backup & Restore hours 24 h Low Regular data backups to S3/Glacier, restore in DR region
Pilot Light tens of minutes minutes Medium Minimal running copy (DB, core services), scale on failover
Warm Standby minutes seconds High Reduced production copy running, scale on failover
Active-Active (Multi-Region) seconds < 1 s Very high Fully active in multiple regions, traffic routing (Route53, Global Accelerator)

Key books on the topic:

  • Engineering Resilient Systems on AWS (Schwarz, Moran, Bachmeier, 2024) — practical labs for resilience patterns: back off and retry, multi-Region failover, circuit breaker, chaos engineering using AWS Fault Injection Simulator
  • Building Resilient Architectures on AWS (2025) — data security, backup strategies, recovery plan automation

Chaos Engineering

Deliberate fault injection to verify system resilience:

  • AWS Fault Injection Simulator (FIS) — managed fault injection for EC2, ECS, EKS, RDS
  • Tools: Chaos Mesh (Kubernetes), Gremlin, Litmus
  • Process: define hypothesis → run experiment → measure impact → improve system
  • Safety: experiments in isolated environment, safety controls, automatic rollback

Cloud Design Patterns

Strangler Fig

Gradually replacing parts of a monolithic application with microservices.

  • Legacy functionality is progressively redirected to new services
  • Strangler Fig proxy (route headers, feature flags) controls traffic migration
  • Advantage: incremental value delivery without big-bang rewrite

Circuit Breaker

Prevents cascading failures when a dependent service fails.

  • Three states: Closed (normal operation), Open (requests immediately fail), Half-Open (test request after timeout)
  • Parameters: failure threshold, timeout (reset timeout), half-open max requests
  • Implementations: resilience4j, Hystrix (legacy), Istio (envoy), AWS App Mesh

Saga

Distributed transaction across microservices — a series of local transactions with compensating actions.

  • Choreography — each service publishes an event, the next service reacts (Kafka, EventBridge)
  • Orchestration — central orchestrator manages steps (Step Functions, Temporal, Camunda)

CQRS (Command Query Responsibility Segregation)

Separation of write (Command) and read (Query) models.

  • Command model: optimized for writes (normalized, transactional)
  • Query model: optimized for reads (denormalized, read-optimized views)
  • Eventual consistency between models (event bus propagates changes)
  • Use case: reporting, audit logs, high-throughput systems

Event Sourcing

Storing state as a sequence of events, not the current state.

  • Each change is an append-only event in an event store
  • Current state = fold of all events
  • Advantages: audit trail, time travel, CQRS compatibility
  • Implementations: EventStoreDB, Kafka (log), DynamoDB + CDC

Additional Cloud Patterns (Wilder — Cloud Architecture Patterns)

Pattern Category Description
Horizontally Scaling Compute Scalability Adding/removing instances based on load, elasticity
Queue-Centric Workflow Scalability Decoupling components via queues (SQS, RabbitMQ), async processing
Auto-Scaling Scalability Automatic scaling based on metrics (CPU, memory, request count)
MapReduce Big Data Distributed data processing (Hadoop, EMR, BigQuery)
Database Sharding Big Data Horizontal data partitioning across databases
Busy Signal Failure Handling Graceful degradation under overload (HTTP 503, throttling, backpressure)
Node Failure Failure Handling Detection and automatic recovery from compute node failure
Colocation Distributed Users Placing compute close to data to reduce latency
Valet Key Distributed Users Delegated storage access (SAS tokens, S3 presigned URLs)
Multi-Site Deployment Distributed Users Active deployment in multiple geographic locations

Evolutionary Architecture

Definition (Ford, Parsons, Kua, 2022): An evolutionary architecture supports guided, incremental change across multiple dimensions.

Fitness Functions

Automated checks of architectural characteristics — analogous to tests for architecture:

Type Description Example
Atomic Checks a single metric Cyclomatic complexity < 10
Holistic Checks the overall system End-to-end latency < 200 ms
Triggered Triggered by event (CI/CD commit, deployment) API contract verification
Continuous Runs continuously in production Monitoring dependency freshness
Static Code analysis without execution SonarQube, ESLint
Dynamic Runtime analysis Load tests, chaos experiments

Principles of Evolutionary Architecture

  1. Incremental change — small, safe changes thanks to CI/CD, deployment pipelines, mature DevOps
  2. Fitness functions — automated protection of architectural characteristics (scalability, performance, security)
  3. Coupling management — conscious work with component connections (affinity, volatility, cycles)
  4. Evolutionary data — database migrations as first-class citizens (evolutionary schemas, expand-contract pattern)

Antipatterns

  • Big Design Up Front (BDUF) — trying to design everything upfront, ignoring change
  • No Design at All — absence of architectural thinking, purely emergent design
  • Premature Standardization — introducing standards before the domain is understood

Hybrid Cloud Connectivity

See also: NETWORKING.en.md — network architecture (VPN, BGP, VPC design).

  • Site-to-Site VPN — IPSec tunnel over the internet
  • Direct Connect / ExpressRoute / Dedicated Interconnect — private physical connection
  • Cloud VPN / Transit Gateway — hub-and-spoke topology

Cost Optimization Detail

Savings Plans vs Reserved Instances

Property Compute Savings Plan EC2 Instance Savings Plan Reserved Instances
Flexibility Instance family, region, OS Instance family + region Specific instance
Term 1 or 3 years 1 or 3 years 1 or 3 years
Discount (typical) ~30-50 % ~40-60 % ~40-60 %
Change instance Yes (any) Yes (within family) No
Change region Yes No No
Payment options All Upfront / Partial / No Upfront All Upfront / Partial / No Upfront All Upfront / Partial / No Upfront

Spot Instance Best Practices

  • Diversification — use a mix of instance types (spot fleet) for higher availability
  • Graceful handling — application must handle termination notice (2 minute warning)
  • Checkpointing — regular state saving for restart after spot interruption
  • Spot block (AWS) — protection for 1-6 h (limited availability)
  • Use cases: batch processing, CI/CD runners, stateless microservices, ML training
  • Avoid: stateful workloads, databases (without special design)

Organization and Governance

AWS Organizations

Root OU
├── Security OU
│   ├── Audit Account (CloudTrail, Config)
│   └── Security Tooling Account (GuardDuty, Security Hub)
├── Infrastructure OU
│   ├── Network Account (Transit Gateway, VPN)
│   ├── Shared Services Account (AD, SSO)
│   └── Log Archive Account
├── Workloads OU
│   ├── Dev OU → individual dev accounts
│   ├── Staging OU → staging accounts
│   └── Prod OU → production accounts
└── Sandbox OU → isolated experimental accounts
  • SCP (Service Control Policies) — whitelist/blacklist services at OU level
  • Tag policies — enforce tagging across accounts
  • AI services opt-out — control data usage in AWS AI services

Azure Management Groups

Tenant Root Group
├── Platform MG
│   ├── Connectivity (hub VNet, ExpressRoute)
│   ├── Management (Log Analytics, Automation)
│   └── Identity (AD DS, PIM)
├── Application MG
│   ├── DEV (dev subscriptions)
│   ├── TEST (test subscriptions)
│   └── PROD (production subscriptions)
└── Sandbox MG
  • Azure Policy — built-in and custom policies (similar to SCP)
  • Management Group hierarchy — up to 6 levels deep
  • Subscription limits — max 10,000 subscriptions per tenant

GCP Projects

Organization Node
├── Folder: Platform
│   ├── Project: Shared Networking (VPC, Cloud NAT, VPN)
│   ├── Project: Security (Cloud KMS, Secret Manager, Chronicle)
│   └── Project: Monitoring (Cloud Monitoring, Logging)
├── Folder: Workloads
│   ├── Folder: Dev
│   │   └── Project: [app]-dev
│   ├── Folder: Staging
│   │   └── Project: [app]-staging
│   └── Folder: Prod
│       └── Project: [app]-prod
└── Folder: Sandbox
    └── Project: [user]-sandbox
  • Organization policies — constraints at organization/folder level
  • Resource Manager — hierarchy: Organization → Folder → Project → Resources
  • Project limits — max 30 projects (can be increased), 10k resources per project

12-Factor App Methodology

Methodology for building cloud-native applications (Heroku, 2011), expanded by the book Multi-Cloud Handbook for Developers (Natarajan, Jacob, 2024).

# Factor Description Cloud Implementation
1 Codebase One repo, many deployments Git + CI/CD pipeline
2 Dependencies Explicit dependency declaration package.json, requirements.txt, Docker image
3 Config Configuration in environment variables Secrets Manager, Parameter Store, env vars
4 Backing services Dependent services as attached resources RDS, S3, Redis — connection via connection string
5 Build, release, run Strict separation of build stages CI/CD pipeline (GitHub Actions, GitLab CI)
6 Processes Application as stateless processes Horizontal scaling, session in Redis
7 Port binding Service exports port, not embedded in server Express, FastAPI, Spring Boot on own port
8 Concurrency Scaling via process model Horizontal Pod Autoscaler (K8s), EC2 Auto Scaling
9 Disposability Fast startup and graceful shutdown Health checks, SIGTERM handling, preStop hooks
10 Dev/Prod parity Minimal difference between environments Docker, IaC (Terraform), same backing services
11 Logs Logs as event streams stdout/stderr → CloudWatch, ELK, Datadog
12 Admin processes Admin tasks as one-off processes DB migrations, data backfill — run in isolation

Multi-cloud Extensions (Multi-Cloud Handbook for Developers)

  • API-first design — consistent API interfaces across clouds (REST, gRPC)
  • Domain-Driven Design (DDD) — bounded contexts mapped to cloud services
  • Service Mesh — Istio, Linkerd for observability, traffic management and security across clouds
  • GitOps — declarative deployment with ArgoCD/Flux across Kubernetes clusters in different clouds

Azure Cloud Native Architecture (Map Book)

Based on The Azure Cloud Native Architecture Mapbook (2nd ed.) (Eyskens, 2025) — 40+ architectural maps across domains:

Domains of Architectural Maps

Domain Key Azure Services Architectural Patterns
Infrastructure VNet, Azure Firewall, ExpressRoute, VPN Gateway Hub-and-spoke, Virtual WAN, Private Link
Applications App Service, API Management, Service Bus, Functions Event-driven, Strangler Fig, Backend for Frontend
Data Cosmos DB, SQL Database, Synapse, Data Lake CQRS, Event Sourcing, Polyglot Persistence
Container Orchestrators AKS, Azure Container Apps, ACA Sidecar, Ambassador, Adapter (service mesh)
AI Azure OpenAI, Cognitive Services, ML Studio RAG, model fine-tuning, MLOps
Security Entra ID, Defender for Cloud, Key Vault, Sentinel Zero Trust, Defense in depth, JIT Access

Cloud Adoption Framework on Azure

  • Strategy — business case, application catalog, portfolio rationalization
  • Plan — landing zone design, governance baseline, subscription taxonomy
  • Ready — landing zone implementation (ALZ), Azure Policy, Networking, Identity
  • Migrate — assessment (Azure Migrate), rehost/replatform, test and cutover
  • Govern — cost management, policy enforcement, compliance monitoring

Cloud Provider Comparison

Based on Cloud Computing: AWS, Azure, Google Cloud (Sario, 2025):

Area AWS Azure GCP
Compute EC2, Lambda, ECS/EKS VMs, Functions, AKS GCE, Cloud Functions, GKE
Storage S3, EBS, EFS Blob, Disk, Files Cloud Storage, Persistent Disk, Filestore
Relational DB RDS (MySQL, PG, SQL Server, Oracle, MariaDB) SQL Database, MySQL/PostgreSQL Cloud SQL (MySQL, PG, SQL Server)
NoSQL DB DynamoDB, ElastiCache Cosmos DB, Redis Cache Firestore, Bigtable, Memorystore
Message queue SQS, SNS Service Bus, Queue Storage Pub/Sub, Tasks
Observability CloudWatch, X-Ray Monitor, Application Insights Cloud Monitoring, Cloud Trace
AI/ML SageMaker, Bedrock Azure ML, OpenAI Vertex AI, AutoML
Pricing (compute) On-demand, Reserved, Spot, Savings Plan Pay-as-you-go, Reserved, Spot On-demand, Committed Use, Spot

OpenStack as Private Cloud

OpenStack is the dominant open-source platform for building private clouds (IaaS). It provides compute (Nova), networking (Neutron), and storage services (Cinder/Swift/Manila) with a unified API.

Advantages over Commercial Solutions

  • Vendor-neutral API — avoids lock-in (VMware, Hyper-V)
  • Multi-tenancy — Keystone identity, RBAC, projects, quotas
  • Hybrid cloud ready — federation with AWS/Azure/GCP, Terraform provisioning
  • Ecosystem — hundreds of services (Heat orchestration, Magnum containers, Designate DNS)

Suitable Scenarios

Scenario Key Services
Data center with multi-tenancy and self-service Nova, Neutron, Cinder, Horizon
Telco / NFVI / MEC Neutron (DPDK, SR-IOV), Nova (NUMA pinning)
Science and HPC Cyborg (GPU), Manila (NAS), Ironic (bare metal)
Academic clouds Keystone federation, Trove (DBaaS)

Challenges

  • Significant deployment and operations complexity
  • Frequent API breaking changes between releases (cycle per year)
  • Limited enterprise support outside commercial distributions (Red Hat, Canonical, Mirantis)

Best Practices

  • Use infrastructure as code (Terraform, Pulumi, CDK)
  • Design for failure — every component can fail
  • Implement defense in depth — security at every layer
  • Monitor costs — tagging, budget alerts, anomaly detection
  • Use managed services where it makes sense (less operations)
  • Least privilege for all IAM roles and policies

Resources

Links, books and standards: sources/cloud/sources.en.md

  • Cost tagging — assign tags for chargeback/showback (Environment, Team, Cost Center, Application)
  • Automated compliance — AWS Config, Azure Policy, GCP Org Policies for guardrails
  • Multi-account strategy — AWS Control Tower, Azure Landing Zones, GCP Resource Hierarchy
Book Authors ISBN Description
The AI Cloud Infrastructure Blueprint Thummarakoti, Vududala, Madupati, Kaushik 978-1-041-16642-9 End-to-end guide to designing, deploying, and managing AI systems on cloud platforms. Covers public/private/hybrid/multi-cloud models for AI, infrastructure for ML training and inference, MLOps. Target audience: architects, data scientists, DevOps.
AWS for Solutions Architects (3rd ed.) Shrivastava, Srivastav, Thakur 978-1-83664-193-3 Practical guide to AWS architecture — compute (EC2, Lambda), storage (S3, EBS), databases (RDS, DynamoDB), networking, security, Well-Architected Framework, migration, cost optimization. Suitable for AWS Solutions Architect certification preparation.

Last revised: 2026-06-03