# ☁️ Cloud Architecture ## Providers - **AWS** — largest market share, broadest portfolio - **Azure** — strong integration with Microsoft ecosystem - **GCP** — Kubernetes (GKE), data & ML, network connectivity ## Deployment Models | Model | Description | |-------|-------------| | Public cloud | Shared provider infrastructure | | Private cloud | Dedicated infrastructure (on-prem or hosted) | | Hybrid cloud | Public + private interconnection | | Multi-cloud | Multiple public providers | ## Multi-cloud Strategy ### Reasons for Multi-cloud - **Vendor lock-in prevention** — risk diversification - **Regulatory requirements** — data residency in specific regions - **Best-of-breed** — each provider has strengths (AWS networking, Azure enterprise, GCP data/ML) - **Acquisition scenarios** — merge & acquisition unification ### Multi-cloud Connectivity | Method | Latency | Throughput | Cost | |--------|---------|------------|------| | Site-to-Site VPN | Medium | Limited | Low | | Private interconnect (Direct Connect / ExpressRoute / Dedicated Interconnect) | Low | High | High | | Cloud-to-cloud VPN | Medium | Medium | Medium | | SD-WAN | Low | High | Medium | ### Challenges - **Network complexity** — different VPC/VNet concepts, security models - **IAM federation** — unified identities across clouds (SSO, SAML, OIDC) - **Data gravity** — moving data between clouds is expensive and slow - **Monitoring** — single pane of glass across clouds (Grafana, Datadog) ### Cloud Adoption Frameworks (CAF) Each major provider has its own Cloud Adoption Framework for a structured approach to cloud adoption: | Provider | Framework | Focus | |----------|-----------|-------| | AWS | AWS CAF | 6 perspectives: Business, People, Governance, Platform, Security, Operations | | Azure | Microsoft CAF | 8 methodologies: Strategy, Plan, Ready, Migrate, Innovate, Govern, Manage, Secure | | GCP | Google CAF | 4 pillars: Learn, Scale, Modernize, Operate | Multi-Cloud Administration Guide (Mulder, 2024) recommends combining CAF frameworks across providers for unified governance models, especially in: - **Interoperability** — standardization of APIs and IaC across clouds (Terraform, Pulumi) - **Data governance** — unified policy for data residency and lifecycle - **Compliance automation** — automated audits across clouds (AWS Config, Azure Policy, GCP Org Policies) - **Access management** — identity federation and centralized RBAC ## Migration Strategies — 6 Rs | Strategy | Description | Difficulty | Typical Scenario | |----------|-------------|------------|------------------| | **Rehost** (Lift & Shift) | Move VM/as-is without changes | Low | Quick migration, datacenter exit, minimal risk | | **Replatform** (Lift & Reshape) | Migration with minor adjustments (e.g., RDS instead of self-managed DB) | Medium | Optimization without rewriting the application | | **Refactor** (Re-architect) | Rewrite application as cloud-native (microservices, serverless) | High | Maximize cloud benefit, long-term strategy | | **Repurchase** | Move to SaaS (e.g., Salesforce, Workday) | Low | Application is outdated, SaaS alternative exists | | **Retire** | Decommission unused applications | Low | Application no longer in use | | **Retain** | Keep on-prem | None | Regulatory reasons, too high migration risk | ### Decision Framework for 6 Rs ``` Start: Is the application needed? ├── No → Retire └── Yes → Does a SaaS alternative exist? ├── Yes → Repurchase └── No → Is refactoring worthwhile? ├── Yes → Refactor └── No → Is platform change sufficient? ├── Yes → Replatform └── No → Rehost ``` ## Well-Architected Framework (AWS) 1. **Operational Excellence** — automation, monitoring, documentation 2. **Security** — IAM, encryption, compliance 3. **Reliability** — recovery, scaling, backup plans 4. **Performance Efficiency** — right-sizing, choosing the right services 5. **Cost Optimization** — FinOps, reserved instances, spot instances 6. **Sustainability** (since 2022) — carbon footprint, energy efficiency Analogues: Azure Well-Architected Framework, GCP Architecture Framework ### Key Questions from Well-Architected Review (~60 questions) **Operational Excellence (12 questions)** - How are changes managed and automated? - How are operations documented and shared within the team? - How are expected and unexpected events reflected in operations? - What runbooks exist for common operational scenarios? - How is incident management and postmortem process conducted? **Security (12 questions)** - How is identity & access management implemented? - How is data protected at rest and in transit? - How is security incident detection ensured? - What are the procedures for patch management and vulnerability remediation? - How are infrastructure credentials and secrets managed? **Reliability (12 questions)** - How is service availability ensured during a component failure? - How is backup and disaster recovery implemented? - How do service limits (quotas, throttling) affect reliability? - How does automatic scaling work under changing load? - What are the SLI/SLO metrics and how are they monitored? **Performance Efficiency (12 questions)** - How is the correct type and size of compute/storage selected? - How is the database layer optimized (indexes, queries, caching)? - How is monitoring used to identify bottlenecks? - How is scaling implemented (vertical vs horizontal)? **Cost Optimization (12 questions)** - How are costs allocated to teams/projects (chargeback/showback)? - What tools are used for cost analysis? - How are unused resources identified and eliminated? - How is licensing optimized (BYOL, hybrid benefit)? ## Key Components ### Compute Layer - **VM / instances** — EC2, Azure VMs, GCE - **Container orchestration** — EKS, AKS, GKE - **Serverless** — Lambda, Azure Functions, Cloud Functions - **PaaS** — App Engine, Elastic Beanstalk, Azure App Service ### Compute Comparison Matrix (AWS EC2) | Family | Type | vCPU:Memory | Use Case | Example Pricing (on-demand, us-east-1) | |--------|------|-------------|----------|----------------------------------------| | **General purpose** | M7g, m7i | 1:4 | Web servers, microservices, dev/test | m7i.large ~$0.088/h | | **Compute optimized** | C7g, c7i | 1:2 | HPC, batch processing, CI/CD, gaming | c7i.large ~$0.078/h | | **Memory optimized** | R7g, r7i, x2idn | 1:8 to 1:32 | In-memory DB (Redis), SAP HANA, real-time analytics | r7i.large ~$0.118/h | | **Storage optimized** | I4i, im4gn | 1:4 + NVMe | Transactional DB, data warehousing, Kafka | i4i.large ~$0.138/h | | **GPU / ML** | P5, g5, trn1 | GPU attach | AI training (P5), inference (g5), ML (trn1) | g5.xlarge ~$1.006/h | See [GPU.en.md](GPU.en.md) for GPU model and configuration details. ### Storage - **Object storage** — S3, Blob Storage, Cloud Storage - **Block storage** — EBS, managed disks, persistent disks - **File storage** — EFS, Azure Files, Filestore - **CDN** — CloudFront, Azure CDN, Cloud CDN ### S3 Storage Classes | Class | Availability | Retrieval Time | Price / GB / Month | Use Case | |-------|-------------|----------------|--------------------|----------| | **S3 Standard** | 99.99 % | milliseconds | ~$0.023 | Active data, frequent access | | **S3 Intelligent-Tiering** | 99.9 % | milliseconds | ~$0.023 + monitoring fee | Unknown/variable access patterns | | **S3 Standard-IA** | 99.9 % | milliseconds | ~$0.0125 | Less frequent but fast access | | **S3 One Zone-IA** | 99.5 % | milliseconds | ~$0.01 | Reproducible data | | **S3 Glacier Instant** | 99.9 % | milliseconds | ~$0.004 | Archive with occasional access | | **S3 Glacier Flexible** | 99.99 % | 1-5 min (expedite) / 3-5 h (standard) | ~$0.0036 | Long-term archive | | **S3 Glacier Deep Archive** | 99.99 % | 12 h (standard) / 48 h (bulk) | ~$0.00099 | Cheapest, compliance archives | ## Multi-AZ and Multi-Region Architecture ``` Region ┌──────────────────────────────┐ │ AZ-1 AZ-2 AZ-3 │ │ ┌───┐ ┌───┐ ┌───┐ │ │ │APP│──────│APP│──────│APP│ │ │ └─┬─┘ └─┬─┘ └─┬─┘ │ │ │ │ │ │ │ ┌─▼──────────▼──────────▼─┐ │ │ │ Load Balancer │ │ │ └────────────┬────────────┘ │ │ │ │ │ ┌────────────▼────────────┐ │ │ │ Database (Primary) │ │ │ │ + Read Replica │ │ │ └─────────────────────────┘ │ └──────────────────────────────┘ ``` ## Disaster Recovery Strategies ### DR Strategies on AWS (from least to most prepared) | Strategy | RTO | RPO | Cost | Description | |----------|-----|-----|------|-------------| | **Backup & Restore** | hours | 24 h | Low | Regular data backups to S3/Glacier, restore in DR region | | **Pilot Light** | tens of minutes | minutes | Medium | Minimal running copy (DB, core services), scale on failover | | **Warm Standby** | minutes | seconds | High | Reduced production copy running, scale on failover | | **Active-Active (Multi-Region)** | seconds | < 1 s | Very high | Fully active in multiple regions, traffic routing (Route53, Global Accelerator) | Key books on the topic: - **Engineering Resilient Systems on AWS** (Schwarz, Moran, Bachmeier, 2024) — practical labs for resilience patterns: back off and retry, multi-Region failover, circuit breaker, chaos engineering using AWS Fault Injection Simulator - **Building Resilient Architectures on AWS** (2025) — data security, backup strategies, recovery plan automation ### Chaos Engineering Deliberate fault injection to verify system resilience: - **AWS Fault Injection Simulator (FIS)** — managed fault injection for EC2, ECS, EKS, RDS - **Tools**: Chaos Mesh (Kubernetes), Gremlin, Litmus - **Process**: define hypothesis → run experiment → measure impact → improve system - **Safety**: experiments in isolated environment, safety controls, automatic rollback ## Cloud Design Patterns ### Strangler Fig Gradually replacing parts of a monolithic application with microservices. - Legacy functionality is progressively redirected to new services - Strangler Fig proxy (route headers, feature flags) controls traffic migration - Advantage: incremental value delivery without big-bang rewrite ### Circuit Breaker Prevents cascading failures when a dependent service fails. - Three states: **Closed** (normal operation), **Open** (requests immediately fail), **Half-Open** (test request after timeout) - Parameters: failure threshold, timeout (reset timeout), half-open max requests - Implementations: resilience4j, Hystrix (legacy), Istio (envoy), AWS App Mesh ### Saga Distributed transaction across microservices — a series of local transactions with compensating actions. - **Choreography** — each service publishes an event, the next service reacts (Kafka, EventBridge) - **Orchestration** — central orchestrator manages steps (Step Functions, Temporal, Camunda) ### CQRS (Command Query Responsibility Segregation) Separation of write (Command) and read (Query) models. - Command model: optimized for writes (normalized, transactional) - Query model: optimized for reads (denormalized, read-optimized views) - Eventual consistency between models (event bus propagates changes) - Use case: reporting, audit logs, high-throughput systems ### Event Sourcing Storing state as a sequence of events, not the current state. - Each change is an append-only event in an event store - Current state = fold of all events - Advantages: audit trail, time travel, CQRS compatibility - Implementations: EventStoreDB, Kafka (log), DynamoDB + CDC ### Additional Cloud Patterns (Wilder — Cloud Architecture Patterns) | Pattern | Category | Description | |---------|----------|-------------| | **Horizontally Scaling Compute** | Scalability | Adding/removing instances based on load, elasticity | | **Queue-Centric Workflow** | Scalability | Decoupling components via queues (SQS, RabbitMQ), async processing | | **Auto-Scaling** | Scalability | Automatic scaling based on metrics (CPU, memory, request count) | | **MapReduce** | Big Data | Distributed data processing (Hadoop, EMR, BigQuery) | | **Database Sharding** | Big Data | Horizontal data partitioning across databases | | **Busy Signal** | Failure Handling | Graceful degradation under overload (HTTP 503, throttling, backpressure) | | **Node Failure** | Failure Handling | Detection and automatic recovery from compute node failure | | **Colocation** | Distributed Users | Placing compute close to data to reduce latency | | **Valet Key** | Distributed Users | Delegated storage access (SAS tokens, S3 presigned URLs) | | **Multi-Site Deployment** | Distributed Users | Active deployment in multiple geographic locations | ## Evolutionary Architecture Definition (Ford, Parsons, Kua, 2022): *An evolutionary architecture supports guided, incremental change across multiple dimensions.* ### Fitness Functions Automated checks of architectural characteristics — analogous to tests for architecture: | Type | Description | Example | |------|-------------|---------| | **Atomic** | Checks a single metric | Cyclomatic complexity < 10 | | **Holistic** | Checks the overall system | End-to-end latency < 200 ms | | **Triggered** | Triggered by event (CI/CD commit, deployment) | API contract verification | | **Continuous** | Runs continuously in production | Monitoring dependency freshness | | **Static** | Code analysis without execution | SonarQube, ESLint | | **Dynamic** | Runtime analysis | Load tests, chaos experiments | ### Principles of Evolutionary Architecture 1. **Incremental change** — small, safe changes thanks to CI/CD, deployment pipelines, mature DevOps 2. **Fitness functions** — automated protection of architectural characteristics (scalability, performance, security) 3. **Coupling management** — conscious work with component connections (affinity, volatility, cycles) 4. **Evolutionary data** — database migrations as first-class citizens (evolutionary schemas, expand-contract pattern) ### Antipatterns - **Big Design Up Front (BDUF)** — trying to design everything upfront, ignoring change - **No Design at All** — absence of architectural thinking, purely emergent design - **Premature Standardization** — introducing standards before the domain is understood ## Hybrid Cloud Connectivity See also: [NETWORKING.en.md](NETWORKING.en.md) — network architecture (VPN, BGP, VPC design). - **Site-to-Site VPN** — IPSec tunnel over the internet - **Direct Connect / ExpressRoute / Dedicated Interconnect** — private physical connection - **Cloud VPN / Transit Gateway** — hub-and-spoke topology ## Cost Optimization Detail ### Savings Plans vs Reserved Instances | Property | Compute Savings Plan | EC2 Instance Savings Plan | Reserved Instances | |----------|----------------------|---------------------------|-------------------| | Flexibility | Instance family, region, OS | Instance family + region | Specific instance | | Term | 1 or 3 years | 1 or 3 years | 1 or 3 years | | Discount (typical) | ~30-50 % | ~40-60 % | ~40-60 % | | Change instance | Yes (any) | Yes (within family) | No | | Change region | Yes | No | No | | Payment options | All Upfront / Partial / No Upfront | All Upfront / Partial / No Upfront | All Upfront / Partial / No Upfront | ### Spot Instance Best Practices - **Diversification** — use a mix of instance types (spot fleet) for higher availability - **Graceful handling** — application must handle termination notice (2 minute warning) - **Checkpointing** — regular state saving for restart after spot interruption - **Spot block** (AWS) — protection for 1-6 h (limited availability) - **Use cases**: batch processing, CI/CD runners, stateless microservices, ML training - **Avoid**: stateful workloads, databases (without special design) ## Organization and Governance ### AWS Organizations ``` Root OU ├── Security OU │ ├── Audit Account (CloudTrail, Config) │ └── Security Tooling Account (GuardDuty, Security Hub) ├── Infrastructure OU │ ├── Network Account (Transit Gateway, VPN) │ ├── Shared Services Account (AD, SSO) │ └── Log Archive Account ├── Workloads OU │ ├── Dev OU → individual dev accounts │ ├── Staging OU → staging accounts │ └── Prod OU → production accounts └── Sandbox OU → isolated experimental accounts ``` - **SCP** (Service Control Policies) — whitelist/blacklist services at OU level - **Tag policies** — enforce tagging across accounts - **AI services opt-out** — control data usage in AWS AI services ### Azure Management Groups ``` Tenant Root Group ├── Platform MG │ ├── Connectivity (hub VNet, ExpressRoute) │ ├── Management (Log Analytics, Automation) │ └── Identity (AD DS, PIM) ├── Application MG │ ├── DEV (dev subscriptions) │ ├── TEST (test subscriptions) │ └── PROD (production subscriptions) └── Sandbox MG ``` - **Azure Policy** — built-in and custom policies (similar to SCP) - **Management Group hierarchy** — up to 6 levels deep - **Subscription limits** — max 10,000 subscriptions per tenant ### GCP Projects ``` Organization Node ├── Folder: Platform │ ├── Project: Shared Networking (VPC, Cloud NAT, VPN) │ ├── Project: Security (Cloud KMS, Secret Manager, Chronicle) │ └── Project: Monitoring (Cloud Monitoring, Logging) ├── Folder: Workloads │ ├── Folder: Dev │ │ └── Project: [app]-dev │ ├── Folder: Staging │ │ └── Project: [app]-staging │ └── Folder: Prod │ └── Project: [app]-prod └── Folder: Sandbox └── Project: [user]-sandbox ``` - **Organization policies** — constraints at organization/folder level - **Resource Manager** — hierarchy: Organization → Folder → Project → Resources - **Project limits** — max 30 projects (can be increased), 10k resources per project ## 12-Factor App Methodology Methodology for building cloud-native applications (Heroku, 2011), expanded by the book **Multi-Cloud Handbook for Developers** (Natarajan, Jacob, 2024). | # | Factor | Description | Cloud Implementation | |---|--------|-------------|----------------------| | 1 | **Codebase** | One repo, many deployments | Git + CI/CD pipeline | | 2 | **Dependencies** | Explicit dependency declaration | package.json, requirements.txt, Docker image | | 3 | **Config** | Configuration in environment variables | Secrets Manager, Parameter Store, env vars | | 4 | **Backing services** | Dependent services as attached resources | RDS, S3, Redis — connection via connection string | | 5 | **Build, release, run** | Strict separation of build stages | CI/CD pipeline (GitHub Actions, GitLab CI) | | 6 | **Processes** | Application as stateless processes | Horizontal scaling, session in Redis | | 7 | **Port binding** | Service exports port, not embedded in server | Express, FastAPI, Spring Boot on own port | | 8 | **Concurrency** | Scaling via process model | Horizontal Pod Autoscaler (K8s), EC2 Auto Scaling | | 9 | **Disposability** | Fast startup and graceful shutdown | Health checks, SIGTERM handling, preStop hooks | | 10 | **Dev/Prod parity** | Minimal difference between environments | Docker, IaC (Terraform), same backing services | | 11 | **Logs** | Logs as event streams | stdout/stderr → CloudWatch, ELK, Datadog | | 12 | **Admin processes** | Admin tasks as one-off processes | DB migrations, data backfill — run in isolation | ### Multi-cloud Extensions (Multi-Cloud Handbook for Developers) - **API-first design** — consistent API interfaces across clouds (REST, gRPC) - **Domain-Driven Design (DDD)** — bounded contexts mapped to cloud services - **Service Mesh** — Istio, Linkerd for observability, traffic management and security across clouds - **GitOps** — declarative deployment with ArgoCD/Flux across Kubernetes clusters in different clouds ## Azure Cloud Native Architecture (Map Book) Based on **The Azure Cloud Native Architecture Mapbook (2nd ed.)** (Eyskens, 2025) — 40+ architectural maps across domains: ### Domains of Architectural Maps | Domain | Key Azure Services | Architectural Patterns | |--------|-------------------|----------------------| | **Infrastructure** | VNet, Azure Firewall, ExpressRoute, VPN Gateway | Hub-and-spoke, Virtual WAN, Private Link | | **Applications** | App Service, API Management, Service Bus, Functions | Event-driven, Strangler Fig, Backend for Frontend | | **Data** | Cosmos DB, SQL Database, Synapse, Data Lake | CQRS, Event Sourcing, Polyglot Persistence | | **Container Orchestrators** | AKS, Azure Container Apps, ACA | Sidecar, Ambassador, Adapter (service mesh) | | **AI** | Azure OpenAI, Cognitive Services, ML Studio | RAG, model fine-tuning, MLOps | | **Security** | Entra ID, Defender for Cloud, Key Vault, Sentinel | Zero Trust, Defense in depth, JIT Access | ### Cloud Adoption Framework on Azure - **Strategy** — business case, application catalog, portfolio rationalization - **Plan** — landing zone design, governance baseline, subscription taxonomy - **Ready** — landing zone implementation (ALZ), Azure Policy, Networking, Identity - **Migrate** — assessment (Azure Migrate), rehost/replatform, test and cutover - **Govern** — cost management, policy enforcement, compliance monitoring ## Cloud Provider Comparison Based on **Cloud Computing: AWS, Azure, Google Cloud** (Sario, 2025): | Area | AWS | Azure | GCP | |------|-----|-------|-----| | **Compute** | EC2, Lambda, ECS/EKS | VMs, Functions, AKS | GCE, Cloud Functions, GKE | | **Storage** | S3, EBS, EFS | Blob, Disk, Files | Cloud Storage, Persistent Disk, Filestore | | **Relational DB** | RDS (MySQL, PG, SQL Server, Oracle, MariaDB) | SQL Database, MySQL/PostgreSQL | Cloud SQL (MySQL, PG, SQL Server) | | **NoSQL DB** | DynamoDB, ElastiCache | Cosmos DB, Redis Cache | Firestore, Bigtable, Memorystore | | **Message queue** | SQS, SNS | Service Bus, Queue Storage | Pub/Sub, Tasks | | **Observability** | CloudWatch, X-Ray | Monitor, Application Insights | Cloud Monitoring, Cloud Trace | | **AI/ML** | SageMaker, Bedrock | Azure ML, OpenAI | Vertex AI, AutoML | | **Pricing (compute)** | On-demand, Reserved, Spot, Savings Plan | Pay-as-you-go, Reserved, Spot | On-demand, Committed Use, Spot | ## OpenStack as Private Cloud OpenStack is the dominant open-source platform for building private clouds (IaaS). It provides compute (Nova), networking (Neutron), and storage services (Cinder/Swift/Manila) with a unified API. ### Advantages over Commercial Solutions - **Vendor-neutral API** — avoids lock-in (VMware, Hyper-V) - **Multi-tenancy** — Keystone identity, RBAC, projects, quotas - **Hybrid cloud ready** — federation with AWS/Azure/GCP, Terraform provisioning - **Ecosystem** — hundreds of services (Heat orchestration, Magnum containers, Designate DNS) ### Suitable Scenarios | Scenario | Key Services | |----------|--------------| | Data center with multi-tenancy and self-service | Nova, Neutron, Cinder, Horizon | | Telco / NFVI / MEC | Neutron (DPDK, SR-IOV), Nova (NUMA pinning) | | Science and HPC | Cyborg (GPU), Manila (NAS), Ironic (bare metal) | | Academic clouds | Keystone federation, Trove (DBaaS) | ### Challenges - Significant deployment and operations complexity - Frequent API breaking changes between releases (cycle per year) - Limited enterprise support outside commercial distributions (Red Hat, Canonical, Mirantis) ## Best Practices - Use **infrastructure as code** (Terraform, Pulumi, CDK) - Design for **failure** — every component can fail - Implement **defense in depth** — security at every layer - Monitor **costs** — tagging, budget alerts, anomaly detection - Use **managed services** where it makes sense (less operations) - **Least privilege** for all IAM roles and policies ## Resources Links, books and standards: [sources/cloud/sources.en.md](sources/cloud/sources.en.md) - **Cost tagging** — assign tags for chargeback/showback (Environment, Team, Cost Center, Application) - **Automated compliance** — AWS Config, Azure Policy, GCP Org Policies for guardrails - **Multi-account strategy** — AWS Control Tower, Azure Landing Zones, GCP Resource Hierarchy ### Recommended Reading | Book | Authors | ISBN | Description | |------|---------|------|-------------| | The AI Cloud Infrastructure Blueprint | Thummarakoti, Vududala, Madupati, Kaushik | 978-1-041-16642-9 | End-to-end guide to designing, deploying, and managing AI systems on cloud platforms. Covers public/private/hybrid/multi-cloud models for AI, infrastructure for ML training and inference, MLOps. Target audience: architects, data scientists, DevOps. | | AWS for Solutions Architects (3rd ed.) | Shrivastava, Srivastav, Thakur | 978-1-83664-193-3 | Practical guide to AWS architecture — compute (EC2, Lambda), storage (S3, EBS), databases (RDS, DynamoDB), networking, security, Well-Architected Framework, migration, cost optimization. Suitable for AWS Solutions Architect certification preparation. | *Last revised: 2026-06-03*