496 lines
25 KiB
Markdown
496 lines
25 KiB
Markdown
# ☁️ Cloud Architecture
|
|
|
|
## Providers
|
|
|
|
- **AWS** — largest market share, broadest portfolio
|
|
- **Azure** — strong integration with Microsoft ecosystem
|
|
- **GCP** — Kubernetes (GKE), data & ML, network connectivity
|
|
|
|
## Deployment Models
|
|
|
|
| Model | Description |
|
|
|-------|-------------|
|
|
| Public cloud | Shared provider infrastructure |
|
|
| Private cloud | Dedicated infrastructure (on-prem or hosted) |
|
|
| Hybrid cloud | Public + private interconnection |
|
|
| Multi-cloud | Multiple public providers |
|
|
|
|
## Multi-cloud Strategy
|
|
|
|
### Reasons for Multi-cloud
|
|
- **Vendor lock-in prevention** — risk diversification
|
|
- **Regulatory requirements** — data residency in specific regions
|
|
- **Best-of-breed** — each provider has strengths (AWS networking, Azure enterprise, GCP data/ML)
|
|
- **Acquisition scenarios** — merge & acquisition unification
|
|
|
|
### Multi-cloud Connectivity
|
|
|
|
| Method | Latency | Throughput | Cost |
|
|
|--------|---------|------------|------|
|
|
| Site-to-Site VPN | Medium | Limited | Low |
|
|
| Private interconnect (Direct Connect / ExpressRoute / Dedicated Interconnect) | Low | High | High |
|
|
| Cloud-to-cloud VPN | Medium | Medium | Medium |
|
|
| SD-WAN | Low | High | Medium |
|
|
|
|
### Challenges
|
|
- **Network complexity** — different VPC/VNet concepts, security models
|
|
- **IAM federation** — unified identities across clouds (SSO, SAML, OIDC)
|
|
- **Data gravity** — moving data between clouds is expensive and slow
|
|
- **Monitoring** — single pane of glass across clouds (Grafana, Datadog)
|
|
|
|
### Cloud Adoption Frameworks (CAF)
|
|
|
|
Each major provider has its own Cloud Adoption Framework for a structured approach to cloud adoption:
|
|
|
|
| Provider | Framework | Focus |
|
|
|----------|-----------|-------|
|
|
| AWS | AWS CAF | 6 perspectives: Business, People, Governance, Platform, Security, Operations |
|
|
| Azure | Microsoft CAF | 8 methodologies: Strategy, Plan, Ready, Migrate, Innovate, Govern, Manage, Secure |
|
|
| GCP | Google CAF | 4 pillars: Learn, Scale, Modernize, Operate |
|
|
|
|
Multi-Cloud Administration Guide (Mulder, 2024) recommends combining CAF frameworks across providers for unified governance models, especially in:
|
|
- **Interoperability** — standardization of APIs and IaC across clouds (Terraform, Pulumi)
|
|
- **Data governance** — unified policy for data residency and lifecycle
|
|
- **Compliance automation** — automated audits across clouds (AWS Config, Azure Policy, GCP Org Policies)
|
|
- **Access management** — identity federation and centralized RBAC
|
|
|
|
## Migration Strategies — 6 Rs
|
|
|
|
| Strategy | Description | Difficulty | Typical Scenario |
|
|
|----------|-------------|------------|------------------|
|
|
| **Rehost** (Lift & Shift) | Move VM/as-is without changes | Low | Quick migration, datacenter exit, minimal risk |
|
|
| **Replatform** (Lift & Reshape) | Migration with minor adjustments (e.g., RDS instead of self-managed DB) | Medium | Optimization without rewriting the application |
|
|
| **Refactor** (Re-architect) | Rewrite application as cloud-native (microservices, serverless) | High | Maximize cloud benefit, long-term strategy |
|
|
| **Repurchase** | Move to SaaS (e.g., Salesforce, Workday) | Low | Application is outdated, SaaS alternative exists |
|
|
| **Retire** | Decommission unused applications | Low | Application no longer in use |
|
|
| **Retain** | Keep on-prem | None | Regulatory reasons, too high migration risk |
|
|
|
|
### Decision Framework for 6 Rs
|
|
|
|
```
|
|
Start: Is the application needed?
|
|
├── No → Retire
|
|
└── Yes → Does a SaaS alternative exist?
|
|
├── Yes → Repurchase
|
|
└── No → Is refactoring worthwhile?
|
|
├── Yes → Refactor
|
|
└── No → Is platform change sufficient?
|
|
├── Yes → Replatform
|
|
└── No → Rehost
|
|
```
|
|
|
|
## Well-Architected Framework (AWS)
|
|
|
|
1. **Operational Excellence** — automation, monitoring, documentation
|
|
2. **Security** — IAM, encryption, compliance
|
|
3. **Reliability** — recovery, scaling, backup plans
|
|
4. **Performance Efficiency** — right-sizing, choosing the right services
|
|
5. **Cost Optimization** — FinOps, reserved instances, spot instances
|
|
6. **Sustainability** (since 2022) — carbon footprint, energy efficiency
|
|
|
|
Analogues: Azure Well-Architected Framework, GCP Architecture Framework
|
|
|
|
### Key Questions from Well-Architected Review (~60 questions)
|
|
|
|
**Operational Excellence (12 questions)**
|
|
- How are changes managed and automated?
|
|
- How are operations documented and shared within the team?
|
|
- How are expected and unexpected events reflected in operations?
|
|
- What runbooks exist for common operational scenarios?
|
|
- How is incident management and postmortem process conducted?
|
|
|
|
**Security (12 questions)**
|
|
- How is identity & access management implemented?
|
|
- How is data protected at rest and in transit?
|
|
- How is security incident detection ensured?
|
|
- What are the procedures for patch management and vulnerability remediation?
|
|
- How are infrastructure credentials and secrets managed?
|
|
|
|
**Reliability (12 questions)**
|
|
- How is service availability ensured during a component failure?
|
|
- How is backup and disaster recovery implemented?
|
|
- How do service limits (quotas, throttling) affect reliability?
|
|
- How does automatic scaling work under changing load?
|
|
- What are the SLI/SLO metrics and how are they monitored?
|
|
|
|
**Performance Efficiency (12 questions)**
|
|
- How is the correct type and size of compute/storage selected?
|
|
- How is the database layer optimized (indexes, queries, caching)?
|
|
- How is monitoring used to identify bottlenecks?
|
|
- How is scaling implemented (vertical vs horizontal)?
|
|
|
|
**Cost Optimization (12 questions)**
|
|
- How are costs allocated to teams/projects (chargeback/showback)?
|
|
- What tools are used for cost analysis?
|
|
- How are unused resources identified and eliminated?
|
|
- How is licensing optimized (BYOL, hybrid benefit)?
|
|
|
|
## Key Components
|
|
|
|
### Compute Layer
|
|
|
|
- **VM / instances** — EC2, Azure VMs, GCE
|
|
- **Container orchestration** — EKS, AKS, GKE
|
|
- **Serverless** — Lambda, Azure Functions, Cloud Functions
|
|
- **PaaS** — App Engine, Elastic Beanstalk, Azure App Service
|
|
|
|
### Compute Comparison Matrix (AWS EC2)
|
|
|
|
| Family | Type | vCPU:Memory | Use Case | Example Pricing (on-demand, us-east-1) |
|
|
|--------|------|-------------|----------|----------------------------------------|
|
|
| **General purpose** | M7g, m7i | 1:4 | Web servers, microservices, dev/test | m7i.large ~$0.088/h |
|
|
| **Compute optimized** | C7g, c7i | 1:2 | HPC, batch processing, CI/CD, gaming | c7i.large ~$0.078/h |
|
|
| **Memory optimized** | R7g, r7i, x2idn | 1:8 to 1:32 | In-memory DB (Redis), SAP HANA, real-time analytics | r7i.large ~$0.118/h |
|
|
| **Storage optimized** | I4i, im4gn | 1:4 + NVMe | Transactional DB, data warehousing, Kafka | i4i.large ~$0.138/h |
|
|
| **GPU / ML** | P5, g5, trn1 | GPU attach | AI training (P5), inference (g5), ML (trn1) | g5.xlarge ~$1.006/h |
|
|
|
|
See [GPU.en.md](GPU.en.md) for GPU model and configuration details.
|
|
|
|
### Storage
|
|
|
|
- **Object storage** — S3, Blob Storage, Cloud Storage
|
|
- **Block storage** — EBS, managed disks, persistent disks
|
|
- **File storage** — EFS, Azure Files, Filestore
|
|
- **CDN** — CloudFront, Azure CDN, Cloud CDN
|
|
|
|
### S3 Storage Classes
|
|
|
|
| Class | Availability | Retrieval Time | Price / GB / Month | Use Case |
|
|
|-------|-------------|----------------|--------------------|----------|
|
|
| **S3 Standard** | 99.99 % | milliseconds | ~$0.023 | Active data, frequent access |
|
|
| **S3 Intelligent-Tiering** | 99.9 % | milliseconds | ~$0.023 + monitoring fee | Unknown/variable access patterns |
|
|
| **S3 Standard-IA** | 99.9 % | milliseconds | ~$0.0125 | Less frequent but fast access |
|
|
| **S3 One Zone-IA** | 99.5 % | milliseconds | ~$0.01 | Reproducible data |
|
|
| **S3 Glacier Instant** | 99.9 % | milliseconds | ~$0.004 | Archive with occasional access |
|
|
| **S3 Glacier Flexible** | 99.99 % | 1-5 min (expedite) / 3-5 h (standard) | ~$0.0036 | Long-term archive |
|
|
| **S3 Glacier Deep Archive** | 99.99 % | 12 h (standard) / 48 h (bulk) | ~$0.00099 | Cheapest, compliance archives |
|
|
|
|
## Multi-AZ and Multi-Region Architecture
|
|
|
|
```
|
|
Region ┌──────────────────────────────┐
|
|
│ AZ-1 AZ-2 AZ-3 │
|
|
│ ┌───┐ ┌───┐ ┌───┐ │
|
|
│ │APP│──────│APP│──────│APP│ │
|
|
│ └─┬─┘ └─┬─┘ └─┬─┘ │
|
|
│ │ │ │ │
|
|
│ ┌─▼──────────▼──────────▼─┐ │
|
|
│ │ Load Balancer │ │
|
|
│ └────────────┬────────────┘ │
|
|
│ │ │
|
|
│ ┌────────────▼────────────┐ │
|
|
│ │ Database (Primary) │ │
|
|
│ │ + Read Replica │ │
|
|
│ └─────────────────────────┘ │
|
|
└──────────────────────────────┘
|
|
```
|
|
|
|
## Disaster Recovery Strategies
|
|
|
|
### DR Strategies on AWS (from least to most prepared)
|
|
|
|
| Strategy | RTO | RPO | Cost | Description |
|
|
|----------|-----|-----|------|-------------|
|
|
| **Backup & Restore** | hours | 24 h | Low | Regular data backups to S3/Glacier, restore in DR region |
|
|
| **Pilot Light** | tens of minutes | minutes | Medium | Minimal running copy (DB, core services), scale on failover |
|
|
| **Warm Standby** | minutes | seconds | High | Reduced production copy running, scale on failover |
|
|
| **Active-Active (Multi-Region)** | seconds | < 1 s | Very high | Fully active in multiple regions, traffic routing (Route53, Global Accelerator) |
|
|
|
|
Key books on the topic:
|
|
- **Engineering Resilient Systems on AWS** (Schwarz, Moran, Bachmeier, 2024) — practical labs for resilience patterns: back off and retry, multi-Region failover, circuit breaker, chaos engineering using AWS Fault Injection Simulator
|
|
- **Building Resilient Architectures on AWS** (2025) — data security, backup strategies, recovery plan automation
|
|
|
|
### Chaos Engineering
|
|
|
|
Deliberate fault injection to verify system resilience:
|
|
- **AWS Fault Injection Simulator (FIS)** — managed fault injection for EC2, ECS, EKS, RDS
|
|
- **Tools**: Chaos Mesh (Kubernetes), Gremlin, Litmus
|
|
- **Process**: define hypothesis → run experiment → measure impact → improve system
|
|
- **Safety**: experiments in isolated environment, safety controls, automatic rollback
|
|
|
|
## Cloud Design Patterns
|
|
|
|
### Strangler Fig
|
|
Gradually replacing parts of a monolithic application with microservices.
|
|
- Legacy functionality is progressively redirected to new services
|
|
- Strangler Fig proxy (route headers, feature flags) controls traffic migration
|
|
- Advantage: incremental value delivery without big-bang rewrite
|
|
|
|
### Circuit Breaker
|
|
Prevents cascading failures when a dependent service fails.
|
|
- Three states: **Closed** (normal operation), **Open** (requests immediately fail), **Half-Open** (test request after timeout)
|
|
- Parameters: failure threshold, timeout (reset timeout), half-open max requests
|
|
- Implementations: resilience4j, Hystrix (legacy), Istio (envoy), AWS App Mesh
|
|
|
|
### Saga
|
|
Distributed transaction across microservices — a series of local transactions with compensating actions.
|
|
- **Choreography** — each service publishes an event, the next service reacts (Kafka, EventBridge)
|
|
- **Orchestration** — central orchestrator manages steps (Step Functions, Temporal, Camunda)
|
|
|
|
### CQRS (Command Query Responsibility Segregation)
|
|
Separation of write (Command) and read (Query) models.
|
|
- Command model: optimized for writes (normalized, transactional)
|
|
- Query model: optimized for reads (denormalized, read-optimized views)
|
|
- Eventual consistency between models (event bus propagates changes)
|
|
- Use case: reporting, audit logs, high-throughput systems
|
|
|
|
### Event Sourcing
|
|
Storing state as a sequence of events, not the current state.
|
|
- Each change is an append-only event in an event store
|
|
- Current state = fold of all events
|
|
- Advantages: audit trail, time travel, CQRS compatibility
|
|
- Implementations: EventStoreDB, Kafka (log), DynamoDB + CDC
|
|
|
|
### Additional Cloud Patterns (Wilder — Cloud Architecture Patterns)
|
|
|
|
| Pattern | Category | Description |
|
|
|---------|----------|-------------|
|
|
| **Horizontally Scaling Compute** | Scalability | Adding/removing instances based on load, elasticity |
|
|
| **Queue-Centric Workflow** | Scalability | Decoupling components via queues (SQS, RabbitMQ), async processing |
|
|
| **Auto-Scaling** | Scalability | Automatic scaling based on metrics (CPU, memory, request count) |
|
|
| **MapReduce** | Big Data | Distributed data processing (Hadoop, EMR, BigQuery) |
|
|
| **Database Sharding** | Big Data | Horizontal data partitioning across databases |
|
|
| **Busy Signal** | Failure Handling | Graceful degradation under overload (HTTP 503, throttling, backpressure) |
|
|
| **Node Failure** | Failure Handling | Detection and automatic recovery from compute node failure |
|
|
| **Colocation** | Distributed Users | Placing compute close to data to reduce latency |
|
|
| **Valet Key** | Distributed Users | Delegated storage access (SAS tokens, S3 presigned URLs) |
|
|
| **Multi-Site Deployment** | Distributed Users | Active deployment in multiple geographic locations |
|
|
|
|
## Evolutionary Architecture
|
|
|
|
Definition (Ford, Parsons, Kua, 2022): *An evolutionary architecture supports guided, incremental change across multiple dimensions.*
|
|
|
|
### Fitness Functions
|
|
|
|
Automated checks of architectural characteristics — analogous to tests for architecture:
|
|
|
|
| Type | Description | Example |
|
|
|------|-------------|---------|
|
|
| **Atomic** | Checks a single metric | Cyclomatic complexity < 10 |
|
|
| **Holistic** | Checks the overall system | End-to-end latency < 200 ms |
|
|
| **Triggered** | Triggered by event (CI/CD commit, deployment) | API contract verification |
|
|
| **Continuous** | Runs continuously in production | Monitoring dependency freshness |
|
|
| **Static** | Code analysis without execution | SonarQube, ESLint |
|
|
| **Dynamic** | Runtime analysis | Load tests, chaos experiments |
|
|
|
|
### Principles of Evolutionary Architecture
|
|
|
|
1. **Incremental change** — small, safe changes thanks to CI/CD, deployment pipelines, mature DevOps
|
|
2. **Fitness functions** — automated protection of architectural characteristics (scalability, performance, security)
|
|
3. **Coupling management** — conscious work with component connections (affinity, volatility, cycles)
|
|
4. **Evolutionary data** — database migrations as first-class citizens (evolutionary schemas, expand-contract pattern)
|
|
|
|
### Antipatterns
|
|
- **Big Design Up Front (BDUF)** — trying to design everything upfront, ignoring change
|
|
- **No Design at All** — absence of architectural thinking, purely emergent design
|
|
- **Premature Standardization** — introducing standards before the domain is understood
|
|
|
|
## Hybrid Cloud Connectivity
|
|
|
|
See also: [NETWORKING.en.md](NETWORKING.en.md) — network architecture (VPN, BGP, VPC design).
|
|
|
|
- **Site-to-Site VPN** — IPSec tunnel over the internet
|
|
- **Direct Connect / ExpressRoute / Dedicated Interconnect** — private physical connection
|
|
- **Cloud VPN / Transit Gateway** — hub-and-spoke topology
|
|
|
|
## Cost Optimization Detail
|
|
|
|
### Savings Plans vs Reserved Instances
|
|
|
|
| Property | Compute Savings Plan | EC2 Instance Savings Plan | Reserved Instances |
|
|
|----------|----------------------|---------------------------|-------------------|
|
|
| Flexibility | Instance family, region, OS | Instance family + region | Specific instance |
|
|
| Term | 1 or 3 years | 1 or 3 years | 1 or 3 years |
|
|
| Discount (typical) | ~30-50 % | ~40-60 % | ~40-60 % |
|
|
| Change instance | Yes (any) | Yes (within family) | No |
|
|
| Change region | Yes | No | No |
|
|
| Payment options | All Upfront / Partial / No Upfront | All Upfront / Partial / No Upfront | All Upfront / Partial / No Upfront |
|
|
|
|
### Spot Instance Best Practices
|
|
|
|
- **Diversification** — use a mix of instance types (spot fleet) for higher availability
|
|
- **Graceful handling** — application must handle termination notice (2 minute warning)
|
|
- **Checkpointing** — regular state saving for restart after spot interruption
|
|
- **Spot block** (AWS) — protection for 1-6 h (limited availability)
|
|
- **Use cases**: batch processing, CI/CD runners, stateless microservices, ML training
|
|
- **Avoid**: stateful workloads, databases (without special design)
|
|
|
|
## Organization and Governance
|
|
|
|
### AWS Organizations
|
|
|
|
```
|
|
Root OU
|
|
├── Security OU
|
|
│ ├── Audit Account (CloudTrail, Config)
|
|
│ └── Security Tooling Account (GuardDuty, Security Hub)
|
|
├── Infrastructure OU
|
|
│ ├── Network Account (Transit Gateway, VPN)
|
|
│ ├── Shared Services Account (AD, SSO)
|
|
│ └── Log Archive Account
|
|
├── Workloads OU
|
|
│ ├── Dev OU → individual dev accounts
|
|
│ ├── Staging OU → staging accounts
|
|
│ └── Prod OU → production accounts
|
|
└── Sandbox OU → isolated experimental accounts
|
|
```
|
|
|
|
- **SCP** (Service Control Policies) — whitelist/blacklist services at OU level
|
|
- **Tag policies** — enforce tagging across accounts
|
|
- **AI services opt-out** — control data usage in AWS AI services
|
|
|
|
### Azure Management Groups
|
|
|
|
```
|
|
Tenant Root Group
|
|
├── Platform MG
|
|
│ ├── Connectivity (hub VNet, ExpressRoute)
|
|
│ ├── Management (Log Analytics, Automation)
|
|
│ └── Identity (AD DS, PIM)
|
|
├── Application MG
|
|
│ ├── DEV (dev subscriptions)
|
|
│ ├── TEST (test subscriptions)
|
|
│ └── PROD (production subscriptions)
|
|
└── Sandbox MG
|
|
```
|
|
|
|
- **Azure Policy** — built-in and custom policies (similar to SCP)
|
|
- **Management Group hierarchy** — up to 6 levels deep
|
|
- **Subscription limits** — max 10,000 subscriptions per tenant
|
|
|
|
### GCP Projects
|
|
|
|
```
|
|
Organization Node
|
|
├── Folder: Platform
|
|
│ ├── Project: Shared Networking (VPC, Cloud NAT, VPN)
|
|
│ ├── Project: Security (Cloud KMS, Secret Manager, Chronicle)
|
|
│ └── Project: Monitoring (Cloud Monitoring, Logging)
|
|
├── Folder: Workloads
|
|
│ ├── Folder: Dev
|
|
│ │ └── Project: [app]-dev
|
|
│ ├── Folder: Staging
|
|
│ │ └── Project: [app]-staging
|
|
│ └── Folder: Prod
|
|
│ └── Project: [app]-prod
|
|
└── Folder: Sandbox
|
|
└── Project: [user]-sandbox
|
|
```
|
|
|
|
- **Organization policies** — constraints at organization/folder level
|
|
- **Resource Manager** — hierarchy: Organization → Folder → Project → Resources
|
|
- **Project limits** — max 30 projects (can be increased), 10k resources per project
|
|
|
|
## 12-Factor App Methodology
|
|
|
|
Methodology for building cloud-native applications (Heroku, 2011), expanded by the book **Multi-Cloud Handbook for Developers** (Natarajan, Jacob, 2024).
|
|
|
|
| # | Factor | Description | Cloud Implementation |
|
|
|---|--------|-------------|----------------------|
|
|
| 1 | **Codebase** | One repo, many deployments | Git + CI/CD pipeline |
|
|
| 2 | **Dependencies** | Explicit dependency declaration | package.json, requirements.txt, Docker image |
|
|
| 3 | **Config** | Configuration in environment variables | Secrets Manager, Parameter Store, env vars |
|
|
| 4 | **Backing services** | Dependent services as attached resources | RDS, S3, Redis — connection via connection string |
|
|
| 5 | **Build, release, run** | Strict separation of build stages | CI/CD pipeline (GitHub Actions, GitLab CI) |
|
|
| 6 | **Processes** | Application as stateless processes | Horizontal scaling, session in Redis |
|
|
| 7 | **Port binding** | Service exports port, not embedded in server | Express, FastAPI, Spring Boot on own port |
|
|
| 8 | **Concurrency** | Scaling via process model | Horizontal Pod Autoscaler (K8s), EC2 Auto Scaling |
|
|
| 9 | **Disposability** | Fast startup and graceful shutdown | Health checks, SIGTERM handling, preStop hooks |
|
|
| 10 | **Dev/Prod parity** | Minimal difference between environments | Docker, IaC (Terraform), same backing services |
|
|
| 11 | **Logs** | Logs as event streams | stdout/stderr → CloudWatch, ELK, Datadog |
|
|
| 12 | **Admin processes** | Admin tasks as one-off processes | DB migrations, data backfill — run in isolation |
|
|
|
|
### Multi-cloud Extensions (Multi-Cloud Handbook for Developers)
|
|
- **API-first design** — consistent API interfaces across clouds (REST, gRPC)
|
|
- **Domain-Driven Design (DDD)** — bounded contexts mapped to cloud services
|
|
- **Service Mesh** — Istio, Linkerd for observability, traffic management and security across clouds
|
|
- **GitOps** — declarative deployment with ArgoCD/Flux across Kubernetes clusters in different clouds
|
|
|
|
## Azure Cloud Native Architecture (Map Book)
|
|
|
|
Based on **The Azure Cloud Native Architecture Mapbook (2nd ed.)** (Eyskens, 2025) — 40+ architectural maps across domains:
|
|
|
|
### Domains of Architectural Maps
|
|
|
|
| Domain | Key Azure Services | Architectural Patterns |
|
|
|--------|-------------------|----------------------|
|
|
| **Infrastructure** | VNet, Azure Firewall, ExpressRoute, VPN Gateway | Hub-and-spoke, Virtual WAN, Private Link |
|
|
| **Applications** | App Service, API Management, Service Bus, Functions | Event-driven, Strangler Fig, Backend for Frontend |
|
|
| **Data** | Cosmos DB, SQL Database, Synapse, Data Lake | CQRS, Event Sourcing, Polyglot Persistence |
|
|
| **Container Orchestrators** | AKS, Azure Container Apps, ACA | Sidecar, Ambassador, Adapter (service mesh) |
|
|
| **AI** | Azure OpenAI, Cognitive Services, ML Studio | RAG, model fine-tuning, MLOps |
|
|
| **Security** | Entra ID, Defender for Cloud, Key Vault, Sentinel | Zero Trust, Defense in depth, JIT Access |
|
|
|
|
### Cloud Adoption Framework on Azure
|
|
- **Strategy** — business case, application catalog, portfolio rationalization
|
|
- **Plan** — landing zone design, governance baseline, subscription taxonomy
|
|
- **Ready** — landing zone implementation (ALZ), Azure Policy, Networking, Identity
|
|
- **Migrate** — assessment (Azure Migrate), rehost/replatform, test and cutover
|
|
- **Govern** — cost management, policy enforcement, compliance monitoring
|
|
|
|
## Cloud Provider Comparison
|
|
|
|
Based on **Cloud Computing: AWS, Azure, Google Cloud** (Sario, 2025):
|
|
|
|
| Area | AWS | Azure | GCP |
|
|
|------|-----|-------|-----|
|
|
| **Compute** | EC2, Lambda, ECS/EKS | VMs, Functions, AKS | GCE, Cloud Functions, GKE |
|
|
| **Storage** | S3, EBS, EFS | Blob, Disk, Files | Cloud Storage, Persistent Disk, Filestore |
|
|
| **Relational DB** | RDS (MySQL, PG, SQL Server, Oracle, MariaDB) | SQL Database, MySQL/PostgreSQL | Cloud SQL (MySQL, PG, SQL Server) |
|
|
| **NoSQL DB** | DynamoDB, ElastiCache | Cosmos DB, Redis Cache | Firestore, Bigtable, Memorystore |
|
|
| **Message queue** | SQS, SNS | Service Bus, Queue Storage | Pub/Sub, Tasks |
|
|
| **Observability** | CloudWatch, X-Ray | Monitor, Application Insights | Cloud Monitoring, Cloud Trace |
|
|
| **AI/ML** | SageMaker, Bedrock | Azure ML, OpenAI | Vertex AI, AutoML |
|
|
| **Pricing (compute)** | On-demand, Reserved, Spot, Savings Plan | Pay-as-you-go, Reserved, Spot | On-demand, Committed Use, Spot |
|
|
|
|
## OpenStack as Private Cloud
|
|
|
|
OpenStack is the dominant open-source platform for building private clouds (IaaS). It provides compute (Nova), networking (Neutron), and storage services (Cinder/Swift/Manila) with a unified API.
|
|
|
|
### Advantages over Commercial Solutions
|
|
|
|
- **Vendor-neutral API** — avoids lock-in (VMware, Hyper-V)
|
|
- **Multi-tenancy** — Keystone identity, RBAC, projects, quotas
|
|
- **Hybrid cloud ready** — federation with AWS/Azure/GCP, Terraform provisioning
|
|
- **Ecosystem** — hundreds of services (Heat orchestration, Magnum containers, Designate DNS)
|
|
|
|
### Suitable Scenarios
|
|
|
|
| Scenario | Key Services |
|
|
|----------|--------------|
|
|
| Data center with multi-tenancy and self-service | Nova, Neutron, Cinder, Horizon |
|
|
| Telco / NFVI / MEC | Neutron (DPDK, SR-IOV), Nova (NUMA pinning) |
|
|
| Science and HPC | Cyborg (GPU), Manila (NAS), Ironic (bare metal) |
|
|
| Academic clouds | Keystone federation, Trove (DBaaS) |
|
|
|
|
### Challenges
|
|
|
|
- Significant deployment and operations complexity
|
|
- Frequent API breaking changes between releases (cycle per year)
|
|
- Limited enterprise support outside commercial distributions (Red Hat, Canonical, Mirantis)
|
|
|
|
## Best Practices
|
|
|
|
- Use **infrastructure as code** (Terraform, Pulumi, CDK)
|
|
- Design for **failure** — every component can fail
|
|
- Implement **defense in depth** — security at every layer
|
|
- Monitor **costs** — tagging, budget alerts, anomaly detection
|
|
- Use **managed services** where it makes sense (less operations)
|
|
- **Least privilege** for all IAM roles and policies
|
|
|
|
## Resources
|
|
|
|
Links, books and standards: [sources/cloud/sources.en.md](sources/cloud/sources.en.md)
|
|
- **Cost tagging** — assign tags for chargeback/showback (Environment, Team, Cost Center, Application)
|
|
- **Automated compliance** — AWS Config, Azure Policy, GCP Org Policies for guardrails
|
|
- **Multi-account strategy** — AWS Control Tower, Azure Landing Zones, GCP Resource Hierarchy
|
|
|
|
### Recommended Reading
|
|
|
|
| Book | Authors | ISBN | Description |
|
|
|------|---------|------|-------------|
|
|
| The AI Cloud Infrastructure Blueprint | Thummarakoti, Vududala, Madupati, Kaushik | 978-1-041-16642-9 | End-to-end guide to designing, deploying, and managing AI systems on cloud platforms. Covers public/private/hybrid/multi-cloud models for AI, infrastructure for ML training and inference, MLOps. Target audience: architects, data scientists, DevOps. |
|
|
| AWS for Solutions Architects (3rd ed.) | Shrivastava, Srivastav, Thakur | 978-1-83664-193-3 | Practical guide to AWS architecture — compute (EC2, Lambda), storage (S3, EBS), databases (RDS, DynamoDB), networking, security, Well-Architected Framework, migration, cost optimization. Suitable for AWS Solutions Architect certification preparation. |
|
|
|
|
*Last revised: 2026-06-03*
|