Files
knowledge-base/DC-MIGRATION.en.md
Stanislav Hubacek b53714113c new files
2026-06-16 15:47:45 +02:00

246 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🏗️ Data Center Migration
## Migration strategies
| Strategy | RTO | RPO | Risk | Cost | Duration | Description |
|-----------|-----|-----|--------|---------|-------------|-------|
| **Cold / Big Bang** | hoursdays | days | High | Low | days | Shut everything down, move, power up |
| **Phased / Wave** | minutes (per wave) | minutes | Medium | Medium | weeksmonths | Workloads moved in waves |
| **Rolling** | 0 (live) | 0 | Low | High | months | Live migration per VM/service |
| **Parallel Run** | 0 | 0 | Low | Very high | months | Both DCs operational, gradual cutover |
| **Pilot Light** | hours | minutes | Medium | Low | weeks | Critical services in new DC, rest migrates |
| **Lift & Shift** | hours | minutes | Medium | Low | weeks | VMs/servers moved without configuration changes |
| **Re-platform** | hours | minutes | Low | Medium | months | Optimization during migration (OS upgrade, resize) |
| **Re-architect** | 0 | 0 | Low | High | monthsyears | Application redesigned for new platform |
---
## Decision tree
```mermaid
flowchart TD
Start(["DC Migration"]) --> APP{"Application\nstateful?"}
APP -->|"Yes"| DOWNTIME{"Tolerates\ndowntime?"}
APP -->|"No"| ROLLING["Rolling / Parallel Run"]
DOWNTIME -->|"Yes, hours+"| COLD["Cold / Big Bang\nSimplest, cheapest\nRisk: all at once"]
DOWNTIME -->|"Yes, minutes"| PHASED["Phased / Wave\nBy application / business unit"]
DOWNTIME -->|"No (zero downtime)"| SYNC{"Sync replication\npossible?"}
SYNC -->|"Yes, < 100 km"| ROLLING
SYNC -->|"No"| PARALLEL["Parallel Run\nBoth DCs active, gradual cutover"]
ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
ROLL_HA -->|"Yes"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
ROLL_HA -->|"No"| ROLL_REPL["Storage + DB replication\nGradual workload migration"]
```
---
## Migration phases
### 1. Discovery and assessment
| Task | Tools | Output |
|------|----------|--------|
| HW and SW inventory | RVTools, NetBox, CMDB | Server, VM, and service list |
| Dependency mapping | ServiceNow, AppDynamics, manual | Application dependency graph |
| Traffic analysis | NetFlow, sFlow, vRNI | Bandwidth, latency, peak usage |
| Performance baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
| License audit | Flexera, SAM | Licenses, support, compliance |
**Output:** workload list with RTO/RPO, dependencies, and criticality.
### 2. Planning
- **Wave plan** — workload division into migration waves (1050 VMs per wave)
- **Dependency ordering** — DNS, NTP, LDAP, PKI first
- **Cutover window** — time window for switching (typically weekend)
- **Rollback plan** — conditions and procedure for reversal
- **Test plan** — what and how to test post-migration
- **Communication plan** — who, when, how is informed
### 3. New DC preparation
- **Infrastructure** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (see [DATACENTERS.en.md](DATACENTERS.en.md) — deployment order)
- **Network** — BGP peering, VXLAN/VLAN, firewall rules, load balancers
- **Storage** — SAN zoning, NAS exports, Ceph cluster
- **Virtualization** — vCenter, Hyper-V cluster, Proxmox
### 4. Replication and synchronization
| Layer | Method | Tools |
|--------|--------|----------|
| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
| **Databases** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
### 5. Workload migration
#### Wave migration (recommended for medium/large DCs)
```mermaid
gantt
title Wave migration
dateFormat YYYY-MM-DD
section Wave 1 - Core
DNS, NTP, LDAP :done, w1a, 2026-07-01, 3d
Monitoring + logging :done, w1b, after w1a, 2d
section Wave 2 - Network
Load balancers :active, w2a, 2026-07-06, 2d
Firewalls :active, w2b, 2026-07-08, 2d
section Wave 3 - Storage
NAS migration :w3a, 2026-07-10, 5d
SAN replication :w3b, 2026-07-10, 3d
section Wave 4 - Dev/Test
Dev VMs :w4a, 2026-07-15, 5d
section Wave 5 - Prod tier 3
Internal apps :w5a, 2026-07-22, 5d
section Wave 6 - Prod tier 2
Business apps :w6a, 2026-07-29, 5d
section Wave 7 - Prod tier 1
Critical apps :w7a, 2026-08-05, 5d
```
#### Typical single wave procedure:
1. **Day -7**: Sync data replication (initial seed)
2. **Day -1**: Incremental sync, final test
3. **Day 0 (cutover)**:
- Stop application in source DC
- Final sync (last delta)
- Start application in target DC
- DNS/Traffic switch
- Smoke test
4. **Day +1**: Monitoring (performance, errors, lag)
5. **Day +7**: Rollback window end (success confirmation)
### 6. Network strategies
#### IP re-addressing
| Approach | Description | Pros | Cons |
|---------|-------|--------|----------|
| **Keep IP** | Same IPs, BGP anycast or stretch VLAN | No application config changes | Stretched VLAN/L2 limitations |
| **Change IP** | New IP range, DNS/BGP routing change | Clean architecture | Config changes, DNS TTL |
| **NAT translation** | NAT between old and new IP space | No application changes | Latency, troubleshooting complexity |
**Keep IP** is only possible with:
- L2 stretch between DCs (VXLAN, OTV) — distance limited
- BGP anycast for VIPs (load balancers)
- Applications tolerant to ARP cache changes
#### DNS cutover
```
1. Lower TTL to 60300 s (one week ahead)
2. At cutover, change A/AAAA records to new IPs
3. Wait for propagation (per TTL)
4. Monitor traffic
```
#### Traffic steering
| Technique | Use case |
|----------|----------|
| **BGP** | Change AS path / local pref for traffic steering |
| **DNS** | Lower TTL, change A records |
| **Load balancer** | Change pool members, health check |
| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
### 7. Database migration
See individual DB files for details. Summary table:
| DB | Method | RPO | RTO | Note |
|----|--------|-----|-----|----------|
| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
| **MySQL** | Group Replication / async replication | 0 (sync) / seconds | min | InnoDB Cluster |
| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync for remote DCs |
| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
| **MongoDB** | Replica set election | seconds | < 1 min | Priority-based failover |
| **Cassandra** | Multi-DC replication | eventual | 0 | Native multi-master |
### 8. Testing
| Phase | What to test | Method |
|------|-------------|--------|
| **Pre-migration** | Application in new DC (isolated) | Dry run on replicated data |
| **Cutover** | Functionality, availability, latency | Smoke test, synthetic transactions |
| **Post-migration** | Performance, integration, monitoring | A/B comparison with baseline, canary traffic |
| **Rollback** | Return to old DC | Tested rollback plan |
### 9. Rollback plan
Each wave must have a defined rollback:
| Condition | Action |
|----------|------|
| Application fails to start in new DC | DNS switch back, stop replication |
| Performance worse than baseline (> 20 %) | Rollback, root cause analysis |
| Integration failure (API timeout, DB connection) | Rollback, dependency check |
| Security incident | Rollback, forensic analysis |
Rollback must be tested **before** the real cutover.
---
## Special cases
### Mainframe migration
- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
- HyperSwap for storage mirroring
- Cross-system coupling facility (XCF)
- Often the last migrated component
### COTS applications (Oracle EBS, SAP)
- Require vendor-specific migration procedures
- Oracle EBS: Autoconfig, cloning (ADXLC)
- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
- License re-licensing on HW change
### Cloud migration (On-prem → Cloud)
See [CLOUD.en.md](CLOUD.en.md) — migration strategies (6 Rs):
| Strategy | Description |
|-----------|-------|
| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
| **Re-architect** | Application rewritten as cloud-native |
| **Retire** | Decommission unnecessary applications |
| **Retain** | Application stays on-prem (review later) |
| **Repurchase** | SaaS replacement |
---
## Recommended approach per DC size
| DC Size | VM Count | Recommended strategy | Duration | Team |
|-------------|----------|---------------------|-------------|-----|
| **Small** | < 50 | Big Bang (weekend) | 24 days | 35 people |
| **Medium** | 50500 | Phased (510 waves) | 28 weeks | 510 people |
| **Large** | 5005000 | Phased + Rolling | 312 months | 1030 people |
| **Enterprise** | 5000+ | Parallel Run / Rolling | 1236 months | 30+ people |
---
## Related
- [DATACENTERS.en.md](DATACENTERS.en.md) — DC topologies, secondary DC, deployment order
- [CLOUD.en.md](CLOUD.en.md) — cloud migration strategies (6 Rs)
- [DR.en.md](DR.en.md) — disaster recovery, RTO/RPO
- [NETWORKING.en.md](NETWORKING.en.md) — BGP, DNS, VXLAN, traffic steering
- [STORAGE.en.md](STORAGE.en.md) — storage replication
## Sources
Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
*Last revision: 2026-06-12*