Files
knowledge-base/DC-MIGRATION.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

9.6 KiB
Raw Blame History

🏗️ Data Center Migration

Migration strategies

Strategy RTO RPO Risk Cost Duration Description
Cold / Big Bang hoursdays days High Low days Shut everything down, move, power up
Phased / Wave minutes (per wave) minutes Medium Medium weeksmonths Workloads moved in waves
Rolling 0 (live) 0 Low High months Live migration per VM/service
Parallel Run 0 0 Low Very high months Both DCs operational, gradual cutover
Pilot Light hours minutes Medium Low weeks Critical services in new DC, rest migrates
Lift & Shift hours minutes Medium Low weeks VMs/servers moved without configuration changes
Re-platform hours minutes Low Medium months Optimization during migration (OS upgrade, resize)
Re-architect 0 0 Low High monthsyears Application redesigned for new platform

Decision tree

flowchart TD
    Start(["DC Migration"]) --> APP{"Application\nstateful?"}
    APP -->|"Yes"| DOWNTIME{"Tolerates\ndowntime?"}
    APP -->|"No"| ROLLING["Rolling / Parallel Run"]

    DOWNTIME -->|"Yes, hours+"| COLD["Cold / Big Bang\nSimplest, cheapest\nRisk: all at once"]
    DOWNTIME -->|"Yes, minutes"| PHASED["Phased / Wave\nBy application / business unit"]
    DOWNTIME -->|"No (zero downtime)"| SYNC{"Sync replication\npossible?"}

    SYNC -->|"Yes, < 100 km"| ROLLING
    SYNC -->|"No"| PARALLEL["Parallel Run\nBoth DCs active, gradual cutover"]

    ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
    ROLL_HA -->|"Yes"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
    ROLL_HA -->|"No"| ROLL_REPL["Storage + DB replication\nGradual workload migration"]

Migration phases

1. Discovery and assessment

Task Tools Output
HW and SW inventory RVTools, NetBox, CMDB Server, VM, and service list
Dependency mapping ServiceNow, AppDynamics, manual Application dependency graph
Traffic analysis NetFlow, sFlow, vRNI Bandwidth, latency, peak usage
Performance baseline Prometheus, Zabbix, vRealize CPU/RAM/disk/network per workload
License audit Flexera, SAM Licenses, support, compliance

Output: workload list with RTO/RPO, dependencies, and criticality.

2. Planning

  • Wave plan — workload division into migration waves (1050 VMs per wave)
  • Dependency ordering — DNS, NTP, LDAP, PKI first
  • Cutover window — time window for switching (typically weekend)
  • Rollback plan — conditions and procedure for reversal
  • Test plan — what and how to test post-migration
  • Communication plan — who, when, how is informed

3. New DC preparation

  • Infrastructure — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (see DATACENTERS.en.md — deployment order)
  • Network — BGP peering, VXLAN/VLAN, firewall rules, load balancers
  • Storage — SAN zoning, NAS exports, Ceph cluster
  • Virtualization — vCenter, Hyper-V cluster, Proxmox

4. Replication and synchronization

Layer Method Tools
Storage (block) SAN sync/async mirror, LUN replication NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster
Storage (file) DFS-R, rsync, robocopy Windows DFS, Rsync
Storage (object) Cross-region replication MinIO replication, S3 CRR
Databases Log shipping, CDC, streaming replication PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication
VM Storage vMotion, replication VMware vSphere Replication, Hyper-V Replica, Zerto
Kubernetes Velero + Restic, Rook Ceph mirror Velero, Rook

5. Workload migration

gantt
    title Wave migration
    dateFormat  YYYY-MM-DD
    section Wave 1 - Core
    DNS, NTP, LDAP    :done, w1a, 2026-07-01, 3d
    Monitoring + logging :done, w1b, after w1a, 2d
    section Wave 2 - Network
    Load balancers     :active, w2a, 2026-07-06, 2d
    Firewalls          :active, w2b, 2026-07-08, 2d
    section Wave 3 - Storage
    NAS migration      :w3a, 2026-07-10, 5d
    SAN replication    :w3b, 2026-07-10, 3d
    section Wave 4 - Dev/Test
    Dev VMs            :w4a, 2026-07-15, 5d
    section Wave 5 - Prod tier 3
    Internal apps      :w5a, 2026-07-22, 5d
    section Wave 6 - Prod tier 2
    Business apps      :w6a, 2026-07-29, 5d
    section Wave 7 - Prod tier 1
    Critical apps      :w7a, 2026-08-05, 5d

Typical single wave procedure:

  1. Day -7: Sync data replication (initial seed)
  2. Day -1: Incremental sync, final test
  3. Day 0 (cutover):
    • Stop application in source DC
    • Final sync (last delta)
    • Start application in target DC
    • DNS/Traffic switch
    • Smoke test
  4. Day +1: Monitoring (performance, errors, lag)
  5. Day +7: Rollback window end (success confirmation)

6. Network strategies

IP re-addressing

Approach Description Pros Cons
Keep IP Same IPs, BGP anycast or stretch VLAN No application config changes Stretched VLAN/L2 limitations
Change IP New IP range, DNS/BGP routing change Clean architecture Config changes, DNS TTL
NAT translation NAT between old and new IP space No application changes Latency, troubleshooting complexity

Keep IP is only possible with:

  • L2 stretch between DCs (VXLAN, OTV) — distance limited
  • BGP anycast for VIPs (load balancers)
  • Applications tolerant to ARP cache changes

DNS cutover

1. Lower TTL to 60300 s (one week ahead)
2. At cutover, change A/AAAA records to new IPs
3. Wait for propagation (per TTL)
4. Monitor traffic

Traffic steering

Technique Use case
BGP Change AS path / local pref for traffic steering
DNS Lower TTL, change A records
Load balancer Change pool members, health check
GSLB Global Server Load Balancing (F5 GTM, NSX ALB)
Cloud DNS AWS Route53, Azure Traffic Manager, Google Cloud DNS

7. Database migration

See individual DB files for details. Summary table:

DB Method RPO RTO Note
PostgreSQL Streaming replication + Patroni switchover 0 (sync) / ~MB (async) min Patroni auto-failover
MySQL Group Replication / async replication 0 (sync) / seconds min InnoDB Cluster
Oracle Data Guard switchover 0 (sync) min Far sync for remote DCs
MSSQL AlwaysOn AG failover 0 (sync) min Cloud witness
MongoDB Replica set election seconds < 1 min Priority-based failover
Cassandra Multi-DC replication eventual 0 Native multi-master

8. Testing

Phase What to test Method
Pre-migration Application in new DC (isolated) Dry run on replicated data
Cutover Functionality, availability, latency Smoke test, synthetic transactions
Post-migration Performance, integration, monitoring A/B comparison with baseline, canary traffic
Rollback Return to old DC Tested rollback plan

9. Rollback plan

Each wave must have a defined rollback:

Condition Action
Application fails to start in new DC DNS switch back, stop replication
Performance worse than baseline (> 20 %) Rollback, root cause analysis
Integration failure (API timeout, DB connection) Rollback, dependency check
Security incident Rollback, forensic analysis

Rollback must be tested before the real cutover.


Special cases

Mainframe migration

  • IBM z/OS — GDPS (Geographically Dispersed Parallel Sysplex)
  • HyperSwap for storage mirroring
  • Cross-system coupling facility (XCF)
  • Often the last migrated component

COTS applications (Oracle EBS, SAP)

  • Require vendor-specific migration procedures
  • Oracle EBS: Autoconfig, cloning (ADXLC)
  • SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
  • License re-licensing on HW change

Cloud migration (On-prem → Cloud)

See CLOUD.en.md — migration strategies (6 Rs):

Strategy Description
Re-host (Lift & Shift) VM → Cloud VM (AWS MGN, Azure Migrate)
Re-platform OS upgrade, managed DB (RDS, Cloud SQL)
Re-architect Application rewritten as cloud-native
Retire Decommission unnecessary applications
Retain Application stays on-prem (review later)
Repurchase SaaS replacement

DC Size VM Count Recommended strategy Duration Team
Small < 50 Big Bang (weekend) 24 days 35 people
Medium 50500 Phased (510 waves) 28 weeks 510 people
Large 5005000 Phased + Rolling 312 months 1030 people
Enterprise 5000+ Parallel Run / Rolling 1236 months 30+ people

Sources

Links, books, and standards: sources/infrastructure/sources.en.md

Last revision: 2026-06-12