Vsphere recovery between two data centers

It’s been a hot topic how to do a recovery after the failure between two remote data centers using Vmware Vsphere and Virtual Center. What if one of the data centers goes down and you want all your virtual machines to be running in a second data center? In this post I am not evaluating Vmware  Site Recovery Manager software that offers data center recovery – I should probably come up with a more detailed information about this software in the near future.

Remember, there will be downtime even if you intend to use Site Recovery Manager – in order to bring up virtual machines in second data center you need to shutdown virtual machines in the first data center. What if the network is unreachable? You can use STONITH (the old good stuff, remember). Or you can just stop the disk array replication and set-up second data center disk array LUN to write and start all virtual machines in this data center. Remember that you need to isolate the failed data center network connectivity/routing otherwise some of the users will try to access primary site and that’s no good.

Remember, you need a good connection between the data centers so that the LUN data replication will not lag or fail.

Below, I have compiled a list of requirements to build your virtual infrastructure in two different data centers for site recovery in case your primary data center fails:

  1. Vmware Vsphere (ESXi or ESX) server boxes in DC1 and DC2.
  2. For less problems you can have an additional Virtual Center in each data center, but it’s more expensive. You can also run virtual machines and manage each ESXi server without connecting it to Virtual Center, but it might be a problem if you have HA and DRS set-ups. You can also mirror Virtual Center config and instance and run it in case the first instance in DC1  is shutdown (so you don’t break Vmware licensing rules).
  3. Good connectivity between DC1 and DC2 (at least 1Gbps with low latency).
  4. Fiber channel array at DC1 and DC2 with specific LUN data replication between both arrays.
  5. Correct VLANs (preferable VLAN trunk set-up) and routing network between the data centers.
  6. Experienced tech guy(s).

So in case of first data center failure, your techs perform the following steps to bring up all important virtual machines in second data center:

  1. If possible, shut down DC1 virtual machines and Virtual center.
  2. Stop the LUN replication between both data centers and configure DC2 storage array to be master and in write mode.
  3. Check virtual machine guaranteed settings and affinity configuration, if any.
  4. Make quick changes to virtual machine settings and/or start virtual machines by logging into ESXi host directly.
  5. You may need to delete some lock files, if virtual machines won’t start. Check the log files.
  6. Perform all the other necessary steps to isolate DC1 ESXi servers and virtual machines so when the DC1 network becomes available again no users are redirected to the “old virtual machines”.
  7. Check if virtual machines are up and necessary services are running.
  8. Test connectivity from multiple network points (like other branch offices etc…) to make sure all users are able to access servers in DC2.

This is it. I hope I will get in more detail shortly and draw some architecture images so you can understand much better.


Leave a Reply