The situation: ESX HA cluster stretched over two sites A and B. The shared storage is at site B.
The task: If site A looses electricity initiate gracefull HA failover of all the virtual machines to site B considering that the hosts are licensed only with vSphere Standard edition (no vMotion).
How to do this:
- Have enough capacity for VMs from site A on hosts at site B
- UPS has to call a script on hosts on site A. An agent from the UPS supplier has to be installed on the hosts or on vSphere Management Assistant that controls the hosts. The script must be run on the hosts itself, it is not possible to execute the script remotely!
- The script is quite simple:
esxcfg-vswif -D
sleep 180
esxcfg-vswif -E
- Set HA isolation response to shut down
How it works? The first command disconects all network interfaces from the service console. This creates isolation of the host, because the heartbeat to other hosts or gateway is lost. After while the HA on the host shuts down the guest VMs. This is gracefull shutdown (if VMware Tools are installed) and takes some time, therefore the sleep command. In this case the sleep command waits 3 minutes. When the other hosts on site B detect the loss of heartbeat they try to restart the machines. They have to wait till the SCSI locks on VM files are released. The sleep time has to be long enough for all the guest machines to be shut down so the other hosts still detect the loss of heartbeat and the SCSI lock is released. Finaly the service console network interfaces are restored and the host can be shut down (either by the UPS agent or with shutdown -h now command.