How to initiate scripted VMware HA failover

The situation: ESX HA cluster stretched over two sites A and B. The shared storage is at site B.

The task: If site A looses electricity initiate gracefull HA failover of all the virtual machines to site B considering that the hosts are licensed only with vSphere Standard edition (no vMotion).

How to do this:

  1. Have enough capacity for VMs from site A on hosts at site B
  2. UPS has to call a script on hosts on site A. An agent from the UPS supplier has to be installed on the hosts or on vSphere Management Assistant that controls the hosts. The script must be run on the hosts itself, it is not possible to execute the script remotely!
  3. The script is quite simple:

    esxcfg-vswif -D
    sleep 180
    esxcfg-vswif -E

  4. Set HA isolation response to shut down

How it works? The first command disconects all network interfaces from the service console. This creates isolation of the host, because the heartbeat to other hosts or gateway is lost. After while the HA on the host shuts down the guest VMs. This is gracefull shutdown (if VMware Tools are installed) and takes some time, therefore the sleep command. In this case the sleep command waits 3 minutes. When the other hosts on site B detect the loss of heartbeat they try to restart the machines. They have to wait till the SCSI locks on VM files are released. The sleep time has to be long enough for all the guest machines to be shut down so the other hosts still detect the loss of heartbeat and the SCSI lock is released. Finaly the service console network interfaces are restored and the host can be shut down (either by the UPS agent or with shutdown -h now command.

vSphere: Cannot remove empty virtual switch

A few days ago I was trying to migrate running virtual machines from one virtual switch to another one without any downtime. When I thought I was done I tried to remove the vacated virtual switch, but instead was greeted with the following error:

Error: A specified parameter was not correct.

A specified parameter was not correct.

Well after scratching my head for a while I discovered nasty bug in vSphere. If you rename a port group, the configuration files of VMs using this portgroup are not updated. If you create a new port group with the old name, vSphere client then shows the VMs in the new port group, however in reality they are still residing in the old one and using its connectivity. 

To reproduce my steps that lead to the above image:

  1. I created ‘Test’ virtual machine port group on new vSwitch2 and placed there a running VM1
  2. I renamed the port group to ‘Test2′ – VM1 disappeared
  3. I created new virtual machine port group ‘Test‘ on vSwitch0. The VM1 immediately jumped into this new port group.
  4. I tried to delete vSwitch2 and got the error.

Running esxcfg-vswitch -l I received this output:

The supposedly empty port group ‘Test2‘ is using 1 port and the new ‘Test‘ port group shows 0 used ports even though vSphere client shows running VM1 in it.

So what is really happening? Actually this is not a bug of vSphere client as it relays on info provided by SDK of ESX server. Running commands

vmware-vim-cmd vmsvc/get.networks <vmid>

vmware-vim-cmd hostsvc/net/vswitch_info

gives wrong info about the port group names. vSphere client incorrectly assumes that the VM was migrated to the other switch, but in fact it still resides on the old switch. The only way out of it is open VM settings and change the network connection to a different port group and back again. The network adapter info gets updated and now finally the VM is migrated to the new switch and the old one can be removed.

This was tested on ESX build 4.0.0,236512.