Tag Archives: HA

vSphere HA and NFS Datastores

Recently during a vCloud Director project we were testing how long it  takes to recover from an HA event. The test was done on two node management cluster where we loaded one host with almost all of the management VMs and then shut it down and measured how long it takes all the affected services to recover. This exercise was done to see if we can fulfill the required SLA.

The expectation was that it will take about 20 second for the other host to find out the first one is dead and then it will start to power up all the VMs based on their restart priority. Database server and Domain Controller have high priority, the rest of the VMs had the default one. To our surprise it did not took 20 seconds or so, but 8:40 minutes to register the database server and start the boot procedure. For some reason the particular server was shown with 95% Power On status. Although there are books written about vSphere HA, this behaviour was not explained.

See the Start and Completed Times:

At first it looked like a bug so SR was raised but then we found out it is like that by design. We were using NFS storage and NFS locking is influencing how long it takes to release the locks on VMs vmdk and vswp files. KB article 1007909 states that the time to recover the lock on NFS storage can be calculated:

X = (NFS.DiskFileLockUpdateFreq * NFS.LockRenewMaxFailureNumber) + NFS.LockUpdateTimeout

which with default values is

X = (10 * 3) + 5 = 35 seconds.

However the database server had 12 vmdk disks (2 disks per database) and the restart actually took (12+1)*(35+5) = 8:40 minutes. It means the locks were released sequentially, additional 5 seconds was added to each and also the VM swap file lock had to be released. This is expected behavior for vSphere 5.0 and older. The newly release vSphere 5.1 lowers the time down to 2 minutes as there are 5 threads (main vmx + 4 worker threads) working in parallel and those 13 files can be released in 3 takes.

KB article 2034935 was written about this behaviour.

If this is by design what can you do to avoid it?

1. Upgrade to ESX 5.1 to get up to 5 time faster HA restart times

2. Use block storage instead of NFS

3. Tweak NFS advanced parameters (DiskFileLockUpdateFreq, LockRenewMaxFailureNumber, LockUpdateTimeout) – however this is not recommended

4. Do not use that many VMDKs. Either consolidate on smaller number of disks, or use in guest disk mapping (iSCSI, NFS)

5. Just accept it when you calculate your SLAs.

How to initiate scripted VMware HA failover

The situation: ESX HA cluster stretched over two sites A and B. The shared storage is at site B.

The task: If site A looses electricity initiate gracefull HA failover of all the virtual machines to site B considering that the hosts are licensed only with vSphere Standard edition (no vMotion).

How to do this:

  1. Have enough capacity for VMs from site A on hosts at site B
  2. UPS has to call a script on hosts on site A. An agent from the UPS supplier has to be installed on the hosts or on vSphere Management Assistant that controls the hosts. The script must be run on the hosts itself, it is not possible to execute the script remotely!
  3. The script is quite simple:

    esxcfg-vswif -D
    sleep 180
    esxcfg-vswif -E

  4. Set HA isolation response to shut down

How it works? The first command disconects all network interfaces from the service console. This creates isolation of the host, because the heartbeat to other hosts or gateway is lost. After while the HA on the host shuts down the guest VMs. This is gracefull shutdown (if VMware Tools are installed) and takes some time, therefore the sleep command. In this case the sleep command waits 3 minutes. When the other hosts on site B detect the loss of heartbeat they try to restart the machines. They have to wait till the SCSI locks on VM files are released. The sleep time has to be long enough for all the guest machines to be shut down so the other hosts still detect the loss of heartbeat and the SCSI lock is released. Finaly the service console network interfaces are restored and the host can be shut down (either by the UPS agent or with shutdown -h now command.