Recently during a vCloud Director project we were testing how long it takes to recover from an HA event. The test was done on two node management cluster where we loaded one host with almost all of the management VMs and then shut it down and measured how long it takes all the affected services to recover. This exercise was done to see if we can fulfill the required SLA.
The expectation was that it will take about 20 second for the other host to find out the first one is dead and then it will start to power up all the VMs based on their restart priority. Database server and Domain Controller have high priority, the rest of the VMs had the default one. To our surprise it did not took 20 seconds or so, but 8:40 minutes to register the database server and start the boot procedure. For some reason the particular server was shown with 95% Power On status. Although there are books written about vSphere HA, this behaviour was not explained.
See the Start and Completed Times:
At first it looked like a bug so SR was raised but then we found out it is like that by design. We were using NFS storage and NFS locking is influencing how long it takes to release the locks on VMs vmdk and vswp files. KB article 1007909 states that the time to recover the lock on NFS storage can be calculated:
X = (NFS.DiskFileLockUpdateFreq * NFS.LockRenewMaxFailureNumber) + NFS.LockUpdateTimeout
which with default values is
X = (10 * 3) + 5 = 35 seconds.
However the database server had 12 vmdk disks (2 disks per database) and the restart actually took (12+1)*(35+5) = 8:40 minutes. It means the locks were released sequentially, additional 5 seconds was added to each and also the VM swap file lock had to be released. This is expected behavior for vSphere 5.0 and older. The newly release vSphere 5.1 lowers the time down to 2 minutes as there are 5 threads (main vmx + 4 worker threads) working in parallel and those 13 files can be released in 3 takes.
KB article 2034935 was written about this behaviour.
If this is by design what can you do to avoid it?
1. Upgrade to ESX 5.1 to get up to 5 time faster HA restart times
2. Use block storage instead of NFS
3. Tweak NFS advanced parameters (DiskFileLockUpdateFreq, LockRenewMaxFailureNumber, LockUpdateTimeout) – however this is not recommended
4. Do not use that many VMDKs. Either consolidate on smaller number of disks, or use in guest disk mapping (iSCSI, NFS)
5. Just accept it when you calculate your SLAs.