vMSC and vSphere HA Timeout?

I am currently working of vSphere Metro Cluster Design (vMCS) with NetApp MetroCluster. vMSC is basically a stretched vSphere HA cluster between two sites with some kind of clustered storage solution. The idea is to leverage vSphere HA and vMotion to achieve high SLAs for planned or unplanned downtime.

While researching and simulating all the failure scenarios there is one particular that must include manual step. When there is complete site failure the surviving storage controller cannot distinguish between site failure or just a network partition (split brain). NetApp MetroCluster offers Tie-Breaker that needs to be deployed in a third datacenter and acts as a witness to help the surviving controller to decide what to do.

NetApp MetroCluster with Tie-Breaker and Full Site Failure
NetApp MetroCluster with Tie-Breaker and Full Site Failure

If the third datacenter is not available and Tie-Breaker cannot be implemented the storage controller takes no action and the storage administrator needs to do manual forced takeover of the storage resources on the surviving controller.

I was wondering how will vSphere infrastructure react to this and what steps the vSphere administrator will need to take to recover the failed workloads. VMware white paper VMware vSphere Metro Storage Cluster Case Study actually mentions this scenarion and says:

NOTE: vSphere HA will stop attempting to start a virtual machine after 30 minutes by default. If the storage team has not issued the takeover command within that time frame, the vSphere administrator must manually start virtual machines when the storage is available.

I was wondering how to streamline the manual vSphere administrator process of registering and starting the failed virtual machines but to my surprise in my test lab even if the storage failover took more than an hour, vSphere HA happily restarted all the failed VMs.

What is actually happening can be observed in the FDM logs of the HA master node. When the site fails following log message is repeating every minute:

[4CD9EB90 info ‘Placement’] [RR::CreatePlacementRequest] 6 total VM with some excluded: 0 VM disabled; 0 VM being placed; 6 VM waiting resources; 0 VM in time delay;

The HA master node tried to find suitable hosts to restart the six failed VMs from the failed site. As none of the available host had yet access to the storage needed by those VMs it was retrying every minute. Once I failed over the storage  following messages immediately appeared:

[InventoryManagerImpl::ProcessVmCompatMatrixChange] Updating the existing compatibility matrix.

[VmOperationsManager::PerformPlacements] Sending a list of 6 VMs to the placement manager for placement.

The host to VM compatibility matrix was updated and immediately the placement manager found suitable hosts and HA restart process started.

This means you need not to worry about vSphere HA timeout. The timeout actually does not start ticking till the HA power on operation is attempted and then it takes 5 restart attempts in 0, 2 min, 6 min, 14 min and 30 minute time points.

P.S. Lee Dilworth and Duncan Epping actually had VMworld 2012 session about vMSC and this information is mentioned there.

Edit 8/8/2013

P.P.S. Duncan Epping blogged about HA and its dependence on compatibility matrix here.

vSphere HA and NFS Datastores

Recently during a vCloud Director project we were testing how long it  takes to recover from an HA event. The test was done on two node management cluster where we loaded one host with almost all of the management VMs and then shut it down and measured how long it takes all the affected services to recover. This exercise was done to see if we can fulfill the required SLA.

The expectation was that it will take about 20 second for the other host to find out the first one is dead and then it will start to power up all the VMs based on their restart priority. Database server and Domain Controller have high priority, the rest of the VMs had the default one. To our surprise it did not took 20 seconds or so, but 8:40 minutes to register the database server and start the boot procedure. For some reason the particular server was shown with 95% Power On status. Although there are books written about vSphere HA, this behaviour was not explained.

See the Start and Completed Times:

At first it looked like a bug so SR was raised but then we found out it is like that by design. We were using NFS storage and NFS locking is influencing how long it takes to release the locks on VMs vmdk and vswp files. KB article 1007909 states that the time to recover the lock on NFS storage can be calculated:

X = (NFS.DiskFileLockUpdateFreq * NFS.LockRenewMaxFailureNumber) + NFS.LockUpdateTimeout

which with default values is

X = (10 * 3) + 5 = 35 seconds.

However the database server had 12 vmdk disks (2 disks per database) and the restart actually took (12+1)*(35+5) = 8:40 minutes. It means the locks were released sequentially, additional 5 seconds was added to each and also the VM swap file lock had to be released. This is expected behavior for vSphere 5.0 and older. The newly release vSphere 5.1 lowers the time down to 2 minutes as there are 5 threads (main vmx + 4 worker threads) working in parallel and those 13 files can be released in 3 takes.

KB article 2034935 was written about this behaviour.

If this is by design what can you do to avoid it?

1. Upgrade to ESX 5.1 to get up to 5 time faster HA restart times

2. Use block storage instead of NFS

3. Tweak NFS advanced parameters (DiskFileLockUpdateFreq, LockRenewMaxFailureNumber, LockUpdateTimeout) – however this is not recommended

4. Do not use that many VMDKs. Either consolidate on smaller number of disks, or use in guest disk mapping (iSCSI, NFS)

5. Just accept it when you calculate your SLAs.

How to initiate scripted VMware HA failover

The situation: ESX HA cluster stretched over two sites A and B. The shared storage is at site B.

The task: If site A looses electricity initiate gracefull HA failover of all the virtual machines to site B considering that the hosts are licensed only with vSphere Standard edition (no vMotion).

How to do this:

  1. Have enough capacity for VMs from site A on hosts at site B
  2. UPS has to call a script on hosts on site A. An agent from the UPS supplier has to be installed on the hosts or on vSphere Management Assistant that controls the hosts. The script must be run on the hosts itself, it is not possible to execute the script remotely!
  3. The script is quite simple:

    esxcfg-vswif -D
    sleep 180
    esxcfg-vswif -E

  4. Set HA isolation response to shut down

How it works? The first command disconects all network interfaces from the service console. This creates isolation of the host, because the heartbeat to other hosts or gateway is lost. After while the HA on the host shuts down the guest VMs. This is gracefull shutdown (if VMware Tools are installed) and takes some time, therefore the sleep command. In this case the sleep command waits 3 minutes. When the other hosts on site B detect the loss of heartbeat they try to restart the machines. They have to wait till the SCSI locks on VM files are released. The sleep time has to be long enough for all the guest machines to be shut down so the other hosts still detect the loss of heartbeat and the SCSI lock is released. Finaly the service console network interfaces are restored and the host can be shut down (either by the UPS agent or with shutdown -h now command.