vMSC and vSphere HA Timeout?

I am currently working of vSphere Metro Cluster Design (vMCS) with NetApp MetroCluster. vMSC is basically a stretched vSphere HA cluster between two sites with some kind of clustered storage solution. The idea is to leverage vSphere HA and vMotion to achieve high SLAs for planned or unplanned downtime.

While researching and simulating all the failure scenarios there is one particular that must include manual step. When there is complete site failure the surviving storage controller cannot distinguish between site failure or just a network partition (split brain). NetApp MetroCluster offers Tie-Breaker that needs to be deployed in a third datacenter and acts as a witness to help the surviving controller to decide what to do.

NetApp MetroCluster with Tie-Breaker and Full Site Failure
NetApp MetroCluster with Tie-Breaker and Full Site Failure

If the third datacenter is not available and Tie-Breaker cannot be implemented the storage controller takes no action and the storage administrator needs to do manual forced takeover of the storage resources on the surviving controller.

I was wondering how will vSphere infrastructure react to this and what steps the vSphere administrator will need to take to recover the failed workloads. VMware white paper VMware vSphere Metro Storage Cluster Case Study actually mentions this scenarion and says:

NOTE: vSphere HA will stop attempting to start a virtual machine after 30 minutes by default. If the storage team has not issued the takeover command within that time frame, the vSphere administrator must manually start virtual machines when the storage is available.

I was wondering how to streamline the manual vSphere administrator process of registering and starting the failed virtual machines but to my surprise in my test lab even if the storage failover took more than an hour, vSphere HA happily restarted all the failed VMs.

What is actually happening can be observed in the FDM logs of the HA master node. When the site fails following log message is repeating every minute:

[4CD9EB90 info ‘Placement’] [RR::CreatePlacementRequest] 6 total VM with some excluded: 0 VM disabled; 0 VM being placed; 6 VM waiting resources; 0 VM in time delay;

The HA master node tried to find suitable hosts to restart the six failed VMs from the failed site. As none of the available host had yet access to the storage needed by those VMs it was retrying every minute. Once I failed over the storage  following messages immediately appeared:

[InventoryManagerImpl::ProcessVmCompatMatrixChange] Updating the existing compatibility matrix.

[VmOperationsManager::PerformPlacements] Sending a list of 6 VMs to the placement manager for placement.

The host to VM compatibility matrix was updated and immediately the placement manager found suitable hosts and HA restart process started.

This means you need not to worry about vSphere HA timeout. The timeout actually does not start ticking till the HA power on operation is attempted and then it takes 5 restart attempts in 0, 2 min, 6 min, 14 min and 30 minute time points.

P.S. Lee Dilworth and Duncan Epping actually had VMworld 2012 session about vMSC and this information is mentioned there.

Edit 8/8/2013

P.P.S. Duncan Epping blogged about HA and its dependence on compatibility matrix here.

One thought on “vMSC and vSphere HA Timeout?

  1. Hi Tom,
    I have this deployment in place, when we power off one the hosts to test HA and simulate a host failure, the vm´s are restarted in the remaining hosts after 2-8 minutes but we see a message in vCenter:
    “VMs awaiting a retry: a previous restart attempt failed, and vSphere HA is waiting for a timeout to expire before trying again.”

    There are some “Cannot open .vmx file….resource is busy” messages in hostd.log file, but not sure if we can see who is locking it, we left a post in the community but no luck yet

    Any ideas about this VMs awaiting a retry message ?
    Thank you

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.