vMSC and vSphere HA Timeout?

I am currently working of vSphere Metro Cluster Design (vMCS) with NetApp MetroCluster. vMSC is basically a stretched vSphere HA cluster between two sites with some kind of clustered storage solution. The idea is to leverage vSphere HA and vMotion to achieve high SLAs for planned or unplanned downtime.

While researching and simulating all the failure scenarios there is one particular that must include manual step. When there is complete site failure the surviving storage controller cannot distinguish between site failure or just a network partition (split brain). NetApp MetroCluster offers Tie-Breaker that needs to be deployed in a third datacenter and acts as a witness to help the surviving controller to decide what to do.

NetApp MetroCluster with Tie-Breaker and Full Site Failure
NetApp MetroCluster with Tie-Breaker and Full Site Failure

If the third datacenter is not available and Tie-Breaker cannot be implemented the storage controller takes no action and the storage administrator needs to do manual forced takeover of the storage resources on the surviving controller.

I was wondering how will vSphere infrastructure react to this and what steps the vSphere administrator will need to take to recover the failed workloads. VMware white paper VMware vSphere Metro Storage Cluster Case Study actually mentions this scenarion and says:

NOTE: vSphere HA will stop attempting to start a virtual machine after 30 minutes by default. If the storage team has not issued the takeover command within that time frame, the vSphere administrator must manually start virtual machines when the storage is available.

I was wondering how to streamline the manual vSphere administrator process of registering and starting the failed virtual machines but to my surprise in my test lab even if the storage failover took more than an hour, vSphere HA happily restarted all the failed VMs.

What is actually happening can be observed in the FDM logs of the HA master node. When the site fails following log message is repeating every minute:

[4CD9EB90 info ‘Placement’] [RR::CreatePlacementRequest] 6 total VM with some excluded: 0 VM disabled; 0 VM being placed; 6 VM waiting resources; 0 VM in time delay;

The HA master node tried to find suitable hosts to restart the six failed VMs from the failed site. As none of the available host had yet access to the storage needed by those VMs it was retrying every minute. Once I failed over the storage  following messages immediately appeared:

[InventoryManagerImpl::ProcessVmCompatMatrixChange] Updating the existing compatibility matrix.

[VmOperationsManager::PerformPlacements] Sending a list of 6 VMs to the placement manager for placement.

The host to VM compatibility matrix was updated and immediately the placement manager found suitable hosts and HA restart process started.

This means you need not to worry about vSphere HA timeout. The timeout actually does not start ticking till the HA power on operation is attempted and then it takes 5 restart attempts in 0, 2 min, 6 min, 14 min and 30 minute time points.

P.S. Lee Dilworth and Duncan Epping actually had VMworld 2012 session about vMSC and this information is mentioned there.

Edit 8/8/2013

P.P.S. Duncan Epping blogged about HA and its dependence on compatibility matrix here.

Hardware Accelerated Fast Provisioning in vCloud Director

Recently I have been struggling to enable hardware accelerated fast provisioning in vCloud Director. It is not particularly well documented so I am putting all the necessary steps here for the benefits of others.

First some theory: VMware technical marketing storage guru Cormac Hogan explains on his personal blog new vSphere 5.1 storage enhancement in vCloud Director and also the new NFS VAAI Fast File Clone primitive. vCloud Director has been supporting linked clone based fast provisioning since version 1.5. However in the current version 5.1 we fully support the hardware offload. A linked clone (first used in VMware View) is a duplicate of a virtual machine that uses the same base disk as the original with a chain of delta disks to track the differences between the original and the clone. It is used mainly to speed up the VM provisioning operations where it takes just a second or so to create a clone whereas a full clone operation can take minutes. It brings also significant storage efficiency. However it has also some drawbacks – the main is the loss of performance. As we can create clone of a clone a long chain of related delta disks is created (by default up to 30) and the virtual machine’s disk read I/O operation has to traverse through the chain to find the right block. Also by design delta disks are not storage aligned as they contain the block content plus its location.

So here comes the hardware offload. If NFS storage is used we can use the Fast File Clone VAAI primitive which offloads the clone creation (basically a vmdk file copy) to the storage hardware. NFS has advantage here over the block storage as the storage array has no notion of vmdk files. I have seen EMC and NetApp implementations and will describe the latter as it can be easily simulated with the NetApp Edge Virtual Storage Appliance which can be downloaded with 90 day evaluation licenses here: (note NetApp Edge requires 2 vCPUs and utilizes them 100% all the time!).

  1. NetApp supports NAS VAAI with firmware Data ONTAP 8.1 in cluster mode and Data ONTAP 8.1.1 in 7-mode. The NetApp Edge Appliance I used came with Data ONTAP 8.1.1 in 7-mode.
  2. FlexClone file technology is used to create hardware offloaded linked clones therefore it must be licensed. FlexClone does not physically copy any data blocks just a new metadata is created which points to the original blocks as shown in the picture taken from the NetApp Storage Management Guide.
    FlexClone

    If FlexClone is not licensed, vCloud Director can still offload the cloning to the array however slow full clone is created instead by the array. The Eval license comes with FlexClone license key which must be entered with license add <license code> command from the Data ONTAP console.

  3. In order to enable NFS VAAI on vSphere a storage vendor VMkernel module must be installed. NetApp provides NetApp NAS Plugin (NetAppNasPlugin.v18.zip). It must be either incorporated to the ESX installation image profile or installed manually from the ESXi shell:esxcli software vib install -d /path/to/NetAppNasPlugin.v18.zip

     

  4. VAAI must also be enabled on the NetApp. This is done by enabling VMware vStorage support by running the following command from Data ONTAP console:options nfs.vstorage.enable on
  5. If everything done correctly the Hardware Acceleration column in the vSphere list of datastore should show Supported. More info about the datastore can be displayed by running command vmkfstools -Ph /vmfs/volumes/<datastore>
    ~ # vmkfstools -Ph /vmfs/volumes/NetAppNFS1
    NFS-1.00 file system spanning 1 partitions.
    File system label (if any): NetAppNFS1
    Mode: public
    Capacity 28.5 GB, 28.5 GB available, file block size 4 KB
    UUID: 1799ba01-2494838d-0000-000000000000
    Partitions spanned (on “notDCS”):
           nfs:NetAppNFS1
    NAS VAAI Supported: YES
    Is Native Snapshot Capable: YES

     

  6. Once we assign the NFS datastore to a StorageProfile which is used by a Provider VDC in vCloud Director we should see it in the vCloud Director > System > Manage & Monitor > vSphere Resources > Datastores & Datastores Clusters menu. Here we must check the Enable VAAI fast provisioning checkbox in the datastore General properties, which instructs vCloud Director that on this particular datastore linked clones will be hardware offloaded.

    Datastore Properties
  7. Now we can create a Organization VDC with Fast Provisioning enabled and test its functionality.

A few notes:

  • Maximum length of FlexClone chain is 32,767 which is much higher than vCloud Director default VAAI chain length 256. In cases where storage array does not support chain length 256 it must be lowered by changing the value in the vCloud Director database (table config -> VirtualMachine.AllowedMaxVAAIChainLength)
  • You can see if a particular vmdk was create by hardware snapshot by examining its vmdk descriptor file:isNativeSnapshot=”yes” … hardware offloaded snapshot
    isNativeSnapshot=”no” … regular vSphere REDO logs based snapshot
  • The concept of shadow VMs stays the same as with regular fast provisioning. FlexClone operation cannot span FlexVols even if both are on the same aggregate. Therefore a fully cloned Shadow VM is created first when a clone operation between datastore is initiated. Then a regular vSphere snapshot is performed on the Shadow VM and then a Native Snapshot is created for the target clone. Shadow VMs are registered in the System VDC resource pool.
  • Shadow VM is not deleted automatically even if all her clones on datastores are removed. This actually makes sense as additional clones could be created and leverage the fast clone operation. It can be deleted from original VMs “Shadow VMs” tab.
  • Native clones cannot be storage vMotioned.
  • I have seen statement that only VM hardware 9 is supported however I have successfully tested it with VM hardware 8.