vCloud Director: Online Migration of Virtual Data Center – Part II

About two years ago I have written a blog post describing how service provider (with cloud based on vCloud Director) can replace hardware without tenants actually noticing this. The provider can stand up new vSphere cluster with new hardware and migrate whole Provider VDC to it with running virtual machines. In my mind this is one of the distinguishing features from other cloud management systems.

Recently I had an opportunity to test this in large environment with vCloud Director 5.5 and would like to report how it is even more simplified compared to vCloud Director 5.1 which the former post was based on. No more API calls are needed and everything can be done from vCloud Administrator GUI in the following five steps:

  1. Prepare new cluster and create new provider VDC on it.
  2. Merge it with the old provider VDC. Disable its resource pool(s).
    Merge PVDCs
  3. Migrate all VMs from the old resource pool to the new one in vCloud Director (not in vSphere!!!)
    Migrate VMs
  4. Redeploy all Edge Gateways running on the old hardware.
  5. Detach the old resource pool.
    Detaching Resource Pool

While the first 4 steps are identical as in vCloud Director 5.1 the step 5 is where the simplification is. The last step takes care of both migration of catalog templates and deletion of shadow VMs.

As before this is possible only with Elastic VDCs (Pay-As-You-Go or Elastic Allocation Pool).

 

Memory Overhead in vCloud Director

This little known fact came up already twice recently so I will write just a short post about it.

In the past when you created allocation pool Organization Virtual Datacenter (Org VDC) and allocated to it for example 20 GB RAM it actually did not allow you to deploy VMs with total memory sum of 20 GB due to virtualization memory overhead being charged against the allocation as well. This behavior forced service providers to add additional 5%-15% to the memory Org VDC allocation. This was also very confusing for the end users who were complaining why their VM cannot power on.

With the elastic allocation pool VDC changes which came with vCloud Director 5.1 it is no longer an issue. The reason is that in the non-elastic VDC it is vSphere (and Org VDC resource pool object with a limit set) who does the admission control – i.e. how many VMs can be deployed into it. In the elastic VDC it is actually vCloud Director who is responsible for the decision if a VM can be deployed into particular VDC.

Allocation Pool Elasticity

So to sum it up: if you use elastic allocation pool the tenant can use it up to the last MB. The virtualization VM memory overhead is charged to the provider who must take it into account when doing capacity management.

Allocation Pool Organization VDC Changes in vCloud Director 5.1.2

This is a follow up article to the original one Allocation Pool Organization vDC Changes in vCloud Director 5.1 to reflect what has changed regarding the subject in the recently released vCloud Director 5.1.2.

One of the new features of vCloud Director v 5.1 was elastic Allocation pool VDC. Elastic means that the VDC can span multiple clusters which simplifies providers capacity management.

The feature required some changes how now elastic VDC maps to vSphere Resource Pools. And these changes were disruptive for some customers upgrading from vCloud Director 1.5. Therefore both vCloud Director 5.1.1 and 5.1.2 tweaked the feature to make those customer happy.

For deep dive how Org VDC allocation types relate to vSphere resource management go to Massimo Re Ferre post here: vCloud Director 5.1(.1) Changes in Resource Entitlements (Updated).

I will just concentrate on the Allocation Pool VDC differencies.

vCloud Director 5.1.0

Allocation pool VDC require new parameter: vCPU speed, which is used to define how much CPU reservation and limit is applied to Org VDC resource pools that can span multiple clusters. Each such resource pool gets reservation and limit based on sum of all vCPUs of deployed vApps in that particular resource pool.

Example: If vCPU parameter is set to 1GHz and I have deployed 3 VMs each with 2 vCPUs and one is placed into one resource pool and the rest to the other, the first resource pool will get 4 GHz limit and the second 8 GHz (reservation is set as a percentage of the limit).

This means that you cannot overallocate Org VDC in terms of vCPUs (max #of vCPUs x vCPU speed = Org VDC CPU allocation) in very similar way the memory could not be overallocated in vCloud Director 1.5.

vCloud Director 5.1.1

As mentioned above some customers complained that the vCloud Director tenants are now constrained in how many vCPUs they can deploy into their Org VDC. Providers tried to fight this with setting very small vCPU speeds, but the problem is that if you have only a few VMs deployed the resource pool limit was very low compared to the allocated Org VDC CPU GHz.

vCloud Director 5.1.1 came with a quick fix. The CPU limit of Allocation pool resource pools was no longer based on number of vCPUs deployed in the resource pool as in 5.1.0, but was the whole Org VDC CPU allocation instead. This means that even the first (and only) deployed vCPU can utilize the full Org CPU Allocation (obviously limited by the physical speed of the core). The downside is that if the Org VDC spans multiple resource pools, the tenant will get more CPU resources then he is entitled to. However as long the provider designed all his Provider VDCs to be backed by only one cluster/resource pool and set low vCPU speed the behavior was very similar to vCloud Director 1.5.

vCloud Director 5.1.2

The problem with the previous approach was that if you upgraded to 5.1.1 you could not revert to 5.1.0 with the truly elastic VDCs if you wanted. That has changed now with 5.1.2.

There is a new “Make Allocation pool Org VDCs elastic” configuration option in System Settings > General > Miscellaneous which gives you the possibility to choose the Allocation Pool behavior.

Allocation Pool Elasticity

When upgrading from vCloud Director 5.1.0 that used Allocation Pool Org VDC spanning multiple clusters this option will be enabled, otherwise it will always be disabled by default.

If it is disabled then the Allocation Pool Org VDCs behave exactly as in vCloud Director 1.5. That means no vCPU speed setting, no spanning of multiple clusters and easy vCPU overallocation.

If the option is enabled then the Allocation Pool Org VDCs behave exactly as in vCloud Director 5.1.1!  So beware – it does not revert to 5.1.0 way of setting the resource pool CPU limit, but uses the 5.1.1 way which results in possibility that tenant will use more CPU resources than is his Org VDC CPU allocation.

Personally I have hoped that the elastic behavior would be exactly as in 5.1.0 which is not the case, but could happen in the future releases.

vCloud Director: Online Migration of Virtual Data Center

vCloud Director completely abstracts underlying virtual resources from the consumers who get compute and storage resources represented by virtual datacenters (VDC) with given tiered profile (e.g. gold – silver – bronze). However the provider must care about the actual physical hardware and from time to time is facing the issues of upgrades and migrations.

Fortunately it is possible to migrate whole Provider VDCs non disruptively from the old hardware to a new one with no or minimal impact on the vCloud customers with no downtime or their VMs running in the cloud.

vCloud Director 5.1 has two features that help to accomplish this: elastic VDC (VDCs spanning multiple vSphere clusters) and merging of Provider VDCs. I already wrote about elastic VDCs in the post about Allocation Pool changes so please read that article first if that concept is new to you.

The online migration process from the old to the new hardware in high level works like this:

  • Let’s say that the Provider VDC (PVDC) called GoldVDC is backed by Cluster1 consisting of old hardware.
  • A new Cluster2 is created with new hardware.
  • A new PVDC is created – let us call it GoldVDCnew from the Cluster2 and merged with Cluster1. Although we could add the new Cluster2 directly to the GoldVDC this would not allow us to retire the old hardware as it is not possible to detach the primary resource pool from a PVDC.
    Merge PVDCs
  • We can now rename PVDC GoldVDCnew to GoldVDC and disable Cluster1. This has no impact on running VMs however any newly deployed or power on VMs are already placed to Cluster2.
    Disable Resource Pool
  • Now we have to migrate all the workloads from Cluster1 to Cluster2 and then detach the Cluster1 from the PVDC.
    Detach Resource Pool

The actual migration between Clusters (or resource pools) has to take into account following 5 resources that exist in the VDCs – vApps, catalog templates, catalog images, Edge Gateways and vApp Edges.

vApps

vCloud Director actually does not use vSphere vApp objects. vCloud vApps are from the point of view of vSphere infrastructure just a collection of VMs. So we just need to migrate the VMs. This cannot be done from within vSphere because vCloud Director keeps track in which resource pool each VM is placed. Additionally vCloud Director also needs to apply proper resource pool reservations and limits based on the org VDC allocation type. There is however migrate option in vCloud Director that can be used. This can be done from GUI or with API (see the end of this article). Note: the migration leverages vMotion with shared storage. It is not possible to migrate this way between clusters without shared storage even though vSphere 5.1 has so called Enhanced vMotion (aka shared nothing vMotion).

Migrate VMs

Catalog Templates

Migration of catalog templates is more difficult. Again the vCloud template is quite different from the vSphere template. vCloud templates are basically powered off VMs. Although migration at the vSphere level seems not as harmful as in the previous case, because no resource pool settings must be configured (catalog VMs are never powerd on), we would still encounter a problem when we would try to detach Cluster1 as vCloud Director keeps track of VM to Resource Pool associations.

Unfortunately the GUI migration process from the step 1 cannot be used. The GUI workaround is to open each catalog and move each catalog VM to the same catalog. This basically creates a clone of the VM which gets registered to a Cluster2 host and the original VM is deleted. This is however very expensive operation from the storage perspective. The cloning operation needs temporarily extra space and creates quite a lot of I/O storage traffic. Fast provisioning (linked clones) can help here.

The second alternative is to use the same API call as in the first case. Although this is not documented it works (see the example at the end of the article).

Catalog Media

ISO or FLP images are stored on vSphere datastores in special folder <VCD Instance Name>/media/<org ID>/<org VDC ID>/media-<media ID>.<ISO|FLP>. vSphere (vCenter) does not keep track of these objects in any way. vCloud Director stores the datastore moref and the media folders in its database. The media upload is done by the vCloud Director Cells with NFC – VMware Network File Copy protocol via any ESX host that is connected to the datastore. Therefore as long as the Cluster2 has access to the media datastores nothing needs to be migrated.

vShield Edges

Gateway and vApp Edges are always running VMs placed in System vDC resource pool in Cluster1. If a vApp with routed vApp network is powered off the particular vApp Edge is destroyed. When the vApp is started again a new vApp Edge is deployed with identical configuration and would be placed into the new Cluster2. Simple vMotion between cluster seems to work at first but is definitely not recommended. vShield Manager keeps track to which cluster is each Edge deployed. Any major Edge configuration change (upgrade to new version or upgrade ot full configuration) would try to deploy the Edge to the original cluster.

Edge redeploy is an operation with minimum impact on the network flows going through the virtual router. A new Edge VM is deployed by vShield Manager, identical configuration is pushed to it and then networks are disconnected from the old Edge and connected to the new one. This might have impact on loosing an IPsec VPN connection or load balanced session otherwise the disruption is minimal. The Edge redeploy however cannot be done directly from vShield Manager (too bad as there is a nice script for this: see KB 2035939) because vShield Manager knows nothing about restrictions made in vCloud Director on the PVDC (the disabled Cluster1). Edge Gateway redeploy and routed/fenced network reset must be done from vCloud Director. This can be done from the GUI (however it is not trivial to find all the running vApp Edges) or with vCloud API.

Other Considerations

There are some limitations or considerations that need to be taken into account:

  • VDC elasticity currently (version 5.1) works only within vCenter Datacenter domain and all clusters need to use the same distributed switch for external networks and network pool.
  • Reservation allocation type Org VDCs do not currently support elasticity of VDC (those workloads cannot be migrated).
  • Both clusters should have access to the same storage. If storage migration is required do it as independent second step.
  • vSphere vMotion restrictions apply: if the new hardware has newer generation CPU leverage EVC and lower the compatibility of the new cluster to the old hardware. Once the old hardware is retired the EVC mode can be changed and any restarted (full power cycle required) or new VMs can take advantage of it. Obviously migrations between different CPU architectures is not possible (AMD vs Intel),
  • 1 GHz of old CPU is not equal to 1 GHz of a newer generation CPU. Therefore do not mix them in elastic VDCs unless for above mentioned migration reasons. This could also impact Chargeback – customer will get different (higher) performance for the same cost.

vCloud API Examples

As mentioned above vCloud VMs can and vCloud template should be migrated with an API call. The request looks like this:
POST API-URL/admin/extension/resourcePool/id/action/migrateVms with Request body containing MigrateParams.

VM migration example:

Migrate VM API

Template migration example

Migrate Template API

Allocation Pool Organization vDC Changes in vCloud Director 5.1

vCloud Director offers three consumption types of cloud resources (organization virtual datacenters) – so called allocation models. These are:

  • Reservation Pool, which is mostly used when a whole cluster (and its resources) is dedicated to the tenant
  • Pay-As-You-Go, which is similar to Amazon cloud consumption model, where tenant gets the resources needed only when particular VM is deployed
  • Allocation Pool when only a percentage of provider resources is dedicated

In the first release of vCloud Director 1.0, provider virtual datacenter (provider vDC) had to be backed up by either DRS cluster or its child resource pool. vSphere cluster can scale up to 32 hosts which means that the provider vDC has maximum compute capacity determined by the maximum cluster size. Tenants get their compute by getting chunks of provider vDC, called organization vDCs and are limited by the available free capacity in the cluster. One of the characteristic of cloud delivery model is to be able to scale in order to enable tenants to consume cloud resources in elastic way. However if the cluster backing up the provider vDC is full and the tenant is asking for more resources, the provider has to create a new organization vDC for the tenant from a different provider vDC backed by another resource pool from another cluster. And this is not an ideal solution for the tenant as managing multiple organization vDCs is more demanding from capacity perspective.

The second major release of vCloud Director (version 1.5) came up with the notion of elastic vDC. Provider vDC could be backed up by multiple resource pools and thus not bound to one vSphere cluster. However only Pay-As-You-Go organization vDCs were able to take advantage of this elasticity. Allocation and Reservation organization vDCs were still bound only to the primary resource pool of the provider vDC.

The third major release of vCloud Director and the most recent – version 5.1 extends the elasticity also to Allocation pool organization vDCs. The consumption models rely on vSphere resource management – virtual machine and resource pool reservations and limits and the way Allocation pool leverage them has changed significantly, therefore  the need for this post.

Firstly I would recommend reading Frank Denneman’s and Chris Colotti’s recently released white paper VMware vCloud Director Resource Allocation Models, which describes mapping of all three organization vDC allocation models to vSphere resource management for vCloud Director 1.5.x. Let me now quickly recap how the Allocation pool works in vCloud Director 1.5.x:

  • Provider creates allocation pool organization vDC with a CPU and memory allocation, where only a percentage is guaranteed for each CPU and memory.
  • On vSphere level a child resource pool in the primary resource pool of provider vDC is created with limits of CPU and memory org vDC allocations and reservations of guaranteed percentage x allocated value
  • When the tenant deploys a VM in this organization vDC, no VM CPU limit or reservation is set, however memory reservation is set to memory guaranteed percentage x VM memory and limit equal to VM memory.
  • This results in very flexible use of CPU resources – tenant can overallocate vCPUs as he wishes, because organization vDC resource pool controls that total CPU usage does not exceeds tenant’s CPU allocation, however he cannot overallocate memory as the VM memory reservations must be backed up by resource pool memory reservations.

Elastic Allocation pool (in vCloud Director 5.1) implies that the organization vDC can consist of two or more resource pools. And this means that the CPU and memory usage management cannot be controlled at the resource pool level the way it was done in vCloud Director 1.5.x. The organization vDC could be fragmented into many resource pools and vCloud Director has to distribute the organization vDC guaranteed resources (reservations) and allocations (limits) among them. Each resource pools entitlements are therefore based on the virtual machines running in it. This is similar in the way Pay-As-You-Go model works where resource pool settings are changed every time a VM is powered on in it. To do this for memory is not a big problem, because memory is not as elastic resource as CPU (it is harder to take it away from a VM when it is not used and give it to another VM, where on the other hand it is very easy to redistribute unused CPU cycles). Unfortunately there is impact in the way CPU resources are distributed. Allocation model now has to rely on vCPU speed settings which must be set when the organization vDC is created. Then it is possible to calculate the entitlements of each VM in terms of CPU MHz and sum them up for entitlements of each resource pool. This results in resource sharing only among the VMs that are running in the particular resource pool. If there are some organization vDC resources undeployed they cannot be used by running VMs in the organization vDC.

Allocation Pool Organization vDC

What happens at vSphere level?

  • The organization vDC is after creation backed by one resource pool in the primary resource pool of the provider vDC. The organization vDC resource pool has 0 MHz CPU reservation and limit with no expandable reservations and 0 MB RAM reservation but expandable reservations and unlimited memory.
  • When a vApp is deployed (but not powered on yet) it might be placed into another resource pool backing up the provider vDC with most free capacity, where available networks and datastores are also considered. vCloud Director performs admission control to check if organization vDC has available resources to power on the vApp. If it is placed into a new provider vDC resource pool, again organization vDC resource pool is created with 0 MHz CPU reservation and limit and no expandable reservations and 0 MB RAM reservation and unlimited with expandable reservation.
  • When the vApp is powered on its resource pool (RP) reservations are increased by its guaranteed allocation:

New RP CPU reservation = original RP CPU reservation + org vDC vCPU speed x VM number of vCPUs x percentage guaranteed

New RP RAM reservation = original RP RAM reservation + VM RAM x percentage guaranteed + VM memory overhead

Memory still has expandable reservations. There is still no memory limit set, but there is CPU limit increased by the VM vCPU allocation:

New RP CPU limit = original RP CPU limit + org vDC vCPU speed x VM number of vCPUs x percentage guaranteed

 The memory limit on resource pool does not need to be set, because vCloud Director is doing the memory admission control and memory allocation is the natural limit at the VM level. CPU is the more liquid resource therefore limit is set only at the resource pool level (calculated from vCPU speed) and is not set at the VM level (this is different from the Pay-As-You-Go model).

Design considerations

  1. Use the same CPUs for all hosts in the same provider vDC to guarantee consistent performance
  2. vCloud Director capacity management is now more important as capacity guarantees are not enforced at the vSphere level.  Provider can create organization vDC which guaranteed resources exceed available resources at the provider vDC level. Information box is displayed when this happens.
  3. The tenant cannot overcommit its organization vDC – neither in memory (that was possible in vCloud Director 1.x) versions nor in CPU (this is new). If the tenant buys 10 GHz of CPU power with 2 GHz vCPU speed, he can deploy only VMs in total with 5 vCPUs.
  4. CPU and memory resource guarantee percentage enables provider to overallocate its resources. This is different from the point 2 as exceeding guaranteed resources will not result in provisioning error and vSphere resource sharing mechanism will come into play.
  5. vCPU speed must be chosen carefully – too high can limit number of vCPUs deployed (see point 3) too low will result in poor VM performance if it is only one of few deployed in particular resource pool. On the other hand it gives provider more control on what is the minimum VM size deployed in the cloud (tenant cannot deploy ½ vCPU VM).
  6. When upgrading from an older vCloud Director release consider the impact on existing allocation organization vDCs.
  7. When a VM is powered off and then powered on again, it might get relocated to another resource pool if the original resource pool ran out of resources. This is more costly operation.