Org VDC Edge Gateway CPU/RAM Reservations

vCloud Director 8.20 allows deployment of Org VDC Edge Gateways in 4 different form factors from Compact to X-Large where each provides different level of performance and consumes different amount of resources.

As these Edge Gateways are deployed by NSX Manager which allows setting custom reservations for CPU and RAM via an API call PUT https://<NSXManager>/api/4.0/edgePublish/tuningConfiguration, it is also possible in vCloud Director to set custom reservations.

Why would you change the default reservations? Reservations at VM (Edge) level reserve the resources for itself which means no other VM can utilize them in case they are unused. They basically guarantee certain level of service that the VM (Edge) from performance perspective will always deliver. In service provider environments oversubscription provides ROI benefits and if the service provider can guarantee enough resources at cluster scale, than the VM level reservations can be set lower if at all.

This can be accomplished by tuning the networking.gatewayMemoryReservationMultiplier and networking.gatewayCpuReservationMultiplier settings via cell-management-tool from vCloud Director cell. By default the CPU multiplier is set to 64 MHz per vCPU and the Memory multiplier to 0.5.

By default Edge Gateways will be deployed with the following reservation settings:

Org VDC Edge GW Default Resource Reservations

The following command will change memory multiplier to 10%:

/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n networking.gatewayMemoryReservationMultiplier -v 0.1

Note: The new reservation settings are applicable only for newly deployed Org VDC Edge Gateways. Redeploying existing edges will not change their reservation settings. You must either use NSX API to do so, or modify Org VDC Edge Gateway form factor (e.g. change Large to Compact and then back to Large) which is not so elegant as it will basically redeploy the Edge twice.

Also note that NSX 6.2 and NSX 6.3 have different sizing of Quad Large Edge. vCloud Director 8.20 is by default set for the NSX 6.3 size which is 2 GB RAM (as opposed to NSX 6.2 value of 1 GB RAM). It is possible to change the default for the reservation calculation by editing networking.full4GatewayMemoryMb setting to value ‘1024’


vCloud Director: Online Migration of Virtual Data Center – Part II

About two years ago I have written a blog post describing how service provider (with cloud based on vCloud Director) can replace hardware without tenants actually noticing this. The provider can stand up new vSphere cluster with new hardware and migrate whole Provider VDC to it with running virtual machines. In my mind this is one of the distinguishing features from other cloud management systems.

Recently I had an opportunity to test this in large environment with vCloud Director 5.5 and would like to report how it is even more simplified compared to vCloud Director 5.1 which the former post was based on. No more API calls are needed and everything can be done from vCloud Administrator GUI in the following five steps:

  1. Prepare new cluster and create new provider VDC on it.
  2. Merge it with the old provider VDC. Disable its resource pool(s).
    Merge PVDCs
  3. Migrate all VMs from the old resource pool to the new one in vCloud Director (not in vSphere!!!)
    Migrate VMs
  4. Redeploy all Edge Gateways running on the old hardware.
  5. Detach the old resource pool.
    Detaching Resource Pool

While the first 4 steps are identical as in vCloud Director 5.1 the step 5 is where the simplification is. The last step takes care of both migration of catalog templates and deletion of shadow VMs.

As before this is possible only with Elastic VDCs (Pay-As-You-Go or Elastic Allocation Pool).


Memory Overhead in vCloud Director

This little known fact came up already twice recently so I will write just a short post about it.

In the past when you created allocation pool Organization Virtual Datacenter (Org VDC) and allocated to it for example 20 GB RAM it actually did not allow you to deploy VMs with total memory sum of 20 GB due to virtualization memory overhead being charged against the allocation as well. This behavior forced service providers to add additional 5%-15% to the memory Org VDC allocation. This was also very confusing for the end users who were complaining why their VM cannot power on.

With the elastic allocation pool VDC changes which came with vCloud Director 5.1 it is no longer an issue. The reason is that in the non-elastic VDC it is vSphere (and Org VDC resource pool object with a limit set) who does the admission control – i.e. how many VMs can be deployed into it. In the elastic VDC it is actually vCloud Director who is responsible for the decision if a VM can be deployed into particular VDC.

Allocation Pool Elasticity

So to sum it up: if you use elastic allocation pool the tenant can use it up to the last MB. The virtualization VM memory overhead is charged to the provider who must take it into account when doing capacity management.

Performance Monitoring in Public Cloud (vCHS) – Part II

This is second part of the article. Read the first part here.

In the first part we have established why it is important to monitor resource utilization of workloads deployed in the public cloud and hinted how we can get some hypervisor metrics. In the second part I want to present an architecture example how it can be achieved.


Hyperic and vCenter Operations monitoring vCHSThe above diagram describes the whole setup. I am monitoring three workloads (VMs) deployed in a public cloud – in this case vCloud Hybrid Service. I have also deployed vFabric Hyperic into the same cloud, which collects the workload metrics through Hyperic agents. vCenter Operations Manager which is installed on premise is then collecting the metrics from Hyperic database and is displaying them in custom dashboards while doing its own magic (super metrics, dynamic thresholding and analytics).

I have also created custom Hyperic plugin which collects Windows VM performance metrics through VMware Tools Guest SDK.

Deployment and Configuration

vFabric Hyperic Deployment

  1. Create Org VDC network in public cloud for Hyperic Deployment.
  2. Upload vFabric Hyperic 5.7.1 Server installation and vFabric Hyperic 5.7.1 DB installation in OVF format to a public cloud catalog.
  3. Deploy Hyperic DB first to Org VDC network first and note the IP address it has been allocated.
  4. Deploy Hyperic Server and enter the Hyperic DB IP address when asked.
  5.  Install Hyperic Agent to all VMs that are going to be monitored.

vFabric Hyperic Configuration

  1. Connect to Hyperic GUI (http://<hyperic server IP>:7080). As I have used jumpbox in the public cloud I did not needed to open the port to the internet, otherwise create the appropriate DNAT and firewall rule to reach the server from outside.
  2. Go to Administration > Plugin Manager > and click Add/Update Plugin(s) button and upload vmPerfMon-plugin.xml custom plugin:
     <server name="VM Performance Counters"
     <plugin type="measurement"
     <!-- You always need availability metric, so just pick some service -->
     <metric name="Availability"
     <!-- Template filter is passed to metrics -->
     <filter name="template"
     <!-- Using object filter to reduce amount of xml -->
     <filter name="object" value="VM Memory"/>
     <metric name="Memory Reservation" alias="Memory Reservation in MB" units="MB"/>
     <metric name="Memory Limit" alias="Memory Limit in MB" units="MB"/>
     <metric name="Memory Shares" alias="Memory Shares"/>
     <metric name="Memory Active" alias="Memory Active in MB" units="MB"/>
     <metric name="Memory Ballooned" alias="Memory Ballooned in MB" units="MB"/>
     <metric name="Memory Swapped" alias="Memory Swapped in MB" units="MB"/>
     <!-- Win perf object is changed, using new one -->
     <filter name="object" value="VM Processor"/>
     <!-- Processor object needs instance information to access -->
     <filter name="instance" value="_Total"/>
     <!-- Giving new template since we now need instance info -->
     <filter name="template"
     <metric name="CPU Reservation in MHz" alias="Reservation in MHz"/>
     <metric name="CPU Limit in MHz" alias="Limit in MHz"/>
     <metric name="CPU Shares" alias="Shares"/>
     <metric name="Effective VM Speed in MHz" alias="Effective VM Speed in MHz" indicator="true"/>
     <metric name="Host processor speed in MHz" alias="Host processor speed in MHz"/>
  3. The plugin should be automatically distributed through agents to the workloads. However we need to configure it. In the Resources tab browse to the workload and in the Tools Menu select New Server. Type a name, in the Server Type drop down find Windows VM Performance Counters and type something in the Install path field.Add Server
  4. After clicking OK immediately click on Configuration Properties and check Auto-discover services.
  5. To start collecting data we need to configure collection interval for the metrics. Go to Monitor > Metric Data  and click Show All Metrics button. Select the metrics and at the bottom input collection interval and submit.Collection Interval
  6. Now when we are collecting data we could create indicators, create alerts, etc., However Hyperic is just a collection engine for us. We will feed its data to vCenter Operations Manager.

vCenter Operations Manager Configuration

  1. Assuming there is already an existing on-premise installation of vCenter Operations Manager we need to configure it to collect data from the cloud Hyperic instance. To do this first we need to open Edge Gateway firewall and create DNAT and firewall rule to the Hyperic DB server (TCP port 5432).
  2. Now we need to download Hyperic vCOps Adapter from here.
  3. To install go to vC Ops admin interface and install it through Update tab.
  4. Then go to the vC Ops custom interface and in Admin > Support click Describe icon.
  5. Next we need to configure the adapter. Go to Environment > Configuration > Adapter Instances and add new Hyperic instance for the public cloud. Then configure the (public) IP address of Hyperic DB, port and credentials.Hyperic Adapter Instance


When finished with all configurations we can create custom dashboards from the collected metrics in vC Ops. This is however out of scope of this article as it depends on the use case.

vC OpsAs an example above I am showing real CPU usage of one cloud VM as reported by the hypervisor. The host physical core speed is 2.2 GHz, however the effective VM speed was varying between 2.5 – 2.7 GHz (max turbo frequency of Intel E5-2660 CPU is 3 GHz). If I would look at the internal Windows GuestOS task manager metric I would see just meaningless 100% CPU utilization.

Performance Monitoring in Public Cloud (vCHS) – Part I

VI administrator is used to monitor the performance of the workloads running in his environment through the rich vSphere client interface. The performance tab presents multiple graphs displaying metrics related to CPU, memory, network, datastore, disk, … This helps him when troubleshooting, rightsizing the workloads or when doing capacity calculations.

So what happens when the VI admin becomes Cloud admin and starts deploying workloads to a public cloud? No access to vCenter Server means he has to solely rely on guest OS metrics (perfmon, top) or monitoring interfaces provided by his service provider. Although vCloud Director has monitoring dashboard it does not show any performance data – see my Org VDC Monitoring post.

OrgVDC Monitoring
OrgVDC Monitoring

What about those guest OS metrics? Any vSphere admin who went through VCP training knows that the guest OS metrics like CPU utilization are never to be trusted in virtual environment – the OS does now know how much of actual physical CPU time has been scheduled to its vCPUs and high CPU utilization could mean either that a demanding workload is running in the OS or that the VM is competing with other VMs in highly overallocated environment.

Should the VI/Cloud admin be concerned? It depends on the way the provider is oversubscribing (overselling) his compute resources. I have identified three schools of thought:

  1. ISP model: similarly how internet provider oversubscribes the line 1:20 the IaaS provider will sell you CPU/RAM allocation with certain percentage guaranteed (e.g. 10% for CPU). The consumer will know that during quiet times he might get 100% of requested resources, but during busy times he might get only his 10%. The consumer pays for the whole allocation.
  2. Telco model: the consumer commits to certain level of consumption and is charged extra for bursting above it. So again guaranteed percentage of resources is defined and known but the difference from the ISP model is that the consumer is charged flat rate for the guaranteed percentage plus the premium when he bursts above it.
  3. SLA model: the consumer pays for the whole allocation but does not know what resource oversubscription the provider is using. The provider must monitor the tenants to understand how much he can oversell the compute to get the highest ROI while keeping the SLAs.

All these three models are achieved by the same allocation model – Allocation Pool. Only the chargeback, amount of disclosed information and SLA differs.

It is obvious that in all three models we need performance monitoring for rightsizing the individual workloads and to correctly size the whole Org VDC. In ISP model we need to understand if we should buy more allocation because during the busy times our workloads suffer. In Telco model we need to avoid the expansive bursting and in SLA model to control the provider’s SLAs. On top of that it would be nice to be able to peek under the provider’s kimono to find out what is the overcommit level of his cloud.

By the way the need for performance monitoring still applies to Reservation Pool – where the tenant is in full control of OrgVDC overallocation and needs to understand if he went to far or not. In Pay-As-You-Go Pool it is again about understanding if my VMs are starving for CPU resources because of too aggressive oversubscription on provider’s side.

Guest SDK

One of the less known features of VMware Tools is the ability to use Guest SDK which provides read only API to for monitoring various virtual machine statistics. An example of such implementation are two additional Windows PerfMon libraries: VM Memory and VM Processor. They both contain number of counters showing very interesting information familiar to VI admin as they are exposed in vSphere client.

PerfMon VM Counters
Windows PerfMon VM Counters

A linux alternative (although not so powerful) is vmware-toolbox-cmd stat command.

We can find out what is the CPU or memory reservation, if memory ballooning is active (or even VM SWAP). We can also see what is the actual physical processor speed and what is the effective VM processor speed. This gives as quite interesting peek into the hypervisor. Btw the access to Guest SDK could be disabled by the provider via advanced VM .vmx configuration parameter (not a standard practice):

tools.guestlib.enableHostInfo = "FALSE"

In the second part I will describe how these metrics can be collected, monitored and analyzed. Stay tuned…