About two years ago I have written a blog post describing how service provider (with cloud based on vCloud Director) can replace hardware without tenants actually noticing this. The provider can stand up new vSphere cluster with new hardware and migrate whole Provider VDC to it with running virtual machines. In my mind this is one of the distinguishing features from other cloud management systems.
Recently I had an opportunity to test this in large environment with vCloud Director 5.5 and would like to report how it is even more simplified compared to vCloud Director 5.1 which the former post was based on. No more API calls are needed and everything can be done from vCloud Administrator GUI in the following five steps:
Prepare new cluster and create new provider VDC on it.
Merge it with the old provider VDC. Disable its resource pool(s).
Migrate all VMs from the old resource pool to the new one in vCloud Director (not in vSphere!!!)
Redeploy all Edge Gateways running on the old hardware.
Detach the old resource pool.
While the first 4 steps are identical as in vCloud Director 5.1 the step 5 is where the simplification is. The last step takes care of both migration of catalog templates and deletion of shadow VMs.
As before this is possible only with Elastic VDCs (Pay-As-You-Go or Elastic Allocation Pool).
This little known fact came up already twice recently so I will write just a short post about it.
In the past when you created allocation pool Organization Virtual Datacenter (Org VDC) and allocated to it for example 20 GB RAM it actually did not allow you to deploy VMs with total memory sum of 20 GB due to virtualization memory overhead being charged against the allocation as well. This behavior forced service providers to add additional 5%-15% to the memory Org VDC allocation. This was also very confusing for the end users who were complaining why their VM cannot power on.
With the elastic allocation pool VDC changes which came with vCloud Director 5.1 it is no longer an issue. The reason is that in the non-elastic VDC it is vSphere (and Org VDC resource pool object with a limit set) who does the admission control – i.e. how many VMs can be deployed into it. In the elastic VDC it is actually vCloud Director who is responsible for the decision if a VM can be deployed into particular VDC.
So to sum it up: if you use elastic allocation pool the tenant can use it up to the last MB. The virtualization VM memory overhead is charged to the provider who must take it into account when doing capacity management.
This is second part of the article. Read the first part here.
In the first part we have established why it is important to monitor resource utilization of workloads deployed in the public cloud and hinted how we can get some hypervisor metrics. In the second part I want to present an architecture example how it can be achieved.
The above diagram describes the whole setup. I am monitoring three workloads (VMs) deployed in a public cloud – in this case vCloud Hybrid Service. I have also deployed vFabric Hyperic into the same cloud, which collects the workload metrics through Hyperic agents. vCenter Operations Manager which is installed on premise is then collecting the metrics from Hyperic database and is displaying them in custom dashboards while doing its own magic (super metrics, dynamic thresholding and analytics).
I have also created custom Hyperic plugin which collects Windows VM performance metrics through VMware Tools Guest SDK.
Deployment and Configuration
vFabric Hyperic Deployment
Create Org VDC network in public cloud for Hyperic Deployment.
Upload vFabric Hyperic 5.7.1 Server installation and vFabric Hyperic 5.7.1 DB installation in OVF format to a public cloud catalog.
Deploy Hyperic DB first to Org VDC network first and note the IP address it has been allocated.
Deploy Hyperic Server and enter the Hyperic DB IP address when asked.
Install Hyperic Agent to all VMs that are going to be monitored.
vFabric Hyperic Configuration
Connect to Hyperic GUI (http://<hyperic server IP>:7080). As I have used jumpbox in the public cloud I did not needed to open the port to the internet, otherwise create the appropriate DNAT and firewall rule to reach the server from outside.
Go to Administration > Plugin Manager > and click Add/Update Plugin(s) button and upload vmPerfMon-plugin.xml custom plugin:
<server name="VM Performance Counters"
<!-- You always need availability metric, so just pick some service -->
<!-- Template filter is passed to metrics -->
<!-- Using object filter to reduce amount of xml -->
<filter name="object" value="VM Memory"/>
<metric name="Memory Reservation" alias="Memory Reservation in MB" units="MB"/>
<metric name="Memory Limit" alias="Memory Limit in MB" units="MB"/>
<metric name="Memory Shares" alias="Memory Shares"/>
<metric name="Memory Active" alias="Memory Active in MB" units="MB"/>
<metric name="Memory Ballooned" alias="Memory Ballooned in MB" units="MB"/>
<metric name="Memory Swapped" alias="Memory Swapped in MB" units="MB"/>
<!-- Win perf object is changed, using new one -->
<filter name="object" value="VM Processor"/>
<!-- Processor object needs instance information to access -->
<filter name="instance" value="_Total"/>
<!-- Giving new template since we now need instance info -->
<metric name="CPU Reservation in MHz" alias="Reservation in MHz"/>
<metric name="CPU Limit in MHz" alias="Limit in MHz"/>
<metric name="CPU Shares" alias="Shares"/>
<metric name="Effective VM Speed in MHz" alias="Effective VM Speed in MHz" indicator="true"/>
<metric name="Host processor speed in MHz" alias="Host processor speed in MHz"/>
The plugin should be automatically distributed through agents to the workloads. However we need to configure it. In the Resources tab browse to the workload and in the Tools Menu select New Server. Type a name, in the Server Type drop down find Windows VM Performance Counters and type something in the Install path field.
After clicking OK immediately click on Configuration Properties and check Auto-discover services.
To start collecting data we need to configure collection interval for the metrics. Go to Monitor > Metric Data and click Show All Metrics button. Select the metrics and at the bottom input collection interval and submit.
Now when we are collecting data we could create indicators, create alerts, etc., However Hyperic is just a collection engine for us. We will feed its data to vCenter Operations Manager.
vCenter Operations Manager Configuration
Assuming there is already an existing on-premise installation of vCenter Operations Manager we need to configure it to collect data from the cloud Hyperic instance. To do this first we need to open Edge Gateway firewall and create DNAT and firewall rule to the Hyperic DB server (TCP port 5432).
Now we need to download Hyperic vCOps Adapter from here.
To install go to vC Ops admin interface and install it through Update tab.
Then go to the vC Ops custom interface and in Admin > Support click Describe icon.
Next we need to configure the adapter. Go to Environment > Configuration > Adapter Instances and add new Hyperic instance for the public cloud. Then configure the (public) IP address of Hyperic DB, port and credentials.
When finished with all configurations we can create custom dashboards from the collected metrics in vC Ops. This is however out of scope of this article as it depends on the use case.
As an example above I am showing real CPU usage of one cloud VM as reported by the hypervisor. The host physical core speed is 2.2 GHz, however the effective VM speed was varying between 2.5 – 2.7 GHz (max turbo frequency of Intel E5-2660 CPU is 3 GHz). If I would look at the internal Windows GuestOS task manager metric I would see just meaningless 100% CPU utilization.
VI administrator is used to monitor the performance of the workloads running in his environment through the rich vSphere client interface. The performance tab presents multiple graphs displaying metrics related to CPU, memory, network, datastore, disk, … This helps him when troubleshooting, rightsizing the workloads or when doing capacity calculations.
So what happens when the VI admin becomes Cloud admin and starts deploying workloads to a public cloud? No access to vCenter Server means he has to solely rely on guest OS metrics (perfmon, top) or monitoring interfaces provided by his service provider. Although vCloud Director has monitoring dashboard it does not show any performance data – see my Org VDC Monitoring post.
What about those guest OS metrics? Any vSphere admin who went through VCP training knows that the guest OS metrics like CPU utilization are never to be trusted in virtual environment – the OS does now know how much of actual physical CPU time has been scheduled to its vCPUs and high CPU utilization could mean either that a demanding workload is running in the OS or that the VM is competing with other VMs in highly overallocated environment.
Should the VI/Cloud admin be concerned? It depends on the way the provider is oversubscribing (overselling) his compute resources. I have identified three schools of thought:
ISP model: similarly how internet provider oversubscribes the line 1:20 the IaaS provider will sell you CPU/RAM allocation with certain percentage guaranteed (e.g. 10% for CPU). The consumer will know that during quiet times he might get 100% of requested resources, but during busy times he might get only his 10%. The consumer pays for the whole allocation.
Telco model: the consumer commits to certain level of consumption and is charged extra for bursting above it. So again guaranteed percentage of resources is defined and known but the difference from the ISP model is that the consumer is charged flat rate for the guaranteed percentage plus the premium when he bursts above it.
SLA model: the consumer pays for the whole allocation but does not know what resource oversubscription the provider is using. The provider must monitor the tenants to understand how much he can oversell the compute to get the highest ROI while keeping the SLAs.
All these three models are achieved by the same allocation model – Allocation Pool. Only the chargeback, amount of disclosed information and SLA differs.
It is obvious that in all three models we need performance monitoring for rightsizing the individual workloads and to correctly size the whole Org VDC. In ISP model we need to understand if we should buy more allocation because during the busy times our workloads suffer. In Telco model we need to avoid the expansive bursting and in SLA model to control the provider’s SLAs. On top of that it would be nice to be able to peek under the provider’s kimono to find out what is the overcommit level of his cloud.
By the way the need for performance monitoring still applies to Reservation Pool – where the tenant is in full control of OrgVDC overallocation and needs to understand if he went to far or not. In Pay-As-You-Go Pool it is again about understanding if my VMs are starving for CPU resources because of too aggressive oversubscription on provider’s side.
One of the less known features of VMware Tools is the ability to use Guest SDK which provides read only API to for monitoring various virtual machine statistics. An example of such implementation are two additional Windows PerfMon libraries: VM Memory and VM Processor. They both contain number of counters showing very interesting information familiar to VI admin as they are exposed in vSphere client.
A linux alternative (although not so powerful) is vmware-toolbox-cmd stat command.
We can find out what is the CPU or memory reservation, if memory ballooning is active (or even VM SWAP). We can also see what is the actual physical processor speed and what is the effective VM processor speed. This gives as quite interesting peek into the hypervisor. Btw the access to Guest SDK could be disabled by the provider via advanced VM .vmx configuration parameter (not a standard practice):
tools.guestlib.enableHostInfo = "FALSE"
In the second part I will describe how these metrics can be collected, monitored and analyzed. Stay tuned…
I got question from a customer what is the meaning of the various datastore metrics in vCloud Director: used, provisioned and requested storage. These can be viewed at the Datastores & Datastores Cluster screen in the Manager & Monitor tab.
There are even more metrics when we open properties of a single datastore or a datastore cluster.
So let us go through them:
Total Capacity: Total size of the datastore or datastore cluster as reported by vCenter Server
Used Capacity: Actual used capacity of the datastore or the datastore cluster as reported by vCenter Server.
Available Capacity: Difference between the values above (Available Capacity = Total Capacity – Used Capacity)
I am including screenshot from vCenter Server of the same datastore:
Provisioned Capacity: Total storage provisioned of virtual machines residing on the datastore. This number might be much bigger than the actual datastore capacity if thin provisioning is used. Again, this number is reported by vCenter Server (can be seen in the Datastore and Datastore Clusters view on the Summary tab). I am again including relevant vCenter Server screenshot.
Requested Storage: This is the only metric coming directly from vCloud Director. It adds up storage for all vCloud Director provisioned virtual machines, catalog templates and media and vCNS (vShield) Edges. It uses allocated storage and also includes (even potential) memory swap for virtual machines (not for catalog VMs). It does not include storage occupied by shadow VMs or intermediate disks in a linked clone tree.