Performance Monitoring in Public Cloud (vCHS) – Part II

This is second part of the article. Read the first part here.

In the first part we have established why it is important to monitor resource utilization of workloads deployed in the public cloud and hinted how we can get some hypervisor metrics. In the second part I want to present an architecture example how it can be achieved.

Architecture

Hyperic and vCenter Operations monitoring vCHSThe above diagram describes the whole setup. I am monitoring three workloads (VMs) deployed in a public cloud – in this case vCloud Hybrid Service. I have also deployed vFabric Hyperic into the same cloud, which collects the workload metrics through Hyperic agents. vCenter Operations Manager which is installed on premise is then collecting the metrics from Hyperic database and is displaying them in custom dashboards while doing its own magic (super metrics, dynamic thresholding and analytics).

I have also created custom Hyperic plugin which collects Windows VM performance metrics through VMware Tools Guest SDK.

Deployment and Configuration

vFabric Hyperic Deployment

  1. Create Org VDC network in public cloud for Hyperic Deployment.
  2. Upload vFabric Hyperic 5.7.1 Server installation and vFabric Hyperic 5.7.1 DB installation in OVF format to a public cloud catalog.
  3. Deploy Hyperic DB first to Org VDC network first and note the IP address it has been allocated.
  4. Deploy Hyperic Server and enter the Hyperic DB IP address when asked.
  5.  Install Hyperic Agent to all VMs that are going to be monitored.

vFabric Hyperic Configuration

  1. Connect to Hyperic GUI (http://<hyperic server IP>:7080). As I have used jumpbox in the public cloud I did not needed to open the port to the internet, otherwise create the appropriate DNAT and firewall rule to reach the server from outside.
  2. Go to Administration > Plugin Manager > and click Add/Update Plugin(s) button and upload vmPerfMon-plugin.xml custom plugin:
    <plugin>
    
     <server name="VM Performance Counters"
     platforms="Win32">
    
     <plugin type="measurement"
     class="org.hyperic.hq.product.MeasurementPlugin"/>
    
     <!-- You always need availability metric, so just pick some service -->
     <metric name="Availability"
     template="win32:Service=Eventlog:Availability"
     indicator="true"/>
    
     <!-- Template filter is passed to metrics -->
     <filter name="template"
     value="win32:Object=${object}:${alias}"/>
    
     <!-- Using object filter to reduce amount of xml -->
     <filter name="object" value="VM Memory"/>
    
     <metric name="Memory Reservation" alias="Memory Reservation in MB" units="MB"/>
     <metric name="Memory Limit" alias="Memory Limit in MB" units="MB"/>
     <metric name="Memory Shares" alias="Memory Shares"/>
     <metric name="Memory Active" alias="Memory Active in MB" units="MB"/>
     <metric name="Memory Ballooned" alias="Memory Ballooned in MB" units="MB"/>
     <metric name="Memory Swapped" alias="Memory Swapped in MB" units="MB"/>
    
     <!-- Win perf object is changed, using new one -->
     <filter name="object" value="VM Processor"/>
    
     <!-- Processor object needs instance information to access -->
     <filter name="instance" value="_Total"/>
    
     <!-- Giving new template since we now need instance info -->
     <filter name="template"
     value="win32:Object=${object},Instance=${instance}:${alias}"/>
     <metric name="CPU Reservation in MHz" alias="Reservation in MHz"/>
     <metric name="CPU Limit in MHz" alias="Limit in MHz"/>
     <metric name="CPU Shares" alias="Shares"/>
     <metric name="Effective VM Speed in MHz" alias="Effective VM Speed in MHz" indicator="true"/>
     <metric name="Host processor speed in MHz" alias="Host processor speed in MHz"/>
    
     </server>
    </plugin>
    
  3. The plugin should be automatically distributed through agents to the workloads. However we need to configure it. In the Resources tab browse to the workload and in the Tools Menu select New Server. Type a name, in the Server Type drop down find Windows VM Performance Counters and type something in the Install path field.Add Server
  4. After clicking OK immediately click on Configuration Properties and check Auto-discover services.
  5. To start collecting data we need to configure collection interval for the metrics. Go to Monitor > Metric Data  and click Show All Metrics button. Select the metrics and at the bottom input collection interval and submit.Collection Interval
  6. Now when we are collecting data we could create indicators, create alerts, etc., However Hyperic is just a collection engine for us. We will feed its data to vCenter Operations Manager.

vCenter Operations Manager Configuration

  1. Assuming there is already an existing on-premise installation of vCenter Operations Manager we need to configure it to collect data from the cloud Hyperic instance. To do this first we need to open Edge Gateway firewall and create DNAT and firewall rule to the Hyperic DB server (TCP port 5432).
  2. Now we need to download Hyperic vCOps Adapter from here.
  3. To install go to vC Ops admin interface and install it through Update tab.
  4. Then go to the vC Ops custom interface and in Admin > Support click Describe icon.
  5. Next we need to configure the adapter. Go to Environment > Configuration > Adapter Instances and add new Hyperic instance for the public cloud. Then configure the (public) IP address of Hyperic DB, port and credentials.Hyperic Adapter Instance

Dashboards

When finished with all configurations we can create custom dashboards from the collected metrics in vC Ops. This is however out of scope of this article as it depends on the use case.

vC OpsAs an example above I am showing real CPU usage of one cloud VM as reported by the hypervisor. The host physical core speed is 2.2 GHz, however the effective VM speed was varying between 2.5 – 2.7 GHz (max turbo frequency of Intel E5-2660 CPU is 3 GHz). If I would look at the internal Windows GuestOS task manager metric I would see just meaningless 100% CPU utilization.

Performance Monitoring in Public Cloud (vCHS) – Part I

VI administrator is used to monitor the performance of the workloads running in his environment through the rich vSphere client interface. The performance tab presents multiple graphs displaying metrics related to CPU, memory, network, datastore, disk, … This helps him when troubleshooting, rightsizing the workloads or when doing capacity calculations.

So what happens when the VI admin becomes Cloud admin and starts deploying workloads to a public cloud? No access to vCenter Server means he has to solely rely on guest OS metrics (perfmon, top) or monitoring interfaces provided by his service provider. Although vCloud Director has monitoring dashboard it does not show any performance data – see my Org VDC Monitoring post.

OrgVDC Monitoring
OrgVDC Monitoring

What about those guest OS metrics? Any vSphere admin who went through VCP training knows that the guest OS metrics like CPU utilization are never to be trusted in virtual environment – the OS does now know how much of actual physical CPU time has been scheduled to its vCPUs and high CPU utilization could mean either that a demanding workload is running in the OS or that the VM is competing with other VMs in highly overallocated environment.

Should the VI/Cloud admin be concerned? It depends on the way the provider is oversubscribing (overselling) his compute resources. I have identified three schools of thought:

  1. ISP model: similarly how internet provider oversubscribes the line 1:20 the IaaS provider will sell you CPU/RAM allocation with certain percentage guaranteed (e.g. 10% for CPU). The consumer will know that during quiet times he might get 100% of requested resources, but during busy times he might get only his 10%. The consumer pays for the whole allocation.
  2. Telco model: the consumer commits to certain level of consumption and is charged extra for bursting above it. So again guaranteed percentage of resources is defined and known but the difference from the ISP model is that the consumer is charged flat rate for the guaranteed percentage plus the premium when he bursts above it.
  3. SLA model: the consumer pays for the whole allocation but does not know what resource oversubscription the provider is using. The provider must monitor the tenants to understand how much he can oversell the compute to get the highest ROI while keeping the SLAs.

All these three models are achieved by the same allocation model – Allocation Pool. Only the chargeback, amount of disclosed information and SLA differs.

It is obvious that in all three models we need performance monitoring for rightsizing the individual workloads and to correctly size the whole Org VDC. In ISP model we need to understand if we should buy more allocation because during the busy times our workloads suffer. In Telco model we need to avoid the expansive bursting and in SLA model to control the provider’s SLAs. On top of that it would be nice to be able to peek under the provider’s kimono to find out what is the overcommit level of his cloud.

By the way the need for performance monitoring still applies to Reservation Pool – where the tenant is in full control of OrgVDC overallocation and needs to understand if he went to far or not. In Pay-As-You-Go Pool it is again about understanding if my VMs are starving for CPU resources because of too aggressive oversubscription on provider’s side.

Guest SDK

One of the less known features of VMware Tools is the ability to use Guest SDK which provides read only API to for monitoring various virtual machine statistics. An example of such implementation are two additional Windows PerfMon libraries: VM Memory and VM Processor. They both contain number of counters showing very interesting information familiar to VI admin as they are exposed in vSphere client.

PerfMon VM Counters
Windows PerfMon VM Counters

A linux alternative (although not so powerful) is vmware-toolbox-cmd stat command.

We can find out what is the CPU or memory reservation, if memory ballooning is active (or even VM SWAP). We can also see what is the actual physical processor speed and what is the effective VM processor speed. This gives as quite interesting peek into the hypervisor. Btw the access to Guest SDK could be disabled by the provider via advanced VM .vmx configuration parameter (not a standard practice):

tools.guestlib.enableHostInfo = "FALSE"

In the second part I will describe how these metrics can be collected, monitored and analyzed. Stay tuned…