I had a question if there is a way to download vCloud Director VMware Remote Console plug-in without the need to actually log in into vCloud Director. For example in order to prepare desktop images for your vCloud users.
Yes, it is possible and here are the links for vCloud Director 5.5:
This is second part of the article. Read the first part here.
In the first part we have established why it is important to monitor resource utilization of workloads deployed in the public cloud and hinted how we can get some hypervisor metrics. In the second part I want to present an architecture example how it can be achieved.
The above diagram describes the whole setup. I am monitoring three workloads (VMs) deployed in a public cloud – in this case vCloud Hybrid Service. I have also deployed vFabric Hyperic into the same cloud, which collects the workload metrics through Hyperic agents. vCenter Operations Manager which is installed on premise is then collecting the metrics from Hyperic database and is displaying them in custom dashboards while doing its own magic (super metrics, dynamic thresholding and analytics).
I have also created custom Hyperic plugin which collects Windows VM performance metrics through VMware Tools Guest SDK.
Deployment and Configuration
vFabric Hyperic Deployment
Create Org VDC network in public cloud for Hyperic Deployment.
Upload vFabric Hyperic 5.7.1 Server installation and vFabric Hyperic 5.7.1 DB installation in OVF format to a public cloud catalog.
Deploy Hyperic DB first to Org VDC network first and note the IP address it has been allocated.
Deploy Hyperic Server and enter the Hyperic DB IP address when asked.
Install Hyperic Agent to all VMs that are going to be monitored.
vFabric Hyperic Configuration
Connect to Hyperic GUI (http://<hyperic server IP>:7080). As I have used jumpbox in the public cloud I did not needed to open the port to the internet, otherwise create the appropriate DNAT and firewall rule to reach the server from outside.
Go to Administration > Plugin Manager > and click Add/Update Plugin(s) button and upload vmPerfMon-plugin.xml custom plugin:
<server name="VM Performance Counters"
<!-- You always need availability metric, so just pick some service -->
<!-- Template filter is passed to metrics -->
<!-- Using object filter to reduce amount of xml -->
<filter name="object" value="VM Memory"/>
<metric name="Memory Reservation" alias="Memory Reservation in MB" units="MB"/>
<metric name="Memory Limit" alias="Memory Limit in MB" units="MB"/>
<metric name="Memory Shares" alias="Memory Shares"/>
<metric name="Memory Active" alias="Memory Active in MB" units="MB"/>
<metric name="Memory Ballooned" alias="Memory Ballooned in MB" units="MB"/>
<metric name="Memory Swapped" alias="Memory Swapped in MB" units="MB"/>
<!-- Win perf object is changed, using new one -->
<filter name="object" value="VM Processor"/>
<!-- Processor object needs instance information to access -->
<filter name="instance" value="_Total"/>
<!-- Giving new template since we now need instance info -->
<metric name="CPU Reservation in MHz" alias="Reservation in MHz"/>
<metric name="CPU Limit in MHz" alias="Limit in MHz"/>
<metric name="CPU Shares" alias="Shares"/>
<metric name="Effective VM Speed in MHz" alias="Effective VM Speed in MHz" indicator="true"/>
<metric name="Host processor speed in MHz" alias="Host processor speed in MHz"/>
The plugin should be automatically distributed through agents to the workloads. However we need to configure it. In the Resources tab browse to the workload and in the Tools Menu select New Server. Type a name, in the Server Type drop down find Windows VM Performance Counters and type something in the Install path field.
After clicking OK immediately click on Configuration Properties and check Auto-discover services.
To start collecting data we need to configure collection interval for the metrics. Go to Monitor > Metric Data and click Show All Metrics button. Select the metrics and at the bottom input collection interval and submit.
Now when we are collecting data we could create indicators, create alerts, etc., However Hyperic is just a collection engine for us. We will feed its data to vCenter Operations Manager.
vCenter Operations Manager Configuration
Assuming there is already an existing on-premise installation of vCenter Operations Manager we need to configure it to collect data from the cloud Hyperic instance. To do this first we need to open Edge Gateway firewall and create DNAT and firewall rule to the Hyperic DB server (TCP port 5432).
Now we need to download Hyperic vCOps Adapter from here.
To install go to vC Ops admin interface and install it through Update tab.
Then go to the vC Ops custom interface and in Admin > Support click Describe icon.
Next we need to configure the adapter. Go to Environment > Configuration > Adapter Instances and add new Hyperic instance for the public cloud. Then configure the (public) IP address of Hyperic DB, port and credentials.
When finished with all configurations we can create custom dashboards from the collected metrics in vC Ops. This is however out of scope of this article as it depends on the use case.
As an example above I am showing real CPU usage of one cloud VM as reported by the hypervisor. The host physical core speed is 2.2 GHz, however the effective VM speed was varying between 2.5 – 2.7 GHz (max turbo frequency of Intel E5-2660 CPU is 3 GHz). If I would look at the internal Windows GuestOS task manager metric I would see just meaningless 100% CPU utilization.
VI administrator is used to monitor the performance of the workloads running in his environment through the rich vSphere client interface. The performance tab presents multiple graphs displaying metrics related to CPU, memory, network, datastore, disk, … This helps him when troubleshooting, rightsizing the workloads or when doing capacity calculations.
So what happens when the VI admin becomes Cloud admin and starts deploying workloads to a public cloud? No access to vCenter Server means he has to solely rely on guest OS metrics (perfmon, top) or monitoring interfaces provided by his service provider. Although vCloud Director has monitoring dashboard it does not show any performance data – see my Org VDC Monitoring post.
What about those guest OS metrics? Any vSphere admin who went through VCP training knows that the guest OS metrics like CPU utilization are never to be trusted in virtual environment – the OS does now know how much of actual physical CPU time has been scheduled to its vCPUs and high CPU utilization could mean either that a demanding workload is running in the OS or that the VM is competing with other VMs in highly overallocated environment.
Should the VI/Cloud admin be concerned? It depends on the way the provider is oversubscribing (overselling) his compute resources. I have identified three schools of thought:
ISP model: similarly how internet provider oversubscribes the line 1:20 the IaaS provider will sell you CPU/RAM allocation with certain percentage guaranteed (e.g. 10% for CPU). The consumer will know that during quiet times he might get 100% of requested resources, but during busy times he might get only his 10%. The consumer pays for the whole allocation.
Telco model: the consumer commits to certain level of consumption and is charged extra for bursting above it. So again guaranteed percentage of resources is defined and known but the difference from the ISP model is that the consumer is charged flat rate for the guaranteed percentage plus the premium when he bursts above it.
SLA model: the consumer pays for the whole allocation but does not know what resource oversubscription the provider is using. The provider must monitor the tenants to understand how much he can oversell the compute to get the highest ROI while keeping the SLAs.
All these three models are achieved by the same allocation model – Allocation Pool. Only the chargeback, amount of disclosed information and SLA differs.
It is obvious that in all three models we need performance monitoring for rightsizing the individual workloads and to correctly size the whole Org VDC. In ISP model we need to understand if we should buy more allocation because during the busy times our workloads suffer. In Telco model we need to avoid the expansive bursting and in SLA model to control the provider’s SLAs. On top of that it would be nice to be able to peek under the provider’s kimono to find out what is the overcommit level of his cloud.
By the way the need for performance monitoring still applies to Reservation Pool – where the tenant is in full control of OrgVDC overallocation and needs to understand if he went to far or not. In Pay-As-You-Go Pool it is again about understanding if my VMs are starving for CPU resources because of too aggressive oversubscription on provider’s side.
One of the less known features of VMware Tools is the ability to use Guest SDK which provides read only API to for monitoring various virtual machine statistics. An example of such implementation are two additional Windows PerfMon libraries: VM Memory and VM Processor. They both contain number of counters showing very interesting information familiar to VI admin as they are exposed in vSphere client.
A linux alternative (although not so powerful) is vmware-toolbox-cmd stat command.
We can find out what is the CPU or memory reservation, if memory ballooning is active (or even VM SWAP). We can also see what is the actual physical processor speed and what is the effective VM processor speed. This gives as quite interesting peek into the hypervisor. Btw the access to Guest SDK could be disabled by the provider via advanced VM .vmx configuration parameter (not a standard practice):
tools.guestlib.enableHostInfo = "FALSE"
In the second part I will describe how these metrics can be collected, monitored and analyzed. Stay tuned…
I had a question from customer how to interpret the numbers reported in Organization VDC monitoring tab in vCloud Director. It is in Administration > Virtual Datacenter menu. If you are using vCloud Hybrid Service it is also displayed in the dashboard on homepage for each Virtual Datacenter (in the lower part of the screen).
It is not very intuitive and we have a KB article 203003 explaining that the displayed usage numbers (blue bars) do not actual correspond to real time resource usage but instead represent how many resources have been allocated by deployed VMs in the Org VDC. These values are updated every 5 minutes.
The next question would be why is CPU usage always 0 GHz in the pictures above?
I already wrote couple articles (here and here) about the new elastic Allocation Pool Org VDC in vCloud Director 5.1.x. Go back to read them to actually understand the reasoning behind the following formulas.
Elastic Allocation Pool Org VDC
CPU Used = Org VDC vCPU speed (in GHz) x # of vCPUs of deployed VMs in Org VDC CPU Allocated = Org VDC CPU allocation (in GHz)
RAM Used = Total memory of VMs deployed in Org VDC (in GB) RAM Allocated = Org VDC memory allocation (in GB)
If 1 VM with 1 vCPU of nominal value 1 GHz is deployed, it might use somewhere between 0 GHz up to 3 GHz (max physical core speed) but the CPU Used value in the graph will always be 1 GHz. Sounds confusing? Remember that the vCPU speed value is not a VM limit in Elastic Allocation Pool Org VDCs contrary to Pay As You Go Org VDCs.
Non-elastic Allocation Pool Org VDC
CPU Used = Always 0 as all VMs in Org VDC share its CPU allocation CPU Allocated = Org VDC CPU allocation (in GHz)
RAM Used = Total memory of VMs deployed in Org VDC (in GB) RAM Allocated = Org VDC memory allocation (in GB)
The storage numbers are always calculated same way for all Org VDC types:
Storage Used = Total allocated disk space of VMs deployed in Org VDC + Total memory of VMs deployed in Org VDC (to account for VM swap usage) + Maximum possible existing snapshot size
Storage allocated = Org VDC storage allocation
So if I have 1 VM with 1 GB RAM and 10 GB disk and create a snapshot (including memory), the storage usage will be: 10 GB (disk) + 1 GB (VM swap) + 10 GB (maximum disk snapshot size) + 1 GB (memory snapshot size) = 22 GB.
Yes, vCHS is using non-elastic Allocation Pool Org VDCs
Amazon Web Services recently released a white paper describing how to deploy highly available SQL 2012 instance in AWS Cloud. It is quite interesting reading as Amazon AWS Cloud is not usually associated with running traditional enterprise applications that are not designed for fail. This is mainly because AWS does not offer infrastructure level high availability as is common with clouds based on vSphere which offer by default vSphere HA.
Microsoft SQL Server 2012 AlwaysOn Availability Groups are built on top of Microsoft Windows Server Failover Clustering (WSFC) but do not require any shared block storage or storage replication and therefore are good candidate for deployment into public clouds which do not offer such highly available infrastructure. While reading the white paper I wondered what would be the experience and difference of deploying such setup into vCloud type public cloud. As i am currently testing VMware vCloud Hybrid Service Cloud (vCHS) I just for fun deployed SQL 2012 cluster into it.
Before I will describe the exercise let’s talk about the reasons why to do this. We want to deploy an application into public cloud that requires SQL database with requirements for availability higher than the cloud provider is offering. vCloud clouds with vSphere High Availability usually provide 99.9% availability but that might be not enough if higher availability is required or if we want geo-protection. vSphere HA also does not help if we want to do rolling patch updates of the underlying Microsoft OS (every patch Tuesday).
We can deploy multiple WSFC cluster nodes with SQL DB running on each one into multiple availability zones or multiple regions of one vCloud provider or even to clouds provided by different vCloud providers.
In short SQL AlwaysOn Availability Groups are logical containers of databases and a unit of failover. Each set of availability database is hosted by an availability replica where there is always one read-write primary replica and one or more read-only secondary replicas ready for failover. Primary replica sends transaction logs to the secondary replicas so there is no need for shared storage. AlwaysOn Availability Groups run on top of WSFC cluster who takes care of the resource monitoring, quorum and failover of the resources in case of node failure.
I have closely followed the whitepaper design mentioned above. Using two cloud availability zones connected with VPN connectivity (provided by Org VDC Edge Gateways) I have deployed two node cluster in different subnets. As all the nodes need to be in the same Active Directory domain I have also set up in each cloud replicated domain controller/DNS. As we have even number of cluster nodes it is also necessary to provide a file share witness. For simplicity I have used one provided by one of the domain controllers but this might not be a good strategy for production deployments as failure of the availability zone where the node and the witness file share are running would render the cluster unusable as the surviving node would not be able to establish majority and the fail over would have to be forced manually. In production we would deploy the witness file share or third node into third availability zone (on-premise data center).
i have not include any application logic and all my testing was done from on-premise simple ODBC connected test application.
Here are high level deployment steps (more detail around the MSFC and SQL configuration is in the Amazon whitepaper).
Create OrgVDC networks in both clouds
On Edge Gateways establish Virtual Private Network between clouds
On Edge Gateways create firewall rules in both clouds to allow communication between all networks.
As can be seen from the screenshot I did not go very granular just for the sake of implicity and lazines.
Deploy VMs (Windows 2012) for the domain controllers and create the domain. Create also DNAT rule on the Edge Gateways to be able to access domain controllers from the internet (RDP: TCP port 3389) so they can act as jump servers. Optionally create SNAT rule to be able to reach internet (for updates, downloads, etc.) On the screenshot below you can see also DNAT rule for the SQL AvailabilityGroup Listener (port 5023). I have used 10.9.2.101 and 10.10.2.101 IP addresses for the Windows Node VMs, 10.9.2.102 and 10.10.2.102 for the WSFC cluster nodes and 10.9.2.103 and 10.10.2.103 for the avaliability group listener.
Create (witness) file share on the first domain controller
Deploy cluster nodes, add them to domain and create the cluster with Node and File Share Majority Quorum.
Install SQL 2012 on both nodes, and enable AlwaysOn Availability in SQL Server Configuration Manager.
On one of the nodes with SQL Server Management Studio create test SQL database in full recovery mode.
Still in SQL Server Management Studio create the Availability Group. For initial database synchronization (full backup and restore) we will need another share (Replica) which I also created on the first domain controller. We will also setup listener.
As was mentioned in the step 4 I have created DNAT rules for the listener IP addresses in each cloud and thus published the database to the internet. On the client PC I have installed SQL ODBC 11 driver that supports always on availability and set up the database connection. As my client was not in the same domain I had to create a DNS records for the listener IPs and connect with SQL authentication. As the listener IPs are in different subnets, multi subnet failover must be enabled in the driver which means that it will try to create parallel connections to all listener IP addresses which will result in much faster failover. in my tests (I tried node failures and split brain scenario) it took about 10 seconds.
That is it. I recommend following Microsoft whitepaper for more information about SQL High Availability.
In the Amazon whitepaper the cloud infrastructure creation and configuration steps (networks, NAT instances, security groups, IP assignments) were done with (fairly complex) CloudFormation templates. I am not aware of any similar tool for vCloud that would create with vCloud API Org VDC infrastructure snapshot which could be reused to quickly redeploy to another Org VDC but it did not take me that much time to create it through vCloud GUI.