vCloud Availability – Updated Whitepaper

I have updated my vCAT-SP vCloud Availability whitepaper to reflect changes that came with vCloud Availability 2.0 and vSphere 6.5/6.7.

It can be downloaded from the vCAT-SP site from the Storage and Availability section. The direct link to PDF is here. You will know you have the latest document if you see June 2018 date on the title page.

Edit highlights:

  • Installer Appliance section
  • Tenant and Provider portal sections
  • PSC section update
  • Supported Org VDC Topologies
  • Application Network Design
  • Network Bandwidth Requirements
  • Monitoring updates
  • Updates and Upgrades section
  • Monitoring with vRealize Operations

vSphere Replication Issue with ESXi 6.5U1

This is a quick post to highlight an issue vSphere Replication has with ESXi 6.5U1 for To-the-cloud replication.

Only customers that use vSphere Replication for DR or migrations to the cloud endpoints (e.g. vCloud Availability for vCloud Director) with ESXi 6.5U1 hosts are affected (ESXi 6.5 and older works fine). Also host-to-host replication is not affected.

The root cause is that ESXi 6.5U1 hosts are unable to retrieve from vSphere Replication Appliance vr2c-firewall.vib that is responsible for opening outgoing communication ports for replication traffic on the ESXi host firewall.

This results in inability to perform any to-the-cloud replications. To see the issue look into the host Firewall configuration in the Security Profile section. If you do not see Replication-to-Cloud Traffic section you are affected.

The picture below which traffic it is related to (red rectangle on the left):

If you would look into esxupdate.log on the host you will see error: [Errno 14] curl#56 – “Content-Length: in 200 response”.

Until a fix is going to be released here is a workaround:

  1. Download the vr2c-firewall.vib from the vSphere Replication Appliance: https://vSphere-Replication-Appliance-ip-or-fqdn:8043/vib/vr2c-firewall.vib.
  2. Upload the vib to a shared location (datastore)
  3. Install the vib to every host with the following command: esxcli software vib install -v /vmfs/volumes/<datastore>/vr2c-firewall.vib
  4. Verify the fix was installed properly with: esxcli software vib list | grep vr2c

vCloud Availability: Replication of Powered-off VM

Just a short post about a feature I recently learned.

In vSphere Replication when you are configuring replication of powered-off VM you will get the following message:

The virtual machine is not powered on. Replication will start when the virtual machine is powered on.

The replication is actually configured and its placeholder VM is created in the recovery location (cloud) but the VM will stay in Not Active state.

Why is this? Immediate start of replication locks VM disks which means such VM would not be able to power-on until the initial sync is finished. But what if you want to replicate powered-off VMs for example templates that are never meant to run?

You can in fact force start the replication by right clicking the VM and selecting Sync Now, which asks confirmation question if we really want to do so as the VM will not be able the be powered on until the operation completes.

Is there a use case for this? As I mentioned this could be used for catalog sync as replication is much faster and efficient that OVF export / import.

Monitoring vSphere Replication RPO Compliance

Just a quick post to show how you can monitor Recovery Point Objective (RPO) compliance of a virtual machines protected with vSphere Replication.

Option 1: vCenter Server Alarm

When vSphere Replication Appliance is registered to vCenter Server multiple new vSphere Replication Event Types become available and can be used for creation of custom alarms.

List of all these event types can be queried with the following one-line PowerCLI command:

(get-view eventManager).get_Description()| select -expand Eventinfo |where FullFormat -like “*Hms*”

The following example will show how to set alarm for event “RPO violated”

Key:ExtendedEvent
Description: RPO violated
Category: error
FullFormat: com.vmware.vcHms.rpoViolatedEvent|Virtual machine vSphere Replication RPO is violated by [data.currentRpoViolation] minute(s)

  1. In vCenter Server go to Manager, Alarm Definitions and add new alarm
  2. Set alarm name, monitor VMs and specific events.
    new-alarm
  3. Enter the trigger (com.vmware.vcHms.rpoViolatedEvent)
    alarm-trigger
  4. Add Alarm actions (email, SNMP trap, run command etc.) as necessary.

Triggered alarm:

triggered-alarm

Note that this alarm applies only to VMs replicated from the particular vCenter Server. So it will not be triggered on VMs replicated to this vCenter Server.

Option 2: vCloud API

This options applies only for VM replications to or from a cloud provider who uses vCloud Availability add-on. The vCloud Director tenant APIs are extended with replication APIs. The state of each replication can be retrieved with:

GET /api/vr/replications/<replication-id>

and

GET /api/vr/failbackreplications/<replication-id>

Where list of all replications and their replication-ids is retrieved at org level with these two API calls:

GET /api/org/<org-id>/replications

and

GET /api/org/<org-id>/failbackreplications

An example of VM1 replication state (RPO 15 mins, not active with 16 min RPO violation):

replication-api

The following tables describes all the elements of the API response:

replication-details

vCloud Availability: Replication Traffic Deep Dive

VMware this week released vCloud Availability 1.0.1. It is a disaster recovery as a service solution that extends vCloud Director and enables VM replications between vSphere environments and a multitenant public cloud.

One of the unique features it offers is that there is no need for private networks for replication traffic between the tenant on-prem environment and the public cloud. The bi-directional replication can securely traverse the Internet and in this post I will dive deeper into how this is technically achieved.

On-Premises Components

The tenant on-prem environment can run almost any version of vSphere together with vSphere Replication (VR) appliance. The appliance must have access to the internet but does not need to have an external internet routable IP address. It can be behind a firewall with Source NATing and only one port open – TCP 443.

The appliance runs various services:

  • vSphere Replication Manager Service (vRMS) – the brain of the on-prem VR solution with an internal or optionally external database. It also provides the vSphere Web Client plugin extension for managing the replication.
  • vSphere Replication Server (vRS) – staging point for incoming (from-the-cloud) replications before they are de-staged via ESXi hosts to target datastores. This component can scale-out if needed by deploying additional appliances (cca 200 incoming replications for each vRS). It is not in the path for outgoing (to-the-cloud) replications.
  • vCloud Tunneling Agent (vCTA) – component that provides secure tunnel connections to the cloud. It also keeps control connection open so reverse replication can be initiated as well. More on that later.

That is it. There is nothing else needed to be installed by the tenant. This is because the actual replication engine (the vSphere Replication agent and vSCSI filter) is already present in the hypervisor – ESXi VMkernel. This also means no dependency what so ever on the storage hardware.

To-the-cloud replication flow:

ESXi host (VR Agent) > vSphere Replication Appliance (vCTA) > Internet > vCloud Availability public endpoint (Cloud Proxy load balanced VIP)

From-the-cloud replication flow:

vCloud Availability public endpoint (Cloud Proxy) > Internet > vSphere Replication Appliance (vCTA) > vSphere Replication Server (either embedded on VR appliance or standalone) > ESXi host

Public Cloud Components

In the cloud we need to have supported version of vCloud Director. It consists of multiple load balanced vCloud Director cells, database and resource vSphere environments.

For vCloud Availability we need vSphere Replication components – vRMS appliances for each resource vCenter Server and vRS appliances for replication staging that scale-out based on number of replications.

Additionally we need:

  • vSphere Replication Cloud Service (vRCS) – highly available appliances. The brain of the solution with extended vCloud VR APIs. It needs external Cassandra database and RabbitMQ to communicate with vCloud Director.
  • Cloud Proxies – load balanced vCloud Director cell like components with all vCloud services disabled with only the Cloud Proxy service running (multitenanted vCTA)
  • vCloud Availability Portal appliances – load balanced stateless components that provide portal to manage replications in the cloud when the on-prem vSphere Replication UI is not available (in disaster situations).

The provider can serve hundreds of customers with thousands of concurrent tunnels. To achieve such level of scalability, the Cloud Proxies are deployed in scale-out fashion with load balancer in front. The load balancer provides single endpoint for the on-prem vCTA control connection as well as for the to-the-cloud replication traffic.

To-the-cloud replication flow:

Tenant on-prem VR Appliance > Internet > Load balancer > Cloud Proxy (tunnel termination and decryption) > vRS > ESXi host.

From-the-cloud replication flow:

In order not to require public visibility of the on-prem tenant VR Appliance, the from-the-cloud replication is set up in a quite clever way. There are two options of doing this – one with load balancer with L7 application rules and the other without. The seconds approach is more scalable and recommended so let me describe it.

The following diagram shows the workflow:

from-the-cloud-replication

As was said before the connection is always initiated by the on-prem environment. That’s why we have the control connection (1) that is load balanced to one of the Cloud Proxies (in our example Cloud Proxy 2). The replicated traffic is coming from in the cloud resource ESXi host that sends it (2) via internal load balancer to one of the Cloud Proxies – in our case Cloud Proxy 1 (3). Through the control connection the on-prem vCTA endpoint is notified which Cloud Proxy is used for the particular replication (4-7). Now the on-prem vCTA can establish new connection to the correct Cloud Proxy 1 – it does not use its load balanced address, but instead a direct IP/FQDN that is DNATed 1:1 to the Cloud Proxy (9). Finally the two connections (3) and (9) can be stitched together (10) and we can start sending from-the-cloud replication traffic all the way to the on-prem environment.

To summarize, we need:

  • Cloud Proxy load balancer with CloudProxyBase VIP that is used for the Control connection and to-the-cloud replications.
  • Internal load balancer for resource ESXi to Cloud Proxy (from-the-cloud) traffic.
  • Additional public IP/FQDN for each Cloud Proxy for from-the-cloud traffic. This FQDN is configured on the Cloud Proxy cell in global.properties file (cloudproxy.reverseconnection.fqdn=FQDN:443).
  • As a consequence of using the same Cloud Proxy under different FQDNs (CloudProxyBase VIP and Cloud Proxy reverse connection) we need to take care that Cloud Proxy cell http certificate is set for both FQDNs. Probably the easiest way to achieve this is to use wild card certificate on cloud proxies (CN *.cloudproxy.example.com).