Availability Zone Design in vCloud Director


Service providers with multiple datacenter often want to offer to their customers the choice of having multiple virtual datacenters from different availability zones. Failure of one availability zone should not impact different availability zone. The customer then can deploy his application resiliently in both virtual datacenters leveraging load balancing and application clustering.

Depending on the distance of datacenters and network latency between them it is possible to have multiple availability zones accessible from within single vCloud Director instance which means one single GUI or API endpoint and very easy consumption from customer’s perspective. Read vCloud Architecture Toolkit – Architecting vCloud for more detail on latency and supportability considerations.

Multiple vCenter Server Design

Typical approach in single instance vCloud Director is to have for each availability zone its own vCenter Server and vCNS Manager. vCloud Director in version 5.5 can connect up to 20 vCenter Servers.

Following diagram shows how the management and resource clusters are typically placed between two sites.

Multi vC Design

Each site has management cluster. The shared cloud management VMs (vCloud Director cells, databases, Chargeback, AMQP, etc) run primarily from site 1 with failover to site 2. Provider VDC management resources (vCenter Server, vCNS/NSX Managers, databases) are distributed to each site. There is no sharing of resource group components which makes very clean availability zone design.

One problem for the customers is that they cannot stretch organization VDC networks between the sites. The reason for this is that although VXLAN networks could be stretched over routed Layer 3 networks between sites, they cannot be stretched between different vCenter Servers. Single vCNS/NSX manager is the boundary for VXLAN network and there is 1:1 relationship between vCenter Server and vCNS/NSX Manager. This means that if the customer wants to achieve communication between VMs in each of his VDCs from different availability zones he has to create Edge Gateway IPsec VPN or provide external network connectivity between them. All that results in quite complicated routing configuration. Following diagram shows the typical example of such setup.

VDC design in multi vCenter Provider VDC

Single vCenter Server Design with Stretched Org Networks

I have come up with an alternative approach. The goal is to be able to achieve stretched OrgVDC network between two sites and have only one highly available Edge Gateway to manage. The desirable target state is shown in the following diagram.

VDC design with streched network

To accomplish this we need only one Resource group vCenter Server instance and thus one VXLAN domain while still having the separation of resources into two availability zones. vCenter Server can be made resilient with vSphere HA (stretched cluster), vCenter Server Heartbeat or Site Recovery Manager.

Could we have the same cluster design as in multi-vCenter scenario with each Provider VDCs having its own set of clusters based on site? To answer this question I first need to describe the VXLAN transport zone (VXLAN scope) concept. VXLAN network pools created by vCloud Director have only Provider VDC scope. This means that any Org VDC network created from such VXLAN network pool will be able to span clusters that are used by the Provider VDC. When a cluster is added or removed to or from Provider VDC, the VXLAN transport zone scope is expanded or reduced by the cluster. This can be viewed in vCNS Manager or in NSX – Transport Zone menu.

VXLAN - Transpost zone

There are two ways how to expand the VXLAN transport zone.

Manual VXLAN Scope Expanding

The first one is simple enough and involves manually extending the VXLAN transport zone scope in vCNS or NSX Manager. The drawback is that any reconfiguration of Provider VDC clusters or resource pool will remove this manual change. As Provider VDC reconfiguration does not happen too often this is viable option.

Stretching Provider VDCs

The second solution involves stretching at least one Provider VDC into the other site so its VXLAN scope covers both sites. The resulting Network Pool (which created the VXLAN transport zone) then can be assigned to Org VDCs needing to span networks between sites. This can be achieved with using multiple Resource Pools inside clusters and assigning those to Provider VDCs. As we want to stretch only the VXLAN networks and not the actual compute resources (we do not want vCloud Director deploying VMs into wrong site) we will have site specific storage policies. Although a Provider VDC will have access to Resource Pool from the other site it will not have access to the storage as only storage from the first site is assigned to it.

Hopefully following diagram better describes the second solution:

Stretched PVDC Design


The advantage of the second approach is that this is much cleaner solution from support perspective although the actual setup is more complex.

Highly Available Edge Gateway

Now that we have successfully stretched the Org VDC network between both sites we also need to solve the Edge Gateway site resiliency. Resilient applications without connectivity to external world are useless. Edge Gateway (and the actual Org VDC) is created inside one (let’s call it primary) Provider VDC. The Org VDC network is marked as shared so other Org VDCs can use it as well. The Edge Gateways are deployed by the service provider. He will deploy the Edge Gateway in high availability configuration which will result in two Edge Gateway VMs deployed in the same primary Provider VDC (in System VDC sub-resource pool). The VMs will use internal Org VDC network for heartbeat communication. The trick to make it site resilient is to go into vCNS/NSX Edge configuration and change the Resource Pool (System VDC RP with the same name but different cluster) and Datastore for the 2nd instance to the other site. vCNS/NSX Manager then immediately redeploys the reconfigured Edge instance to the other site. This change survives Edge Gateway redeploys from within vCloud Director without any problems.

HA Edge


Migration from Nexus 1000V to NSX in vCloud Director

A few service providers were asking me questions about migration possibilities from Nexus 1000V to NSX if using vCloud Director. I will first try to explain the process in high level and then offer a step by step approach.


NSX (actually NSX for vSphere) is built on top of vSphere Distributed Switch which is extended by a few VMkernel modules and host level agents. It cannot run on top of Nexus 1000V. Therefore it is necessary to first migrate from Nexus 1000V to vSphere Distributed Switch (vDS) and then install NSX. The core management component of NSX is NSX Manager which runs as a virtual appliance. It is very similar in a sense to vShield or vCloud Networking and Security Manager and there is actually direct upgrade path from vCNS Manager to NSX Manager.

NSX is backward compatible with vCNS API and therefore it works without any problems with vCloud Director. However NSX advanced features (distributed logical router and firewall, dynamic routing protocols support on Edges, etc.) are not available via vCloud Director GUI or vCloud API. The only exception is multicast-less VXLAN which works with vCloud DIrector out of the box.

Migration from Nexus 1000V to vSphere Distributed Switch

In pure vSphere environment it is pretty easy to migrate from Nexus 1000V to vDS and can be done even without any VM downtime as it is possible to have on the same ESXi host both distributed switches (vDS and Nexus). However, this is not the case when vCloud Director with VXLAN technology is involved. VXLAN can be configured with per cluster granularity and therefore it is not possible to mix two VXLAN providers (vDS and Nexus) on the same cluster. This unfortunately means we cannot do live VM migration from Nexus to vDS as they are on different clusters and (live) vMotion does not work across two different distributed switches. Cold migration must be used instead.

Note that if VXLAN is not involved, live migration is possible which will be leveraged while migrating vCloud external networks.

Another point should be noted. We are going to mix two VXLAN providers in the same Provider VDC, meaning that VXLAN scope will span both Nexus 1000V and vSphere Distributed Switch. To my knowledge this is not recommended and supported although it works and both VTEP types can communicate with each other thanks to multicast controller plane. As we will be migrating VMs in powered off state it is not an issue and no communication will run over mixed VXLAN network during the migration.

High Level Steps

We will need two clusters. One legacy with Nexus 1000V and one empty for vDS. I will use two vDS switches, although one could be enough. One vDS is used for management, vMotion, NFS and VLAN traffic (vCloud external networks) the other vDS is used purely for VXLAN VM traffic.

We need to first migrate all VLAN based networks from Nexus 1000V to vDS1. Then we prepare the second cluster with vDS based VXLAN and create elastic Provider VDC which will contain both clusters (resouce pools). We will disable the Nexus resource pool and migrate all VMs, templates and Edges off of it to the new one – see my article vCloud Director: Online Migration of Virtual Data Center.

We will detach the old cluster remove and unistall Nexus 1000V and then create vDS with VXLAN instead and add it back to Provider VDC. After this we can upgrade the vCNS Manager to NSX Manager and prepare all hosts for NSX and also install NSX Controllers. We can optionally change the VXLAN transport mode from Multicast to Unicast/Hybrid. We however cannot upgrade vCloud Director deployed vShield Edges to NSX Edges as that would break its compatibility with vCloud Director.

Step-by-step Procedure

  1. Create new vDS switch. Migrate management networks (vMotion, NFS, management) and vCloud external networks to it (NAT rules on Edge Gateways using external network IPs will need to be removed and re-added).
    Prepare new cluster (Cluster2) and add it to the same vDS. Optionally create a new vDS for VXLAN networks (if the first one will not be used). Prepare VXLAN fabric on Cluster2 in vCNS Manager.
    Distributed switches
    VXLAN fabric before migration
    Nexus1000V – Cluster1 VXLAN only
    dvSwitch01 — Cluster 1 and Cluster 2 external networks, vMotion, management and NFS
    dvSwtich02 – Cluster2 VXLAN only
  2. Create new Provider VDC on new cluster (GoldVDCnew)
  3. Merge new Provider VDC with the old one. Make sure the new Provider VDC is primary! Disable the secondary (Cluster1) Resource Pool.
    Merging PVDCs
    PVDC Resource Pools
  4. Migrate VMs from Cluster1 to Cluster2 from within vCloud Director. VMs connected to a VXLAN network will need to be powered off as it is not possible to do live vMotion between two different distributed switches.
    VM Migration
  5. Migrate templates (move them to another catalog). As the Cluster1 is disabled the templates will be registered on hosts from Cluster2.
  6. Redeploy Edges from within vCloud Director (not vCNS Manager). Again the disabled Cluster1 will mean that the Edges will be registered on hosts from Cluster2.
  7. Detach (now empty) Resource Pool
    Detaching Resource Pool
  8. Rename Provider VDC to the original name (GoldVDC)
  9. Unprepare VXLAN fabric on Cluster1. Remove VXLAN vmknics on all hosts from Nexus1000V.
  10. Remove Nexus1000V switch from the (now empty) cluster and extend VXLAN vDS there (or create a new one if no L2 connectivity exists between clusters). Remove Nexus VEM VIBs from all hosts. Prepare VXLAN fabric on the cluster and add it to Provider VDC.
    VXLAN fabric after migration
  11. Upgrade vCNS Manager to NSX Manager with vCNS to NSX 6 Upgrade Bundle
    vCNS Manager upgrade
    NSX Manager
  12. Update hosts in vSphere Web Client > Networking and Security > Installation > Host Preparation. This action will require host reboot.
    Host preparation for NSX - beforeHost preparation for NSX - after
  13. (Optionally) install NSX controllers
    NSX Controller installation
  14. (Optionally) change VXLAN transport mode to Unicast/Hybrid.
    VXLAN transport mode

vCloud Director – vCenter Single Sign-On Integration Troubleshooting

The access to vCloud Director provider context (system administrator accounts) can be federated with vCenter Single Sign-On. This means that when vCloud system administrator wants to log into the vCloud Director (http://vCloud-FQDN/cloud) he is redirected to vSphere Web Client where he needs to authenticate and then redirected back to vCloud Director.


Here follows collection of topics that might be useful when dealing when troubleshooting the federation:

1. When the SSO federation is enabled the end-user is always redirected to vSphere Web client. If you want to use local authentication use http://vCloud-FQDN/cloud/login.jsp and type local or LDAP account (if LDAP is configured).

2. If you enabled SSO federation and the vCenter SSO is no longer available, you cannot unregister its lookup service. To do this go to vCloud database and in dbo.config table clear the lookupservice.url value.

3. In case you are using self-signed untrusted certificate on the vSphere web client some browsers (Firefox) might not display the Add Security Exception button when being redirected. Therefore open first the vSphere web client page directly, create the security exception and then the redirection from vCloud website should work.

4. HTTP ERROR 500. Problem accessing /cloud/saml/HoKSSO/alias/vcd. Reason: Error determining metadata contracts
Metadata for issuer https://xxx:7444/STS wasn’t found

This error might appear after the vSphere web client SSO authentication when the user is redirected back to vCloud portal. To understand this error let’s first talk about what is going on in the background. When the SSO federation is enabled, vCloud Director establishes trust with vCenter SSO. The trust is needed so the identity provider (SSO) knows that the request for authentication is not malicious (phishing) and also the service provider (vCloud Director) needs to be sure the reply comes from the right identity provider.

The trust is established when the federation is configured with metadata exchange that contains keys and some information about the other party. The SSO metadata can be seen in vCloud Director database in the dbo.saml_id_provider_settings table. Now during the actual authentication process if for some reason the security token reply comes from different source than the one expected based on the identity provider metadata, you will get this error.

This issue might happen for example if the vCenter SSO hostname has been changed. In this particular case which I encountered it happened on vCenter Server Appliance 5.1. The SSO has been initiated before the hostname was set. So the identity provider response came with metadata containing the issuer’s IP address instead FQDN which the service provider (VCD) expected based on the SSO endpoint address. The issuer information did not get updated after the hostname change.

The VCSA 5.1 the issuer information is stored in SSO PostgreSQL DB. These are the steps to change it:

  1. Get the SSO DB instance name, user and password

cat /usr/lib/vmware-sso/webapps/lookupservice/WEB-INF/classes/config.properties


  1. Connect to the DB with the password retrieved from the previous step (db.pass=…)

/opt/vmware/vpostgres/1.0/bin/psql ssodb -U ssod

Retrieve the STS issuer:

ssodb=> select issuer from ims_sts_config;

If the issuer is really incorrect update it with following command:

ssodb=> update ims_sts_config SET issuer='https://FQDN:7444/STS'

Note: I need to credit William Lam for helping me where to find the SSO DB password.

Memory Overhead in vCloud Director

This little known fact came up already twice recently so I will write just a short post about it.

In the past when you created allocation pool Organization Virtual Datacenter (Org VDC) and allocated to it for example 20 GB RAM it actually did not allow you to deploy VMs with total memory sum of 20 GB due to virtualization memory overhead being charged against the allocation as well. This behavior forced service providers to add additional 5%-15% to the memory Org VDC allocation. This was also very confusing for the end users who were complaining why their VM cannot power on.

With the elastic allocation pool VDC changes which came with vCloud Director 5.1 it is no longer an issue. The reason is that in the non-elastic VDC it is vSphere (and Org VDC resource pool object with a limit set) who does the admission control – i.e. how many VMs can be deployed into it. In the elastic VDC it is actually vCloud Director who is responsible for the decision if a VM can be deployed into particular VDC.

Allocation Pool Elasticity

So to sum it up: if you use elastic allocation pool the tenant can use it up to the last MB. The virtualization VM memory overhead is charged to the provider who must take it into account when doing capacity management.

Datastore Cluster Issue in vCloud Director 5.5.1

Just found out there is a known issue with removing datastores in and out of datastore clusters. In vCloud Director 5.1 this was working fine. vCloud Director refers to storage via storage policies (formerly profiles) so you could on-the fly change the datastore cluster structure (as long as all datastore inside had the same storage policy).

However in vCloud Director 5.5.1 if you move a datastore in or out of a datastore cluster, vCloud Director will loose it. The fix is described in KB 2075366 and involves clean up of vCloud Director database inventory data.