Load Balancing vCloud Director Cells with NSX Edge Gateway

About two years ago I have written article how to use vShield Edge to load balance vCloud Director cells. I am revisiting the subject; however this time with NSX Edge load balancer. One of the main improvements of the NSX Edge load balancer is SSL termination.

Why SSL Termination?

All vCloud GUI or API requests are made over HTTPS. When vShield (vCNS) Edge was used for load balancing it was just passing through the traffic untouched. There was no chance to inspect the request – the load balancer saw only source and destination IP and encrypted data. If we want to inspect the HTTP request we need to terminate the SSL session on the load balancer and then create a new SSL session towards the cell pool.

SSL Termination

 

This way we can filter URLs, modify the header or do even more advanced inspection. I will demonstrate how we can easily block portal access for a given organization and how to add X-Forwarded-For header so vCloud Director can log the actual end-user’s and not only load balancer’s IP addresses.

Basic Configuration

I am going to use exactly the same setup as in my vShield article. Two vCloud Director cells (IP addresses 10.0.1.60-61 and 10.0.1.62-63) behind Virtual IPs – 10.0.2.80 (portal/API) and 10.0.2.81 (VMRC).

vCloud Director Design

While NSX Edge load balancer is very similar to vShield load balancer the UI and the configuration workflow has changed quite a bit. I will only briefly describe the steps to set up basic load balancing:

  1. Create Application Profiles for VCD HTTP (port 80), VCD HTTPS (port 443) and VCD VMRC (port 443). We will use HTTP, HTTPS and TCP types respectively. For HTTPS we will for now enable SSL passthrough.
  2. Create new Service Monitoring (type HTTPS, method GET, URL /cloud/server_status)
    Service Monitor
  3.  Create server Pools (VCD_HTTP with members 10.0.1.60 and 62, port 80, monitor port 443; VCD_HTTPS with members 10.0.1.60 and 62, port 443, monitor port 443 and VCD_VMRC with members 10.0.1.61 and 63, port 443, monitor port 443). Always use monitor created in previous step. I used Round Robin algorithm.
    Pools
  4. Create Virtual Servers for respective pools, application profiles and external IP/port (10.0.2.80:80 for VCD_HTTP, 10.0.2.80:443 for VCD_HTTPS and 10.0.2.81:443 for VCD_VMRC).
  5. Enable load balancer in its Global Configuration.
    Global LB Config

Now we should have load balanced access to vCloud Director with identical functionality as in vShield Edge case.

Advanced Configuration

Now comes the fun part. To terminate SSL session at the Edge we need to create and upload to the load balancer vCloud http SSL certificate. Note that it is not possible to terminate VMRC proxy as it is a poor socket SSL connection. As I have vCloud Director 5.5 I had to use identical certificate as the one on the cells otherwise catalog OVF/ISO upload would fail with SSL thumbprint mismatch (see KB 2070908 for more details). The actual private key, certificate signing request and certificate creation and import was not straightforward so I am listing exact commands I used (do not create CSR on the load balancer as you will not be able to export the key to later import it to the cells):

  1. Create private key with pass phrase encryption with openssl:
    openssl genrsa -aes128 -passout pass:passwd -out http.key 2048
  2. Create certificate signing request with openssl:
    openssl.exe req -new -in http.key -out http.csr
  3. Sign CSR (http.csr) with your or public Certificate Authority to get http.crt.
  4. Upload the certificate and key to both cells. See how to import private key in my older article.
  5. Import your root CA and http certificate to the NSX Edge (Manager > Settings > Certificates).
    Certificates

Now we will create a simple Application Rule that will block vCloud portal access to organization ACME (filter /cloud/org/acme URL).

acl block_ACME path_beg – i /cloud/org/acme
block if block_ACME

Application Rule

Now we will change previously created VCD_HTTPS Application Profile. We will disable SSL Passthrough, check Insert X-Forwarded-For HTTP header (which will pass to vCloud Director the original client IP address) and Enable Pool Side SSL. Select previously imported Service Certificate.

Application Profiles

And finally we will assign the Application Rule to the VCD_HTTPS Virtual Server.

Virtual Servers

Now we can test if we can access vCloud Director portal and see the new certificate, we should not be able to access vCloud Director portal for ACME organization and we should see in the logs the client and proxy IP addresses.

ACME Forbidden

Client IP

For more advanced application rules check HAProxy documentation.

vCloud Connector in Multisite Environment with Single VC

I have received an interesting question. Customer has single vCenter Server environment with large number of sites. Can they use vCloud Connector Content Sync to keep templates synced among the sites?

The answer is yes but the set up is not so straight forward. vCloud Connector client does not allow registration of multiple nodes pointing to the same end-point (vCenter Server). The workaround is to fool vCloud Connector by using different FQDNs for the same end-point (IP address). Example: vcenter01.fojta.com, vcenter01a.fojta.com, vcenter01b.fojta.com all pointing to the same IP address. This will obviously impact SSL certificate verification which must be turned of or the vCenter Server certificates must include all those alternative names in S.A.N. attribute.

So the customer would deploy vCloud Connector node in each site, register it with the central vCenter Server always using different FQDN. Then in vCloud Connector client each site would look like a different vCenter Server and the Content Sync could be set up.

Availability Zone Design in vCloud Director

Introduction

Service providers with multiple datacenter often want to offer to their customers the choice of having multiple virtual datacenters from different availability zones. Failure of one availability zone should not impact different availability zone. The customer then can deploy his application resiliently in both virtual datacenters leveraging load balancing and application clustering.

Depending on the distance of datacenters and network latency between them it is possible to have multiple availability zones accessible from within single vCloud Director instance which means one single GUI or API endpoint and very easy consumption from customer’s perspective. Read vCloud Architecture Toolkit – Architecting vCloud for more detail on latency and supportability considerations.

Multiple vCenter Server Design

Typical approach in single instance vCloud Director is to have for each availability zone its own vCenter Server and vCNS Manager. vCloud Director in version 5.5 can connect up to 20 vCenter Servers.

Following diagram shows how the management and resource clusters are typically placed between two sites.

Multi vC Design

Each site has management cluster. The shared cloud management VMs (vCloud Director cells, databases, Chargeback, AMQP, etc) run primarily from site 1 with failover to site 2. Provider VDC management resources (vCenter Server, vCNS/NSX Managers, databases) are distributed to each site. There is no sharing of resource group components which makes very clean availability zone design.

One problem for the customers is that they cannot stretch organization VDC networks between the sites. The reason for this is that although VXLAN networks could be stretched over routed Layer 3 networks between sites, they cannot be stretched between different vCenter Servers. Single vCNS/NSX manager is the boundary for VXLAN network and there is 1:1 relationship between vCenter Server and vCNS/NSX Manager. This means that if the customer wants to achieve communication between VMs in each of his VDCs from different availability zones he has to create Edge Gateway IPsec VPN or provide external network connectivity between them. All that results in quite complicated routing configuration. Following diagram shows the typical example of such setup.

VDC design in multi vCenter Provider VDC

Single vCenter Server Design with Stretched Org Networks

I have come up with an alternative approach. The goal is to be able to achieve stretched OrgVDC network between two sites and have only one highly available Edge Gateway to manage. The desirable target state is shown in the following diagram.

VDC design with streched network

To accomplish this we need only one Resource group vCenter Server instance and thus one VXLAN domain while still having the separation of resources into two availability zones. vCenter Server can be made resilient with vSphere HA (stretched cluster), vCenter Server Heartbeat or Site Recovery Manager.

Could we have the same cluster design as in multi-vCenter scenario with each Provider VDCs having its own set of clusters based on site? To answer this question I first need to describe the VXLAN transport zone (VXLAN scope) concept. VXLAN network pools created by vCloud Director have only Provider VDC scope. This means that any Org VDC network created from such VXLAN network pool will be able to span clusters that are used by the Provider VDC. When a cluster is added or removed to or from Provider VDC, the VXLAN transport zone scope is expanded or reduced by the cluster. This can be viewed in vCNS Manager or in NSX – Transport Zone menu.

VXLAN - Transpost zone

There are two ways how to expand the VXLAN transport zone.

Manual VXLAN Scope Expanding

The first one is simple enough and involves manually extending the VXLAN transport zone scope in vCNS or NSX Manager. The drawback is that any reconfiguration of Provider VDC clusters or resource pool will remove this manual change. As Provider VDC reconfiguration does not happen too often this is viable option.

Stretching Provider VDCs

The second solution involves stretching at least one Provider VDC into the other site so its VXLAN scope covers both sites. The resulting Network Pool (which created the VXLAN transport zone) then can be assigned to Org VDCs needing to span networks between sites. This can be achieved with using multiple Resource Pools inside clusters and assigning those to Provider VDCs. As we want to stretch only the VXLAN networks and not the actual compute resources (we do not want vCloud Director deploying VMs into wrong site) we will have site specific storage policies. Although a Provider VDC will have access to Resource Pool from the other site it will not have access to the storage as only storage from the first site is assigned to it.

Hopefully following diagram better describes the second solution:

Stretched PVDC Design

 

The advantage of the second approach is that this is much cleaner solution from support perspective although the actual setup is more complex.

Highly Available Edge Gateway

Now that we have successfully stretched the Org VDC network between both sites we also need to solve the Edge Gateway site resiliency. Resilient applications without connectivity to external world are useless. Edge Gateway (and the actual Org VDC) is created inside one (let’s call it primary) Provider VDC. The Org VDC network is marked as shared so other Org VDCs can use it as well. The Edge Gateways are deployed by the service provider. He will deploy the Edge Gateway in high availability configuration which will result in two Edge Gateway VMs deployed in the same primary Provider VDC (in System VDC sub-resource pool). The VMs will use internal Org VDC network for heartbeat communication. The trick to make it site resilient is to go into vCNS/NSX Edge configuration and change the Resource Pool (System VDC RP with the same name but different cluster) and Datastore for the 2nd instance to the other site. vCNS/NSX Manager then immediately redeploys the reconfigured Edge instance to the other site. This change survives Edge Gateway redeploys from within vCloud Director without any problems.

HA Edge

 

Migration from Nexus 1000V to NSX in vCloud Director

A few service providers were asking me questions about migration possibilities from Nexus 1000V to NSX if using vCloud Director. I will first try to explain the process in high level and then offer a step by step approach.

Introduction

NSX (actually NSX for vSphere) is built on top of vSphere Distributed Switch which is extended by a few VMkernel modules and host level agents. It cannot run on top of Nexus 1000V. Therefore it is necessary to first migrate from Nexus 1000V to vSphere Distributed Switch (vDS) and then install NSX. The core management component of NSX is NSX Manager which runs as a virtual appliance. It is very similar in a sense to vShield or vCloud Networking and Security Manager and there is actually direct upgrade path from vCNS Manager to NSX Manager.

NSX is backward compatible with vCNS API and therefore it works without any problems with vCloud Director. However NSX advanced features (distributed logical router and firewall, dynamic routing protocols support on Edges, etc.) are not available via vCloud Director GUI or vCloud API. The only exception is multicast-less VXLAN which works with vCloud DIrector out of the box.

Migration from Nexus 1000V to vSphere Distributed Switch

In pure vSphere environment it is pretty easy to migrate from Nexus 1000V to vDS and can be done even without any VM downtime as it is possible to have on the same ESXi host both distributed switches (vDS and Nexus). However, this is not the case when vCloud Director with VXLAN technology is involved. VXLAN can be configured with per cluster granularity and therefore it is not possible to mix two VXLAN providers (vDS and Nexus) on the same cluster. This unfortunately means we cannot do live VM migration from Nexus to vDS as they are on different clusters and (live) vMotion does not work across two different distributed switches. Cold migration must be used instead.

Note that if VXLAN is not involved, live migration is possible which will be leveraged while migrating vCloud external networks.

Another point should be noted. We are going to mix two VXLAN providers in the same Provider VDC, meaning that VXLAN scope will span both Nexus 1000V and vSphere Distributed Switch. To my knowledge this is not recommended and supported although it works and both VTEP types can communicate with each other thanks to multicast controller plane. As we will be migrating VMs in powered off state it is not an issue and no communication will run over mixed VXLAN network during the migration.

High Level Steps

We will need two clusters. One legacy with Nexus 1000V and one empty for vDS. I will use two vDS switches, although one could be enough. One vDS is used for management, vMotion, NFS and VLAN traffic (vCloud external networks) the other vDS is used purely for VXLAN VM traffic.

We need to first migrate all VLAN based networks from Nexus 1000V to vDS1. Then we prepare the second cluster with vDS based VXLAN and create elastic Provider VDC which will contain both clusters (resouce pools). We will disable the Nexus resource pool and migrate all VMs, templates and Edges off of it to the new one – see my article vCloud Director: Online Migration of Virtual Data Center.

We will detach the old cluster remove and unistall Nexus 1000V and then create vDS with VXLAN instead and add it back to Provider VDC. After this we can upgrade the vCNS Manager to NSX Manager and prepare all hosts for NSX and also install NSX Controllers. We can optionally change the VXLAN transport mode from Multicast to Unicast/Hybrid. We however cannot upgrade vCloud Director deployed vShield Edges to NSX Edges as that would break its compatibility with vCloud Director.

Step-by-step Procedure

  1. Create new vDS switch. Migrate management networks (vMotion, NFS, management) and vCloud external networks to it (NAT rules on Edge Gateways using external network IPs will need to be removed and re-added).
    Prepare new cluster (Cluster2) and add it to the same vDS. Optionally create a new vDS for VXLAN networks (if the first one will not be used). Prepare VXLAN fabric on Cluster2 in vCNS Manager.
    Distributed switches
    VXLAN fabric before migration
    Nexus1000V – Cluster1 VXLAN only
    dvSwitch01 — Cluster 1 and Cluster 2 external networks, vMotion, management and NFS
    dvSwtich02 – Cluster2 VXLAN only
  2. Create new Provider VDC on new cluster (GoldVDCnew)
  3. Merge new Provider VDC with the old one. Make sure the new Provider VDC is primary! Disable the secondary (Cluster1) Resource Pool.
    Merging PVDCs
    PVDC Resource Pools
  4. Migrate VMs from Cluster1 to Cluster2 from within vCloud Director. VMs connected to a VXLAN network will need to be powered off as it is not possible to do live vMotion between two different distributed switches.
    VM Migration
  5. Migrate templates (move them to another catalog). As the Cluster1 is disabled the templates will be registered on hosts from Cluster2.
  6. Redeploy Edges from within vCloud Director (not vCNS Manager). Again the disabled Cluster1 will mean that the Edges will be registered on hosts from Cluster2.
  7. Detach (now empty) Resource Pool
    Detaching Resource Pool
  8. Rename Provider VDC to the original name (GoldVDC)
  9. Unprepare VXLAN fabric on Cluster1. Remove VXLAN vmknics on all hosts from Nexus1000V.
  10. Remove Nexus1000V switch from the (now empty) cluster and extend VXLAN vDS there (or create a new one if no L2 connectivity exists between clusters). Remove Nexus VEM VIBs from all hosts. Prepare VXLAN fabric on the cluster and add it to Provider VDC.
    VXLAN fabric after migration
  11. Upgrade vCNS Manager to NSX Manager with vCNS to NSX 6 Upgrade Bundle
    vCNS Manager upgrade
    NSX Manager
  12. Update hosts in vSphere Web Client > Networking and Security > Installation > Host Preparation. This action will require host reboot.
    Host preparation for NSX - beforeHost preparation for NSX - after
  13. (Optionally) install NSX controllers
    NSX Controller installation
  14. (Optionally) change VXLAN transport mode to Unicast/Hybrid.
    VXLAN transport mode

vCloud Director – vCenter Single Sign-On Integration Troubleshooting

The access to vCloud Director provider context (system administrator accounts) can be federated with vCenter Single Sign-On. This means that when vCloud system administrator wants to log into the vCloud Director (http://vCloud-FQDN/cloud) he is redirected to vSphere Web Client where he needs to authenticate and then redirected back to vCloud Director.

VCD_SSO_Federation

Here follows collection of topics that might be useful when dealing when troubleshooting the federation:

1. When the SSO federation is enabled the end-user is always redirected to vSphere Web client. If you want to use local authentication use http://vCloud-FQDN/cloud/login.jsp and type local or LDAP account (if LDAP is configured).

2. If you enabled SSO federation and the vCenter SSO is no longer available, you cannot unregister its lookup service. To do this go to vCloud database and in dbo.config table clear the lookupservice.url value.

3. In case you are using self-signed untrusted certificate on the vSphere web client some browsers (Firefox) might not display the Add Security Exception button when being redirected. Therefore open first the vSphere web client page directly, create the security exception and then the redirection from vCloud website should work.

4. HTTP ERROR 500. Problem accessing /cloud/saml/HoKSSO/alias/vcd. Reason: Error determining metadata contracts
….
Metadata for issuer https://xxx:7444/STS wasn’t found

This error might appear after the vSphere web client SSO authentication when the user is redirected back to vCloud portal. To understand this error let’s first talk about what is going on in the background. When the SSO federation is enabled, vCloud Director establishes trust with vCenter SSO. The trust is needed so the identity provider (SSO) knows that the request for authentication is not malicious (phishing) and also the service provider (vCloud Director) needs to be sure the reply comes from the right identity provider.

The trust is established when the federation is configured with metadata exchange that contains keys and some information about the other party. The SSO metadata can be seen in vCloud Director database in the dbo.saml_id_provider_settings table. Now during the actual authentication process if for some reason the security token reply comes from different source than the one expected based on the identity provider metadata, you will get this error.

This issue might happen for example if the vCenter SSO hostname has been changed. In this particular case which I encountered it happened on vCenter Server Appliance 5.1. The SSO has been initiated before the hostname was set. So the identity provider response came with metadata containing the issuer’s IP address instead FQDN which the service provider (VCD) expected based on the SSO endpoint address. The issuer information did not get updated after the hostname change.

The VCSA 5.1 the issuer information is stored in SSO PostgreSQL DB. These are the steps to change it:

  1. Get the SSO DB instance name, user and password

cat /usr/lib/vmware-sso/webapps/lookupservice/WEB-INF/classes/config.properties

db.type=postgres
db.url=jdbc:postgresql://localhost:5432/ssodb
db.user=ssod
db.pass=rTjsz9PRPLOFswBkJTCNAAVp
`

  1. Connect to the DB with the password retrieved from the previous step (db.pass=…)

/opt/vmware/vpostgres/1.0/bin/psql ssodb -U ssod

Retrieve the STS issuer:

ssodb=> select issuer from ims_sts_config;

If the issuer is really incorrect update it with following command:

ssodb=> update ims_sts_config SET issuer='https://FQDN:7444/STS'

Note: I need to credit William Lam for helping me where to find the SSO DB password.