Control System Admin Access to VMware Cloud Director

When VMware Cloud Director is deployed in public environment setup it is a good practice to restrict the system admin access only for specific networks so no brute force attack can be triggered against the publicly available UI/API end points.

There is actually a relatively easy way to achieve this via any web application firewall (WAF) with URI access filter. The strategy is to protect only the provider authentication end points which is much easier than to try to distinguish between provider and tenant URIs.

As the access (attack) can be done either through UI or API the solution should address both. Let us first talk about the UI. The tenants and provider use specific URL to access their system/org context but we do not really need to care about this at all. The UI is actually using (public) APIs so there is nothing needed to harden the UI specifically if we harder the API endpoint. Well, the OAuth and SAML logins are exception so let me tackle them separately.

So how can you authenticate to VCD via API?

Integrated Authentication

The integrated basic authentication consisting of login/password is used for VCD local accounts and LDAP accounts. The system admin (provider context) uses /cloudapi/1.0.0/sessions/provider API endpoint while the tenants use /cloudapi/1.0.0/sessions.

The legacy (common for both providers and tenant) API endpoint /api/sessions has been deprecated since API version 33.0 (introduced in VCD 10.0). Note that deprecated does not mean removed and it is still available even with API version 36.x so you can expect to be around for some time as VCD keeps backward compatible APIs for few years.

You might notice that there is in a Feature Flags section the possibility to enable “Legacy Login Removal”.

Feature Flags

Enabling this feature will disable legacy login both for tenants and providers however only if you use alpha API version (in the case of VCD 10.3.3.1 it is 37.0.0-alpha-1652216327). So this is really only useful for testing your own tooling where you can force the usage of that particular API version. The UI and any 3rd party tooling will still use the main (supported) API versions where the legacy endpoint will still work.

However, you can forcefully disable it for provider context for any API version with the following CMT command (run from any cell, no need to restart the services):

/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n vcloud.api.legacy.nonprovideronly -v true

The providers will need to use only the new cloudapi/1.0.0/providers/session endpoint. So be careful as it might break some legacy tools!

API Access Token Authentication

This is a fairly new method of authentication to VCD (introduced in version 10.3.1) that uses once generated secret token for API authentication. It is mainly used by automation or orchestration tools. The actual method of generating session token requires access to the tenant or provider oauth API endpoints:

/oauth/tenant/<tenant_name>/token

/oauth/provider/token

This makes it easy to disable provider context via URI filter.

SAML/OAuth Authentication via UI

Here we must distinguish the API and UI behavior. For SAML, the UI is using /login/org/<org-name>/… endpoint. The provider context is using the default SYSTEM org as the org name. So we must filter URI starting with /login/org/SYSTEM.

For OAuth the UI is using the same endpoint as API access token authentication /oauth/tenant vs /oauth/provider. /login/oauth?service=provider

For API SAML/OAuth logins cloudapi/1.0.0/sessions vs cloudapi/1.0.0/sessions/provider endpoints are used.

WAF Filtering Example

Here is an example how to set up URI filtering with VMware NSX Advanced Load Balancer.

  1. We need to obviously set up VCD cell (SSL) pool and Virtual Service for the external IP and port 443 (SSL).
  2. The virtual service application profile must be set to System-Secure-HTTP as we need to terminate SSL sessions on the load balancer in order to inspect the URI. That means the public SSL certificate must be uploaded to load balancer as well. The cells can actually use self signed certs especially if you use the new console proxy that does not require SSL pass through and works on port 443.
  3. In the virtual service go to Policies > HTTP Request and create following rules:
    Rule Name: Provider Access
    Client IP Address: Is Not: <admin subnets>
    Path: Criteria – Begins with:
    /cloudapi/1.0.0/sessions/provider
    /oauth/provider
    /login/oauth?service=provider
    /login/org/SYSTEM
    Content Switch: Local response – Status Code: 403.
WAF Access Rule

And this is what you can observe when trying to log in via integrated authentication from non-authorized subnets:

And here is an example of SAML login:

Console Proxy Traffic Enhancements

VMware Cloud Director provides direct access to tenant’s VM consoles via proxying the vSphere console traffic from ESXi hosts running the workload, through VCD cells, load balancer to the end-user browser or console client. This is fairly complex process that requires dedicated TCP port (by default 8443), certificate and a load balancer configuriation without SSL termination (SSL pass-through).

Especially the dedicated certificate requirement is annoying as any change to this certificate cannot be done at the load balancer level, but must be performed on every cell in the VCD server group and those need to be restarted.

However, VMware Cloud Director 10.3.3 for the first time showcases newly improved console proxy. It is still an experimental feature and therefore not enabled by default, but can be accessed in the Feature Flags section of the provider Administration.

By enabling it, you switch to the enhanced console proxy implementation that gives you the following benefits:
  • Console proxy traffic is now going over the default HTTPS 443 port together with UI/API. That means no need for dedicated port/IP/certificate.
  • This traffic can be SSL terminated at the load balancer. This means no need for specific load balancing configuration that needed the SSL pass through of port 8443.
  • The Public Addresses Console Proxy section is irrelevant and not used

The followin diagram shows the high level implementation (credit and shout-out goes to Francois Misiak – the brain behind the new functionality).

As this feature has not yet been tested at scale it is marked as experimental but it is expected that this will be the default console proxy mechanism starting in the next major VMware Cloud Director release. Note that you will still be able to revert to the legacy one if needed.

Load Balancing with Avi in VMware Cloud Director

VMware Cloud Director 10.2 is adding network load balancing (LB) functionality in NSX-T backed Organization VDCs. It is not using the native NSX-T load balancer capabilities but instead relies on Avi Networks technology that was acquired by VMware about a year ago and since then rebranded to VMware NSX Advanced Load Balancer. I will call it Avi for short in this article.

The way Avi works is quite different from the way load balancing worked in NSX-V or NSX-T. Understanding the differences and Avi architecture is essential to properly use it in multitenant VCD environments.

I will focus only on the comparison with NSX-V LB as this is relevant for VCD (NSX-T legacy LB was never viable option for VCD environments).

In VCD in an NSX-V backed Org VDC the LB is running on Org VDC Edge Gateway (VM) that can have four different sizes (compact, large, quad large and extra large) and be in standalone or active / standby configuration. That Edge VM also needs to perform routing, NATing, firewalling, VPN, DHCP and DNS relay. Load balancer on a stick is not an option with NSX-V in VCD. The LB VIP must be an IP assigned to one of external or internally attached network interfaces of the Org VDC Edge GW.

Enabling load balancing on an Org VDC Edge GW in such case is easy as the resource is already there. 

In the case of Avi LB the load balancing is performed by external (dedicated to load balancing) components which adds more flexibility, scalability and features but also means more complexity. Let’s dive into it.

You can look at Avi as another separate platform solution similar to vSphere or NSX – where vSphere is responsible for compute and storage, NSX for routing, switching and security, Avi is now responsible for load balancing.

Picture is worth thousand words, so let me put this diagram here first and then dig deeper (click for larger version).

 

Control Path

You start by deploying Avi controller cluster (highly available 3 nodes) which integrates with vSphere (to use for compute/storage) and NSX-T (for routing LB data and control plane traffic). The controllers would sit somewhere in your management infrastructure next to all other management solutions.

The integration is done by setting up so called NSX-T Cloud in Avi where you define vCenter Server (only one is supported per NSX-T Cloud) and NSX-T Manager endpoints, NSX-T overlay transport zone (with 1:1 relationship between TZ and NSX-T Cloud definition). Those would be your tenant/workload VC/NSX-T.

You must also point to pre-created management network segment that will be used to connect all load balancing engines (more on them later) so they can communicate with the controllers for management and control traffic. To do so, in NSX-T you would set up dedicated Tier-1 (Avi Management) GW with the Avi Management segment connected and DHCP enabled. The expectation is the Tier-1 GW would be able through Tier-0 to reach the Avi Controllers.

Data Path

Avi Service Engines (SE) are VM resources to perform the load balancing. They are similar to NSX-T Edge Nodes in a sense that the load balancing virtual services can be placed on any SE node based on capacity or reservations (as Tier-1 GW can be placed on any Edge Node). Per se there is no strict relationship between tenant’s LB and SE node. SE node can be shared across Org VDC Edge GWs or even tenants. SE node is a VM with up to 10 network interfaces (NICs). One NIC is always needed for the management/control traffic (blue network). The rest (9) are used to connect to the Org VDC Edge GW (Tier-1 GW) via a Service Network logical segment (yellow and orange). The service networks are created by VCD when you enable load balancing service on the Org VDC Edge GW together with DHCP service to provide IP addresses for the attached SEs. It will by default get 10.255.255.0/25 subnet, but the system admin can change it, if it clashes with existing Org VDC networks. Service Engines run each service interface in a different VRF context so there is no worry about IP conflicts or even cross tenant communication.

When a load balancing pool and virtual service is configured by the tenant Avi will automatically pick a Service Engine to instantiate the LB service. It might even need to first deploy (automatically) an SE node if there is no existing capacity. When SE is assigned Avi will configure static route (/32) on the Org VDC Edge GW pointing the virtual service VIP (virtual IP) to the service engine IP address (from the tenant’s LB service network).

Note: The VIP contrary to NSX-V LB can be almost any arbitrary IP address. It can be routable external IP address allocated to the Org VDC Edge GW or any non-externally routed address but it cannot clash with any existing Org VDC networks. or with the LB service network. If you use an external Org VDC Edge GW allocated IP address you cannot use the address for anything else (e.g. SNAT or DNAT). That’s the way NSX-T works (no NAT and static routing at the same time). So for example if you want to use public address 1.2.3.4 for LB on port 80 but at the same time use it for SNAT, use an internal IP for the LB (e.g. 172.31.255.100) and create DNAT port forwarding rule to it (1.2.3.4:80 to 172.31.255.100:80).

Service Engine Groups

With the basics out of the way let’s discuss how can service provider manage the load balancing quality of service – performance, capacity and availability. This is done via Service Engine Groups (SEG).

SEGs are (today) configured directly in Avi Controller and imported into VCD. They specify SE node sizing (CPU, RAM, storage), bandwidth restrictions, virtual services maximums per node and availability mode.

The availability mode needs more explanation. Avi supports four availability modes:
A/S … legacy (only two nodes are deployed), service is active only on one node at a time and stand by on the other, no scale out support (service across nodes), very fast failover

A/A … elastic, service is active on at least two SEs, session info is proactively replicated, very fast failover

N+M … elastic, N is number of SE nodes service is scaled over, M is a buffer in number of failures the group can sustain, slow failover (due to controller need to re-assign services), but efficient SE utilization

N+0 … same as N+M but no buffer, the controller will deploy new SE nodes when failure occurs. The most efficient use of resources but the slowest failover time.

The base Avi licensing supports only legacy A/S high availability mode. For best availability and performance usage of elastic A/A is recommended.

As mentioned Service Engine Groups are imported into VCD where the system administrator makes a decision whether SEG is going to be dedicated (SE nodes from that group will be attached to only one Org VDC Edge GW) or shared.

Then when load balancing is enabled on a particular Org VDC Edge GW, the service provider assigns one or more SEGs to it together with capacity reservation and maximum in terms of virtual services for the particular Org VDC Edge GW.

Use case examples:

  • A/S dedicated SEG for each tenant / Org VDC Edge GW. Avi will create two SE nodes for each LB enabled Org VDC Edge GW and will provide similar service as LB on NSX-V backed Org VDC Edge GW did. Does not require additional licensing but SEG must be pre-created for each tenant / Org VDC Edge GW.
  • A/A elastic shared across all tenants. Avi will create pool of SE nodes that are going to be shared. Only one SEG is created. Capacity allocation is managed in VCD, Avi elastically deploys and undeploys SE nodes based on actual usage (the usage is measured in number of virtual services, not actual throughput or request per seconds).

Service Engine Node Placement

The service engine nodes are deployed by Avi into the (single) vCenter Server associated with the NSX Cloud and they live outside of VMware Cloud Director management. The placement is defined in the service engine group definition (you must use Avi 20.1.2 or newer). You can select vCenter Server folder and limit the scope of deployment to list of ESXi hosts and datastores. Avi has no understanding of vSphere host, and datastore clusters or resource pools. Avi will also not configure any DRS anti-affinity for the deployed nodes (but you can do so post-deployment).

Conclusion

The whole Avi deployment process for the system admin is described in detail here. The guide in the link refers to general Avi deployment of NSX-T Cloud, however for VCD deployment you would just stop before the step Creating Virtual Service as that would be done from VCD by the tenant.

Avi licensing is basic or enterprise and is set at Avi Controller cluster level. So it is possible to mix both licenses for two tier LB service by deploying two Avi Controller cluster instances and associating each with a different NSX-T transport zone (two vSphere clusters or Provider VDCs).

The feature differences between basic and enterprise editions are quite extensive and complex. Besides Service Engine high availability modes the other important difference is access to metrics, amount of application types, health monitors and pool selection algorithms.

The Avi usage metering for licensing purposes is currently done via Python script that is ran at the Avi Controller to measure Service Engine total  high mark vCPU usage during a given period and must be reported manually. The basic license is included for free with VCPP NSX usage and is capped to 1 vCPU per 640 GB reported vRAM of NSX base usage.

Update 2020/10/23: Make sure to check interoperability matrix. As of today only Avi 20.1.1 is supported with VCD 10.2.

vCloud Director H5 UI Error: 431 Request Header Fields Too Large

This is just a short blog post to describe an issue you might get with the tenant or portal HTML UI in vCloud Director where you will see errors in the browser related to request header fields too large.

You will see it more likely with Chrome browser and if your cloud domain is shared with other services. The root cause is that the browser API calls will stop working once the request header gets larger than 8 KBs. While 8 KBs seems like big enough size especially as the request headers vCloud Director uses contain only session ID, JWT token and possibly load balancers headers it also includes all the browser cookies applicable to the vCloud Director domain stored by other web services.

The temporary fix is for the end-user to delete her browser cookies. But is there something the provider could do?

In our case we saw the situation where the vCloud Director instance was on *.vmware.com domain and the browser contained lots of large OAM cookies related to VMware Single Sign-On solution. While those cookies are essential for multiple VMware internal applications, there is no reason for vCloud Director to receive them in every API request. One way how to block the cookies and thus decrease the request header size is to remove them at the load balancer. With NSX-V load balancer this can be accomplished by utilizing SSL L7 termination and an application rule (see my older blog post how to configure NSX-V Edge Load balancer).

In my case the application rule I use is:

Update 2019/10/24: The initial rule would remove all Cookies. I have now amended it with another rule that removes all but vcloud_session_id and vcloud_jwt cookies if they are present.

reqirep ^Cookie:\s.*(vcloud_session_id=[^;]*)|(vcloud_jwt=[^;]*) Cookie:\ \1;\ \2
reqidel ^Cookie:.*OAM*

which deletes all cookies from the request header starting with OAM string

Update 11/23/2021:

This single rule with a better formed regex seems to work the best:

reqirep ^Cookie:.*?((?:vcloud_session_id|vcloud_jwt)=[^;]*)(?:;.*((?:vcloud_session_id|vcloud_jwt)=[^;]*))? Cookie:\ \1;\ \2

Load Balancing vCloud Director with NSX-T

I just have had a chance for the first time to set up vCloud Director installation that was fronted by NSX-T based load balancer (version 2.4.1). In the past I have blogged how to load balance vCloud Director cells with NSX-V:

Load Balancing vCloud Director Cells with NSX Edge Gateway

vCloud OpenAPI – Large Payload Issue with Load Balancer

NSX-T differs quite a lot from NSX-V therefore the need for this article. The load balancer instance is deployed into the NSX-T Edge Cluster which is a set of virtual or physical NSX-T Edge Nodes. There are also strict sizing guidelines related to the size and number of LB and size of Edge Nodes – see the official docs.

Certificates

Import your VCD public cert in the NSX Manager UI: System > Certificates > Import Certificate. You will need to provide name, full certificate chain, private key and set is as Service Certificate. If it is signed by Enterprise CA you will also before that import the CA cert.

Monitor

Create new monitor in Networking > Load Balancing > Monitors > Add Active Monitor HTTPs

  • protocol HTTPs
  • monitoring port 443
  • default timers
  • HTTP Request Configuration: GET /cloud/server_status, HTTP Request Version: 1
  • HTTP Response Configuration: HTTP response body: Service is up.
  • SSL Configuration: Enabled, Client Certificate: None

Profiles

Application Profile

Networking > Load Balancing > Profiles > Select Profile Type: Application > Add Application Profile > HTTP

Here in the UI we can set only Request Header Size and Request Body Size. Set both to 65535 maximum (65535 for header size and at least 52428800 for body size as ISO/OVA uploads use 50 MB chunks). We will later use API to also configure Response Header Size.

Persistence and SSL Profiles

I will reuse existing default-source-ip-lb-persistence-profile and default-balanced-client-ssl-profile.

Server Pools

Networking > Load Balancing > Server Pools > Add Server Pool

  • Algorithm: Least Connection
  • Active Monitor: picked the one created before
  • Select members: Enter individual members (do not enter port as we will reuse the pool for multiple ports)

 

Virtual Servers

We will add two virtual servers. One for UI/API and another for VM Remote Console connections. For both I have picked the same IP address from the cell logical segment. Ports will be different (443 vs 8443).

vCloud UI

  • Add virtual server: L7 HTTP
  • Ports: 443
  • Ignore Load Balancer placement for now
  • Server Pool: the one we created before
  • Application Profile: the one we created before
  • Persistence: default-source-ip-lb-persistence-profile
  • SSL Configuration: Client SSL: Enabled, Default Certificate: the one we imported before, Client SSL Profile: default-balanced-client-ssl-profile
    Server SSL: Enabled, Client Certificate: None, Server SSL Profile: default-balanced-client-ssl-profile

vCloud Console

  • Add virtual server: L4 TCP
  • Ports: 8443
  • Ignore Load Balancer placement for now
  • Server Pool: the one we created before
  • Application Profile: default-tcp-lb-app-profile
  • Persistence: disabled

Load Balancer

Now we can create load balancer instance and associate the virtual servers with it. Create the LB instance on the Tier 1 Gateway which routes to your VCD cell network. Make sure the Tier 1 Gateway runs on an Edge node with the proper size (see the doc link before).

Networking > Load Balancing > Load Balancers > Add Load Balancer

  • Size: small
  • Tier 1 Gateway
  • Add Virtual Servers: add the 2 virtual servers created in the previous step

Now we have the load balancer up and running you should get all green in the status column. We are not done yet though.

Firstly we need to increase the response header size as vCloud Director Open API sends huge headers with links. Without this, you would get H5 UI errors (Nginx 502 Bad Gateway) and some API calls would fail.  This can be currently done only with NSX Policy API. Fire up Postman or Curl and do GET and then PUT on the following URI:

NSX-manager/policy/api/v1/infra/lb-app-profiles/<profile-name>

in the payload change the response_header_size to at least 10240 50000 bytes.

And finally we will need to set up NAT so our load balanced virtual servers are available both from the outside world (on Tier-0 Gateway) as well from the internal networks. This is quite network topology specific, but do not forget the cells itself must properly connect to the public (load balanced) URL configured in vCloud Director public addresses.