VMware Cloud Director on VMware Cloud Foundation

There has been more and more interest lately among service providers in usage of VMware Cloud Foundation (VCF) as the underlying virtualization platform in their datacenter. VCF is getting more and more mature and offers automated lifecycle capabilities that service providers appreciate when operating infrastructure at scale.

I want to focus on the topic how would you design and deploy VMware Cloud Director (VCD) on top of VCF with a specific example. While there are whitepaper on this topic written they do not go into the nitty gritty detail. This should not be considered as prescribed architecture – just one way to skin a cat that should inspire you for your own design.

VCF 4.0 consists of a management domain – smaller infrastructure with usually one vSphere 7 cluster , NSX-T 3 and vRealize components (vRealize Suite Lifecycle Manager, vRealize Operations Manager, vRealize Log Insight). It is also used for deployment of management components for workload domains, which are separate vSphere 7+NSX-T 3 environments.

VCF has prescribed architecture based on VMware Validated Designs (VVD) how all the management components are deployed. Some are on VLAN backed networks but some are on overlay logical segments created in NSX-T (VVD calls them application virtual networks – AVN) and routed via NSX-T Edge Gateways. The following picture shows typical logical architecture of the management cluster which we will start with:

Reg-MGT and X-Reg-MGMT are overlay segments, rest are VLAN networks.
VC Mgmt … Management vCenter Server
VC Res … Workload domain (resource) vCenter Server
NSX Mgmt … Management NSX-T Managers (3x)
Res Mgmt … Workload domain (resource) NSX-T Managers (3x)
SDDC Mgr … SDDC Manager
Edge Nodes … NSX-T Edge Nodes VMs (2x) that provide resources for Tier-0, Tier-1 gateways and Load Balancer
vRLCM … vRealize Suite Lifecycle Manager
vROps … vRealize Operation Managers (two or more nodes)
vROps RC … vRealize Operation Remote Collectors (optional)
vRLI … vRealize Log Insight (two or more nodes)
WS1A … Workspace ONE Access (former VIDM, one or more nodes)

Now we are going to add VMware Cloud Director solution. I will focus on the following components:

  • VCD cells
  • RabbitMQ (needed for extensibility such as vROps Tenant App or Container Service Extension)
  • vRealize Operations Tenant App (provides multitenant vROps view in VCD and Chargeback functionality)
  • Usage Meter

I have followed these design principles:

  • VCD solution will utilize overlay (AVN) networks
  • leverage existing VCF infrastructure when it makes sense
  • consider future scalability
  • separate internet traffic from the management one

And here is the proposed design:

New overlay segment (AVN) called VCD DMZ has been added to separate the internet traffic. It is routed via separate Tier-1 GW but connected to the existing Tier-0. VCD cells (3 or more) have their primary (eth0) interface on this network with NSX-T Load balancer (running in its own Tier-1 similar to the vROps one). And finally vRealize Operations Tenant App VM.

Existing Reg-Mgmt is used for the secondary interface of VCD cells, Usage Meter VM and for vSAN File Services NFS share that VCD cell require.

And finally the cross region X-Reg-MGMT is utilized for RabbitMQ nodes (2 or more) in order to leverage existing vROps Load Balancer and get away with deploying additional one just for RabbitMQ.

Additional notes:

  • VCF deploys two NSX-T Edge nodes in 2-node NSX-T Edge Cluster. These currently cannot easily be scaled out. Therefore I would recommend deploying additional Edge nodes in separate NSX-T Edge cluster (directly in NSX-T) for the DMZ Tier-1 gateway and VCD load balancer. This guarantees compute and networking resources especially for the load balancer that will perform SSL termination (might not apply if you chose to use different load balancer e.g. Avi). This will also add possibility to deploy separate Tier-0 for more N/S bandwidth.
  • vSAN FS NFS deployment is described here. Do not forget to enable MAC learning on the Reg-MGMT NSX-T logical segment (via segment profile).
  • Both Tier-1 gateways can provide north-south firewalling for additional security
  • As all the incoming internet traffic to VCD goes over the VCD load balancer which provides Source NAT I have opted to have default route on the VCD cells on the management interface to get away with any need for static routes necessary to separate tenant and management traffic

Let me know in the comments if you plan VCD on VCF and if you are facing any challenges.

Enable MAC Learning as Default on vSphere Distributed Switch

This short PowerCLI script will change the vSphere Distributed Switch default port group configuration to enable MAC learning policy. This means every port group on such switch inherits this configuration and will have MAC learning enabled unless specifically disabled.

For more information why would you need that read William’s Lam blog.

$vds = get-vdswitch 'DSwitch1'
$spec = New-Object VMware.Vim.VMwareDVSConfigSpec
$spec.DefaultPortConfig = New-Object VMware.Vim.VMwareDVSPortSetting
$spec.DefaultPortConfig.MacManagementPolicy = New-Object VMware.Vim.DVSMacManagementPolicy
$spec.DefaultPortConfig.MacManagementPolicy.MacLearningPolicy = New-Object VMware.Vim.DVSMacLearningPolicy

$spec.DefaultPortConfig.MacManagementPolicy.MacLearningPolicy.Enabled = $True
$spec.DefaultPortConfig.MacManagementPolicy.MacLearningPolicy.AllowUnicastFlooding = $True
$spec.DefaultPortConfig.MacManagementPolicy.MacLearningPolicy.Limit = 4000
$spec.DefaultPortConfig.MacManagementPolicy.MacLearningPolicy.LimitPolicy = "DROP"
$vds.ExtensionData.ReconfigureDvs_Task($spec)

 

Update 08/07/2020

In case you are using this approach for nested vSphere lab instead of the old promiscuous mode, make sure the vmk0 vmkernel port has a different MAC address than the MAC address of the vmnic of the nested ESXi host. This is because when the vmk0 is migrated to a different ESXi host uplink the vDS will not learn the MAC address on the new switch port as it conflicts with the assigned MAC on the first uplink port (same MAC cannot be learnt on two ports).

The vmkernel port MAC can be easily changed by editing /etc/vmware/esx.conf file.

vCloud Director 9.7 JMS Certificate Issue

Are you still on vCloud Director 9.7 (VCD) in multi-cell configuration? Then you are susceptible to Java Message Service (JMS) certificate expiration issue. Read on.

Background

In multi-cell set up VCD cells need to communicate between themselves. They use shared database but for much faster and efficient communication they also use internal ActiveMQ message bus. It is used for activity sharing and vCenter Server events notifications. If the message bus is dysfunctional it slows any operations almost to halt. For this particular certificate issue you will see in the logs similar message:

Could not accept connection from tcp://<primary-cell-IP:port> : javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown

In vCloud Director 9.7 the bus communication become encrypted in preparation for other use cases (read here). On upgrade or new deployment of each cell new certificate was issued by internal VCD_CA with 365 day duration. In vCloud Director 10.0 or VMware Cloud Director 10.1 the certificate is regenerated upon upgrade and its duration is extended to 3 years.

To find out the certificates expiry date, run the following command from any cell:


/opt/vmware/vcloud-director/bin/cell-management-tool jms-certificates -status

It will for every cell print out its JMS certificate details:

Cell with UUID fd0d2ca0-e357-4aae-9a3b-1c1c5d143d05 and IP 192.168.3.12 has jms certificate: [
[
Version: V3
Subject: CN=vcd-node2.vmware.local
Signature Algorithm: SHA256withRSA, OID = 1.2.840.113549.1.1.11

Key: Sun RSA public key, 2048 bits
modulus: 25783371233977292378120920630797680107189459997753528550984552091043966882929535251723319624716621953887829800012972122297123129787471789561707788386606748136996502359020718390547612728910865287660771295203595369999641232667311679842803649012857218342116800466886995068595408721695568494955018577009303299305701023577567636203030094089380853265887935886100721105614590849657516345704143859471289058449674598267919118170969835480316562172830266483561584406559147911791471716017535443255297063175552471807941727578887748405832530327303427058308095740913857590061680873666005329704917078933424596015255346720758701070463
public exponent: 65537
Validity: [From: Wed Jun 12 15:38:11 UTC 2019,
To: Thu Jun 11 15:38:11 UTC 2020]

 

Yes, this particular cell’s certificate will expire Jun 12 2020 – in less than two months!

The Fix

Set a calendar reminder and when the certificate expiration day is approaching run the following command.

/opt/vmware/vcloud-director/bin/cell-management-tool jms-certificates --certgen

Or upgrade to vCloud Director 10.0 or newer.

Update 21/05/2020: KB 78964 has been published on this topic. Also if the CA signing certificate is expired you will need to disable SSL altogether, restart the cell, regenerate the cert and re-enable SSL.

VMware Cloud Director: Push Notifications in Tenant Context

In VMware Cloud Director 10.1 (VCD) the organization users can subscribe to event and task push notifications which might be useful if the tenant needs to keep track of the activity in the cloud, connect CMDB or any other custom solution and does not want to permanently poll audit log via API.

Access to notifications was in the past only in the realm of service providers who needed to deploy RabbitMQ and connect their Cloud Director cells to it. They can still do so and in fact have to, if they need blocking taks or use VCD API extension (for example Container Service Extension, App Launch Pad or vRealize Operations Tenant App).

The new functionality is enabled by internal Artemis ActiveMQ bus that runs on every VCD cell. The MQTT client connects to the public https endpoint and uses WebSocket connnection to the bus. Authentication is provided via the JWT authentication token. The official documentation provides some detail here, but not enough to actually set this up.

Therefore I want to demonstrate here with very simple Python script how to set up connection and start utilizing this feature.

The Python 3 script leverages the Pyvcloud module (22.0 or newer is required) and Paho MQTT Python Client. Both can be installed simply with pip.

pip install pyvcloud paho-mqtt

In the example org admin credentials are used, which allows to subscription to all organization messages via publish/<org UUID>/* subscription string. It can also be used by system administrator while changing the subscription string to publish/*/*.

#!/usr/bin/python3

import paho.mqtt.client as mqtt
import json
import datetime
import pyvcloud.vcd.client
import pyvcloud.vcd.vdc

vcdHost = 'vcloud.fojta.com'
vcdPort = 443
path = "/messaging/mqtt"
logFile = 'vcd_log.log'

#org admin credentials
user = 'acmeadmin'
password = 'VMware1!'
org = 'acme'

credentials = pyvcloud.vcd.client.BasicLoginCredentials(user, org, password)
vcdClient = pyvcloud.vcd.client.Client(vcdHost+":"+str(vcdPort),None,True,logFile)
vcdClient.set_credentials(credentials)
accessToken = vcdClient.get_access_token()
headers = {"Authorization": "Bearer "+ accessToken}

if max(vcdClient.get_supported_versions_list()) < "34.0":
    exit('VMware Cloud Director 10.1 or newer is required')

org = vcdClient.get_org_list()
orgId = (org[0].attrib).get('id').split('org:',1)[1]

def on_message(client, userdata, message):
    event = message.payload.decode('utf-8')
    event = event.replace('\\','')
    eventPayload = event.split('payload":"',1)[1]
    eventPayload = eventPayload[:-2]
    event_json = json.loads(eventPayload)
    print(datetime.datetime.now())
    print (event_json)

# Enable for logging
# def on_log(client, userdata, level, buf):
#     print("log: ",buf)

client = mqtt.Client(client_id = "PythonMQTT",transport = "websockets")
client.ws_set_options(path = path, headers = headers)
client.tls_set_context(context=None)
# client.tls_insecure_set(True)
client.on_message=on_message
# client.on_log=on_log  #enable for logging
client.connect(host = vcdHost, port = vcdPort , keepalive = 60)
print('Connected')
client.subscribe("publish/"+orgId+"/*")
client.loop_forever()

Notice that the client needs to connect to the /messaging/mqtt path on the VCD endpoint and must provide valid JWT authentication token in the header. That rules some MQTT WebSocket clients that do not support custom headers (JavaScript).

The actual event is in JSON format with nested payload JSON providing the details. The code example prints the time when the event was received and just the nested payload JSON. The script runs forever until interrupted with Ctrl+C.

Note: The actual RabbitMQ extensibility configuration in VCD and the Non-blocking AMQP Notifications setting in the UI have no impact on this functionality and can be disabled if not used by the service provider.

How to Migrate VMware Cloud Director from NSX-V to NSX-T

VMware Cloud Director as a cloud management solution is built on top of the underlying compute and networking platforms that virtualize the physical infrastructure. For the compute and storage part VMware vSphere was always used. However, the networking platform is more interesting. It all started with vShield Edge which was later rebranded to vCloud Networking and Security, Cisco Nexus 1000V was briefly an option, but currently NSX for vSphere (NSX-V) and NSX-T Data Center are supported.

VMware has announced the sunsetting of NSX-V (current end of general support is planned for (January 2022) and is fully committed going forward to the NSX-T Data Center flavor. The two NSX releases while similar are completely different products and there is no direct upgrade path from one to the other. So it is natural that all existing NSX-V users are asking how to migrate their environments to the NSX-T?

NSX-T Data Center Migration Coordinator has been available for some time but the way it works is quite destructive for Cloud Director and cannot be used in such environments.

Therefore with VMware Cloud Director 10.1 VMware is releasing compatible migration tool called VMware NSX Migration for VMware Cloud Director.

The philosophy of the tool is following:

  • Enable granular migration of tenant workloads and networking at Org VDC granularity with minimum downtime from NSX-V backed Provider VDC (PVDC) to NSX-T backed PVDC.
  • Check and allow migration of only supported networking features
  • Evolve with new releases of NSX-T and Cloud Director

In other words, it is not in-place migration. The provider will need to stand up new NSX-T backed cluster(s) next to the NSX-V backed ones in the same vCenter Server. Also the current NSX-T feature set in Cloud Director is not equivalent to the NSX-V. Therefore there are networking features that cannot in principle be migrated. To see comparison of the NSX-V and NSX-T Cloud Director feature set see the table at the end of this blog post.

The service provider will thus need to evaluate what Org VDCs can be migrated today based on existing limitations and functionality. Start with the simple Org VDCs and as new releases are provided migrate the rest.

How does the tool work?

  • It is Python based CLI tool that is installed and run by the system administrator. It uses public APIs to communicate with Cloud Director, NSX-T and vCenter Server to perform the migrations.
  • The environment must be prepared is such way that there exists NSX-T backed PVDC in the same vCenter Server as the source NSX-V PVDC and that their external networks are at the infrastructure level equivalent as existing external IP addresses are seamlessly migrated as well.
  • The service provider defines which source Org VDC (NSX-V backed) is going to be migrated and what is the target Provider VDC (NSX-T backed)
  • The service provider must prepare dedicated NSX-T Edge Cluster whose sole purpose is to perform Layer-2 bridging of source and destination Org VDC networks. This Edge Cluster needs one node for each migrated network and must be deployed in the NSX-V prepared cluster as it will perform VXLAN port group to NSX-T Overlay (Geneve) Logical Segment bridging.
  • When the tool is started, it will first discover the source Org VDC feature set and assess if there are any incompatible (unsupported) features. If so, the migration will be halted.
  • Then it will create the target Org VDC and start cloning the network topology, establish bridging, disconnect target networks and run basic network checks to see if the bridges work properly. If not then roll-back is performed (bridges and target Org VDC are destroyed).
  • In the next step the north / south networking flow will be reconfigured to flow through the target Org VDC. This is done by disconnecting the source networks from the gateway and reconnecting the target ones. During this step brief N/S network disruption is expected. Also notice that the source Org VDC Edge GW needs to be connected temporarily to a dummy network as NSX-V requires at least one connected interface on the Edge at all times.
  • Each vApp is then vMotioned from the source Org VDC to the target one. As this is live vMotion no significant network/compute disruption is expected.
  • Once the provider verifies the correct functionality of the target VDC she can manually trigger the cleanup step that migrates source catalogs, destroys bridges and the source Org VDC and renames the target Org VDC.
  • Rinse and repeat for the other Org VDCs.

Please make sure you read the release notes and user guide for the list of supported solutions and features. The tool will rapidly evolve – short roadmap already includes pre-validation and roll-back features. You are also encouraged to provide early feedback to help VMware decide how the tool should evolve.