VMware Cloud Director on VMware Cloud Foundation

There has been more and more interest lately among service providers in usage of VMware Cloud Foundation (VCF) as the underlying virtualization platform in their datacenter. VCF is getting more and more mature and offers automated lifecycle capabilities that service providers appreciate when operating infrastructure at scale.

I want to focus on the topic how would you design and deploy VMware Cloud Director (VCD) on top of VCF with a specific example. While there are whitepaper on this topic written they do not go into the nitty gritty detail. This should not be considered as prescribed architecture – just one way to skin a cat that should inspire you for your own design.

VCF 4.0 consists of a management domain – smaller infrastructure with one vSphere 7 cluster , NSX-T 3 and vRealize components (vRealize Suite Lifecycle Manager, vRealize Operations Manager, vRealize Log Insight). It is also used for deployment of management components for workload domains, which are separate vSphere 7+NSX-T 3 environments.

VCF has prescribed architecture based on VMware Validated Designs (VVD) how all the management components are deployed. Some are on VLAN backed networks but some are on overlay logical segments created in NSX-T (VVD calls them application virtual networks – AVN) and routed via NSX-T Edge Gateways. The following picture shows typical logical architecture of the management cluster which we will start with:

Reg-MGT and X-Reg-MGMT are overlay segments, rest are VLAN networks.
VC Mgmt … Management vCenter Server
VC Res … Workload domain (resource) vCenter Server
NSX Mgmt … Management NSX-T Managers (3x)
Res Mgmt … Workload domain (resource) NSX-T Managers (3x)
SDDC Mgr … SDDC Manager
Edge Nodes … NSX-T Edge Nodes VMs (2x) that provide resources for Tier-0, Tier-1 gateways and Load Balancer
vRLCM … vRealize Suite Lifecycle Manager
vROps … vRealize Operation Managers (two or more nodes)
vROps RC … vRealize Operation Remote Collectors (optional)
vRLI … vRealize Log Insight (two or more nodes)
WS1A … Workspace ONE Access (former VIDM, one or more nodes)

Now we are going to add VMware Cloud Director solution. I will focus on the following components:

  • VCD cells
  • RabbitMQ (needed for extensibility such as vROps Tenant App or Container Service Extension)
  • vRealize Operations Tenant App (provides multitenant vROps view in VCD and Chargeback functionality)
  • Usage Meter

I have followed these design principles:

  • VCD solution will utilize overlay (AVN) networks
  • leverage existing VCF infrastructure when it makes sense
  • consider future scalability
  • separate internet traffic from the management one

And here is the proposed design:

New overlay segment (AVN) called VCD DMZ has been added to separate the internet traffic. It is routed via separate Tier-1 GW but connected to the existing Tier-0. VCD cells (3 or more) have their primary (eth0) interface on this network with NSX-T Load balancer (running in its own Tier-1 similar to the vROps one). And finally vRealize Operations Tenant App VM.

Existing Reg-Mgmt is used for the secondary interface of VCD cells, Usage Meter VM and for vSAN File Services NFS share that VCD cell require.

And finally the cross region X-Reg-MGMT is utilized for RabbitMQ nodes (2 or more) in order to leverage existing vROps Load Balancer and get away with deploying additional one just for RabbitMQ.

Additional notes:

  • VCF deploys two NSX-T Edge nodes in 2-node NSX-T Edge Cluster. These currently cannot easily be scaled out. Therefore I would recommend deploying additional Edge nodes in separate NSX-T Edge cluster (directly in NSX-T) for the DMZ Tier-1 gateway and VCD load balancer. This guarantees compute and networking resources especially for the load balancer that will perform SSL termination (might not apply if you chose to use different load balancer e.g. Avi). This will also add possibility to deploy separate Tier-0 for more N/S bandwidth.
  • vSAN FS NFS deployment is described here. Do not forget to enable MAC learning on the Reg-MGMT NSX-T logical segment (via segment profile).
  • Both Tier-1 gateways can provide north-south firewalling for additional security
  • As all the incoming internet traffic to VCD goes over the VCD load balancer which provides Source NAT I have opted to have default route on the VCD cells on the management interface to get away with any need for static routes necessary to separate tenant and management traffic

Let me know in the comments if you plan VCD on VCF and if you are facing any challenges.

VCSA Convergence: Failed to Get RPMs

One of my vSphere 6.7 U3 environments I am managing was still using external Platform Services Controller (PSC) from times when it was the prescribed architecture. That is no longer the case so to simplify my management I wanted to get rid of the external PSC via so called Convergence to embedded PSC.

Unfortunately although there is a very nice UI to do this it never worked for me. And I did try multiple times. The error I always ended up was:

Failed to get RPMs.

The /var/log/vmware/converge/converge.log log did not show any error, but what was peculiar there was this entry referring to download of VCSA 6.5.0 files?!

2019-10-29T16:02:01.223Z INFO converge currentURL = https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8d167796-34d5-4899-be0a-6daade4005a3/6.5.0.10000.latest/
2019-10-29T16:02:01.223Z INFO converge Manifest file = https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8d167796-34d5-4899-be0a-6daade4005a3/6.5.0.10000.latest/manifest/manifest-latest.xml

These are obviously not correct for my 6.7 U3 VCSA appliance. This VMware Communities thread finally pushed me in the right direction.

Here are the steps how to resolve this:

  1. Delete content of /root/vema directory on VCSA
  2. Download correct VCSA ISO installation media corresponding to the version of your VC. In my case it was the full 4 GB VMware-VCSA-all-6.7.0-14836122.iso. The patch media VMware-vCenter-Server-Appliance-6.7.0.41000-14836122-patch-FP.iso cca 2 GB large did not work.
  3. Mount the ISO to your VCSA
  4. Re-run the convergence via the UI

vCenter Server Issue: Recent Tasks Show xxx.label

I had an annoying issue in my lab. Some time ago when I performed vSphere 6.7 PSC convergence my vCenter would stop displaying proper names of tasks in the vSphere Clients UI (both Flex and H5) and show only their placeholders with names like xxx.label.

While there are some KB or communities articles about the issue (and fix) none of them was applicable to my situation (running vCenter Server 6.7U1). I thought that VCSA patches or even deploying new appliance with backup restore would fix it but it did not.

After a little research I found out that the issue is caused by missing catalog.zip file in the /etc/vmware-vpx/locale/ folder. I had another lab with the exactly same vCenter Server build deployed so I just copied that file and transferred it to my vCenter Server Appliance. After service restart via VAMI UI tasks names were back.

I do not know the root cause, but if you have the same issue, give it a go.

Update: 5/6/2021: Diego Holzer emailed me with steps needed to rectify this issue in vSphere 7.0U2a:

During first login, the association files will be cached under /etc/vmware/vsphere-ui/cmCatalog.

BUT not after a restore. I wrote an article for workaround for the version 7.0U2a.

it’s in german, but the workaround works well on English 😃

cd /etc/vmware/vsphere-ui/cmCatalog

wget https://data.vcloud24.ch/bugfix_vcenter_70_label/com.vmware.cis.vcenterserver.zip

wget https://data.vcloud24.ch/bugfix_vcenter_70_label/com.vmware.cis.com.vmware.vsphere.client.zip

chown vsphere-ui:users com.vmware.cis.vcenterserver.zip

chmod 0644 com.vmware.cis.vcenterserver.zip

chown vsphere-ui:users com.vmware.cis.com.vmware.vsphere.client.zip

chmod 0644 com.vmware.cis.com.vmware.vsphere.client.zip

vCloud Availability: Replication of Powered-off VM

Just a short post about a feature I recently learned.

In vSphere Replication when you are configuring replication of powered-off VM you will get the following message:

The virtual machine is not powered on. Replication will start when the virtual machine is powered on.

The replication is actually configured and its placeholder VM is created in the recovery location (cloud) but the VM will stay in Not Active state.

Why is this? Immediate start of replication locks VM disks which means such VM would not be able to power-on until the initial sync is finished. But what if you want to replicate powered-off VMs for example templates that are never meant to run?

You can in fact force start the replication by right clicking the VM and selecting Sync Now, which asks confirmation question if we really want to do so as the VM will not be able the be powered on until the operation completes.

Is there a use case for this? As I mentioned this could be used for catalog sync as replication is much faster and efficient that OVF export / import.

vCloud Architecture Toolkit for Service Provider Update

The vCloud Architecture Toolkit for Service Provider website has been updated with new set of documents. All documents were re-branded with the new VMware Cloud Provider Program logos that replace the old vCloud Air Network brand.

My Architecting a VMware vCloud Director Solution for VMware Cloud Providers whitepaper has been refreshed to include vCloud Director 8.10 and 8.20 additions that were missing in the previous version. The current version of the document is 2.8 with August 2017 release date.

Here is summary of the new or updated topics:

  • Cell sizing
  • vCloud DB performance tips
  • New vCenter Chargeback Manager network metrics
  • vRealize Business for Cloud
  • vRealize Log Insight
  • vRealize Operations Manager
  • NSX Networking updates
  • Storage support
  • vCloud RBAC
  • Org VDC vSphere Resource Settings
  • VCDNI deprecation
  • New Org VDC Edge GW features
  • Distributed Firewall
  • VM Auto import
  • vCloud API for NSX
  • vCloud Director orchestrated upgrade

The document can be downloaded in PDF format or viewed online.