vCloud Director with NSX: Edge Cluster (Part 2)

In my previous article vCloud Director with NSX: Edge Cluster I described various design options of NSX Edge Cluster in vCloud Director environment. In this article I would like to discuss additional option which extends the Design Option III – Dedicated Edge Cluster. Below is the picture showing the scenario from the previous post.

Spine/leaf with Dedicated Edge Cluster

Spine/leaf with Dedicated Edge Cluster

There is one Provider deployed Edge in the Edge Cluster for each Transit vCloud Director External network to which Org VDC Edge Gateways are connected to. The option works quite well for use cases where the Provider Edge is dedicated to single tenant – e.g. it is providing VPN services or L2 bridging. (Note that in L2 bridging use case the Org VDC Edge Gateway is not deployed and Org VDC networks connect directly to tenant dedicated external network).

However when we want to provide access to a shared service (for example internet) where we will deploy multiple Org VDC Edge Gateways of different tenants connected to the same external network they will all have to go through a single Provider Edge which can become a bottleneck.

As of NSX version 6.1 Edge Gateways can however be deployed in ECMP (Equal Cost Multi-Path) configuration where we can aggregate bandwidth of up to 8 Edges (8x10GB = 80 GB througput). High availability of ECMP Edges is then achieved with dynamic routing protocol (BGP or OSPF) with aggressive timing for short failover times (3 seconds) which will quickly remove failed path from the routing tables.

The problem is that (as of vCloud Director 5.6) Organization VDC Edges are deployed in the legacy (vShield/vCNS) mode and do not support ECMP routing nor dynamic routing protocols. The design I propose will get around this limitation by deploying Distributed Logical Router between Provider and Organization VDC Edges.

 Spine/leaf with Dedicated Edge Cluster and ECMP Edges


Spine/leaf with Dedicated Edge Cluster and ECMP Edges

The picture above shows two Provider ECMP Edges (can scale up to 8) with two physical VLAN connections each to upstream physical router and one internal interface to the Transit Edge logical switch. Distributed Logical Router (DLR) then connects the Transit Edge logical switch with the Transit vCloud Director External Network to which all tenant Org VDC Edge Gateways are connected to. The DLR has ECMP routing enabled as well as OSPF or BGP dynamic routing peering with the Provider Edges. The DLR will provide two (or more) equal paths to upstream Provider Edges and will choose one based on hashing algorithm of source and destination IP of the routed packet.

The two shown Org VDC Edge Gateways (which can belong to two different tenants) then will take advantage of all the bandwidth provided by the Edge Cluster (indicated with the orange arrows).

The picture also depicts the DLR Control VM. This is the protocol endpoint which peers with Provider Edges and learns and announces routes. These are then distributed to ESXi host vmkernel routing process by the NSX Controller Cluster (not shown in the picture). The failure of DLR Control VM has impact on routing information learned via OSPF/BGP protocol even if DLR is highly available in active standby configuration due to the protocol aggressive timers (DLR control VM failover takes more than 3 seconds). Therefore we will create static route on all ECMP Provider Edges for the Transit vCloud Director External network subnet. That is enough for north – south routing as Org VDC subnets are always NATed by the tenant Org VDC Edge Gateway. South – north routing is static as the Org VDC Edge Gateways are configured with default gateway defined in the External Network properties.

The other consideration is placement of DLR Control VM. If it fails together with one of ECMP Provider Edges the ESXi host vmkernel routes are not updated until DLR Control VM functionality  fails over to the passive instance and meanwhile route to the dead Provider Edge is black holing traffic. If we have enough hosts in the Edge Cluster we should deploy DLR Control VMs with anti-affinity to all ECMP Edges. Most likely we will not have enough hosts therefore we would deployed DLR Control VMs to one of the compute clusters. The VMs are very small (512 MB, 1 vCPU) therefore the cluster capacity impact is negligible

vCloud Director with NSX: Edge Cluster

I see more and more that new and existing vCloud Director deployments leverage NSX as the networking component instead of the legacy vShield / vCloud Network and Security (vCNS). The main reasons are the announced end-of-life for vCNS and the additional features that NSX brings to the table (although most of them are not yet tenant consumable in vCloud Director – as of version 5.6.4).

When deploying NSX with vCloud Director what new considerations should be included when designing the architecture? In this post I want to concentrate on the concept of the Edge Cluster.

What is Edge Cluster?

VMware has published very good NSX-V Network Virtualization Design Guide. This is very detailed document describing all NSX concepts as well as how they should be properly architected. The concept of Edge Cluster is discussed in quite a detail as well so let me just summarize here.

NSX overlay networks allow the creation of logical networks over an existing IP network fabric. This enables highly scalable network design using Leaf / Spine architecture where the boundary between L2 and L3 networks is at the rack level (leafs) and all communication between racks is L3 only going through a set of spine routers.

NSX spans logical network across all racks however in the end we need to connect virtual workloads from the logical networks to the outside physical world (WAN, Internet, colocated physical servers, etc.). These networks are represented by a set of VLAN networks and because we are not stretching L2 across the racks we cannot trunk them everywhere – so they are connected only to one (or two for redundancy) rack which thus become the Edge Cluster.

So the purpose of the Edge Cluster is to host virtual routers – Edge Service Gateways that provide the connectivity between the physical world (VLANs) and virtual world (VXLAN logical switchites). Note that it does not mean that every Edge Gateway needs to be deployed there. If an Edge Gateway provides connectivity between two VXLAN logical switches – it can be deployed anywhere as logical switches span all clusters.

vCloud Director Edges

vCloud Director deploys Edge VMs in order to provide Organization VDC or vApp connectivity. The actual deployment is done through vCNS or NSX Manager but it is vCloud Director who makes decision about placement and configuration of the Edges. vCloud Director Edge Gateway provides connectivity between one or more vCloud Director External Network and one or more Organization VDC Network. It is deployed inside Provider VDC in a special System VDC Resource Pool on a datastore belonging to the Org VDC default storage policy. vCloud Director placement engine selects the most appropriate cluster where the Edge Gateway VM will be deployed – based on which clusters belong to Provider VDC, what is their available capacity and most importantly their access to the right storage and external networks.

vApp Edges provide connectivity between an Organization VDC network and a vApp network. They always have only one external and one internal interface. They are also deployed by vCloud Director to the Provider VDC System VDC Resource Pool and exist only when the vApp is in deployed mode (Powered On).

Transport Zone

Transport Zone defines the scope of a VXLAN logical switch. It consists of one or more vSphere clusters. Transport Zone can be created manually, however vCloud Director automatically creates for each Provider VDC one Transport Zone which matches the clusters that are added to the Provider VDC and associates it with a VXLAN Network Pool. When Organization VDC is created by vCloud System Administrator a Network Pool must be assigned – all Organization VDC and vApp Networks will then have its scope.

Design Option I – Traditional

In the traditional network architecture Access/Aggregation/Core the L2/L3 boundary is at the aggregation switches. This means all racks connected to the same set of aggregation switches have access to the same VLANs and thus there is no need for an Edge Cluster as the Edge connecting VLAN with VXLAN based networks can run on any given rack. In vCloud Director it means that as long as the external networks (VLANs) are trunked to aggregation switches we do not need to worry about Edge placement. The set of racks (clusters) connected to the same aggregation domain usually map to a vCloud Director Provider VDC. The transport zone is then identical to the aggregation domain.

Traditional Access/Aggregation/Core architecture

Traditional Access/Aggregation/Core architecture

 

The drawback of such design is that Provider VDCs cannot span multiple aggregation domains.

Design Option II – Combined Edge/Compute Cluster

In case spine/leaf network architecture is used, VLANs backing vCloud Director external networks are trunked only to one cluster. In this design option we will call it Edge/Compute Cluster. As explained above vCloud Director placement engine will deploy Edge VMs to a cluster based on VLAN connectivity – therefore it will automatically place all Edge Gateways into the Edge/Compute cluster as this is the only cluster where the external connectivity (VLANs) exists. vCloud Director will however also opportunistically place regular tenant VMs into this cluster (hence its name Edge/Compute).

Spine/leaf with Edge/Compute Cluster

This design option has all the scale advantages of Spine/Leaf architecture however the possibility of tenant workloads taking limited space of Edge/Compute cluster is the drawback. There are two potential options how to remediate this:

  1. vCloud Director Edge Gateways are always deployed by vCloud System Administrator. He/she could make sure that prior Edge Gateway deployment there is enough capacity in the Edge/Compute cluster. If not some tenant workloads can be migrated away to another cluster – this must be done from within vCloud Director (Resource Pool / Migrate to option). Live migration is however possible only if the Edge/Compute Cluster shares the same VXLAN prepared vSphere Distributed Switch (vDS) with the other clusters and this requires at least four network uplinks on the Edge/Compute Cluster hosts (two uplinks for vDS with external VLANs and two uplinks for VXLAN vDS).
  2. Artificially limit the size of Edge/Compute Cluster so the placement engine does not choose it for regular tenant workloads. This can be done by leveraging Resource Pool which is created manually in the Edge/Compute cluster and attached to the Provider VDC instead of the whole cluster. Then an artificial limit is set by System Administrator and is increased only when a new Edge Gateway needs to be deployed.

Both options unfortunately provide significant operational overhead.

Design Option IIb – Combined Edge/Compute Cluster with Non-elastic VDC

While elastic Org VDC types (such are Pay-As-You-Go or Allocation type) can span multiple clusters what would be the impact of non-elastic VDC such as Reservation Pool in this design option?

In non-elastic Org VDC all tenant workloads are deployed into the primary Provider VDC resource pool. However Edge VMs can be deployed into secondary resource pools. This means as long as the Edge/Compute cluster is added as secondary Resource Pool into a Provider VDC this design option can still be used.

Spine/leaf with Edge/Compute Clsuter and non-elastic VDC

Design Option III – Dedicated Edge Cluster

This design option extends the previous one but in this case we will have dedicated Edge Cluster which is not managed by vCloud Director at all. We will also introduce new Edge Gateway type – Provider Edges. These are manually deployed by the service provider totally outside of vCloud Director into the Edge Cluster. Their external uplinks are connected to external VLAN based networks and internal interfaces are connected to transit VXLAN Logical Switch spanning all Compute and the Edge clusters (manually created transport zone with all clusters). The transit network(s) are then consumed by vCloud Director as External Network – note that little workaround is need to do so – read here.

The Provider Edges can provide all NSX functionality (dynamic routing protocols on external uplinks, L2 bridging, L2 VPN, etc.). They can scale as additional vCloud Director External Networks are added (current maximum in VCD 5.6 is 750 External Networks). The Edges deployed by vCloud Director then go into compute clusters as all their interfaces connect to VXLAN logical switches spanned everywhere in the Provider VDC.

Spine/leaf with Dedicated Edge Cluster

Spine/leaf with Dedicated Edge Cluster

Read vCloud Director with NSX: Edge Cluster (Part 2) here.

vRealize Automation with Multiple Cloud Endpoints

One of my customers had deployed true hybrid vRealize Automation with multiple cloud endpoints: vCloud Air and internal vCloud Director and AWS. I was called in to troubleshoot strange issue where sometimes deployment of a cloud multimachine blueprint (vApp) would work but most often it would fail with the following message:

VCloud Clone VM failed for machine: XXX100 [Workflow Instance Id=19026]
System.InvalidOperationException: Error occurred while getting vApp template with ID: urn:vcloud:vapptemplate:a21de50d-8b5e-41a6-81d1-acfd8ab8364b

INNER EXCEPTION: com.vmware.vcloud.sdk.utility.VCloudException: [ 8ae6fbca-e0d2-43e7-bc94-5bc9d776bf8d ] No access to entity “com.vmware.vcloud.entity.vapptemplate:a21de50d-8b5e-41a6-81d1-acfd8ab8364b”

Endpoint was properly configured, template existed, so what could be wrong? Why were we denied the access to the template?

It turns out that by design vRealize Automation does not match a template to a particular endpoint. It identifies it just by name. So in our case sometimes it would try to deploy the blueprint to wrong endpoint where the template of the particular name did not exist.

The fix is simple:

  • Define reservation policies which would identify each endpoint.
  • Assign them to the proper reservationsReservation
  • Assign reservation policies to the Cloud vApp blueprint. This way there will never be confusion from which template to provision to which endpoint.Blueprint reservation policy