NSX-T 3.1: Sharing Transport VLAN between Host and Edge Nodes

When NSX-T 3.1 was released a few days ago, the feature that I was most looking for was the ability to share Geneve overlay transport VLAN between ESXi transport nodes and Edge transport nodes.

Before NSX-T 3.1 in a collapsed design where Edge transport nodes were running on ESXi transport nodes (in other words NSX-T Edge VMs were deployed to NSX-T prepared ESXi cluster) you could not share the same transport (TEP) VLAN unless you would dedicate separate physical uplinks for Edge traffic and ESXi underlay host traffic. The reason is that the Geneve encapsulation/decapsulation was happening only on the physical uplink in/egress and that point would be skipped for intra-host datapath between the Edge and host TEP VMkernel port.

This was quite annoying because the two transport VLANs need to route between each other at full jumbo MTU>1600 frame size. So in lab scenarios you had to have additional router taking care of that. And I have seen multiple time issues due to  misconfigured router MTU size.

After upgrading my lab to NSX-T 3.1 I was eager to test it.

Here are the steps I used to migrate to single transport VLAN:

  1. The collapsed Edge Nodes will need to use trunk uplinks created as NSX-T logical segment. My Edge Nodes used regular VDS port group so I renamed the old ones in vCenter and created new trunks in NSX-T Manager.
  2. (Optional) Create new TEP IP Address Pool for the Edges. You can obviously use the ESXi host IP Pool as now they will share the same subnet, or you can use static IP addressing. I opted for new IP Address Pool with the same subnet as my ESXi host TEP IP Address Pool but a different range so I can easily distinguish host and edge TEP IPs.
  3. Create new Edge Uplink Profile VLAN to match the ESXi transport VLAN.
  4. Now for each Edge node repeat this process: edit the node in the Edge Transport Node Overview tab, change its Uplink Profile, IP Pool and uplinks to the created ones in steps #1, #2 and #3. Refresh and observe the Tunnel health.
  5. Clean up now unused Uplink Profile, IP Pool and VDS uplinks.
  6. Deprovision now unused Edge Transport VLAN from physical switches and from the physical router interface.

During the migration I saw one or two pings to drop but that was it. If you see tunnel issues try to put the edge node briefly into NSX Maintenance Mode.

Quotas and Quota Policies in VMware Cloud Director

In this article I want to highlight a new neat feature in VMware Cloud Director 10.2 – the ability to assign quotas and create quota policies.

This can be done at multiple levels both by service provider or organization administrator.

The following resources today can be managed via quotas:

  • Memory
  • CPU
  • Storage
  • All VMs (includes vApp template VMs)
  • Running VMs
  • TKG Clusters

The list might expand in the future so you can easily find what quota capabilities are available via API.

The service provider can create quotas at the organization level in the Organization > Configure > Quotas section:

The org administrator can assign quota to individual users or groups. This is done from the Administration > Access Control > User or Group  > Set Quota section.
The assignment of a quota at the group level is inherited by each group user (so it is not enforced at the aggregate group level) but can be overridden at the individual user quota level. Also if a user is member of multiple groups the least restrictive combination of participating group quotas will be applied to her.

At the same place the user or org admin can see the actual user’s usage compared to the quota.

Org admins can use quotas to easily control good behavior of org users (not running too many VMs concurrently, not consuming too much storage, etc.), while system admins can set safety quotas at org level when using Org VDC allocation models with unlimited consumption with Pay per use billing.

One hidden feature available only via API is the ability to create more generic quota policies that can combine (pool) multiple quota elements and use those to assign them to organizations, groups or individual users. Think of quota policy: Power User vs Regular User, where the former can power on more VMs.

When a specific quota is assigned at the user/group/org object, quota policy is created in the backend anyway but is specific just to the one object, while edit of Power User quota policy would be applied to every user that has such quota policy.

The feature comes with new specific rights so can be easily enabled or disabled:

  • Organization: Manage Quotas of Organization
  • Organization: Edit Quotas Policy
  • General: View Quota Policy Capabilities
  • General: Manage Quota Policy
  • General: View Quota Policy

New Networking Features in VMware Cloud Director 10.2

The 10.2 release of VMware Cloud Director from networking perspective was a massive one. NSX-V vs NSX-T gap was closed and in some cases NSX-T backed Org VDCs now provide more networking functionality than the NSX-V backed ones. UI has been redesigned with new dedicated Networking sections however some new features are currently available only in API.
Let me dive straight in so you do not miss any.

NSX-T Advanced Load Balancing (Avi) support

This is a big feature that requires its own blog post. Please read here. In short, NSX-T backed Org VDCs can now consume network load balancer services that are provided by the new NSX-T ALB / Avi.

Distributed Firewall and Data Center Groups

Another big feature combines Cross VDC networking, shared networks and distributed firewall (DFW) functionality. The service provider first must create Compute Provider Scope. This is basically a tag – abstraction of compute fault domains / availability zones and is done either at vCenter Server level or at Provider VDC level.

The same can be done for each NSX-T Manager where you would define Network Provider Scope.

Once that is done, the provider can create Data Center Group(s) for a particular tenant. This is done from the new networking UI in the Tenant portal by selecting one or multiple Org VDCs. The Data Center Group will now become a routing domain with networks spanning all Org VDCs that are part of the group, with a single egress point (Org VDC Gateway) and the distributed firewall.

Routed networks will automatically be added to a Data Center Group if they are connected to the group Org VDC Edge Gateway. Isolated networks must be added explicitly. An Org VDC can be member of multiple Data Center Groups.

If you want the tenant to use DFW, it must be explicitly enabled and the tenant Organization has to have the correct rights. The DFW supports IP Sets and Security Groups containing network objects that apply rules to all connected VMs.

Note that only one Org VDC Edge Gateway can be added to the Data Center Group. This is due to the limitation that NSX-T logical segment can be attached and routed only via single Tier-1 GW. The Tier-1 GW is in active / standby mode and can theoretically span multiple sites, but only single instance is active at a time (no multi-egress).

VRF-Lite Support

VRF-Lite is an object that allows slicing single NSX-T Tier-0 GW into up to 100 independent virtual routing instances. Lite means that while these instances are very similar to the real Tier-0 GW they do support only subset of its features: routing, firewalling and NATing.

In VCD, when tenant requires direct connectivity to on-prem WAN/MPLS with fully routed networks (instead of just NAT-routed ones), in the past the provider had to dedicated a whole external network backed by Tier-0 GW to such tenant. Now the same can be achieved with VRF which greatly enhances scalability of the feature.

There are some limitations:

  • VRF inherits its parent Tier-0 deployment mode (HA A/A vs A/S, Edge Cluster), BGP local ASN and graceful restart setting
  • all VRFs will share its parent uplinks physical bandwidth
  • VRF uplinks and peering with upstream routers must be individually configured by utilizing VLANs from a VLAN trunk or unique Geneve segments (if upstream router is another Tier-0)
  • an an alternative to the previous point EVPN can be used which allows single MP BGP session for all VRFs and upstream routers with data plane VXLAN encapsulation. Upstream routers obviously must support EVPN.
  • the provider can import into VCD as an external network either the parent Tier-0 GW or its child VRFs, but not both (mixed mode)

IPv6

VMware Cloud Director now supports dual stack IPv4/IPv6 (both for NSX-V and NSX-T backed networks). This must be currently enabled via API version 35 either during network creation or via PUT on the OpenAPI network object by specifying:

“enableDualSubnetNetwork”: true

In the same payload you also have to add the 2nd subnet definition.

 

PUT https://{{host}}/cloudapi/1.0.0/orgVdcNetworks/urn:vcloud:network:c02e0c68-104c-424b-ba20-e6e37c6e1f73

...
    "subnets": {
        "values": [
            {
                "gateway": "172.16.100.1",
                "prefixLength": 24,
                "dnsSuffix": "fojta.com",
                "dnsServer1": "10.0.2.210",
                "dnsServer2": "10.0.2.209",
                "ipRanges": {
                    "values": [
                        {
                            "startAddress": "172.16.100.2",
                            "endAddress": "172.16.100.99"
                        }
                    ]
                },
                "enabled": true,
                "totalIpCount": 98,
                "usedIpCount": 1
            },
            {
                "gateway": "fd13:5905:f858:e502::1",
                "prefixLength": 64,
                "dnsSuffix": "",
                "dnsServer1": "",
                "dnsServer2": "",
                "ipRanges": {
                    "values": [
                        {
                            "startAddress": "fd13:5905:f858:e502::2",
                            "endAddress": "fd13:5905:f858:e502::ff"
                        }
                    ]
                },
                "enabled": true,
                "totalIpCount": 255,
                "usedIpCount": 0
            }
        ]
    }
...
    "enableDualSubnetNetwork": true,
    "status": "REALIZED",
...

 

The UI will still show only the primary subnet and IP address. The allocation of the secondary IP to VM must be either done from its guest OS or via automated network assignment (DHCP, DHCPv6 or SLAAC). DHCPv6 and SLAAC is only available for NSX-T backed Org VDC networks but for NSX-V backed networks you could use IPv6 as primary subnet (with IPv6 pool) and IPv4 with DHCP addressing as the secondary.

To enable IPv6 capability in NSX-T the provider must enable it in Global Networking Config.
VCD automatically creates ND (Neighbor Discovery) Profiles in NSX-T for each NSX-T backed Org VDC Edge GW. And via /1.0.0/edgeGateways/{gatewayId}/slaacProfile API the tenant can set the Edge GW profile either to DHCPv6 or SLAAC. For example:
PUT https://{{host}}/cloudapi/1.0.0/edgeGateways/urn:vcloud:gateway:5234d305-72d4-490b-ab53-02f752c8df70/slaacProfile
{
    "enabled": true,
    "mode": "SLAAC",
    "dnsConfig": {
        "domainNames": [],
        "dnsServerIpv6Addresses": [
            "2001:4860:4860::8888",
            "2001:4860:4860::8844"
        ]
    }
}

And here is the corresponding view from NSX-T Manager:

And finally a view on deployed VM’s networking stack:

DHCP

Speaking of DHCP, NSX-T supports two modes. Network mode (where DHCP service is attached directly to a network and needs an IP from that network) and Edge mode where the DHCP service runs on Tier-1 GW loopback address. VCD now supports both modes (via API only). The DHCP Network mode will work for isolated networks and is portable with the network (meaning the network can be attached or disconnected from the Org VDC Edge GW) without DHCP service disruption. However, before you can deploy DHCP service in Network mode you need to specify Services Edge Cluster (for Edge mode that is not needed as the service runs on the Tier-1 Edge GW).  The cluster definition is done via Network Profile at Org VDC level.

In order to use DHCPv6 the network must be configured in Network mode and attached to Org VDC Edge GW with SLAAC profile configured with DHCPv6 mode.

Other Features

  • vSphere Distributed Switch support for NSX-T segments (also known as Converged VDS), although this feature was already available in VCD 10.1.1+
  • NSX-T IPSec VPN support in UI
  • NSX-T L2VPN support, API only
  • port group backed external networks (used for NSX-V backed Org VDCs) can now have multiple port groups from the same vCenter Server instance (useful if you have vDS per cluster for example)
  • /31 external network subnets are supported
  • Org VDC Edge GW object now supports metadata

NSX-V vs NSX-T Feature Parity

Let me conclude with an updated chart showing comparison of NSX-V vs NSX-T features in VMware Cloud Director 10.2. I highlighted new additions in green.

VMware Cloud Director – Storage IOPS Management – Part II

This is a follow up to the article I posted about a year ago that describes new IOPS management functionality in VMware Cloud Director (VCD) 10.2.

Storage IOPS  is next to compute, networking and storage capacity a limited resource service providers want to manage in order to fairly share underlying physical resources in a multitenant environment.

As was described in the original article VCD supported storage IOPS management  however the feature was quite hidden and available only via API. The recent release of VMware Cloud Director not only fully exposes the functionality in the UI but also adds some new functionality. Let’s dive into it.

There are two main mechanisms now how you can manage IOPS.

vCenter Server managed IOPS

This mechanism relies on setting IOPS limits at storage policy level directly in vCenter Server. That is possible with host based and with vSAN based storage policies. This mechanism is quite simple – when a VM disk is provisioned to such IOPS limited storage policy it will inherit the IOPS limit –  a constant number per policy. You will not be able to set proportional IOPS based on disk capacity.

vSAN Storage Policy with IOPS Limit
Host Based non vSAN Storage Policy with IOPS Limit

I would recommend using such mechanism only if you want to avoid noisy neighbors. The concept is not new, VCD could use such vSAN policies for some time and host based policies were already supported in VCD 10.1. The only difference is that now in 10.2 the tenant will see the limit reservation set at VM disk level but will not be able to change it.

Non-editable Disk IOPS

VCD Managed IOPS

This is much more sophisticated mechanism where you can really manage IOPS as pool of available capacity that you slice and allocate to tenant Org VDCs. This is the mechanism that was until now only available via API.

You will start by tagging your datastores with their IOPS capacity – that has not changed and still must be done from within VC via custom properties.

At Provider VDC level you can then create IOPS managed storage policies and define their service level in terms of disk IOPS defaults, maximums or IOPS allocation based on disk size (0 means unlimited).

This storage policy configuration can be inherited or overridden at Org VDC level. This is big improvement compared to the old approach where you had to create such storage policies always at Org VDC level.

Another new thing is that you can disable IOPS placement mechanism for such storage policy. This is useful in case you want to use Datastore Clusters. VCD will no longer try to place each virtual disk based on a particular datastore available IOPS. The placement decision is instead done by vCenter Server – you should therefore enable Storage DRS with I/O balancing automation. There is no need in such case to tag individual datastores in VC with their IOPS capacity.

Some of the old caveats still apply:

  • Disk IOPS can be assigned only to regular VMs or named (independent) disks, not to VM templates.
  • The disk IOPS will be always allocated against the Org VDC storage profile even if the VM is powered-off. This means the cloud provider can oversubscribe IOPS at the provider VDC storage profile level.
  • System administrator can override IOPS limits when deploying/editing tenant VMs in the system context.

Load Balancing with Avi in VMware Cloud Director

VMware Cloud Director 10.2 is adding network load balancing (LB) functionality in NSX-T backed Organization VDCs. It is not using the native NSX-T load balancer capabilities but instead relies on Avi Networks technology that was acquired by VMware about a year ago and since then rebranded to VMware NSX Advanced Load Balancer. I will call it Avi for short in this article.

The way Avi works is quite different from the way load balancing worked in NSX-V or NSX-T. Understanding the differences and Avi architecture is essential to properly use it in multitenant VCD environments.

I will focus only on the comparison with NSX-V LB as this is relevant for VCD (NSX-T legacy LB was never viable option for VCD environments).

In VCD in an NSX-V backed Org VDC the LB is running on Org VDC Edge Gateway (VM) that can have four different sizes (compact, large, quad large and extra large) and be in standalone or active / standby configuration. That Edge VM also needs to perform routing, NATing, firewalling, VPN, DHCP and DNS relay. Load balancer on a stick is not an option with NSX-V in VCD. The LB VIP must be an IP assigned to one of external or internally attached network interfaces of the Org VDC Edge GW.

Enabling load balancing on an Org VDC Edge GW in such case is easy as the resource is already there. 

In the case of Avi LB the load balancing is performed by external (dedicated to load balancing) components which adds more flexibility, scalability and features but also means more complexity. Let’s dive into it.

You can look at Avi as another separate platform solution similar to vSphere or NSX – where vSphere is responsible for compute and storage, NSX for routing, switching and security, Avi is now responsible for load balancing.

Picture is worth thousand words, so let me put this diagram here first and then dig deeper (click for larger version).

 

Control Path

You start by deploying Avi controller cluster (highly available 3 nodes) which integrates with vSphere (to use for compute/storage) and NSX-T (for routing LB data and control plane traffic). The controllers would sit somewhere in your management infrastructure next to all other management solutions.

The integration is done by setting up so called NSX-T Cloud in Avi where you define vCenter Server (only one is supported per NSX-T Cloud) and NSX-T Manager endpoints, NSX-T overlay transport zone (with 1:1 relationship between TZ and NSX-T Cloud definition). Those would be your tenant/workload VC/NSX-T.

You must also point to pre-created management network segment that will be used to connect all load balancing engines (more on them later) so they can communicate with the controllers for management and control traffic. To do so, in NSX-T you would set up dedicated Tier-1 (Avi Management) GW with the Avi Management segment connected and DHCP enabled. The expectation is the Tier-1 GW would be able through Tier-0 to reach the Avi Controllers.

Data Path

Avi Service Engines (SE) are VM resources to perform the load balancing. They are similar to NSX-T Edge Nodes in a sense that the load balancing virtual services can be placed on any SE node based on capacity or reservations (as Tier-1 GW can be placed on any Edge Node). Per se there is no strict relationship between tenant’s LB and SE node. SE node can be shared across Org VDC Edge GWs or even tenants. SE node is a VM with up to 10 network interfaces (NICs). One NIC is always needed for the management/control traffic (blue network). The rest (9) are used to connect to the Org VDC Edge GW (Tier-1 GW) via a Service Network logical segment (yellow and orange). The service networks are created by VCD when you enable load balancing service on the Org VDC Edge GW together with DHCP service to provide IP addresses for the attached SEs. It will by default get 192.168.255.0/25 subnet, but the system admin can change it, if it clashes with existing Org VDC networks. Service Engines run each service interface in a different VRF context so there is no worry about IP conflicts or even cross tenant communication.

When a load balancing pool and virtual service is configured by the tenant Avi will automatically pick a Service Engine to instantiate the LB service. It might even need to first deploy (automatically) an SE node if there is no existing capacity. When SE is assigned Avi will configure static route (/32) on the Org VDC Edge GW pointing the virtual service VIP (virtual IP) to the service engine IP address (from the tenant’s LB service network).

Note: The VIP contrary to NSX-V LB can be almost any arbitrary IP address. It can be routable external IP address allocated to the Org VDC Edge GW or any non-externally routed address but it cannot clash with any existing Org VDC networks. or with the LB service network. If you use an external Org VDC Edge GW allocated IP address you cannot use the address for anything else (e.g. SNAT or DNAT). That’s the way NSX-T works (no NAT and static routing at the same time). So for example if you want to use public address 1.2.3.4 for LB on port 80 but at the same time use it for SNAT, use an internal IP for the LB (e.g. 172.31.255.100) and create DNAT port forwarding rule to it (1.2.3.4:80 to 172.31.255.100:80).

Service Engine Groups

With the basics out of the way let’s discuss how can service provider manage the load balancing quality of service – performance, capacity and availability. This is done via Service Engine Groups (SEG).

SEGs are (today) configured directly in Avi Controller and imported into VCD. They specify SE node sizing (CPU, RAM, storage), bandwidth restrictions, virtual services maximums per node and availability mode.

The availability mode needs more explanation. Avi supports four availability modes:
A/S … legacy (only two nodes are deployed), service is active only on one node at a time and stand by on the other, no scale out support (service across nodes), very fast failover

A/A … elastic, service is active on at least two SEs, session info is proactively replicated, very fast failover

N+M … elastic, N is number of SE nodes service is scaled over, M is a buffer in number of failures the group can sustain, slow failover (due to controller need to re-assign services), but efficient SE utilization

N+0 … same as N+M but no buffer, the controller will deploy new SE nodes when failure occurs. The most efficient use of resources but the slowest failover time.

The base Avi licensing supports only legacy A/S high availability mode. For best availability and performance usage of elastic A/A is recommended.

As mentioned Service Engine Groups are imported into VCD where the system administrator makes a decision whether SEG is going to be dedicated (SE nodes from that group will be attached to only one Org VDC Edge GW) or shared.

Then when load balancing is enabled on a particular Org VDC Edge GW, the service provider assigns one or more SEGs to it together with capacity reservation and maximum in terms of virtual services for the particular Org VDC Edge GW.

Use case examples:

  • A/S dedicated SEG for each tenant / Org VDC Edge GW. Avi will create two SE nodes for each LB enabled Org VDC Edge GW and will provide similar service as LB on NSX-V backed Org VDC Edge GW did. Does not require additional licensing but SEG must be pre-created for each tenant / Org VDC Edge GW.
  • A/A elastic shared across all tenants. Avi will create pool of SE nodes that are going to be shared. Only one SEG is created. Capacity allocation is managed in VCD, Avi elastically deploys and undeploys SE nodes based on actual usage (the usage is measured in number of virtual services, not actual throughput or request per seconds).

Service Engine Node Placement

The service engine nodes are deployed by Avi into the (single) vCenter Server associated with the NSX Cloud and they live outside of VMware Cloud Director management. The placement is defined in the service engine group definition (you must use Avi 20.1.2 or newer). You can select vCenter Server folder and limit the scope of deployment to list of ESXi hosts and datastores. Avi has no understanding of vSphere host, and datastore clusters or resource pools. Avi will also not configure any DRS anti-affinity for the deployed nodes (but you can do so post-deployment).

Conclusion

The whole Avi deployment process for the system admin is described in detail here. The guide in the link refers to general Avi deployment of NSX-T Cloud, however for VCD deployment you would just stop before the step Creating Virtual Service as that would be done from VCD by the tenant.

Avi licensing is basic or enterprise and is set at Avi Controller cluster level. So it is possible to mix both licenses for two tier LB service by deploying two Avi Controller cluster instances and associating each with a different NSX-T transport zone (two vSphere clusters or Provider VDCs).

The feature differences between basic and enterprise editions are quite extensive and complex. Besides Service Engine high availability modes the other important difference is access to metrics, amount of application types, health monitors and pool selection algorithms.

The Avi usage metering for licensing purposes is currently done via Python script that is ran at the Avi Controller to measure Service Engine total  high mark vCPU usage during a given period and must be reported manually. The basic license is included for free with VCPP NSX usage and is capped to 1 vCPU per 640 GB reported vRAM of NSX base usage.

Update 2020/10/23: Make sure to check interoperability matrix. As of today only Avi 20.1.1 is supported with VCD 10.2.