vCloud Director 9: Create VXLAN Network Pool

vCloud Director uses Network Pools to create programmatically on-demand L2 networking segments for Org VDC and vApp networks. Network pools can be based on VLANs, VXLAN, port groups and legacy (deprecated) vCloud Network isolation (VCDNI) technology.

VXLAN Network Pool is recommended to be used as it scales the best. Until version 9, vCloud Director would create new VXLAN Network Pool automatically for each Provider VDC backed by NSX Transport Zone (again created automatically) scoped to cluster that belong to the particular Provider VDC. This would create multiple VXLAN network pools and potentially confusion which to use for a particular Org VDC.

In vCloud Director 9 we have the option to create our own VXLAN network pool backed by a NSX Transport Zone manually created and scoped to clusters we want to (and using any control plane mode).

During creation of Provider VDC we then have a choice to create a new VXLAN Network Pool (the legacy behavior) or use an existing one.

Advantages of the new feature are:

  • No more clutter of large amount of VXLAN network pools (if there are many Provider VDCs)
  • Simpler way to use hybrid or unicast control plane modes (vCloud Director would always default to multicast before)
  • Control over scope of VXLAN networks – especially useful for sharing Org VDC networks between Org VDCs from different Provider VDCs.
  • Adhering to best practice of scoping transport zone to whole vDS (more here)

VCDNI to VXLAN Migration

vCloud Network Isolation (VCDNI or VCNI) is legacy mechanism to create overlay logical networks independently from physical networking underlay. It was originally used in VMware vCenter Lab Manager (where it was known as Cross Host Fencing). vCloud Director offers it as one of many mechanisms for creation of logical networks (next to VXLAN, VLAN and port group backings). VCDNI uses VMware proprietary MAC-in-MAC encapsulation done by vCloud Agent running in ESXi host vmkernel.

It has been for some time superseded by VXLAN technology which is much more scalable, provides better performance and is industry standard technology. VXLAN network pools have been available in vCloud Director since version 5.1.

VCDNI is consumed by manual creation of a vCloud Network Isolation backed Network Pool that is mapped to an underlay VLAN network with up to 1000 logical networks for each pool (VLAN).

As a deprecated and obsolete technology it is no longer supported in vSphere 6.5 and vCloud Director 8.20 is the last release that will support such network pools. vCloud Director 8.20 also provides simple mechanism to perform low-disruption migrations for Org VDC and vApp networks to VXLAN backed networks. Such migration must be done before upgrade to vSphere 6.5 (see more in KB 2148381).

The migration can be performed via UI or API by system administrator with Org VDC granularity.

Migration via UI

  1. For an Org VDC using VCDNI network pool open in the System tab – Manager & Monitor, Org VDC properties (note that doing the same from Org tab will not work).
    org-vdc
  2. Go to Network Pool & Services tab and change VCDNI backed network pool to VXLAN backed one and click OK.
    network-pool
  3. Again open Network Pool & Services tab of the Org VDC. Migrate to VXLAN button will now appear.
    migrate-to-vxlan
  4. Click the button, confirm the message and start the migration.
    confirmation
  5. After while the Org VDC status will change from busy to ready and the migration is finished. Details (and possible errors) can be reviewed in the Recent Tasks of the Audit Log.
    audit-log

Migration with vCloud API

Org VDC network migration is triggered by single API POST call at the Org VDC level.

POST /api/admin/vdc/<org VDC UUID>/migrateVcdniToVxlan
Content Type: application/vnd.vmware.admin.vdcnitovxlanmigration+xml

The Process

The following happens in the background when migration is triggered for each VCDNI backed network in an Org VDC:

  1. ‘Dummy’ VXLAN logical switch is created
  2. All VMs connected to VCDNI network are reconnected to the new VXLAN logical switch
  3. Edge Gateways connected to VCDNI network are connected to the new VXLAN logical switch
  4. Org VDC/vApp network backing is changed in vCloud DB to use the new VXLAN logical switch
  5. Original VCDNI port group is deleted

Small network disruption is expected during VM and Edge Gateway reconnections. The following Recent Tasks picture from vSphere Client shows what is happening at vCenter Server level and how much time each task could take. In the example there was one Org VDC network and one vApp network migrated with VM1 and Edge Gateway ACME-GW2 involved.

vc-recent-tasks

Update 5/8/2017: Engineering informed me that it was reported that due to vSphere bug, during the migration fenced parameters are not removed from NSX Edge VMs vmx file. This impacts the Edge connectivity to migrated network. As a workaround redeploy the Edge Gateway after the migration.

Automate ESXi Host VTEP Default Gateway

As discussed in my older article VXLAN routed transport network requires to set default gateway of vxlan stack on each ESXi host. While NSX has concept of IP Pools which allows automatic VTEP configuration (including gateway), older vCloud Network and Security (vShield) technology does not have this feature and VTEP IP address must be configure via DHCP or manually.

Following quick and dirty PowerCLI script shows how this can be automated at cluster level:


$hosts = Get-Cluster Cluster1 |Get-VMHost
foreach($vihost in $hosts){
$esxcli = get-vmhost $vihost | Get-EsxCli
$vihost.name
$result=$esxcli.network.ip.route.ipv4.add("10.40.0.1","vxlan","default")
$esxcli.network.ip.interface.ipv4.get("","vxlan")|format-list IPv4Address
$esxcli.network.ip.route.ipv4.list("vxlan")|Format-Table
}

The script sets vxlan stack default gateway to 10.40.0.1 on each host in the cluster ‘Cluster1‘ and displays each host name, VTEP IP address and vxlan routing table.

VXLAN Routing Script

Credit for esxcli to PowerCLI command conversion goes to Virten.net.

 

 

VXLAN on Routed Transport Network

One of the major benefits of VXLAN technology is that it allows creating virtual Layer 2 segments over Layer 3 routed networks. VTEPs (VXLAN Tunnel End Points) encapsulate and decapsulate ethernet frames of VMs on virtual networks and send them as UDP packets. However there still must be a mechanism that provides ability for sending VTEPs to find the receiving VTEPs for broadcast, unknown unicast and multicast (BUM) traffic.

In NSX we can use multicast, hybrid and unicast modes. Hybrid and unicast modes leverage controller cluster that has knowledge of the entire VTEP topology. However in vCloud Network and Security (vCNS) we can use only multicast mode.

While setting up a multicast in a flat layer 2 network is very easy and only requires enabling IGMP snooping and querrier on the physical switch infrastructure, routed multicast is much harder. That is why hybrid and unicast modes that NSX provides are so useful. In unicast mode all BUM traffic is replicated by VTEPs. In hybrid mode, multicast is used in each L2 segment of the transport network while unicast is used to send for replication of the traffic to the other segments.

In my recent VXLAN deployment we however had to stick to pure multicast mode as we used vCNS. To route multicast traffic the physical router was enabled to use PIM-SM (Protocol Independent Multicast in Sparse Mode) with rendezvous point. However it turned out that setting up the VTEPs is not straightforward and not very well documented with some misinformation in blog post I found on the web.

Each VTEP needs to have an IP address assigned. In vCNS the assignment happens over DHCP protocol only,

Auto-assigned VTEP IP address

NSX provides next to DHCP also ability to use network pools. As we were using vCNS and had no DHCP servers in the VXLAN transport network we had to go into each host and manually assign the VTEP vmkernel port IP address through vSphere client. Unfortunately this is not enough for routed communication on the transport network. Default gateway in the VXLAN network stack must be defined.

Missing gateway

The default gateway must be added through ESXi CLI interface as can be seen in above screenshot it is not configurable via GUI. Originally we created a static route to the other segment, but that is not enough (actually not needed at all) and instead the default gateway must be defined with the following command.

esxcli network ip route ipv4  add -n default -g 1.1.1.1 -N vxlan

where 1.1.1.1 is the gateway IP address and vxlan is the networking stack.

The verification that gateway is set properly can be done with net-vdl2 -l command.

net-vdl2

 

Troubleshooting Multicast with Linux

I was looking for lightweight tool which would help me with troubleshooting multicast on VXLAN transport network (underlay). While both vCNS and NSX have built in tools (pings of various sizes and broadcast packets) I needed something more flexibile where I could do arbitrary IGMP joins and leaves.

I used CentOS VM with one interface directly on transport network and software SMCRoute. This link contains binary package that works on RHEL/CentOS. Some other notes:

  • if you have multiple interfaces make sure the multicast is routed through the correct one:
    route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
  • I had to install also glibc package:

    yum -y install glibc.i686

  • Make sure the kernel supports multicast

    cat /boot/config-<kernel version> | grep CONFIG_IP_MULTICAST

  • Enable ICMP ECHO on broadcast/multicast

    sysctl net.ipv4.icmp_echo_ignore_broadcasts=0

  • Start the smcroute daemon first:
    smcroute -d

To join and leave a multicast group use -j and -l commands:
smcroute -j eth0 239.0.0.1
smcroute -l eth0 239.0.0.1

To check current memberships use:
netstat -ng

or

ip maddr

IGMP version can be changed with following command:

echo “2” > /proc/sys/net/ipv4/conf/eth0/force_igmp_version

Additional useful statistics about IGMP joins:

cat /proc/net/igmp

To see which hosts are member of particular IGMP group just ping it and see who replies:

[root@CentOS~]# ping 239.1.0.10
PING 239.1.0.10 (239.1.0.10) 56(84) bytes of data.
64 bytes from 1.1.0.1: icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from 1.1.0.3: icmp_seq=1 ttl=64 time=0.256 ms (DUP!)

Hosts 1.1.0.1 and 1.1.0.3 replied to ping on 239.1.0.10 multicast group.