VXLAN on Routed Transport Network

One of the major benefits of VXLAN technology is that it allows creating virtual Layer 2 segments over Layer 3 routed networks. VTEPs (VXLAN Tunnel End Points) encapsulate and decapsulate ethernet frames of VMs on virtual networks and send them as UDP packets. However there still must be a mechanism that provides ability for sending VTEPs to find the receiving VTEPs for broadcast, unknown unicast and multicast (BUM) traffic.

In NSX we can use multicast, hybrid and unicast modes. Hybrid and unicast modes leverage controller cluster that has knowledge of the entire VTEP topology. However in vCloud Network and Security (vCNS) we can use only multicast mode.

While setting up a multicast in a flat layer 2 network is very easy and only requires enabling IGMP snooping and querrier on the physical switch infrastructure, routed multicast is much harder. That is why hybrid and unicast modes that NSX provides are so useful. In unicast mode all BUM traffic is replicated by VTEPs. In hybrid mode, multicast is used in each L2 segment of the transport network while unicast is used to send for replication of the traffic to the other segments.

In my recent VXLAN deployment we however had to stick to pure multicast mode as we used vCNS. To route multicast traffic the physical router was enabled to use PIM-SM (Protocol Independent Multicast in Sparse Mode) with rendezvous point. However it turned out that setting up the VTEPs is not straightforward and not very well documented with some misinformation in blog post I found on the web.

Each VTEP needs to have an IP address assigned. In vCNS the assignment happens over DHCP protocol only,

Auto-assigned VTEP IP address

NSX provides next to DHCP also ability to use network pools. As we were using vCNS and had no DHCP servers in the VXLAN transport network we had to go into each host and manually assign the VTEP vmkernel port IP address through vSphere client. Unfortunately this is not enough for routed communication on the transport network. Default gateway in the VXLAN network stack must be defined.

Missing gateway

The default gateway must be added through ESXi CLI interface as can be seen in above screenshot it is not configurable via GUI. Originally we created a static route to the other segment, but that is not enough (actually not needed at all) and instead the default gateway must be defined with the following command.

esxcli network ip route ipv4  add -n default -g -N vxlan

where is the gateway IP address and vxlan is the networking stack.

The verification that gateway is set properly can be done with net-vdl2 -l command.



Troubleshooting Multicast with Linux

I was looking for lightweight tool which would help me with troubleshooting multicast on VXLAN transport network (underlay). While both vCNS and NSX have built in tools (pings of various sizes and broadcast packets) I needed something more flexibile where I could do arbitrary IGMP joins and leaves.

I used CentOS VM with one interface directly on transport network and software SMCRoute. This link contains binary package that works on RHEL/CentOS. Some other notes:

  • if you have multiple interfaces make sure the multicast is routed through the correct one:
    route add -net netmask dev eth0
  • I had to install also glibc package:

    yum -y install glibc.i686

  • Make sure the kernel supports multicast

    cat /boot/config-<kernel version> | grep CONFIG_IP_MULTICAST

  • Enable ICMP ECHO on broadcast/multicast

    sysctl net.ipv4.icmp_echo_ignore_broadcasts=0

  • Start the smcroute daemon first:
    smcroute -d

To join and leave a multicast group use -j and -l commands:
smcroute -j eth0
smcroute -l eth0

To check current memberships use:
netstat -ng


ip maddr

IGMP version can be changed with following command:

echo “2” > /proc/sys/net/ipv4/conf/eth0/force_igmp_version

Additional useful statistics about IGMP joins:

cat /proc/net/igmp

To see which hosts are member of particular IGMP group just ping it and see who replies:

[root@CentOS~]# ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from icmp_seq=1 ttl=64 time=0.256 ms (DUP!)

Hosts and replied to ping on multicast group.

How To Change VXLAN VTEP MTU Size and Teaming Policy

One of my customers has configured VXLAN in vCloud Director environment and then created multiple Provider and Org VDCs and deployed virtual networks. Then we found out that MTU and teaming policy configuration was set up incorrectly. Redeployment of the whole environment would take too much time, fortunately there is a way to do this without rip and replace approach.

First little bit of background. VXLAN VTEPs are configured in vShield Manager or in NSX Manager (via vSphere Web Client plugin) on cluster/distributed switch level. vShield/NSX Manager creates one distributed switch port group with given parameters (VLAN, teaming policy) and then for each host added to the cluster creates VTEP vmknic (with configured MTU size and DHCP/IP Pool addressing scheme). This means that teaming policy can be easily changed directly at vSphere level by direct edit of the distributed switch port group and MTU size can be changed on each host VTEP vmknic. However every new host deployed into the VXLAN prepared cluster would still use the wrong MTU size set in vShield/NSX Manager. Note that as there can be only one VTEP port group per distributed switch, clusters sharing the same vSwitch need to have identical VTEP teaming policy and VLAN ID.

The actual vCNS/NSX Manager VTEP configuration can be changed via following REST API call:

PUT https://<vCNS/NSX Manager FQDN>/api/api/2.0/vdn/switches/<switch ID>

with the Body containing the new configuration.

Example using Firefox RESTClient plugin:

  1. Install Firefox RESTClient plugin.
  2. Make sure vCNS/NSX Manager certificate is trusted by Firefox.
  3. In Firefox toolbar click on RESTClient icon.
  4. Create authentication header: Authentication > Basic Authentication > enter vCNS/NSX Manager credentials
  5. Select GET method and in the URL enter https://<vCNS/NSX Manager FQDN>/api/2.0/vdn/switches
    VDS Contexts
  6. This will retrieve all vswitch contexts in vCNS/NSX domain. Find ID of the one you want to change and use it in the following GET call
  7. Select GET method and in the URL enter https://<vCNS/NSX Manager FQDN>/api/api/2.0/vdn/switches/<switch-ID>
    VDS Context
  8. Now copy the Response Body and paste it into the Request Body box. In the XML edit the parameters you want to change. In my case I have changed:
    <mtu>9000</mtu> to <mtu>1600</mtu> and
    <teaming>ETHER_CHANNEL</teaming> to <teaming>FAILOVER_ORDER</teaming>
  9. Change the metod to PUT and add a new header: Content-Type: application/xml.
    PUT Request
  10. Send the request. If everything went successfully we should get Status Code: 200 OK response.
    OK Response

Now we need in vSphere Client change MTU size of all existing hosts to the new value and also change the teaming policy on VTEP portgroup (in my case from Route based on IP hash to Use explicit failover order).

vCloud Network and Security (vShield Manager) supports following teaming policies:

  • LACP_V2

NSX adds following two teaming policies for multiple VTEP vmknics:


Update 9/22/2014

Existing VXLAN VNI portgroups (virtual wires) will use original teaming policy, therefore they need to be changed to match the new one as well.

When using FAILOVER_ORDER teaming policy there must be also specification of the uplinks in the XML. The uplinks should use the names as defined at the distributed switch level.

<uplinkPortName>Uplink 2</uplinkPortName>
<uplinkPortName>Uplink 1</uplinkPortName>

Multitenant Service Network in vCloud Director

Service providers often have to provide additional services to their cloud tenants. An example is providing licensing services (KMS) for Windows VMs deployed from provider managed catalog or RHEL Satellite servers for licensing and patching Red Hat VMs. The questions is then where to deploy these shared services virtual machines so they are securely available in multitenant environment?

In my older blog post Centralized Logging in vCloud Director Environments I described how a shared vCloud Director external logging network can be used to collect logs from Edge Gateways. So the idea is to use the same network for connection to the shared services VMs (KMS/Satellite) running in Administration Organization. The Edge Gateway can have only 10 interfaces so it is good that we do not waste another one. Let’s have a look at following diagram:

Edge GW Logging and admin services

We have 3 organizations and one Org VDC in each – Customer 1, Customer 2 (the tenants) and Admin Organizations (managed by the provider). The tenants connect their vApps to the shared internet network (yellow) via the Edge Gateways by using sub-allocated public addresses (8.8.8.x) utilizing source or destination NAT of their Org VDC network. Each Edge Gateway is connected to another vCloud external network (black) that is using both for Edge logging and access to shared services running in the Admin Organization.

Notice that there are two IP subnet ranges assigned to the service external network. The is used solely for the Edge logging. The syslog server sits in this network ( and firewall infront of it ensures that only Edge logs get there. The Edge Gateway IP from this network ( and is not sub-allocated for tenant use so they cannot create NAT rules with it. They could only route (one way) from their Org VDC networks and send UDP packets but the syslog firewall denies such traffic as it is coming from internal ( IPs.

The second IP subnet range of the service network ( is used for the communication to the service VMs running in Admin Organization. So how is this achieved securely?

  1. The provider sub-allocates the Edge IP to the tenant so he can create NAT rules. So is sub-allocated to Customer 1, is sub-allocated to Customer 2.
  2. The provider pre-creates SNAT rule for each deployed Edge Gateway. The rule must be applied on the Service network, original IP range is everything and translated IP is the sub-allocated IP of the Edge.
    SNAT ruleThe tenant has to be told not to delete or alter the rule otherwise his access to shared services will not work anymore.
  3. The provider creates destination NAT rule for his service VMs running in Admin Organization. To do this he first needs to have sub-allocated IP addresses (in my example and and then DNATs them to the VM internal IPs and Obviously port forwarding could be used as well to save some IPs as long the port numbers of the services are not the same.

That’s it. Any traffic from the tenant’s VM to the external IP address of the service VM (e.g. will be SNATed by the tenant Edge GW and DNATed by the Admin Edge GW and securely delivered without the tenants being able to contact each other (unless the create DNAT rules as well which could be prevented by MAC ACLs on the external network).

I would also advise to use some obscure IP ranges for the service network so they do not overlap with customer defined Org VDC network ranges.

VCP-NV Exam Experience

VCP-NVDuring VMworld 2014 VMware released new certification track – Network Virtualization. There is already quite a big number of bootstrapped VCDX-NVs which is the highest certification level and it is also nowNV certification track possible to schedule the entry level VCP-NV exam.


As I think that NSX is a great technology I am going for this certification track and immediately scheduled VCP-NV in my nearby PearsonVue test center and today took the exam.

While not having much time for preparation I obviously downloaded the exam blueprint and was surprised how extensive it is –  nine objective categories ranging from the NSX architecture, VXLAN, distributed routing and firewalling, Edge services up to service composer, vSphere standard and distributed switch features and vCloud Automation Center integration. From the sheer content it looks like it is not going to be a simple exam.

I have been working with NSX for some time so was pretty confident in all the areas. Prior the exam I reviewed those areas I work less with (Service Composer, Activity Monitoring, dynamic routing protocols – BGP, IS-IS) and went through the packet walks (VM, VTEP, Controller, Multicast, Unicast, etc) for switched, routed and bridged traffic.

In April I passed Cisco CCNA certification so this gave me good opportunity to compare these two entry level networking exams from two major vendors with completely different SDN strategy.

VCP-NV is obviously heavily based on VMware NSX so do not expect much OpenFlow SDN or any Cisco ACI there. Compared to CCNA there is also no basic network theory (subnetting, OSI model, protocols). There are 120 questions in 2 hour time window which is quite a lot. But all are multiple choice questions – no CLI simulators or flash based questions. The questions cover all blueprint areas and my assumption is they are up to the level of VMware NSX: Install, Configure, Manage training which I did not take (only its VMware internal bootcamp predecessor). I was able to go through the test quite quickly – there is usually no reason to dwell on a particular question longer than 30s. You either know the answer or not.

The questions were mostly clearly written which made taking the exam quite enjoyable experience (well it might have been shorter). You get the result immediately and in my case it was a pass.

My recommendation for potential candidates: know vSphere networking (including the advanced features – NetFlow, Port Mirroring, …), have hands on experience with NSX – if you cannot get the bits or do not have a lab use the NSX Hands-On Labs, which are really good and lastly take the NSX ICM course!

Now back to my VCDX-NV design…