VCD-SP 5.6 Upgradability

Just a short post.

vCloud Director 5.6.3 – the first release solely for service providers was released on Tuesday. There is however no upgrade path from vCloud Director 5.5.2 which was released a month ago. So if you are a service provider do not upgrade to 5.5.2 unless you want to wait for the next VCD-SP release which will support the upgrade.

Promiscuous Portgroup Myth

Topic of promiscuous portgroup on virtual switch came up lately from different directions therefore I decided to summarize some information and also debunk one particular myth.

What is promiscuous port? This is what Wikipedia says:

In computer networking, promiscuous mode or promisc mode is a mode for a wired network interface controller (NIC) or wireless network interface controller (WNIC) that causes the controller to pass all traffic it receives to the central processing unit (CPU) rather than passing only the frames that the controller is intended to receive. This mode is normally used for packet sniffing that takes place on a router or on a computer connected to a hub (instead of a switch)…

So how can this be related to virtual environment and virtual switch? VMware KB article 1002934 sheds some light here:

By default, a guest operating system’s virtual network adapter only receives frames that are meant for it. Placing the guest’s network adapter in promiscuous mode causes it to receive all frames passed on the virtual switch that are allowed under the VLAN policy for the associated portgroup. This can be useful for intrusion detection monitoring or if a sniffer needs to analyze all traffic on the network segment.

So does this mean enabling promiscuous port on a vSwitch will make all vSwitch frames visible to VM connected to such port? So let’s step back and explain how VMware vSwitch works. The main difference from a physical switch is that it does not learn MAC addresses by observing passing traffic (and that is why you hear sometimes networking people saying it is not a real switch). It instead relies on the the information the hypervisor (VMkernel) provides about VM vNIC MAC addresses. It basically knows that all vSwitch non uplink ports are used only by VMs (with known MAC addresses). So a frame originating on a VM connected to vSwitch will be either delivered to the right port on the same host (if it is in the same VLAN and matches the destination MAC) or sent to uplink (usually trunk) port. There it is either flooded or switched to the right port by the physical switching infrastructure depending if it is unknown or known unicast and eventually delivered to the right host (if it belongs to a VM) or to a physical device.

So if you think about this behavior described above it should be clear that VM connected to a promiscuous port or portgroup will not see all the vSwitch traffic but only the traffic that is accessible on the host where the VM resides.

Let’s have a look at the following example with three virtual machines (VM A, VM B, VM C) each on different ESX host (Host A, Host B, Host C) connected via physical switch (Port A, Port B, Port C):

Three Host Diagram

 

All VMs are in the same portgroup of VMware vSphere Distributed Switch in VLAN 100, while the port of VM A is set as promiscuous.

So which traffic between VM B and C will and will not VM A see?

When VM B (with MAC B) will try to talk to VM C it will send broadcast ARP packet to find VM C MAC address (MAC C). The physical switch will see the broadcast frame coming to Port B, will note the MAC B address on VLAN 100 is behind this port and will flood the frame to all other ports. vSwitch on host A will get this frame and forward it to all ports on VLAN 100 (it is a broadcast frame) thus to VM A.
VM B will get the ARP request and reply to it with unicast reply from MAC C to MAC B. Physical switch will enter the MAC C into its MAC table noting it is behind port C on VLAN 100 and switch the frame to already learned location of MAC B – to port B. vSwitch will then deliver the frame to VM B. As you can see no frame was delivered to host A and therefore VM A will not see the reply.

Now when communication between VM B and VM C has been established they can start talking with each other and physical switch knowing locations of MAC B and MAC C will switch the frames only between ports B and C of hosts B and C. VM A will see nothing from this unicast communication.

After while the MAC table on the physical switch will expire (if it has shorter timeout than VM B or C ARP cache). In such case it will forget the location of MAC B or C and will flood frame to B (or C) to all VLAN 100 ports and only then VM A will get the frame as the flooded frame reached host A as well.

Broadcast and possibly multicast traffic from VM B or C will reach host A and thus VM A as well.

This should debunk the myth that promiscuous port can be used for packet sniffing. For that you need port mirroring.

There are however use cases for promiscuous port and these are related to the (non-)learning behavior of vSwitch. If VM A would like to see traffic for additional MAC address D which is not hardcoded to its vNIC promiscuous port is requirement. Examples of such use case are nested VMs (VM A is virtual ESXi host) or floating MAC for highly available load balancing VMs (MAC masquarade). As the MAC D responds to ARP requests the physical switch will learn that MAC D is behind port A and will deliver the frame properly. vSwitch on host A will then flood the traffic to all promiscuous ports in the VLAN on the host as it does not know otherwise where to deliver it. Read William Lam’s article how to improve efficiency of this through VMware Fling (VMkernel vib plugin) that gives vSwitch learning ability.

Nested VM

VXLAN on Routed Transport Network

One of the major benefits of VXLAN technology is that it allows creating virtual Layer 2 segments over Layer 3 routed networks. VTEPs (VXLAN Tunnel End Points) encapsulate and decapsulate ethernet frames of VMs on virtual networks and send them as UDP packets. However there still must be a mechanism that provides ability for sending VTEPs to find the receiving VTEPs for broadcast, unknown unicast and multicast (BUM) traffic.

In NSX we can use multicast, hybrid and unicast modes. Hybrid and unicast modes leverage controller cluster that has knowledge of the entire VTEP topology. However in vCloud Network and Security (vCNS) we can use only multicast mode.

While setting up a multicast in a flat layer 2 network is very easy and only requires enabling IGMP snooping and querrier on the physical switch infrastructure, routed multicast is much harder. That is why hybrid and unicast modes that NSX provides are so useful. In unicast mode all BUM traffic is replicated by VTEPs. In hybrid mode, multicast is used in each L2 segment of the transport network while unicast is used to send for replication of the traffic to the other segments.

In my recent VXLAN deployment we however had to stick to pure multicast mode as we used vCNS. To route multicast traffic the physical router was enabled to use PIM-SM (Protocol Independent Multicast in Sparse Mode) with rendezvous point. However it turned out that setting up the VTEPs is not straightforward and not very well documented with some misinformation in blog post I found on the web.

Each VTEP needs to have an IP address assigned. In vCNS the assignment happens over DHCP protocol only,

Auto-assigned VTEP IP address

NSX provides next to DHCP also ability to use network pools. As we were using vCNS and had no DHCP servers in the VXLAN transport network we had to go into each host and manually assign the VTEP vmkernel port IP address through vSphere client. Unfortunately this is not enough for routed communication on the transport network. Default gateway in the VXLAN network stack must be defined.

Missing gateway

The default gateway must be added through ESXi CLI interface as can be seen in above screenshot it is not configurable via GUI. Originally we created a static route to the other segment, but that is not enough (actually not needed at all) and instead the default gateway must be defined with the following command.

esxcli network ip route ipv4  add -n default -g 1.1.1.1 -N vxlan

where 1.1.1.1 is the gateway IP address and vxlan is the networking stack.

The verification that gateway is set properly can be done with net-vdl2 -l command.

net-vdl2

 

Troubleshooting Multicast with Linux

I was looking for lightweight tool which would help me with troubleshooting multicast on VXLAN transport network (underlay). While both vCNS and NSX have built in tools (pings of various sizes and broadcast packets) I needed something more flexibile where I could do arbitrary IGMP joins and leaves.

I used CentOS VM with one interface directly on transport network and software SMCRoute. This link contains binary package that works on RHEL/CentOS. Some other notes:

  • if you have multiple interfaces make sure the multicast is routed through the correct one:
    route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
  • I had to install also glibc package:

    yum -y install glibc.i686

  • Make sure the kernel supports multicast

    cat /boot/config-<kernel version> | grep CONFIG_IP_MULTICAST

  • Enable ICMP ECHO on broadcast/multicast

    sysctl net.ipv4.icmp_echo_ignore_broadcasts=0

  • Start the smcroute daemon first:
    smcroute -d

To join and leave a multicast group use -j and -l commands:
smcroute -j eth0 239.0.0.1
smcroute -l eth0 239.0.0.1

To check current memberships use:
netstat -ng

or

ip maddr

IGMP version can be changed with following command:

echo “2” > /proc/sys/net/ipv4/conf/eth0/force_igmp_version

Additional useful statistics about IGMP joins:

cat /proc/net/igmp

To see which hosts are member of particular IGMP group just ping it and see who replies:

[root@CentOS~]# ping 239.1.0.10
PING 239.1.0.10 (239.1.0.10) 56(84) bytes of data.
64 bytes from 1.1.0.1: icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from 1.1.0.3: icmp_seq=1 ttl=64 time=0.256 ms (DUP!)

Hosts 1.1.0.1 and 1.1.0.3 replied to ping on 239.1.0.10 multicast group.

How To Change VXLAN VTEP MTU Size and Teaming Policy

One of my customers has configured VXLAN in vCloud Director environment and then created multiple Provider and Org VDCs and deployed virtual networks. Then we found out that MTU and teaming policy configuration was set up incorrectly. Redeployment of the whole environment would take too much time, fortunately there is a way to do this without rip and replace approach.

First little bit of background. VXLAN VTEPs are configured in vShield Manager or in NSX Manager (via vSphere Web Client plugin) on cluster/distributed switch level. vShield/NSX Manager creates one distributed switch port group with given parameters (VLAN, teaming policy) and then for each host added to the cluster creates VTEP vmknic (with configured MTU size and DHCP/IP Pool addressing scheme). This means that teaming policy can be easily changed directly at vSphere level by direct edit of the distributed switch port group and MTU size can be changed on each host VTEP vmknic. However every new host deployed into the VXLAN prepared cluster would still use the wrong MTU size set in vShield/NSX Manager. Note that as there can be only one VTEP port group per distributed switch, clusters sharing the same vSwitch need to have identical VTEP teaming policy and VLAN ID.

The actual vCNS/NSX Manager VTEP configuration can be changed via following REST API call:

PUT https://<vCNS/NSX Manager FQDN>/api/api/2.0/vdn/switches/<switch ID>

with the Body containing the new configuration.

Example using Firefox RESTClient plugin:

  1. Install Firefox RESTClient plugin.
  2. Make sure vCNS/NSX Manager certificate is trusted by Firefox.
  3. In Firefox toolbar click on RESTClient icon.
  4. Create authentication header: Authentication > Basic Authentication > enter vCNS/NSX Manager credentials
  5. Select GET method and in the URL enter https://<vCNS/NSX Manager FQDN>/api/2.0/vdn/switches
    VDS Contexts
  6. This will retrieve all vswitch contexts in vCNS/NSX domain. Find ID of the one you want to change and use it in the following GET call
  7. Select GET method and in the URL enter https://<vCNS/NSX Manager FQDN>/api/api/2.0/vdn/switches/<switch-ID>
    VDS Context
  8. Now copy the Response Body and paste it into the Request Body box. In the XML edit the parameters you want to change. In my case I have changed:
    <mtu>9000</mtu> to <mtu>1600</mtu> and
    <teaming>ETHER_CHANNEL</teaming> to <teaming>FAILOVER_ORDER</teaming>
  9. Change the metod to PUT and add a new header: Content-Type: application/xml.
    PUT Request
  10. Send the request. If everything went successfully we should get Status Code: 200 OK response.
    OK Response

Now we need in vSphere Client change MTU size of all existing hosts to the new value and also change the teaming policy on VTEP portgroup (in my case from Route based on IP hash to Use explicit failover order).

vCloud Network and Security (vShield Manager) supports following teaming policies:

  • FAILOVER_ORDER
  • ETHER_CHANNEL
  • LACP_ACTIVE
  • LACP_PASSIVE
  • LACP_V2

NSX adds following two teaming policies for multiple VTEP vmknics:

  • LOADBALANCE_SRCID
  • LOADBALANCE_SRCMAC

Update 9/22/2014

Existing VXLAN VNI portgroups (virtual wires) will use original teaming policy, therefore they need to be changed to match the new one as well.

When using FAILOVER_ORDER teaming policy there must be also specification of the uplinks in the XML. The uplinks should use the names as defined at the distributed switch level.

<teaming>FAILOVER_ORDER</teaming>
<uplinkPortName>Uplink 2</uplinkPortName>
<uplinkPortName>Uplink 1</uplinkPortName>