How to Monitor Health of NSX Edge Gateways

NSX Edge Service Gateways are virtual machines deployed by NSX Manager that provide network services (routing, bridging, load balancing, VPNs, DNS relay, DHCP, …). This makes them quite a critical component in the infrastructure and thus there might be a need to keep a close eye on their availability.

While NSX Manager reports the status of the Edges in the GUI and logs it might take some time to register a change in the health. If you want up-to-date status of an Edge health you need to query the NSX Manager with NSX API. The NSX Manager then retrieves the current status of the Edge. The mechanism of the communication between NSX Manager and Edge appliance(s) depends on the Edge version and the vSphere cluster status:

VIX Communication

This is the legacy mode of communication. NSX Manager utilizes VIX API to query vCenter Server and the ESXi Host which runs the Edge appliance who then via VM Tools talks to the actual VM. This mode of communication is used for the legacy Edges version 5.5.x (deployed via  the compatibility vShield v2 API) and as failback mode when for some reason Message Bus Communication mode is not possible.

VIX Guest Operations
source: https://www.vmware.com/support/developer/vix-api/guestOps50_technote.pdf

 

Message Bus Communication

This is direct (and faster) communication between NSX Manager and the ESXi host (vsfwd process) running the Edge appliance. During the Edge deployment the cluster where the Edge is deployed to must be prepared for NSX and without any issues.

Message Bus
source: NSXvSphereDesignGuidev2.1.pdf

 

To query the Edge health is an expensive operation – it takes time and creates load on NSX Manager. If there is large number of Edges (for example in service provider scenario) this should be considered.

To test the viability of checking at least once in a given time the status of all Edges health I have created simple Powershell function Get-NSXEdgeHealth:

function Get-NSXEdgeHealth {
<#
.SYNOPSIS Gathers Health Status of a NSX Edge
.DESCRIPTION Will query NSX Manager for the health of a particular NSX Edge
.NOTES Author: Tomas Fojta
.PARAMETER NSXManager
The FQDN or IP of your NSX Manager
.PARAMETER Username
The username to connect with. Defaults to admin if nothing is provided.
.PARAMETER Password
The password to connect with
.PARAMETER EdgeId
ID of the Edge to gather health data for.
.EXAMPLE
PS> Get-NSXEdge -NSXManager nsxmgr.fqdn -Username admin -Password password -EdgeId EdgeId
#>
[CmdletBinding()]
param(
[Parameter(Mandatory=$true,Position=0)]
[String]$NSXManager,
[Parameter(Mandatory=$false,Position=1)]
[String]$Username = "admin",
[Parameter(Mandatory=$true)]
[String]$Password,
[Parameter(Mandatory=$true)]
[String]$EdgeId
)
Process {
### Ignore TLS/SSL errors
add-type @"
using System.Net;
using System.Security.Cryptography.X509Certificates;
public class TrustAllCertsPolicy : ICertificatePolicy {
public bool CheckValidationResult(
ServicePoint srvPoint, X509Certificate certificate,
WebRequest request, int certificateProblem) {
return true;
}
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy
### Create authorization string and store in $head
$auth = [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($Username + ":" + $Password))
$head = @{"Authorization"="Basic $auth"}
$HealthRequest = "https://$NSXManager/api/3.0/edges"+"/"+$EdgeId+"/status"

$h = @{} | select Health, Detail

$r = Invoke-WebRequest -Uri $HealthRequest -Headers $head -ContentType "application/xml" -ErrorAction:Stop
[xml]$rxml = $r.Content
$h.Health = $rxml.edgeStatus.edgeStatus
$Details = @()
foreach ($Feature in $rxml.edgeStatus.featureStatuses.featureStatus)
{
$n = @{} | select Service, Status
$n.Service = $Feature.service
$n.Status = $Feature.status
$Details += $n
}
$h.Detail = $Details

return ,$h

} # End of process

} # End of function

PowerShell 3.0 or higher and (at least audit) credentials (and connectivity) to NSX Manager are needed.

Usage example:

Example

 

As can be seen the function needs the Edge ID parameter and then returns the overall Edge health and also detailed status of all its network services.

Following health states are defined:

  • green – good. This is the only state that guarantees that the Edge is functional.
  • red – no backing appliance is in service state
  • grey – unknown status (for example undeployed Edge)
  • yellow – intermittent health check failures (if more than 5 consecutive health checks fail the status goes to red)

Following function Get-NSXEdges will collect all Edges in the environment:

function Get-NSXEdges {
<#
.SYNOPSIS Gathers NSX Edges from NSX Manager
.DESCRIPTION Will inventory all of your NSX Edges from NSX Manager
.NOTES Author: Tomas Fojta
.PARAMETER NSXManager
The FQDN or IP of your NSX Manager
.PARAMETER Username
The username to connect with. Defaults to admin if nothing is provided.
.PARAMETER Password
The password to connect with
.EXAMPLE
PS> Get-NSXEdges -NSXManager nsxmgr.fqdn -Username admin -Password password
#>
[CmdletBinding()]
param(
[Parameter(Mandatory=$true,Position=0)]
[String]$NSXManager,
[Parameter(Mandatory=$false,Position=1)]
[String]$Username = "admin",
[Parameter(Mandatory=$true)]
[String]$Password
)

Process {
### Ignore TLS/SSL errors
add-type @"cd
using System.Net;
using System.Security.Cryptography.X509Certificates;
public class TrustAllCertsPolicy : ICertificatePolicy {
public bool CheckValidationResult(
ServicePoint srvPoint, X509Certificate certificate,
WebRequest request, int certificateProblem) {
return true;
}
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy
### Create authorization string and store in $head
$auth = [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($Username + ":" + $Password))
$head = @{"Authorization"="Basic $auth"}
### Connect to NSX Manager via API
$Request = "https://$NSXManager/api/3.0/edges"
$r = Invoke-WebRequest -Uri ($Request+"?startIndex=0&pageSize=1") -Headers $head -ContentType "application/xml" -ErrorAction:Stop
$TotalCount = ([xml]$r).pagedEdgeList.edgePage.pagingInfo.totalCount

$r = Invoke-WebRequest -Uri ($Request+"?startIndex=0&pageSize="+$TotalCount) -Headers $head -ContentType "application/xml" -ErrorAction:Stop
[xml]$rxml = $r.Content

### Return the NSX Edges

$Edges = @()

foreach ($EdgeSummary in $rxml.pagedEdgeList.edgePage.edgeSummary)
{
$n = @{} | select Name,Id
$n.Name = $edgeSummary.Name
$n.Id = $edgeSummary.objectId
$Edges += $n
}
return ,$Edges

} # End of process

} # End of function

And here is a sample script leveraging both functions above that continuously displays health status of all the Edges in the environment and also displays the time needed to go through all of them.

$NSXManager = "nsx01.fojta.com"
$Username = "admin"
$Password = "default"

$AllEdges = Get-NSXEdges -NSXManager $NSXManager -Username $Username -Password $Password

DO
{
$StartTime = get-date
foreach ($Edge in $AllEdges)
{
$Health = Get-NSXEdgeHealth -NSXManager $NSXManager -Username $Username -Password $Password -EdgeId $Edge.Id
Write-Host $Edge.Name $Health.Health
}
$Duration = (get-date) - $StartTime
Write-Host
Write-Host "Duration:" $Duration.Minutes "Minutes" $Duration.Seconds "Seconds"
Write-Host
} While ($true)

In my lab it took at least cca 2 seconds to get status of an Edge (depending on the mode of communication and its actual health). It is obvious that most time NSX Manager spends on communication with the ESXi host – so the task should and can be parallelized. While running 5 sessions at the same time the retrieval of health status of one Edge went up to 3-4 seconds (for a green Edge) while the load on NSX Manager went up 14 % (I run NSX Manager only with 2 vCPUs in my lab).

Monitoring

While the article mentions only NSX the scripts should work also with vShield / vCloud Networking and Security Manager.

Automate ESXi Host VTEP Default Gateway

As discussed in my older article VXLAN routed transport network requires to set default gateway of vxlan stack on each ESXi host. While NSX has concept of IP Pools which allows automatic VTEP configuration (including gateway), older vCloud Network and Security (vShield) technology does not have this feature and VTEP IP address must be configure via DHCP or manually.

Following quick and dirty PowerCLI script shows how this can be automated at cluster level:


$hosts = Get-Cluster Cluster1 |Get-VMHost
foreach($vihost in $hosts){
$esxcli = get-vmhost $vihost | Get-EsxCli
$vihost.name
$result=$esxcli.network.ip.route.ipv4.add("10.40.0.1","vxlan","default")
$esxcli.network.ip.interface.ipv4.get("","vxlan")|format-list IPv4Address
$esxcli.network.ip.route.ipv4.list("vxlan")|Format-Table
}

The script sets vxlan stack default gateway to 10.40.0.1 on each host in the cluster ‘Cluster1‘ and displays each host name, VTEP IP address and vxlan routing table.

VXLAN Routing Script

Credit for esxcli to PowerCLI command conversion goes to Virten.net.

 

 

How To Change VXLAN VTEP MTU Size and Teaming Policy

One of my customers has configured VXLAN in vCloud Director environment and then created multiple Provider and Org VDCs and deployed virtual networks. Then we found out that MTU and teaming policy configuration was set up incorrectly. Redeployment of the whole environment would take too much time, fortunately there is a way to do this without rip and replace approach.

First little bit of background. VXLAN VTEPs are configured in vShield Manager or in NSX Manager (via vSphere Web Client plugin) on cluster/distributed switch level. vShield/NSX Manager creates one distributed switch port group with given parameters (VLAN, teaming policy) and then for each host added to the cluster creates VTEP vmknic (with configured MTU size and DHCP/IP Pool addressing scheme). This means that teaming policy can be easily changed directly at vSphere level by direct edit of the distributed switch port group and MTU size can be changed on each host VTEP vmknic. However every new host deployed into the VXLAN prepared cluster would still use the wrong MTU size set in vShield/NSX Manager. Note that as there can be only one VTEP port group per distributed switch, clusters sharing the same vSwitch need to have identical VTEP teaming policy and VLAN ID.

The actual vCNS/NSX Manager VTEP configuration can be changed via following REST API call:

PUT https://<vCNS/NSX Manager FQDN>/api/api/2.0/vdn/switches/<switch ID>

with the Body containing the new configuration.

Example using Firefox RESTClient plugin:

  1. Install Firefox RESTClient plugin.
  2. Make sure vCNS/NSX Manager certificate is trusted by Firefox.
  3. In Firefox toolbar click on RESTClient icon.
  4. Create authentication header: Authentication > Basic Authentication > enter vCNS/NSX Manager credentials
  5. Select GET method and in the URL enter https://<vCNS/NSX Manager FQDN>/api/2.0/vdn/switches
    VDS Contexts
  6. This will retrieve all vswitch contexts in vCNS/NSX domain. Find ID of the one you want to change and use it in the following GET call
  7. Select GET method and in the URL enter https://<vCNS/NSX Manager FQDN>/api/api/2.0/vdn/switches/<switch-ID>
    VDS Context
  8. Now copy the Response Body and paste it into the Request Body box. In the XML edit the parameters you want to change. In my case I have changed:
    <mtu>9000</mtu> to <mtu>1600</mtu> and
    <teaming>ETHER_CHANNEL</teaming> to <teaming>FAILOVER_ORDER</teaming>
  9. Change the metod to PUT and add a new header: Content-Type: application/xml.
    PUT Request
  10. Send the request. If everything went successfully we should get Status Code: 200 OK response.
    OK Response

Now we need in vSphere Client change MTU size of all existing hosts to the new value and also change the teaming policy on VTEP portgroup (in my case from Route based on IP hash to Use explicit failover order).

vCloud Network and Security (vShield Manager) supports following teaming policies:

  • FAILOVER_ORDER
  • ETHER_CHANNEL
  • LACP_ACTIVE
  • LACP_PASSIVE
  • LACP_V2

NSX adds following two teaming policies for multiple VTEP vmknics:

  • LOADBALANCE_SRCID
  • LOADBALANCE_SRCMAC

Update 9/22/2014

Existing VXLAN VNI portgroups (virtual wires) will use original teaming policy, therefore they need to be changed to match the new one as well.

When using FAILOVER_ORDER teaming policy there must be also specification of the uplinks in the XML. The uplinks should use the names as defined at the distributed switch level.

<teaming>FAILOVER_ORDER</teaming>
<uplinkPortName>Uplink 2</uplinkPortName>
<uplinkPortName>Uplink 1</uplinkPortName>

Update 4/1/2015

As mentioned in the comments below vCNS and NSX differ slightly in the API call. For NSX the correct call is:

PUT https://nsx01.fojta.com/api/2.0/switches

(without the switch-id at the end).

Rate Limiting of External Networks in vCloud Director and Nexus 1000V

There is a new feature in vCloud Director 5.1 which was requested a lot by service providers – configurable limits on routed external networks (for example Internet) for each tenant. Limits can be set both for incoming and outgoing directions by vCloud Administrator on tenant’s Edge Gateway.

Edge Rate Limit Configuration
Edge Rate Limit Configuration

However this feature only works with VMware vSphere distributed switch – it does not work with Cisco Nexus 1000V or VMware standard switch. Why? Although the feature is provided by the Edge Gateway, what is actually happening in the background is that vShield Manager instructs vCenter to create a traffic shaping policy on the distributed vswitch port used by the Edge VM.

vSphere Distributed Switch Traffic Shaping
vSphere Distributed Switch Traffic Shaping

Standard switch does not allow port specific traffic shaping and Nexus 1000V management plane (Virtual Supervisor Module) is not accessible by the vShield Manager/vCenter. The rate limit could be applied on the port of the Cisco switch manually, however any Edge redeploy operation, which is accessible by the tenant via GUI would deploy a new Edge and use different port on the virtual switch and tenant could thus easily disable the limit.

For the standard switch backed external network vCloud Director GUI will not even present the option to set the rate limit, however for the Nexus backed external network the operation will fail with similar error:

Cannot update edge gateway “ACME_GW”
java.util.concurrent.ExecutionException: com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (10086): Traffic shaping policy can be set only for a Vnic connected to a vmware distributed virtual portgroup configured with static port binding. Invalid portgroup ‘dvportgroup-9781’.

Nexus 1000V Error
Nexus 1000V Error

Btw the rate limit can be set on the Edge (when not using vCloud Director) also via vShield Manager or its API – it is called Traffic Shaping Policy and configurable in the vSM > Edge > Configure > Interfaces > Actions menu.

vShield Manager Traffic Shaping
vShield Manager Traffic Shaping

Do not forget to consider this when designing vCloud Director environments and choosing the virtual switch technology.

Load Balancing vCloud Director Cells with vShield Edge

Big deployments of vCloud Director should have at least two vCloud Director cells for high availability and load balancing reasons. This implies usage of a load balancer. One can choose either physical box (for example F5) or virtual one (Citrix Netscaler, Riverbed Stingray, Zenloadbalancer, …). With the new release of VMware vCloud Networking and Security (vCNS) which is the successor of VMware vShield it is possible to use the Edge (version 5.1) as a load balancer.

Compared to the old vShield Edge (5.0) there are quite a few enhancements. Besides being able to load balance not only HTTP connection as was the case in the previous versions, load balancing of HTTPS and generic TCP connections is also supported. Additionally the new Edge can have up to 10 network interfaces, can connect to VXLAN networks, provide traffic shaping, relay DNS, create SSL VPN and can scale up to 3 sizes (compact, large, x-large) with statefull active passive high availability.

I am going to describe how to use Edge as a load balancer in front of two vCloud Director cells. The following picture shows my lab network setup.

This is based on quite standard architecture where the vCloud Director cells sit in DMZ zone usually separated by two firewalls from the internet and the management zone. In order to deploy the Edge, vCNS Manager (former vShield Manager) must be deployed first. If two different vCenters are used for management of resource group cluster and management cluster, also two different vCNS Managers must be used as there is 1:1 relationship between the vCenter and vCNS Manager.

Deployment Process

1. Deploy vCNS Manager (OVF virtual appliance), configure and register with management cluster vCenter

2. Either using vSphere Client (use the .NET version as there is no vShield plugin for Web Client available yet) or directly through vCNS Manager web GUI go to Host and Clusters view, select Datacenter and click Network Virtualization tab. Click + icon to add a new Edge.

3. Configure the Edge deployement size, HA, network interfaces (portgroups, IPs and subnets), default firewall policy and placement. In my lab I have used compact size, no HA and two interfaces (INT and EXT as shown in the picture).

4. Once the Edge is deployed (Manager deploys OVF and then with VIX API pushes configurations to the Edge VM), select it and click the gear icon Actions to go to Manage menu.

Before we configure the load balancer we must add additional IP(s) to the external interface. This is vCloud Director requirement as both portal/API and VMware Remote Console (VMRC)  Proxy use the same port 443. I have used the default Edge external IP address for the vCloud Director portal and added a second one for VMRC Proxy. This can be done in Configure tab, interfaces menu.

5. Now we can configure the load balancer. Firstly Pools of real servers must be created and then Virtual Server can be configured.

I have created two pools: VCD_80-443 with two services enabled: HTTP and HTTPS, both using LEAST_CONN Balancing Method on Ports 80 and 443. I have enabled HTTP health check with the default settings on URI /cloud/server_status. The members were the VCD cells with IPs 10.0.1.60 and 10.0.1.62 and respective ports 80 and 443 on each IP.

The second pool:  VMRC_443 has a TCP service with LEAST_CONN Balancing Method and default TCP health check on port 443. The VCD cell IPs 10.0.1.61 and 10.0.1.62 with ports 443 were added.

6. Two Virtual Servers were then created. One for each external IP from step 4. “vcloud” Virtual Server uses VCD_80-443 Pool with 10.0.2.80 external IP address. “VMRC” Virtual Server uses VMRC_443 Pool with 10.0.2.81 external IP address.

7. The configurations must be uploaded to the Edge by clicking the Publish Changes button.

Happy load balancing.