Migration from Nexus 1000V to NSX in vCloud Director

A few service providers were asking me questions about migration possibilities from Nexus 1000V to NSX if using vCloud Director. I will first try to explain the process in high level and then offer a step by step approach.

Introduction

NSX (actually NSX for vSphere) is built on top of vSphere Distributed Switch which is extended by a few VMkernel modules and host level agents. It cannot run on top of Nexus 1000V. Therefore it is necessary to first migrate from Nexus 1000V to vSphere Distributed Switch (vDS) and then install NSX. The core management component of NSX is NSX Manager which runs as a virtual appliance. It is very similar in a sense to vShield or vCloud Networking and Security Manager and there is actually direct upgrade path from vCNS Manager to NSX Manager.

NSX is backward compatible with vCNS API and therefore it works without any problems with vCloud Director. However NSX advanced features (distributed logical router and firewall, dynamic routing protocols support on Edges, etc.) are not available via vCloud Director GUI or vCloud API. The only exception is multicast-less VXLAN which works with vCloud DIrector out of the box.

Migration from Nexus 1000V to vSphere Distributed Switch

In pure vSphere environment it is pretty easy to migrate from Nexus 1000V to vDS and can be done even without any VM downtime as it is possible to have on the same ESXi host both distributed switches (vDS and Nexus). However, this is not the case when vCloud Director with VXLAN technology is involved. VXLAN can be configured with per cluster granularity and therefore it is not possible to mix two VXLAN providers (vDS and Nexus) on the same cluster. This unfortunately means we cannot do live VM migration from Nexus to vDS as they are on different clusters and (live) vMotion does not work across two different distributed switches. Cold migration must be used instead.

Note that if VXLAN is not involved, live migration is possible which will be leveraged while migrating vCloud external networks.

Another point should be noted. We are going to mix two VXLAN providers in the same Provider VDC, meaning that VXLAN scope will span both Nexus 1000V and vSphere Distributed Switch. To my knowledge this is not recommended and supported although it works and both VTEP types can communicate with each other thanks to multicast controller plane. As we will be migrating VMs in powered off state it is not an issue and no communication will run over mixed VXLAN network during the migration.

High Level Steps

We will need two clusters. One legacy with Nexus 1000V and one empty for vDS. I will use two vDS switches, although one could be enough. One vDS is used for management, vMotion, NFS and VLAN traffic (vCloud external networks) the other vDS is used purely for VXLAN VM traffic.

We need to first migrate all VLAN based networks from Nexus 1000V to vDS1. Then we prepare the second cluster with vDS based VXLAN and create elastic Provider VDC which will contain both clusters (resouce pools). We will disable the Nexus resource pool and migrate all VMs, templates and Edges off of it to the new one – see my article vCloud Director: Online Migration of Virtual Data Center.

We will detach the old cluster remove and unistall Nexus 1000V and then create vDS with VXLAN instead and add it back to Provider VDC. After this we can upgrade the vCNS Manager to NSX Manager and prepare all hosts for NSX and also install NSX Controllers. We can optionally change the VXLAN transport mode from Multicast to Unicast/Hybrid. We however cannot upgrade vCloud Director deployed vShield Edges to NSX Edges as that would break its compatibility with vCloud Director.

Step-by-step Procedure

  1. Create new vDS switch. Migrate management networks (vMotion, NFS, management) and vCloud external networks to it (NAT rules on Edge Gateways using external network IPs will need to be removed and re-added).
    Prepare new cluster (Cluster2) and add it to the same vDS. Optionally create a new vDS for VXLAN networks (if the first one will not be used). Prepare VXLAN fabric on Cluster2 in vCNS Manager.
    Distributed switches
    VXLAN fabric before migration
    Nexus1000V – Cluster1 VXLAN only
    dvSwitch01 — Cluster 1 and Cluster 2 external networks, vMotion, management and NFS
    dvSwtich02 – Cluster2 VXLAN only
  2. Create new Provider VDC on new cluster (GoldVDCnew)
  3. Merge new Provider VDC with the old one. Make sure the new Provider VDC is primary! Disable the secondary (Cluster1) Resource Pool.
    Merging PVDCs
    PVDC Resource Pools
  4. Migrate VMs from Cluster1 to Cluster2 from within vCloud Director. VMs connected to a VXLAN network will need to be powered off as it is not possible to do live vMotion between two different distributed switches.
    VM Migration
  5. Migrate templates (move them to another catalog). As the Cluster1 is disabled the templates will be registered on hosts from Cluster2.
  6. Redeploy Edges from within vCloud Director (not vCNS Manager). Again the disabled Cluster1 will mean that the Edges will be registered on hosts from Cluster2.
  7. Detach (now empty) Resource Pool
    Detaching Resource Pool
  8. Rename Provider VDC to the original name (GoldVDC)
  9. Unprepare VXLAN fabric on Cluster1. Remove VXLAN vmknics on all hosts from Nexus1000V.
  10. Remove Nexus1000V switch from the (now empty) cluster and extend VXLAN vDS there (or create a new one if no L2 connectivity exists between clusters). Remove Nexus VEM VIBs from all hosts. Prepare VXLAN fabric on the cluster and add it to Provider VDC.
    VXLAN fabric after migration
  11. Upgrade vCNS Manager to NSX Manager with vCNS to NSX 6 Upgrade Bundle
    vCNS Manager upgrade
    NSX Manager
  12. Update hosts in vSphere Web Client > Networking and Security > Installation > Host Preparation. This action will require host reboot.
    Host preparation for NSX - beforeHost preparation for NSX - after
  13. (Optionally) install NSX controllers
    NSX Controller installation
  14. (Optionally) change VXLAN transport mode to Unicast/Hybrid.
    VXLAN transport mode

vCloud Director – vCenter Single Sign-On Integration Troubleshooting

The access to vCloud Director provider context (system administrator accounts) can be federated with vCenter Single Sign-On. This means that when vCloud system administrator wants to log into the vCloud Director (http://vCloud-FQDN/cloud) he is redirected to vSphere Web Client where he needs to authenticate and then redirected back to vCloud Director.

VCD_SSO_Federation

Here follows collection of topics that might be useful when dealing when troubleshooting the federation:

1. When the SSO federation is enabled the end-user is always redirected to vSphere Web client. If you want to use local authentication use http://vCloud-FQDN/cloud/login.jsp and type local or LDAP account (if LDAP is configured).

2. If you enabled SSO federation and the vCenter SSO is no longer available, you cannot unregister its lookup service. To do this go to vCloud database and in dbo.config table clear the lookupservice.url value.

3. In case you are using self-signed untrusted certificate on the vSphere web client some browsers (Firefox) might not display the Add Security Exception button when being redirected. Therefore open first the vSphere web client page directly, create the security exception and then the redirection from vCloud website should work.

4. HTTP ERROR 500. Problem accessing /cloud/saml/HoKSSO/alias/vcd. Reason: Error determining metadata contracts
….
Metadata for issuer https://xxx:7444/STS wasn’t found

This error might appear after the vSphere web client SSO authentication when the user is redirected back to vCloud portal. To understand this error let’s first talk about what is going on in the background. When the SSO federation is enabled, vCloud Director establishes trust with vCenter SSO. The trust is needed so the identity provider (SSO) knows that the request for authentication is not malicious (phishing) and also the service provider (vCloud Director) needs to be sure the reply comes from the right identity provider.

The trust is established when the federation is configured with metadata exchange that contains keys and some information about the other party. The SSO metadata can be seen in vCloud Director database in the dbo.saml_id_provider_settings table. Now during the actual authentication process if for some reason the security token reply comes from different source than the one expected based on the identity provider metadata, you will get this error.

This issue might happen for example if the vCenter SSO hostname has been changed. In this particular case which I encountered it happened on vCenter Server Appliance 5.1. The SSO has been initiated before the hostname was set. So the identity provider response came with metadata containing the issuer’s IP address instead FQDN which the service provider (VCD) expected based on the SSO endpoint address. The issuer information did not get updated after the hostname change.

The VCSA 5.1 the issuer information is stored in SSO PostgreSQL DB. These are the steps to change it:

  1. Get the SSO DB instance name, user and password

cat /usr/lib/vmware-sso/webapps/lookupservice/WEB-INF/classes/config.properties

db.type=postgres
db.url=jdbc:postgresql://localhost:5432/ssodb
db.user=ssod
db.pass=rTjsz9PRPLOFswBkJTCNAAVp
`

  1. Connect to the DB with the password retrieved from the previous step (db.pass=…)

/opt/vmware/vpostgres/1.0/bin/psql ssodb -U ssod

Retrieve the STS issuer:

ssodb=> select issuer from ims_sts_config;

If the issuer is really incorrect update it with following command:

ssodb=> update ims_sts_config SET issuer='https://FQDN:7444/STS'

Note: I need to credit William Lam for helping me where to find the SSO DB password.

Memory Overhead in vCloud Director

This little known fact came up already twice recently so I will write just a short post about it.

In the past when you created allocation pool Organization Virtual Datacenter (Org VDC) and allocated to it for example 20 GB RAM it actually did not allow you to deploy VMs with total memory sum of 20 GB due to virtualization memory overhead being charged against the allocation as well. This behavior forced service providers to add additional 5%-15% to the memory Org VDC allocation. This was also very confusing for the end users who were complaining why their VM cannot power on.

With the elastic allocation pool VDC changes which came with vCloud Director 5.1 it is no longer an issue. The reason is that in the non-elastic VDC it is vSphere (and Org VDC resource pool object with a limit set) who does the admission control – i.e. how many VMs can be deployed into it. In the elastic VDC it is actually vCloud Director who is responsible for the decision if a VM can be deployed into particular VDC.

Allocation Pool Elasticity

So to sum it up: if you use elastic allocation pool the tenant can use it up to the last MB. The virtualization VM memory overhead is charged to the provider who must take it into account when doing capacity management.

Datastore Cluster Issue in vCloud Director 5.5.1

Just found out there is a known issue with removing datastores in and out of datastore clusters. In vCloud Director 5.1 this was working fine. vCloud Director refers to storage via storage policies (formerly profiles) so you could on-the fly change the datastore cluster structure (as long as all datastore inside had the same storage policy).

However in vCloud Director 5.5.1 if you move a datastore in or out of a datastore cluster, vCloud Director will loose it. The fix is described in KB 2075366 and involves clean up of vCloud Director database inventory data.

VXLAN as an External vCloud Director Network

I was asked by a customer how to use a VXLAN network as an external network in vCloud Director. I thought there was already written blog article about it bud did not find any. So writing the answer here will benefit hopefully others as well.

Why?

First questions would be why would you do it? Aren’t vCloud Director external networks supposed to be the way to connect internal vCloud networks (usually VXLAN based) to the external world via VLAN based networks through a Edge Gateway device? Yes, but there are a few use cases for use of VXLAN network as an external network.

  • Usage of different virtual edge router other than vShield Edge that supports needed features (IPv6, dynamic routing protocols). In picture below you see virtual Fortigate router in place of vShield Edge. The router is deployed manually and its internal interface is connected to a VXLAN network (again created manually) which acts as external network that is directly connected to OrgVDC network. This helps saving VLANs which are usually scarce resource in service provider environment.
    Virtual Router
  • Service network spanning multiple pods crossing L3 boundaries. Each pod (cluster) has its own L2 networking so VLAN cannot span all clusters. However VXLAN can. So service network (for example syslog or monitoring network) can be used by any VM in any rack. See this article how to secure such network in multitenant environment.
    Service Network

How?

Although you can easily manually create a VXLAN network directly in vShield Manager (or in vSphere Web Client if you use NSX) you will not see the VXLAN portgroup in vCloud Director GUI.

Service Network

External Networks 1

The fix is simple. vCloud Director is filtering out all portgroups that start with ‘vxw’ string. Rename the portgroup in vCenter Server (remove the string) and you will be able to select the portgroup as an External Network.

External Networks 2