Architecting a VMware vCloud Availability for vCloud Director Solution

Another vCloud Architecture Toolkit whitepaper that I authored was published on the vCAT SP website – it discusses how to architect vCloud Availability solution in large production scenarios.

It is based on real live deployments and includes the following chapters:

 

 

 

  • Introduction
  • Use Cases
    • Disaster Recovery
    • Migration
  • vCloud Availability Architecture Design Overview
    • vCloud Availability Architecture
    • Network Flows
    • Conceptual Architecture
  • vCloud Availability Management Components
    • Logical Architecture
    • vCloud Availability Portal
    • Cloud Proxy
    • RabbitMQ
    • Cassandra Database
    • VMware Platform Services Controller
    • vSphere Replication Cloud Service
    • vSphere Replication Manager
    • vSphere Replication Servers
    • ESXi Hosts
    • vCloud Availability Metering
    • vRealize Orchestrator
    • Management Component Resiliency Considerations
  • vCloud Director Configuration
    • User Roles
    • Tenant Limits and Leases
    • Organization Virtual Data Center
    • Network Management
    • Storage Management
    • vApps and Virtual Machines
  • Billing
  • vRealize Orchestrator Configuration
    • On-Premises Deployment
    • In-the-Cloud Deployment
    • Provider Deployment
    • Failover Orchestration
  • Monitoring
    • Component Monitoring
    • VM Replication Monitoring
    • Backup Strategy
  • Appendix A – Port Requirements / Firewall Rules
  • Appendix B – Glossary
  • Appendix C – Maximums
  • Appendix D – Reference Documents
  • Appendix E – Tenant API Structure
  • Appendix F – Undocumented HybridSettings vCloud API
  • Appendix G – Monitoring

Download from the vCAT-SP website: https://www.vmware.com/solutions/cloud-computing/vcat-sp.html or direct link to pdf.

vCloud Director 8.20: Granular Role Based Access Control

vCloud Director 8.20 introduces the possibility to create granular roles at tenant and system level. This is important for service providers who want to differentiate which tenants have access to specific features (for example advanced networking services). This also gives opportunity to tenants to create their own roles that correspond to their team structure (e.g. network administrator). And lastly, system administrator can create additional roles in system context with access to subset of features.

A role is a set of rights which can be assigned to a user or a group. There are many new rights in vCloud Director 8.20. A few examples:

Access to Distributed Firewall:

  • Enable / Disable Distributed Firewall

Gateway Advanced Services

  • Configure IPSEC VPN
  • Configure Load Balancer
  • Configure BGP Routing
  • Configure OSPF Routing
  • Configure SSL VPN
  • Configure Firewall
  • Configure DHCP
  • Configure NAT
  • Configure L2 VPN
  • Configure Static Routing

Or system level rights like:

Host

  • Upgrade Host
  • Repair Host
  • Migrate Host VMs
  • Open a Host in vSphere
  • Enable / Disable a Host
  • Prepare / Unprepare a Host
  • View Host

Prior vCloud Director 8.20

  • Only global roles could be created by system administrator next to handful of predefined roles (vApp Author, Organization Administrator, …).
  • Every organization would have access to the global and predefined roles.
  • Organization administrator could assign the roles to organization users.
  • Service provider could not differentiate access to features among different tenants.
  • There was only one system administrator role with access to everything.

vCloud Director 8.20

  • Roles are no longer global, but instead are organization specific.
  • Former global and predefined roles become role templates.
  • Service provider can create new role templates.
  • Role templates are used to instantiate organization specific roles.
  • Service provider can selectively grant rights to specific organizations.
  • Organization administrator can create own organization specific roles from subset of granted rights.
  • New roles can be created in the system context from subset of system administrator rights.

The transition from pre-vCloud Director 8.20 role management happens during upgrade to 8.20. Existing roles are transferred to role templates and each organization gets its own roles instantiation based on the role templates. The UI has changed and now includes Organization column and filter. A new System organization is added with default System Administrator role.

vCloud Director 8.10 UI
vCloud Director 8.20 UI

Tenant Rights and Role Management

When a new organization is created it will have access to all rights that are used in role templates. System administrator can grant additional rights to the organization with vCloud API only:

GET /api/admin … get references to all rights in VCD instance

GET /api/admin/org/<org-id>/rights … get references to all rights in the organization

PUT /api/admin/org/<org-id>/rights edit rights in the organization

System administrator or Organization Administrator can create new roles in its organization with vCloud API only:

POST /admin/org/<org-id>/roles

Note: While system administrator can edit tenant roles in the UI, editing of a role based on role template would change the role template and thus change it for all organizations (more below).

How to Create Global Role

The UI no longer allows creation of global roles, only organization specific roles can be created that way.

However, there is a way to create global role (actually role template) with the legacy API (e.g. version 9.0, 20.0 but not 27.0). Here is an example:

POST /api/admin/roles
Header:
Accept: application/*;version=9.0
Content-Type: application/vnd.vmware.admin.role+xml

Body:

<?xml version="1.0" encoding="UTF-8"?>
<Role xmlns="http://www.vmware.com/vcloud/v1.5" name="New Global Role">
	<Description>My new global role</Description>
	<RightReferences>
		<RightReference href="https://vcloud.fojta.com/api/admin/right/0b8c8cd2-5af9-32ad-a0bd-dc356503a552" name="General: Administrator View" type="application/vnd.vmware.admin.right+xml"/>
		<RightReference href="https://vcloud.fojta.com/api/admin/right/5e579955-fe9d-3f0b-bc6b-a3da4db328f1" name="Group / User: View" type="application/vnd.vmware.admin.right+xml"/>
		<RightReference href="https://vcloud.fojta.com/api/admin/right/2cd2d9d7-262c-34f8-8bee-fd92f422cc2c" name="General: Administrator Control" type="application/vnd.vmware.admin.right+xml"/>
	</RightReferences>
</Role>

Note: Using above API call with API version 27.0 would create the role in the system organization.

How to Edit Global Roles?

Again with legacy vCloud API we can list all global (template) and system organization roles:

GET /api/admin
Header:
Accept: application/*;version=9.0
Response:
<RoleReferences>
	...
	<RoleReference href="https://vcloud.fojta.com/api/admin/role/75717adf-8700-419e-afe1-d5e2ea3b0bd6" name="User Admin" type="application/vnd.vmware.admin.role+xml"/>
	...
</RoleReferences>

After finding the right role reference we can delete the role template with the following call:

DELETE /api/admin/role/<role-id>
Header:
Accept: application/*;version=9.0

Addition and removal of rights from a role template:

  • In UI add/remove the right from the role which is based on role template from any organization.
  • To add a new right, the organization needs to have access to the right. If it does not have, add it first with the API calls mentioned above.
  • Adding or removing rights to a role based on role template will affect all other organizations.
    • Adding right: other organizations will see the new right if their instance of role template has been granted the right. If the organization did not have access to the right, the right will not be added!
    • Removing right: in other organizations the right will be removed from the role based on the role template

 

The post was written with kind support of John Kilroy.

vCloud Director 8.20: Distributed Firewall

NSX Distributed Firewall (DFW) is the most popular feature of NSX which enables microsegmenation of networks with vNIC level firewalls in hypervisor. For real technical deep dive into the feature I recommend reading Wade Holmes free e-book available here.

vCloud Director 8.20 provides this feature to tenants with brand new HTML5 UI and API. It is managed at Org VDC level from Manage Firewall link. This opens new tab with the new user interface.

manage-firewall

dfw-ui

Firewall Comparison

vCloud Director now offers three different firewalls types for tenants, which might be confusing. So let me quickly compare them.

firewall-comparison

The picture above shows two Org VDCs each with different network topologies. Org VDC 1 is using Org VDC Edge Gateway that provides firewalling as well as other networking services (load balancing, VPNs, NAT, routing, etc.). It has also brand new UI and Network API. Firewalling at this level is enforced only on packets routed through the Edge Gateway.

One level below we see vApps with vApp Edges. These provide routing, firewalling and NAT between routed vApp Network and Org VDC network. There is no change in firewall capability of vApp Edge in vCloud Director 8.20 and old flash UI and vCloud API can be used for its configuration. Firewalling at vApp Edge level is enforced only on packets routed between Org VDC and vApp networks.

Distributed firewall is applied at the vNIC level of virtual machines. It means it can inspect every packet and frame coming and leaving VM and is therefore completely independent from the network topology and can be used for microsegmentation of layer 2 network. Both layer 3 and layer 2 rules can be created.

Obviously all three firewall types can be combined and used together.

Managing Access to Distributed Firewall

There are four new access rights related to DFW in vCloud Director.

  • Manage Firewall
  • Configure Distributed Firewall Rules
  • View Distributed Firewall Rules
  • Enable / Disable Distributed Firewall

The last right is by default available only to system administrators, therefore the provider can control which tenant can and cannot use DFW and it can thus be offered as a value added service. The provider can either enable DFW selectively for specific Org VDCs or alternatively grant Enable/Disable Distributed Firewall right to a specific organization via API. The tenant can enable DFW by himself.

Distributed Firewall under the Hood

Each tenant is given a section in the NSX firewall table and can only apply rules to VMs and Edge Gateways in his domain. There is one section for each Org VDC that has DFW enabled and it is created always on top.

Edit 3/14/2017: In fact it is possible to create the section at the bottom just above the default section. This allows provider to create its own section on the top which will be always enforced first. The use case for this could be service network.

To force creation of the section at the bottom the firewall must be enabled with API call with ?append=true at the end.

Example: 

POST https://vcloud.fojta.com/network/firewall/vdc/be0f2baa-d36f-47f0-8443-3c5cac231ba5?append=true

Org VDC Section Appended at the Bottom

As tenants could have overlapping IPs all rules in the section are scoped to a security group with dynamic membership of tenant Org VDC resource pools and thus will be applied only to VMs in the Org VDC.

nsx-dfw-section
Org VDC section in NSX DFW
org-vdc-security-group
Org VDC Security Group

Tenants can create layer 3 (IP based) or layer 2 (MAC based) rules while using the following objects when defining them:

  •  IP address, IP/MAC sets
  • Virtual Machine
  • Org VDC Network
  • Org VDC

Note that using L3 non-IP based rules requires NSX to learn IP address(es) of the guest VM. One of the following mechanism must be enabled:

  • VMware Tools installed in guest VM
  • DHCP Snooping IP Detection Type
  • ARP Snooping IP Detection Type

IP Detection Type is configured in NSX at Cluster Level in Host Preparation tab.

host-preparation

ip-detection-type

Scope for each rule can be defined in Applied To column. As mentioned before by default it is set to the Org VDC, however tenant can further limit the scope of the rule to a particular VM, or Org VDC network (note that vApp network cannot be used). It is also possible to apply the rule to Org VDC Edge Gateway, in such case the rule is actually created and enforced on the Edge Gateway as pre-rule which has precedence over all other firewall rules defined at that Edge Gateway.

DFW Rule Applied to Edge GW
DFW Rule Applied to Edge GW

Tenant can enable logging of a specific firewall rule with API by editing <rule … logged=”true|false”> element. NSX then logs the first session packet matching the rule to ESXi host log with tenant specific tag (Org VDC UUID subset string). The provider can then filter such logs and forward them to tenants with its own syslog solution.

logging
NSX DFW Rule Tenant Tag

vCloud Director 8.20: Orchestrated Upgrade

vCloud Director architecture consist of multiple cells that share common database. The upgrade process involves shutting down services on all cells, upgrading them, upgrading the database and starting the cells. In large environments where there are three or more cells this can be quite labor intensive.

vCloud Director 8.20 brings new feature – an orchestrated upgrade. All cells and vCloud database can be upgraded with a single command from the primary cell VM. This brings two advantages. Simplicity – it is no longer needed to login to each cell VM, upload binaries and execute upgrade process manually. Availability – downtime during the upgrade maintenance window is reduced.

Prerequisites

Set up ssh private key login from the primary cell to all other cells in the vCloud Director instance for user vcloud.

  1. On the primary cell generate private/public key (with no passphrase):

    ssh-keygen -t rsa -f $VCLOUD_HOME/etc/id_rsa
    chown vcloud:vcloud $VCLOUD_HOME/etc/id_rsa
    chmod 600 /opt/vmware/vcloud-director/etc/id_rsa
     

  2. Copy public key to each additional cell in the instance to authorized_keys file. This can be done with one line command ran from the primary cell or with this ssh-copy-id. Use IP/FQDN it is registered with in VCD

    cat $VCLOUD_HOME/etc/id_rsa.pub | ssh root@<cell-IP> “mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys” 

  3. Verify that login with private key works for each secondary cell in the environment

    sudo -u vcloud ssh -i $VCLOUD_HOME/etc/id_rsa root@<cell IP/FQDN>

Multi-cell Installation

Upload vCloud Director binary to the primary cell and make it executable. Execute the file with private-key-path option pointing to the private key.

/root/vmware-vcloud-director-distribution-8.20.0-5070903.bin –private-key-path $VCLOUD_HOME/etc/id_rsa

 

Optionally a maintenance cell can be specified with –maintenance-cell option.

For troubleshooting, the upgrade log is located on the primary cell in  $VCLOUD_HOME/logs/upgrade-<date and time>.log

For no-prompt execution you can add –unattended-upgrade option.

Workflow

This is the workflow that is automatically executed:

  1. Quiesce, shutdown and upgrade of the primary cell. Does not start the cell.
  2. If maintenance cell was specified, it is put into maintenance mode.
  3. Quiescing and shut down of all the other cells.
  4. Upgrade of the vCloud Database (a prompt for backup)
  5. Upgrade and start of all other cells (except the maintenance cell)
  6. If maintenance cell was specified, it is upgraded and started.
  7. Start of the primary cell

What is the difference between a quiesced cell and a cell in the maintenance mode?

Quiesced cell:

  • finishes existing long running operations
  • answers to new requests and queues them
  • does not dequeue any operations (they will stay in the queue)
  • VC lister keeps running
  • Console proxy keeps running

Cell in maintenance mode

  • waits for finish of long running but fails all queued operations
  • answer to most requests with HTTP Error code 504 (unavailable)
  • still issues auth token for /api/sessions login requests
  • No VC listener
  • No Console proxy

Interoperability with vCloud Availability

vCloud Availability uses Cloud Proxies to terminate replication tunnels from the internet. Cloud Proxies are essentially stripped down vCloud Director cells and are therefore treated as regular cells during the orchestrated upgrade.

Quiesced Cloud Proxy has no impact on replication operations and traffic. Cloud Proxy in the maintenance mode still preserves existing replications however new replications cannot be established.

2/27/2017: Multiple edits based on feedback from engineering. Thank you Matthew Frost!

VCDNI to VXLAN Migration

vCloud Network Isolation (VCDNI or VCNI) is legacy mechanism to create overlay logical networks independently from physical networking underlay. It was originally used in VMware vCenter Lab Manager (where it was known as Cross Host Fencing). vCloud Director offers it as one of many mechanisms for creation of logical networks (next to VXLAN, VLAN and port group backings). VCDNI uses VMware proprietary MAC-in-MAC encapsulation done by vCloud Agent running in ESXi host vmkernel.

It has been for some time superseded by VXLAN technology which is much more scalable, provides better performance and is industry standard technology. VXLAN network pools have been available in vCloud Director since version 5.1.

VCDNI is consumed by manual creation of a vCloud Network Isolation backed Network Pool that is mapped to an underlay VLAN network with up to 1000 logical networks for each pool (VLAN).

As a deprecated and obsolete technology it is no longer supported in vSphere 6.5 and vCloud Director 8.20 is the last release that will support such network pools. vCloud Director 8.20 also provides simple mechanism to perform low-disruption migrations for Org VDC and vApp networks to VXLAN backed networks. Such migration must be done before upgrade to vSphere 6.5 (see more in KB 2148381).

The migration can be performed via UI or API by system administrator with Org VDC granularity.

Migration via UI

  1. For an Org VDC using VCDNI network pool open in the System tab – Manager & Monitor, Org VDC properties (note that doing the same from Org tab will not work).
    org-vdc
  2. Go to Network Pool & Services tab and change VCDNI backed network pool to VXLAN backed one and click OK.
    network-pool
  3. Again open Network Pool & Services tab of the Org VDC. Migrate to VXLAN button will now appear.
    migrate-to-vxlan
  4. Click the button, confirm the message and start the migration.
    confirmation
  5. After while the Org VDC status will change from busy to ready and the migration is finished. Details (and possible errors) can be reviewed in the Recent Tasks of the Audit Log.
    audit-log

Migration with vCloud API

Org VDC network migration is triggered by single API POST call at the Org VDC level.

POST /api/admin/vdc/<org VDC UUID>/migrateVcdniToVxlan
Content Type: application/vnd.vmware.admin.vdcnitovxlanmigration+xml

The Process

The following happens in the background when migration is triggered for each VCDNI backed network in an Org VDC:

  1. ‘Dummy’ VXLAN logical switch is created
  2. All VMs connected to VCDNI network are reconnected to the new VXLAN logical switch
  3. Edge Gateways connected to VCDNI network are connected to the new VXLAN logical switch
  4. Org VDC/vApp network backing is changed in vCloud DB to use the new VXLAN logical switch
  5. Original VCDNI port group is deleted

Small network disruption is expected during VM and Edge Gateway reconnections. The following Recent Tasks picture from vSphere Client shows what is happening at vCenter Server level and how much time each task could take. In the example there was one Org VDC network and one vApp network migrated with VM1 and Edge Gateway ACME-GW2 involved.

vc-recent-tasks