vCloud Director 9.7 Appliance Tips

About half a year ago I published blog post with similar title related to vCloud Director 9.5 appliance. The changes between appliance version 9.5 and 9.7 are so significant therefore I am dedicated a whole new article to the new appliance.

Introduction

The main difference compared to 9.5 version is that vCloud Director 9.7 now comes with embedded PostgreSQL database option that supports replication, with manually triggered semi-automated fail over. The external database is no longer supported with the appliance. Service providers can still use Linux installable version of vCloud Director with external PostgreSQL or Microsoft SQL databases.

The appliance is provided in single OVA file that contains 5 different configurations (flavors). Primary node (small and large), Standby node (small and large) and vCloud Director cell application node.

All node configurations include the vCloud Director cell application services, the primary and standby also includes the database and the replication manager binaries. It is possible to deploy non-DB HA architecture with just the primary and cell nodes, however for production the DB HA is recommended and requires minimum of 3 nodes. One primary and two standbys. The reason for the need of two standby is, that at the moment the replication is configured, PostgreSQL database will not process any write requests as it is not able to synchronously replicated them to at least one standby node. This has some implications also how to remove nodes from clusters which I will get to.

I should also mention that primary and standby nodes once deployed are from appliance perspective equivalent, so standby node can become primary and vice versa. There is always only one primary DB node in the cluster.

NFS transfer share is required and is crucial for sharing information among the nodes about the cluster topology. In the appliance-nodes folder on the transfer share you will find data from each node (name, IP addresses, ssh keys) that are used to automate operations across the cluster.

Contrary to other HA database solution, there is no network load balancing or single floating IP used here, instead all vCloud Director cells are for database access always pointed to the eth1 IP address of the (current) primary node. During the failover the cells are dynamically repointed to the IP of the new node that takes the role of primary.

Speaking about networking interfaces, the appliance has two – eth0 and eth1. Both must be used, and  must have different subnets. The first one (eth0) is primarily used for the vCloud Director services (http – ports 80, 443, console proxy – port 8443, jmx – ports 61611, 61616), the second one (eth1) primary role is for database communication (port 5432). You can use both interfaces for other purposes (ssh, management, ntp, monitoring, communication with vSphere / NSX, ..). Make sure you follow the correct order during their configuration. It is so easy to mix up the subnets or port groups.

Appliance Deployment

Before starting deploying the appliance(s) make sure NFS transfer share is prepared and empty. Yes, it must be empty. When the primary node is deployed, responses.properties and other files are stored on the share and used to bootstrap other appliances in the server group and the database cluster.

The process always starts with the primary node (small or large). I would recommend large for production and small for everything else. Quite a lot of data must be provided in the form of OVF properties (transfer share path, networking, appliance and DB passwords, vCloud Director initial configuration data). As it is easy to make mistake I recommend snapshoting the VM before the first power-on so you can always revert back and fix whatever was wrong (the inputs can be changed in vCenter Flex UI, VM Edit Settings, vApp Options).

To see if the deployments succeeded or why it failed, examine the following log files on the appliance:

firstboot: /opt/vmware/var/log/firstboot
vcd setup:  /opt/vmware/var/log/vcd/setupvcd.log

config data can be checked in: /opt/vmware/etc/vami/ovfEnv.xml

Successful deployment of the primary node results in a single node vCloud Director instance with non-replicated DB running on the same node and with responses.properties file saved to the transfer share ready for other nodes. The file contains database connection information, certificate keystore information and secret to decrypt encrypted passwords. Needless to say, pretty sensitive information to make sure the access to NFS is restricted.

Note about certificates: the appliance generates its own self-signed certificates for the vCloud Director UI/API endpoints (http) and consoleproxy access and stores them to user certificates.ks keystore in /opt/vmware/vcloud-director which is protected with the same password as the initial appliance root password. This is important as the encrypted keystore password in the responses.properties file will be used for configuration of all other appliances and thus you must deploy them with the same appliance root password. If not, you might end up with half working node, where database will be working but the vcd service will not due to failed access to the certificate.ks keystore.

To deploy additional appliance nodes you use standby or pure VCD cell node configs. For HA DB two standbys (at least). As these nodes all run VCD service, deploying additional pure VCD cell nodes is needed only for large environments. Size of the primary and standbys should always be the same.

Database Cluster Operations

Update 2019/06/14: The official documentation has been updated to include this information.

The database appliances currently provides very simple UI on port 5480 showing the cluster state with the only operation to promote standby node and that only if the primary is failed (you cannot in the UI promote standby while primary is running).

Here is a cheat sheet of other database related operations you might need to do through CLI:

  • Start, stop and reload configuration of database on a particular node:
    systemctl start vpostgres.service
    systemctl stop vpostgres.service
    systemctl reload vpostgres.service
  • Show cluster status as seen by particular node:
    sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf cluster show
  • Planned DB failover (for example for a node maintenance). On the standby cell run:
    sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr standby switchover -f /opt/vmware/vpostgres/current/etc/repmgr.conf –siblings-follow

Location of important database related files:
psql (DB CLI client): /opt/vmware/vpostgres/current/bin/psql
configuration, logs and data files: /var/vmware/vpostgres/current/pgdata

How to Rejoin Failed Database Node to the Cluster

The only supported way is to deploy a new node. You should deploy it as standby node and as mentioned in the deployment chapter it will automatically bootstrap and replicate the database content. That can take some time depending on the databse size. You will need to clean up the old failed VCD cell in vCloud Director Admin UI – Cloud Cells section.

There is an unsupported way to rejoin failed node without redeploy, but use at your own risk – all commands are triggered on the failed node:

Stop the DB service:
systemctl stop vpostgres.service

Delete stale DB data:
rm -rf /var/vmware/vpostgres/current/pgdata

Clone DB from the primary (use its eth1 IP):
sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr -h <primary_database_IP> -U repmgr -d repmgr -f /opt/vmware/vpostgres/current/etc/repmgr.conf standby clone

Start the DB service:
systemctl start vpostgres.service

Add the node to repmgr cluster:
sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr -h <primary_database_IP> -U repmgr -d repmgr -f /opt/vmware/vpostgres/current/etc/repmgr.conf standby register –force

How to Remove Failed Standby Node from the Cluster

On the primary node find the failed node ID via the repmgr cluster status command:
sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf cluster show

Now unregister failed node by providing its ID (e.g. 13416):
sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf standby unregister –node-id=13416

Clean up failed VCD cell in Cloud Cells VCD Admin UI.

How to Revert from DB Cluster to Single DB Node Deployment

As mentioned in the introduction, if you shutdown both (all) standby nodes, your primary database will stop serving write I/O request. So how to get out of this pickle?

First, unregister both (deleted) standbys via the previous mentioned commands:

sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf cluster show
sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf standby unregister –node-id=<id1>
sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf standby unregister –node-id=<id2>

Delete appliance-nodes subfolders on the transfer share corresponding to these nodes. Use grep -R standby /opt/vmware/vcloud-director/data/transfer/appliance-nodes to find out which folders should be deleted.

For example:
rm -Rf /opt/vmware/vcloud-director/data/transfer/appliance-nodes/node-38037bcd-1545-49fc-86f2-d0187b4e9768

And finally edit postgresql.conf and change synchronous_standby_names line to synchronous_standby_names = ”. This disables the wait for the transaction commit to at least one standby.

vi /var/vmware/vpostgres/current/pgdata/postgresql.conf

Reload DB config: systemctl reload vpostgres.service.  The database should start serving write I/O requests.

Upgrade and Migration to Appliance

Moving both from Linux cells or 9.5 appliance to 9.7 appliance with embedded DB requires a migration. Unfortunately, it is not possible to just upgrade 9.5 appliance to 9.7 due to the embedded database design.

The way to get to 9.7 appliance is you will first upgrade the existing environment to 9.7, then deploy a brand new 9.7 appliance based environment and transplant the old database content to it.

It is a not a simple process. I recommend testing it up front on a production clone so you are not surprised during the actual migration maintenance window. The procedure is documented in official docs, I will provide only high level process and my notes.

  • Upgrade existing setup to 9.7(.0.x) version. Shut down VCD service and backup the database, global.properties, responses.properties and certificate files. Shut down the nodes if we are going to reuse their IPs.
  • Prepare clean NFS share and deploy single node appliance based VCD instance. I prefer to do the migration on single node instance and then expand it to multi node HA when the transplant is done.
  • Shut down the vcd service on the appliance, delete its vcloud database so we can start with the transplant.
  • We will restore the database (if the source is MS SQL we will use cell-management-tool migration) and overwrite global.properties and responses.properties files. Do not overwrite the user certificate.ks file.
  • Now we will run  the configure script to finalize the transplant. At this point on 9.7.0.1 appliance I hit a bug that was related to SSL DB communication. In case your global.properties file contains vcloud.ssl.truststore.password line, comment it out and run the configure script with SSL disabled. This is my example:
    /opt/vmware/vcloud-director/bin/configure –unattended-installation –database-type postgres –database-user vcloud \
    –database-password “VMware1!” –database-host 10.0.4.62 –database-port 5432 \
    –database-name vcloud –database-ssl false –uuid –keystore /opt/vmware/vcloud-director/certificates.ks \
    –keystore-password “VMware1!” –primary-ip 10.0.1.62 \
    –console-proxy-ip 10.0.1.62 –console-proxy-port-https 8443
  • Update 2019/05/24: The correct way to resolve the bug is to also copy truststore file from the source (if the file does not exist, which can happen if the source was freshly upgraded to 9.7.0.1 or later start the vmware-vcd service at least once). The official docs will be updated shortly. The configure script can be then run with ssl set to true:
    /opt/vmware/vcloud-director/bin/configure –unattended-installation –database-type postgres –database-user vcloud \
    –database-password “VMware1!” –database-host 10.0.4.62 –database-port 5432 \
    –database-name vcloud –database-ssl true–uuid –keystore /opt/vmware/vcloud-director/certificates.ks \
    –keystore-password “VMware1!” –primary-ip 10.0.1.62 \
    –console-proxy-ip 10.0.1.62 –console-proxy-port-https 8443Note that the keystore password is the inital appliance root password! We are still reusing appliance autogenerated self-signed certs at this point.
  • If this went right, start the vcd service and deploy additional nodes as needed.
  • On each node replace self-signed certificate with the CA signed.

Backup and Restore

The backup of the appliance is very easy, the restore less so. The backup is triggered from the primary node with the command:

/opt/vmware/appliance/bin/create-db-backup

It creates single tar file with database content and additional data to fully restore the vCloud Director instance. The problem is that partial restores (that would reuse existing nodes) are nearly impossible (at least in HA DB cluster scenario) and the restore involve basically the same procedure as was the case with migration.

CA Certificate Replacement

There are probably many ways how to accomplish this. You can create your own keystore and import certificates from it with cell-management-tool certificates command to the existing appliance /opt/vmware/vcloud-director/certificates.ks keystore. Or you can replace the appliance certificate.ks file and re-run the configure command. See here for deep dive.

Note that the appliance UI (on port 5480) uses different certificates. These are stored in /opt/vmware/appliance/etc/ssl. I will update this post with the procedure once it is available.

External DB Access

In case you need to access vCloud Director database externally, you must edit pg_hba.conf file with the IP address or subnet of the external host. However, pg_hba.conf file is dynamically generated and any manual changes will be quickly overwritten. The correct procedure is to create on the DB appliance node new file (with any name) in /opt/vmware/appliance/etc/pg_hba.d folder with a similar line:

host all all 10.0.2.0/24 md5

Which means that any host from 10.0.2.0/24 subnet will be able to log in via password authentication method with any database user account and access any database.

There is currently not an easy way to use network load balancer to always point to the primary node. This is planned for the next vCloud Director release.

Postgres User Time Bomb

Both vCloud Director 9.7 and 9.7.0.1 appliance version have unfortunate time bomb issue where postgres user account will expire in 60 days (since the appliance creation, not its deployment). When that happens, the repmgr commands triggered via ssh stop working. So for example UI initiated failover with the promote button will not work.

The 9.7 appliance postgres user expires May 25 2019, 9.7.0.1 appliance postgres user expires July 9 2019. The fix is as root on each DB appliance run the following command (see KB 70332):
chage -M -1 -d 1 postgres

You can check the postgres account status with:
chage -l postgres

 

 

Advertisements

32 thoughts on “vCloud Director 9.7 Appliance Tips

  1. *sorry for the double comment – I realised I commented on the wrong post*
    I’ve hit a nasty issue when I’ve attempted to finish the migration from the 9.5 appliance with sql dB to the 9.7.0.1 appliance with embedded dB.
    After copying over all the files to the new appliance and doing the dB migrate it all looks ok. But when I run the configuration command – I get an error back saying that my keystore file has been tampered with or password is wrong. I traced in the logs that it appears to be referring to the truststore file – which I noticed was a new instruction in the VMware docs to copy over – and something from your notes too.
    My previous appliance was working fine, and I’ve ensured that the right files were copied over which included the truststore password. I’ve redeployed the appliance about 4 times to test various things and make sure I’ve followed the docs to the T but still get to the same spot.
    I’ve also tried it with the default truststore, password and still get the same issue.
    I’ve got a case open with VMware – but wonder if this is something nobody has come across yet? I literally can’t find anything about the truststore bar some loose references on your blog, and some referring to the vcac appliance which doesn’t apply.

      1. As a side note – I have this fixed to a degree now.

        I looked back over your notes about your issues with the truststore and decided to test that out. I removed the truststore password line from my global properties, and deleted my truststore file completely so it didn’t exist.

        Then re-ran the configuration command with the database SSL set to false – it went straight through no issues, regenerated a new truststore and added a new password line. I was then able to re-run with my copied over keystore again and it was also fine.

        After bringing it up and testing it out – it all appears to be OK and working fine – I haven’t found anything missing/wrong/etc (Infact it’s actually slightly better as my vCD Cell doesn’t show as unknown anymore in the web UI).

        I wondered about re-running the command again now I have a fresh truststore file and re-enabling database SSL to see if it works – thoughts?

        Gist being – so far it looks like your original fix was actually what worked for me – copying the truststore file as well caused me a day’s worth of head banging on a brick wall.

      2. And finally to further add to this – after a few emails with VMWare support – I shutdown the cell and re-ran the config command with Database SSL true instead – and it also worked fine.

        It looks like the main issue was copying the original truststore file over and leaving the password line in the global.properties file.

        I suspect that your original fix is actually correct – however VMWare’s fix after that can still be applied as well – but you have to do it in that order. If you try to just follow the straight VMWare guide word for word – I would suspect others are likely going to hit this as well.

        Unless I just had some really bad luck with my truststore file? But I can’t see how given the global.props file had the right PW in, and the truststore appeared intact. *shrug*.

        At least I’ve got my production cutover to 9.7 with the embedded DB now. Would just love for RabbitMQ and an NFS share to be replicated between nodes like the DB is – then I don’t need an external server for my vCD Appliance 🙂

        One thing that should be said – is that the new appliance definitely boots and gets the Web UI up a tonne faster than the old appliance.

  2. hello

    Are we allowed to update hardware version for vCloud director appliance. Is it something supported. When appliance is deployed it is deployed with HW version 10 and Guest OS version is shown as “other Linux 3.x (64 bit)” instead of Photon OS which is not correct. When hardware version is updated e.g 14 Guest OS starts correctly showing Photon OS.

    I could see hardware version update is not allowed for VCSA however could not find similar information for vCD appliance
    https://kb.vmware.com/s/article/1010675

  3. We have been trying to deploy the primary node of this appliance for a green field deployment and having been having a ton of problems getting the connectivity to both eth0 and eth1. We have tried deploying this on the same as well as separate subnets, but we are never able to ping or connect to the management UI from both IPs. Have you seen this problem before? We need to know whether this will become a problem once we start deploying additional cell applications and/or standby nodes.

  4. Hi Tom,
    thanks for your blogs!

    We deployed a 3 cells using ovftool. We are testing how postgresHA work, powering on and off the cells, deleting and re-deploying, also.
    The postgresHA work well, but we got to a point where we have cells displayed at: https:///cloud/#/multiCellPage? that we can not delete from this page. Theese cells doesn´t exist, or were powered off.

    Do you have any idea how to delete theese cells form this page (and/or from the database)?

    The delete option is disabled and all cell have a green ok status, in that page.

    Regards,
    Leonardo.

  5. Hi,
    I deployed Primary, Standby and Cell application. Started vcd service on all and getting below error during start
    Error Starting Application: Unable to decrypt encrypted property: “database.password”.
    Do we have to give database information in global.properties?? these fields are currently empty in all three (primary, standby and vcd cell)
    database.jdbcurl =
    database.username =
    database.password =
    I am assuming we only created database on primary during configuration. do we need to give that information here? or i can remove these fields?

      1. yes. I see “appliance-nodes” directory created on all there cells(primary, standby and cell application)

      1. There was NFS issue during firstboot. I fixed the issue, mounted the NFS share manually and restarted vmware-vcd service

        I am now restarting all the nodes to see if it helps.

  6. Sure. I will clean NFS and redeploy. Just want to confirm if the following configuration will be good

    Primary Node (Large)
    Standby Node (Small)
    Cell Application

  7. Thanks. I deployed total 3 noes (primary + 2 standby). vcd application is running on primary node.

    I need help on creating HA cluster. Can you provide me step by step configuration?

      1. vpostgres is running on all 3 nodes and vmware-vcd only on primary. I am able to access the application.

        Problem is when I open primaryip:5480, it says “no nodes found in cluster, this likely postgres sql is not running on this node.”

        I am not able to see available databases there.

  8. Hello, What does it mean “keystore file has been tampered with or password is wrong”? Tried to migrate Linux with MS SQL to Appliance

  9. Hi Tomas,

    I may be missing something, but…

    If migrating from a two cell 9.7 with external Postgres to the 9.7 appliance, what happens when you deploy the third appliance? specifically, step 14 on the guide outlines to backup, copy and replace global.properties, response.properties, and certs, proxycerts and truststore – but what do you replace them with? The backup of these files (when wanting to use same IP) came from the original two nodes- which I’ve used for the new appliances -do I still have the run step 14 a,b,c,d?

    Thanks in advance

    1. 3rd or 4th, etc. appliances are automatically added to the cell server group when deployed (they get all the necessary information from the transfer share). All you need is to replace self signed certificates (if necessary).

  10. Hi,

    I have my vApp created with Org VDC Network. All VMs of vApp are connected to Org VDC network (192.168.100.x)

    Edge is deployed which is connected to External Network (10.x.x.x). Edge interfaces have 192.168.100.1 IP and the external IPs list (10.x.x.x) configured.

    I have fenced the vApp Org VDC network, created SNAT rules to translate the Org VDC IP to External Network IP. But I am still not able to connect to Internet from my vApp.

    Any thoughts?

    Raghav
    raghavsachdeva29@gmail.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.