vmware – CyberNils.net

VMware Cloud Foundation 5.0.0.1 to 5.1 Upgrade Notes

March 26, 2024Nils KristiansenLeave a comment

I recently upgraded a customer from VMware Cloud Foundation (VCF) 5.0.0.1 to 5.1. The upgrade went well in the end, but I had some issues along the way that I would like to share in this quick post.

The first issue I ran into was that I was unable to select 5.1 as target version and an error message saying “not interopable: ESX_HOST 8.0.2-22380479 -> SDDC_MANAGER 5.0.0.1-22485660“. I found VMware KB95286 which resolved this problem.

After the SDDC Manager was upgraded to 5.1, I got the following error message when going to the Updates tab for my Management WLD :

Retrieving update patches bundles failed. Unable to retrieve aggregated LCM bundles: Encountered error requesting http://127.0.0.1/v1/upgrades api - Encountered error requesting http://127.0.0.1/v1/upgrades api: 500 - "{\"errorCode\":\"VCF_ERROR_INTERNAL_SERVER_ERROR\",\"arguments\":[],\"message\":\"A problem has occurred on the server. Please retry or contact the service provider and provide the reference token.\",\"causes\":[{\"type\":\"com.vmware.evo.sddc.lcm.model.error.LcmException\"},{\"type\":\"java.lang.IllegalArgumentException\",\"message\":\"No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE\"}],\"referenceToken\":\"H0IKSH\"}"
Scheduling immediate update of bundle failed. Something went wrong. Please retry or contact the service provider and provide the reference token.

Going to Bundle Management in SDDC Manager gave me the following error message:

Retrieving available bundles failed. Unable to retrieve aggregated domains upgrade status: Encountered fetching http://127.0.0.1/lcm/inventory/upgrades api - No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE Retrieving all applicable bundles failed. Encountered fetching http://127.0.0.1/lcm/inventory/upgrades api - No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE

Fortunately my colleague Erik G. Raassum had blogged about this issue the day before: https://blog.graa.dev/vCF510-Upgrade

The solution was to follow VMware KB94760 and Delete all obsolete bundles.

Next up was that the NSX Precheck failed with the following error message:

NSX Manager upgrade dry run failed. Do not proceed with the upgrade. Please collect the support bundle and contact VMWare GS. Failed migrations: Starting parallel Corfu Exception during Manager dry-run : Traceback (most recent call last): File "/repository/4.1.2.1.0.22667789/Manager/dry-run/dry_run.py", line 263, in main start_parallel_corfu(dry_run_path) File "/repository/4.1.2.1.0.22667789/Manager/dry-run/dry_run.py", line 150, in start_parallel_corfu subprocess.check_output([str(fullcmd)], File "/usr/lib/python3.8/subprocess.py", line 415, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['python3 /repository/4.1.2.1.0.22667789/Manager/dry-run/setup_parallel_corfu.py']' returned non-zero exit status 255.

I digged into the logs without finding anything helpful. I started thinking what the technical geniuses at The IT Crowd would do, so I rebooted all NSX Manager nodes and tried the upgrade again. This time the precheck succeeded for NSX Manager, but it failed for the NSX Edge Nodes with the following error message:

nkk-c01-ec01 - Edge group upgrade status is FAILED for group 3373386e-5c41-4851-806d-76f0841a5a7d nkk-c01-en01 : [Edge 4.1.2.1.0.22667789/Edge/nub/VMware-NSX-edge-4.1.2.1.0.22667799.nub download OS task failed on edge TransportNode aa134203-1446-4c65-b17f-41c60e325d55: clientType EDGE , target edge fabric node id aa134203-1446-4c65-b17f-41c60e325d55, return status download_os execution failed with msg: Exception during OS download: Command ['/usr/bin/python3', '/opt/vmware/nsx-common/python/nsx_utils/curl_wrapper', '--show-error', '--retry', '6', '--output', '/image/VMware-NSX-edge-4.1.2.1.0.22667799/files/target.vmdk', '--thumbprint', '7aa5bae4a6eddf034c42d0fb77613e9212fec19a7855a8db0af37ed71c3fe7f6', 'https://nkk-c01-nsx01a.cybernils.net/repository/4.1.2.1.0.22667789/Edge/ovf/nsx-edge.vmdk'] returned non-zero code 28: b'curl_wrapper: (28) Failed to connect to nkk-c01-nsx01a.cybernils.net port 443: Connection timed out\n' .].

A quick status check of all the Edge Nodes didn’t help, so I went ahead and rebooted them all and tried the upgrade again. This time all prechecks went well and the upgrade was also successful without any further issues.

I also ran into a few issues while upgrading the Aria Suite. After upgrading VMware Aria Suite Lifecycle from version 8.12 to 8.14.1, the Build and Version numbers were not updated even though the upgrade was successful. This was resolved by following VMware KB95231.

When trying to upgrade Aria Operations for Logs to version 8.14.1, I got the following error message:

Error Code: LCMVRLICONFIG40004 Invalid hostname provided for VMware Aria Operations for Logs. Invalid hostname provided for VMware Aria Operations for Logs import.

This was fixed by removing the SHA1 based algorithms and SSH-RSA based keys usage from the SSH service on VMware Aria Operations for Logs following VMware KB95974.

After upgrading Aria Operations to version 8.14.1, I kept getting the following error message over and over again:

Client API limit has exceeded the allowed limit.

Following VMware KB82721 and setting CLIENT_API_RATE_LIMIT to 30 solved this.

Quite a troublesome upgrade, but at least most of the problems were fixed quickly by either turning something off and on again, or following a KB.

Upgrade VMware Cloud Director using Cloud Provider Lifecycle Manager

March 26, 2024March 26, 2024Nils KristiansenLeave a comment

I wanted to test the NSX Advanced Load Balancer Self-service WAF which came with Cloud Director 10.5.1. My lab was running 10.5.0 so I needed to upgrade. First step was to download VMware_Cloud_Director_10.5.1.10593-22821417_update.tar.gz from VMware and copy it to /cplcmrepo/vcd/10.5.1/update on the Cloud Provider Lifecycle Manager appliance. Then I chose Manage and Upgrade on my deployment, and selected 10.5.1 as the version to upgrade to.

The Upgrade Task kicked off and I could follow the process in the user interface until it was done.

After the upgrade was successfully completed, I could log in to Cloud Director and manage WAF on my virtual services.

Upgrading my Cloud Director cluster using Cloud Provider Lifecycle Manager worked very well and I am sure it is a lot easier and faster than doing it manually. The new WAF self-service feature also looks good and I am sure my customers will be happy for it.

Upgrading to VMware Cloud Foundation 4.3 in my Lab

September 1, 2021September 2, 2021Nils KristiansenLeave a comment

VMware just released VMware Cloud Foundation (VCF) 4.3 and I have several customers planning to upgrade in the near future, so I decided to upgrade my lab to get some experience. I also have two customers planning to deploy VCF 4.3 on new hardware so I will also soon deploy it from scratch to see what’s new with the bring-up procedure. VCF 4.3 comes with a lot of fixes and new features that you can read about in the Release Notes.

My current VCF lab is running version 4.2 and consist of one Management Workload Domain (WLD) with one Stretched Cluster. That is two Availability Zones with four ESXi hosts in each and a vSAN Witness running in a third independent site. In addition, I have one VI Workload Domain (WLD) containing three ESXi hosts in a non-stretched Cluster. Currently I don’t run vRealize Suite, Tanzu or anything else than what is included in the VCF base-platform. Everything is deployed using VLC.

I started by reading the Release Notes and the Upgrading to VMware Cloud Foundation 4.3 docs, as well as a few blog posts about what is new in this release.

The following steps were then performed to upgrade VCF to version 4.3. Note that all images are clickable to make them bigger.

First I did a quick health check of my environment by logging into the vSphere Client and SDDC Manager and looked for any alarms or warnings. It was surprisingly healthy.

Then I checked that I was actually running on version 4.2, and verified that there was an update available for the Management WLD. I also selected to download both required upgrade bundles.

Ran an Update Precheck to ensure that my environment was ready for the update. It passed successfully, but I had already implemented a fix to skip vSAN HCL Checks for ESXi Upgrades since I am running on nested ESXi hosts, or else it would have failed.

Installed the VCF 4.3 update.

Went back to the Patches/Updates section for the Management WLD and found that the NSX-T 3.1.3 update was available, so I chose to download and install that.

I chose to upgrade both my Edge Clusters and my Host Clusters in one go, but there is an option to upgrade them separately starting with the Edge Clusters. You can also choose between parallell or sequential upgrades, and I went for the default which is parallell even though it wouldn’t matter in my case since I only have one cluster of each type.

When the update starts, you can see status on each component it is updating.

You can also select VIEW UPDATE ACTIVITY to get more details on what it is doing.

Next available update was for vCenter Server so I downloaded and installed that.

When vCenter was done upgrading, ESXi 7.0 Update 2a was downloaded and installed.

I selected to enable Quick Boot to speed up the upgrade of ESXi. Note that your hardware must support this feature if you are running on bare-metal instead of nested ESXi like I do.

The ESXi update got cancelled for some reason, so I retried to install it, but it got cancelled again.

I ran a new Precheck and found that VUM had problems uploading the patch files to the four ESXi hosts in AZ2.

Looking at the logs on one of the hosts showed me that it didn’t have enough memory. These four hosts only had 16 GB RAM each, so I increased this to 64 GB to make them equal to the hosts in AZ1.

I ran the Precheck again and this time it succeeded.

I tried to install the ESXi update again, but it got cancelled this time too. Rerunning the Precheck now showed that NTP was out of sync between my PSC and my SDDC Manager. However, when manually checking I found that this was not the case. The error didn’t specify which PSC so I started suspecting it could be due to my VI WLD vCenter appliance was down. After starting that up again, this NTP out of sync error disappeared and the Precheck went through all green. It would be nice if the Precheck was able to tell me which PSC it was complaining about, and also tell me that NTP wasn’t the problem, but that it didn’t have connectivity to it at all.

I tried to install the ESXi upgrade again, but it still got cancelled without giving me any reason. Digging through the /var/log/vmware/vcf/lcm/lcm.log file on SDDC Manager gave me this hint:

2021-09-01T09:36:55.684+0000 WARN  [vcf_lcm,801535c71a337889,d768] [c.v.evo.sddc.lcm.orch.Orchestrator,pool-7-thread-6] Cannot start upgrades since there are pending or, failed workflows

I looked into my Tasks list in SDDC Manager and found several failed tasks, but one stood out as not being resolved. SDDC Manager had tried to rotate the passwords, but were unable to do so on my VI WLD vCenter and NSX Manager since they were turned off temporarily. I went into Password Management and found an error there saying the same thing, and hitting retry solved this issue without problems since all appliances were back up running now.

I went back to Updates/Patches for my Management WLD and retried to install the ESXi update, and this time it started running. So even though the Precheck is all green you can still have issues causing the upgrade to be cancelled without any useful message in the user interface. The Upgrade Prerequisites tells us to “Ensure that there are no failed workflows in your system”, but in my lab there is usually a few failed tasks which are stuck without blocking an upgrade. It is also not a good idea to shut down appliances in other WLDs to save resources during an upgrade.

None of these problems would happen in a production environment since they were all caused by lack of resources in my nested lab.

Checked the VCF version again, and this time it said 4.3!

The last two things to update was the vSAN Disk format version and the ESXi version on my vSAN Witness Appliance, which SDCC Manager doesn’t care about upgrading, which is a bit disappointing. I used vSphere Lifecycle Manager to patch the vSAN Witness Appliance to the same build as my ESXi hosts. vSAN Disk format is also upgraded in the vSphere Client.

I must admit that upgrading VCF in my lab usually gives me some trouble along they way, but most of the time it is caused by some component lacking resources. It tends to be fixed by increasing CPU, memory or disk resources for either vCenter or NSX-T Manager appliances. I have also had issues were ESXi hosts were unable to enter/exit maintenance mode, caused by admission control or a blocking VM.

If time permits, I will soon post about how I deploy VCF 4.3 from scratch with focus on what is new regarding the bring-up, as well as presenting some of the new features in VCF 4.3.

NSX-T Federation in my VMware Cloud Foundation (VCF) Lab

June 28, 2021Nils KristiansenLeave a comment

VCF 4.2 introduced support for NSX-T Federation which provides the ability to manage, control and synchronize multiple NSX-T deployments across different VCF instances which could be in a single region or deployed across regions. You can stretch Tier-0 and Tier-1 Gateways, Segments, and Groups used for Firewalling. Requirements between sites are maximum round-trip time of 150 ms, and 1500 bytes MTU is supported, however not recommended for best performance. NSX Managers and Edge Nodes need connectivity between them, but ESXi hosts don’t require connectivity across sites. Configuration is done on a new NSX Manager role called Global Manager and pushed out to the local NSX Managers in each site, but you can still also connect directly to the Local Managers in case you have a requirement not supported by the Global Manager.

This is not a detailed review of NSX-T Federation, but I will focus on showing you how I got NSX-T Federation working between two VFC instances in my lab. Sorry for the lack of a proper naming convention, but hopefully you are able to follow along.

Configuration Overview

Hostname	Role	VCF Instance	Location	Region
vcenter-mgmt.vcf.sddc.lab	vCenter Server	1	BGO	A
sddc-manager.vcf.sddc.lab	SDDC Manager	1	BGO	A
nsx-mgmt-1.vcf.sddc.lab	NSX-T Local Manager	1	BGO	A
nsx-global-mgmt.vcf.sddc.lab	NSX-T Global Manager	1	BGO	A
vcenter-mgmt.vcf.nils.lab	vCenter Server	2	OSL	B
sddc-manager.vcf.nils.lab	SDDC Manager	2	OSL	B
nsx-mgmt-1.vcf.nils.lab	NSX-T Local Manager	2	OSL	B
nsx-global-mgmt.vcf.nils.lab	NSX-T Global Manager	2	OSL	B

Steps Performed

Note that all images are clickable to make them bigger.

1. Deployed an NSX-T Global Manager appliance in VCF instance 1 (BGO). This is simply done by deploying the nsx-unified-appliance ova and selecting “NSX Global Manager” as Rolename. In a production environment I would also replace the certificate and deploy two additional appliances to create an NSX-T Global Manager Cluster. In my lab I was happy with deploying a single appliance.

2. Added vCenter Server in VCF instance 1 (BGO) as a Compute Manager.

3. Created an IP Pool for Remote Tunnel Endpoints in the Local NSX-T Manager in VCF instance 1 (BGO).

4. Set the NSX-T Global Manager to Active.

5. Obtained the Certificate Thumbprint for the existing NSX-T Manager in VCF Instance 1 (BGO). This can be done by SSH to vCenter and run the following command:

echo -n | openssl s_client -connect nsx-mgmt-1.vcf.sddc.lab:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

6. Enabled NSX-T Federation by adding the existing NSX-T Manager in VCF Instance 1 (BGO) as a location to the NSX-T Global Manager. Then it became a Local NSX-T Manager.

First attempt gave me this error message:

NSX-T Federation requires NSX Data Center Enterprise Plus license, so after upgrading my license it worked fine.

7. Configured networking for the NSX-T Local Manager node in VCF Instance 1 (BGO).

8. Imported the NSX-T Local Manager configurations for VCF Instance 1 (BGO) to the NSX-T Global Manager.

9. Created a Tier-1 Gateway to be stretched between both VCF instances.

10. Connected the existing Cross-Region Segment to the stretched Tier-1 Gateway.

11. Deployed an NSX-T Global Manager appliance in VCF instance 2 (OSL). This is simply done by deploying the nsx-unified-appliance ova and selecting “NSX Global Manager” as Rolename. In a production environment I would also replace the certificate and deploy two additional appliances to create an NSX-T Global Manager Cluster. In my lab I was happy with deploying a single appliance.

12. Connected the new NSX-T Global Manager Node to the vCenter Server in VCF instance 2 (OSL).

13. Created an IP Pool for Remote Tunnel Endpoints in NSX-T Data Center in VCF Instance 2 (OSL).

14. Obtained the Certificate Thumbprint for the existing NSX-T Manager in VCF Instance 2. This can be done by SSH to vCenter and run the following command:

echo -n | openssl s_client -connect nsx-mgmt-1.vcf.nils.lab:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

15. Deleted the existing Cross-Region Segment from the NSX-T Manager in VCF Instance 2 (OSL) since we will stretch the one deployed in VCF Instance 1 (BGO). The name in the image below contains “VXLAN”, but this name has stuck in the lab since VCF 3.x where NSX-V was used. It is in fact a regular NSX-T Overlay Segment.

16. Enabled NSX-T Federation by adding the existing NSX-T Manager in VCF Instance 2 (OSL) as a location to the NSX-T Global Manager. Then it became a Local NSX-T Manager. Note that this is done in the NSX-T Global Manager in VCF Instance 1 (BGO), which is the Active one.

17. Configured networking for the NSX-T Local Manager node in VCF Instance 2 (OSL).

Remote Tunnel Endpoints in OSL looking good.

Remote Tunnel Endpoints in BGO also looking good.

18. Imported the NSX-T Local Manager configuration in VCF Instance 2 (OSL) to the NSX-T Global Manager.

19. Deleted the Existing Tier-0 Gateway for the Management Domain in VCF Instance 2 (OSL). First I had to disconnect the Tier-1 Gateway from the Tier-0 Gateway.

20. Reconfigured the Tier-0 Gateway in VCF Instance 1 (BGO) to stretch the network between VCF Instance 1 (BGO) and VCF Instance 2 (OSL). Added OSL as a Location to existing bgo-mgmt-domain-tier0-gateway.

21. Set interfaces for VCF Instance 2 (OSL) on the Tier-0 Gateway.

22. Configured BGP neighbors for VCF Instance 2 (OSL).

23. Configured an Any IP Prefix in the Tier-0 Gateway.

24. Created a Route Map for No Export Traffic in the Tier-0 Gateway.

25. Configured Route Filters and Route Redistribution for BGP. Repeated for all four BGP neighbourships.

26. Configured route redistribution for VCF Instance 2 (OSL) on the Tier-0 Gateway.

27. Connected the Tier-1 Gateway in VCF Instance 2 (OSL) to the stretched Tier-0 Gateway.

28. Deleted VCF Instance 1 (BGO) as a Location for this Tier-1 Gateway since this is a local only Tier-1 Gateway.

29. Added VCF Instance 2 (OSL) as a Location in the stretched Tier-1 Gateway (mgmt-domain-stretched-t1-gw01).

30. Set the NSX-T Global Manager in VCF Instance 2 (OSL) as Standby for the NSX-T Global Manager in VCF Instance 1 (BGO). This provides high availability of the active NSX-T Global Manager.

First step was to retreive the SHA-256 thumbprint of the NSX-T Global Manager certificate in VCF Instance 2 (OSL) using this command from the vCenter Server:

echo -n | openssl s_client -connect nsx-global-mgmt.vcf.nils.lab:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

Then I added the NSX-T Global Manager in VCF Instance 2 (OSL) as standby.

That’s it! I now have NSX-T Federation between my two VCF Instances which I find very useful. I haven’t done everything required in the VVD to call my instances Region A and Region B, but I can still simulate a lot of cool use cases. Hopefully you found this useful and please let me know if you have any comments or questions.

References

Introducing NSX-T Federation support in VMware Cloud Foundation

Deploy NSX-T Federation for the Management Domain in the Dual-Region SDDC

vCenter Server blocked by NSX firewall

March 29, 2021May 23, 2021Nils Kristiansen3 Comments

Recently I had a customer calling me with panic in his voice. He had managed to create a rule in NSX where sources and destinations were both any, and action was set to drop. This rule was added high up in the rule set so almost all their workloads were blocked from the network, including their vCenter Server. This environment was still running NSX for vSphere (NSX-V) where firewall rules are managed using the NSX plugin in vCenter Server, so he couldn’t fix the rule.

Since I have been working with NSX for many years, I am aware of this risk and knew exactly how to solve it. VMware has a KB (2079620) addressing this issue so we followed that and got the problem fixed in a few minutes. We used a REST API client and ran a call against their NSX Manager to roll back the distributed firewall to its default firewall rule set. This means one default Layer3 section with three default allow rules and one default Layer2 section with one default allow rule. This restored access to the network for all workloads including the vCenter Server appliances. Then we simply logged into vCenter Server and loaded an autosaved firewall configuration from a time before they made the error. We also made sure to add their vCenter Server appliances to the Exclusion List in NSX to avoid getting into this situation again in the future. The NSX Manager appliance is added to the Exclusion List automatically, but you can’t log in directly to NSX Manager GUI in NSX-V to edit the firewall configuration. Note that it may be a good idea to keep vCenter Server off the Exclusion List to be able to secure it with the firewall, but then you need to make sure you don’t make the same mistake as this customer did.

It is possible to retrieve the existing firewall configuration using the following API call:

GET /api/4.0/firewall/globalroot-0/config

This can be useful if you don’t trust that you have a valid autosaved firewall configuration to restore after resetting it. You can also use this to fix the exact rule locking you out instead of resetting the entire configuration, but I will not go into details on how to do that here.

This problem could also happen with NSX-T, but vCenter Server is not where you manage firewall rules in NSX-T, that is done directly in NSX Manager. According to VMware, NSX-T automatically adds NSX Manager and NSX Edge Node virtual machines to the firewall exclusion list. I have been checking all my NSX Managers, currently three separate instances, and none of them display the NSX Managers in the System Excluded VMs list, only the Edge Nodes like you can see in the screen shot below.

Exclusion List
User Excluded Groups
bgo-lab-edge-01
bgo-lab-edge-02
bgo-lab-tkgi-edge-01
osl-lab-edge-01
osl-lab-edge-02
osl-lab-edge-03
OSI-lab-edge-05
osl-lab-edge-07
System Excluded V Ms
bgo-lab-esx-OS.nolab.local
bgo-lab-esx-04.nolab.local
bgo-ldb-esx-01 .nolab.local
osl-mgmt-esx-02.nolab.local
osl-mgmt-esx-03.nolab.local
osl-mgmt-esx-01.nolab.local
osl-mgmt-esx-02.nolab.local
osl-mgmt-esx-02.nolab.local
Tags
Operating System
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Filter by Name. Path and more
Running
Running
Running
Running
Running
Running
Running
Running

I have been trying to retrieve the exclusion list from the REST API, to see if the Managers are listed there, but so far, I have not been successful. My API calls keeps getting an empty list every time, so I am still investigating how to do this.

I also tried the following CLI command on the NSX Managers, but it lists the same content as the GUI:

get firewall exclude-list

I have been able to confirm that none of the NSX Manager VMs have any firewall rules applied by using the following commands on the ESXi hosts running the VMs, so they seem to be excluded, but I think it would be nice to actually see them on the list.

This is how we can verify if a VM is excluded from the distributed firewall. As you can see my NSX Manager appliance VM has no rules applied.

[root@bgo-mgmt-esx-01:~] summarize-dvfilter | grep -A 3 vmm
world 2130640 vmm0:bgo-mgmt-nsxmgr-01 vcUuid:'50 2b fe 43 98 6f d5 be-fe fd e3 eb 36 3e 17 1d'
 port 33554441 bgo-mgmt-nsxmgr-01.eth0
  vNic slot 2
   name: nic-2130640-eth0-vmware-sfw.2
--
world 4700303 vmm0:bgo-vrops-arc-01 vcUuid:'50 2b 40 6d 17 22 e0 48-d1 5b 31 c7 d6 30 48 04'
 port 33554442 bgo-vrops-arc-01.eth0
  vNic slot 2
   name: nic-4700303-eth0-vmware-sfw.2
--
world 8752832 vmm0:bgo-runecast-01 vcUuid:'50 2b 60 41 6b 35 e9 ca-e5 10 a6 57 95 2e f9 f7'
 port 33554443 bgo-runecast-01.eth0
  vNic slot 2
   name: nic-8752832-eth0-vmware-sfw.2
[root@bgo-mgmt-esx-01:~] vsipioctl getrules -f nic-2130640-eth0-vmware-sfw.2
No rules.
[root@bgo-mgmt-esx-01:~]

For comparison, this is how it looks like for a VM not being on the exclusion list:

[root@esxi-1:~] vsipioctl getrules -f nic-2105799-eth0-vmware-sfw.2
ruleset mainrs {
  # generation number: 0
  # realization time : 2021-03-11T12:58:27
  # FILTER (APP Category) rules
  rule 3 at 1 inout inet6 protocol ipv6-icmp icmptype 135 from any to any accept;
  rule 3 at 2 inout inet6 protocol ipv6-icmp icmptype 136 from any to any accept;
  rule 4 at 3 inout protocol udp from any to any port {67, 68} accept;
  rule 2 at 4 inout protocol any from any to any accept;
}

ruleset mainrs_L2 {
  # generation number: 0
  # realization time : 2021-03-11T12:58:27
  # FILTER rules
  rule 1 at 1 inout ethertype any stateless from any to any accept;
}

Since I have been talking about both NSX-V and NSX-T here I would like to remind you that NSX-V has end of general support 2022-01-16. It can be complex and time consuming to migrate from NSX-V to NSX-T so start planning today.

Thanks for reading.

VMware Cloud Foundation in a Lab

Featured3 Comments

VMware Cloud Foundation (VCF) is basically a package containing vSphere, vSAN, NSX-T, and vRealize Suite elegantly managed by something called SDDC Manager. Everything is installed, configured and upgraded automatically without much user intervention. VCF is based on VMware Validated Design, so you get a well-designed, thoroughly tested and consistent deployment. Upgrading is also a lot easier as you don’t have to check interoperability matrices and upgrade order of the individual components – Just click on the upgrade button when a bundle is available. For someone who has implemented all these products manually many times, VCF is a blessing. Tanzu and Horizon are also supported to run on VCF, and almost everything else you can run on vSphere. Many cloud providers are powered by VCF, for instance VMware Cloud on AWS.

VCF requires at least four big vSAN ReadyNodes and 10 gigabit networking with multiple VLANs and routing, so how can you deploy this is in a lab without investing in a lot of hardware? VMware Cloud Foundation Lab Constructor (VLC) to the rescue! VLC is a script that deploys a complete nested VCF environment onto a single physical host. It can even set up a DHCP server, DNS server, NTP server and a router running BGP. It is also very easy to use, with a GUI and excellent support from its creators and other users in their Slack workspace. It is created by Ben Sier and Heath Johnson.

Here is a nice overview taken from the VLC Install Guide:

VLC requires a single physical host with 12 CPU cores, 128 GB RAM, and 2 TB of SSD. I am lucky enough to have a host with dual Xeon CPUs (20 cores) and 768 GB RAM. I don’t use directly attached SSD, but run it on an NFS Datastore on a NetApp FAS2240-4 over 10 gig networking. I can deploy VCF 4.2 with 7 nested ESXi hosts in 3 hours and 10 minutes on this host.

VLC lets you choose between three modes: Automated, Manual and Expansion Pack. Automated will deploy VCF including all dependencies, while Manual will deploy VCF, but you will have to provide DNS, DHCP, NTP and BGP. Expansion Pack can be used to add additional ESXi hosts to your deployment after you have installed VCF, for instance when you want to create more clusters or expand existing ones.

This is what the VLC GUI looks like:

So far, I have only used the Automated and the Expansion Pack modes, and they both worked flawlessly without any issues. Just make sure you have added valid licenses to the json file like the documentation tells you to do. Some people also mess up the networking requirements, so please spend some extra time studying that in the Installation Guide and reach out if you have any questions regarding that.

It can also be challenging for some to get the nested VCF environment to access the Internet. This is essential to be able to download software bundles used to upgrade the deployment, or to install software like vRealize Suite. Since VLC already requires a Windows jump host which is connected to both my Management network as well as the VCF network, I chose to install “Routing and Remote Access” which is included in Windows Server. Then I set the additional IP address 10.0.0.1 on the VCF network adapter. This IP is used as the default gateway for the router deployed in VCF if you also typed it into the “Ext GW” field in VLC GUI. The last step was to configure NAT in “Routing and Remote Access” to give all VCF nodes access to the Internet. I could then connect SDDC Manager to My VMware Account and start downloading software bundles.

Here are some of the things I have used VLC to do:

Deployed VCF 3.10, 4.0, 4.1 and 4.2 with up to 11 ESXi hosts

Being able to deploy earlier versions of VCF has been very useful to test something on the same version my customers are running in production. Many customers don’t have proper lab gear to run VCF. It has also been great to be able to test upgrading VCF from one version to another.

Experimented with the Cloud Foundation Bring-Up Process using both json and Excel files

The bring-up process is automated, but it requires the configuration, like host names, cluster names, IP addresses and so on, to be provided in an Excel or json file. All required details can be found in the Planning and Preparation Workbook.

Stretched a cluster between two Availability Zones

All my VCF customers are running stretched clusters so beings able to run this in my lab is very useful. This requires at least 8 vSAN nodes, 4 per availability zone. Currently this must be configured using the VCF API, but it is not that difficult, and SDDC Manager includes a built in API explorer which can be used to do this directly in the GUI if you want to.

Created additional Clusters and Workload Domains

Creating more clusters and workload domains will be required by most large customers and also by some smaller ones. It is supported to run regular production workloads in the management workload domain, but this is only recommended for smaller deployments and special use cases.

Commissioned and decommissioned hosts in VCF

Adding and removing ESXi hosts in VCF requires us to follow specific procedures called commissioning and decommissioning. The procedures validate that the hosts meet the criteria to be used in VCF so that it is less likely that you run into problems later. I have had some issues decommissioning hosts from my Stretched Cluster, and VMware has filed a bug to engineering to get this resolved in a future release. The problem was that the task failed at “Remove local user in ESXi host”, which makes sense since the host went up in flames. Workaround was to deploy a new host with the same name and IP, then decommissioning worked. Not a great solution. It is also possible to remove the host directly from the VCF database, but that is not supported. If you run into this issue in production, please call VMware Support.

Expanded and shrunk Clusters – including Stretched Clusters

Adding ESXi hosts to existing clusters, or removing hosts, requires you to follow specific procedures. Again, stretched clusters must be expanded and shrunk using the VCF API.

Upgraded all VCF components using the built-in Lifecycle Management feature

Upgrading VCF is a fun experience for someone used to upgrade all the individual VMware products manually. The process is highly automated, and you don’t have to plan the upgrade order or check which product version is compatible with the others. This is taken care of by SDDC Manager. I have successfully upgraded all the products in VCF including the vRealize Suite.

Tested the Password and Certificate Management features

VCF can automate changing the passwords on all its components. This includes root passwords on ESXi hosts, vCenter SSO accounts and administrative users for the various appliances. You can choose to set your own password or have VCF set random passwords. All passwords are stored in SDDC Manager and you can look them up using the API or from the command line. This requires that you know SDDC Manager’s root password and a special privileged user name and the privileged password. These are obviously not rotated by SDDC Manager.

Changing SSL certificates is a daunting task, especially when you have many products and appliances like you do in VCF. SDDC Manager has the option to replace these for you automatically. You can connect SDDC Manager directly to a Microsoft Certificate Authority or you can use an OpenSSL CA which is built in. If you don’t want to use either of those, there is also support for any third-part CA, but then you have to generate CSR files, copy those over to the CA, generate the certificate files, copy those back and install them. This also requires all the files to be present in a very specific folder structure inside a tar.gz file, so it can be a bit cumbersome to get it right. Also note that all the methods seems to generate the CSR for NSX-T Manager without a SAN, so unless you force your CA to include a SAN, the certificate for NSX-T will not be trusted by your web browser. This has been an issue for several years, so I am puzzled that it still hasn’t been resolved. When generating CSRs for NSX-T in environments without VCF, I never use the CSR generator in NSX-T Manager to avoid this issue. vSphere Certificate Manager in VCSA works fine for this purpose.

Tested the NSX-T Edge Cluster deployment feature

SDDC Manager has a wizard to assist in deploying NSX-T Edge Clusters including the Edge Transport Nodes and the Tier-1 and Tier-0 Gateways required to provide north-south routing and network services. The wizard makes sure you fulfil all the prerequisites, then it will ask you to provide all the required settings like names, MTU values, passwords, IP addresses and so on. This helps you to get a consistent Edge Cluster configuration. Note that VCF is not forcing you to deploy all NSX-T Edge Clusters using this wizard, so please reach out if you want to discuss alternative designs.

Deployed vRealize Suite on Application Virtual Networks (AVN)

All the vRealize Suite products are downloaded in SDDC Manager like any VCF software bundle. You then have to deploy vRealize Suite Lifecycle Manager, which will be integrated with SDDC Manager. VMware Workspace ONE Access must then be installed before you can deploy any of the vRealize Suite products. It is used to provide identity and access management services. It is downloaded as an install bundle in SDDC Manager, but it is actually deployed from vRealize Suite Lifecycle Manager, same as the rest of the products like vRealize Log Insight, vRealize Operations and vRealize Automation. Application Virtual Networks (AVN) is just NSX-T Overlay networks designed and automatically deployed for running the vRealize Suite. This gives you all the NSX-T benefits like load balancing, mobility, improved security and disaster recovery. AVN is optional as you can choose to deploy the vRealize Suite on VLAN backed networks as well.

Deployed Workload Management and Tanzu Kubernetes Cluster

Deploying Tanzu in VCF is not an automated process, but there is a wizard helping you to fulfil the following prerequisites:

Proper vSphere for Kubernetes licensing to support Workload Management
An NSX-T based workload domain deployed
At least one NSX-T Edge cluster
IP addresses for pod networking, Services, Ingress and Egress traffic
At least one Content Library

You have to select an NSX-T based, non-vLCM enabled workload domain, and the wizard will then search for any compatible clusters in this domain. It will then validate the cluster, and if it is ok you are directed to complete the deployment in the vSphere Client manually. The VCF docs have specific instructions on how to do this.

VLC has been very helpful when troubleshooting certain issues for my VCF customers, and when preparing for the VMware Cloud Foundation Specialist exam.

You can download the latest version of VLC, which is 4.2, from here.

Please make sure to read the Install Guide included in the zip file.

It is also possible to download earlier versions of VLC, which can be really useful for testing upgrades, or if you want to simulate a customer’s environment.

VLC Version	Download Link
4.10	https://tiny.cc/getVLC410bits
4.0.1	https://tiny.cc/getVLC401bits
4.0	http://tiny.cc/getVLC40bits
3.91-3.10	http://tiny.cc/getVLC310bits
3.8.1-3.9	http://tiny.cc/getVLC38bits

If you give VLC a go and successfully deploy a VCF instance, please send a screen shot of your installation to SDDC Commander in the VLC Support Slack workspace, and he will send you some awesome stickers!

I highly recommend the following articles for more information about VLC:

Deep dive into VMware Cloud Foundation – Part 1 Building a Nested Lab

Deep dive into VMware Cloud Foundation – Part 2 Nested Lab deployment

If you don’t have licenses for VCF, I recommend signing up for a VMUG Advantage membership which gives you a 365 days evaluation license, and a lot more.

Cheers.