nsx – CyberNils.net

VMware Cloud Foundation 5.0.0.1 to 5.1 Upgrade Notes

March 26, 2024Nils KristiansenLeave a comment

I recently upgraded a customer from VMware Cloud Foundation (VCF) 5.0.0.1 to 5.1. The upgrade went well in the end, but I had some issues along the way that I would like to share in this quick post.

The first issue I ran into was that I was unable to select 5.1 as target version and an error message saying “not interopable: ESX_HOST 8.0.2-22380479 -> SDDC_MANAGER 5.0.0.1-22485660“. I found VMware KB95286 which resolved this problem.

After the SDDC Manager was upgraded to 5.1, I got the following error message when going to the Updates tab for my Management WLD :

Retrieving update patches bundles failed. Unable to retrieve aggregated LCM bundles: Encountered error requesting http://127.0.0.1/v1/upgrades api - Encountered error requesting http://127.0.0.1/v1/upgrades api: 500 - "{\"errorCode\":\"VCF_ERROR_INTERNAL_SERVER_ERROR\",\"arguments\":[],\"message\":\"A problem has occurred on the server. Please retry or contact the service provider and provide the reference token.\",\"causes\":[{\"type\":\"com.vmware.evo.sddc.lcm.model.error.LcmException\"},{\"type\":\"java.lang.IllegalArgumentException\",\"message\":\"No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE\"}],\"referenceToken\":\"H0IKSH\"}"
Scheduling immediate update of bundle failed. Something went wrong. Please retry or contact the service provider and provide the reference token.

Going to Bundle Management in SDDC Manager gave me the following error message:

Retrieving available bundles failed. Unable to retrieve aggregated domains upgrade status: Encountered fetching http://127.0.0.1/lcm/inventory/upgrades api - No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE Retrieving all applicable bundles failed. Encountered fetching http://127.0.0.1/lcm/inventory/upgrades api - No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE

Fortunately my colleague Erik G. Raassum had blogged about this issue the day before: https://blog.graa.dev/vCF510-Upgrade

The solution was to follow VMware KB94760 and Delete all obsolete bundles.

Next up was that the NSX Precheck failed with the following error message:

NSX Manager upgrade dry run failed. Do not proceed with the upgrade. Please collect the support bundle and contact VMWare GS. Failed migrations: Starting parallel Corfu Exception during Manager dry-run : Traceback (most recent call last): File "/repository/4.1.2.1.0.22667789/Manager/dry-run/dry_run.py", line 263, in main start_parallel_corfu(dry_run_path) File "/repository/4.1.2.1.0.22667789/Manager/dry-run/dry_run.py", line 150, in start_parallel_corfu subprocess.check_output([str(fullcmd)], File "/usr/lib/python3.8/subprocess.py", line 415, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['python3 /repository/4.1.2.1.0.22667789/Manager/dry-run/setup_parallel_corfu.py']' returned non-zero exit status 255.

I digged into the logs without finding anything helpful. I started thinking what the technical geniuses at The IT Crowd would do, so I rebooted all NSX Manager nodes and tried the upgrade again. This time the precheck succeeded for NSX Manager, but it failed for the NSX Edge Nodes with the following error message:

nkk-c01-ec01 - Edge group upgrade status is FAILED for group 3373386e-5c41-4851-806d-76f0841a5a7d nkk-c01-en01 : [Edge 4.1.2.1.0.22667789/Edge/nub/VMware-NSX-edge-4.1.2.1.0.22667799.nub download OS task failed on edge TransportNode aa134203-1446-4c65-b17f-41c60e325d55: clientType EDGE , target edge fabric node id aa134203-1446-4c65-b17f-41c60e325d55, return status download_os execution failed with msg: Exception during OS download: Command ['/usr/bin/python3', '/opt/vmware/nsx-common/python/nsx_utils/curl_wrapper', '--show-error', '--retry', '6', '--output', '/image/VMware-NSX-edge-4.1.2.1.0.22667799/files/target.vmdk', '--thumbprint', '7aa5bae4a6eddf034c42d0fb77613e9212fec19a7855a8db0af37ed71c3fe7f6', 'https://nkk-c01-nsx01a.cybernils.net/repository/4.1.2.1.0.22667789/Edge/ovf/nsx-edge.vmdk'] returned non-zero code 28: b'curl_wrapper: (28) Failed to connect to nkk-c01-nsx01a.cybernils.net port 443: Connection timed out\n' .].

A quick status check of all the Edge Nodes didn’t help, so I went ahead and rebooted them all and tried the upgrade again. This time all prechecks went well and the upgrade was also successful without any further issues.

I also ran into a few issues while upgrading the Aria Suite. After upgrading VMware Aria Suite Lifecycle from version 8.12 to 8.14.1, the Build and Version numbers were not updated even though the upgrade was successful. This was resolved by following VMware KB95231.

When trying to upgrade Aria Operations for Logs to version 8.14.1, I got the following error message:

Error Code: LCMVRLICONFIG40004 Invalid hostname provided for VMware Aria Operations for Logs. Invalid hostname provided for VMware Aria Operations for Logs import.

This was fixed by removing the SHA1 based algorithms and SSH-RSA based keys usage from the SSH service on VMware Aria Operations for Logs following VMware KB95974.

After upgrading Aria Operations to version 8.14.1, I kept getting the following error message over and over again:

Client API limit has exceeded the allowed limit.

Following VMware KB82721 and setting CLIENT_API_RATE_LIMIT to 30 solved this.

Quite a troublesome upgrade, but at least most of the problems were fixed quickly by either turning something off and on again, or following a KB.

NSX-v to NSX-T – End-to-End Migration #1

February 13, 2022February 13, 2022Nils KristiansenLeave a comment

The End of General Support for VMware NSX Data Center for vSphere (NSX-v) was January 16, 2022. More details can be found here: https://kb.vmware.com/s/article/85706

If you are still running NSX-v, please start planning your migration to NSX-T as soon as possible as this may be a complex and time consuming job.

I have done several NSX-v to NSX-T migrations lately and thought I should share some of my experiences, starting with my latest End-to-End Migration using Migration Coordinator. This was a VMware Validated Design implementation containing a Management Workload Domain and one VI Workload Domain. I will not go into all the details involved, but rather focus on the challenges we faced and how we resolved them.

Challenge 1

The first error we got in Migration Coordinator was this:

Config translation failed [Reason: TOPOLOGY failed with ”NoneType’ object has no attribute ‘get”]

After some investigation we tried to reboot the NSX-V Manager appliance, but this did not seem to resolve it. We then noticed that EAM status was in a Starting state on NSX Manager, but never changed to Up. We tried to figure out why, and after a while we found that the “Universal Synchronization Service” was stopped so we started it manually, and this made the EAM status change to Up. Not sure how this is related really, but we never saw the error again.

Challenge 2

The next error we got in Migration Coordinator:

Config migration failed [Reason: HTTP Error: 400: Active list is empty in the UplinkHostSwitchProfile. for url: http://localhost:7440/nsxapi/api/v1/host-switch-profiles%5D

We went through the logs trying to figure out what was causing this, but never found any clues. We ended up going through every Portgroup on the DVS and found one unused Portgroup with no Active Uplinks. We deleted this Portgroup since it was not in use and this resolved the problem. If the Portgroup had been in use, we could have added an Uplink to it.

Challenge 3

Migration Coordinator failed to remove NSX-v VIBs from the ESXi host. At first we didn’t figure out why, but we tried to manually remove the VIBs using these commands:

esxcli software vib get -n esx-nsxv

Showed that the VIB was installed.

esxcli software vib remove -n esx-nsxv

Failed with “No VIB matching VIB search specification ‘esx-nsxv'”

After rebooting the host, the above command was successfully removing the NSX-v VIBs.

Challenge 4

NSX-T VIBs fail to install/upgrade, due to insufficient space in bootbank on ESXi host.

This was a common problem a few years back, but I hadn’t seen this in a while.

The following KB has more details and helped us figure out what to do: https://kb.vmware.com/s/article/74864

After removing many unused VIBs we were able to make enough room to install NSX-T. When we did this in advance on other hosts, we also got rid of Challenge 3.

Challenge 5

During the “Migrate NSX Data Center for vSphere Hosts” step we noticed the following in the doc:

“You must disable IPFIX and reboot the ESXi hosts before migrating them.”

We had already disabled IPFIX, but we hadn’t rebooted the hosts, so we decided to do that, however this caused all VMs to lose network connectivity. NSX-v is running in CDO mode, so I am not sure why this happened, but probably got to do with the fact that the control plane is down at this point in the process. We had a maintenance window scheduled so the customer didn’t care, but next time I would make sure to do this in advance.

Challenge 6

The customer were using LAG and when checking the Host Transport Nodes in NSX-T Manager, they all had PNIC/Bond Status Degraded. Since we had migrated all PNICs and VMKs to N-VDS, the hosts still had a VDS connected which had no PNICs attached. Removing the VDS solved this problem.

Challenge 7

Since we were migrating from NSX-v to NSX-T in a Management Cluster, we would end up migrating NSX-T Manager from a VDS to an N-VDS.

The NSX-T Data Center Installation Guide has the following guidance as well as details on how to configure this:

“In a single cluster configuration, management components are hosted on an N-VDS switch as VMs. The N-VDS port to which the management component connects to by default is initialized as a blocked port due to security considerations. If there is a power failure requiring all the four hosts to reboot, the management VM port will be initialized in a blocked state. To avoid circular dependencies, it is recommended to create a port on N-VDS in the unblocked state. An unblocked port ensures that when the cluster is rebooted, the NSX-T management component can communicate with N-VDS to resume normal function.”

Hopefully this post will help you avoid some of these challenges when migrating from NSX-v to NSX-T.

NSX-T Federation in my VMware Cloud Foundation (VCF) Lab

June 28, 2021Nils KristiansenLeave a comment

VCF 4.2 introduced support for NSX-T Federation which provides the ability to manage, control and synchronize multiple NSX-T deployments across different VCF instances which could be in a single region or deployed across regions. You can stretch Tier-0 and Tier-1 Gateways, Segments, and Groups used for Firewalling. Requirements between sites are maximum round-trip time of 150 ms, and 1500 bytes MTU is supported, however not recommended for best performance. NSX Managers and Edge Nodes need connectivity between them, but ESXi hosts don’t require connectivity across sites. Configuration is done on a new NSX Manager role called Global Manager and pushed out to the local NSX Managers in each site, but you can still also connect directly to the Local Managers in case you have a requirement not supported by the Global Manager.

This is not a detailed review of NSX-T Federation, but I will focus on showing you how I got NSX-T Federation working between two VFC instances in my lab. Sorry for the lack of a proper naming convention, but hopefully you are able to follow along.

Configuration Overview

Hostname	Role	VCF Instance	Location	Region
vcenter-mgmt.vcf.sddc.lab	vCenter Server	1	BGO	A
sddc-manager.vcf.sddc.lab	SDDC Manager	1	BGO	A
nsx-mgmt-1.vcf.sddc.lab	NSX-T Local Manager	1	BGO	A
nsx-global-mgmt.vcf.sddc.lab	NSX-T Global Manager	1	BGO	A
vcenter-mgmt.vcf.nils.lab	vCenter Server	2	OSL	B
sddc-manager.vcf.nils.lab	SDDC Manager	2	OSL	B
nsx-mgmt-1.vcf.nils.lab	NSX-T Local Manager	2	OSL	B
nsx-global-mgmt.vcf.nils.lab	NSX-T Global Manager	2	OSL	B

Steps Performed

Note that all images are clickable to make them bigger.

1. Deployed an NSX-T Global Manager appliance in VCF instance 1 (BGO). This is simply done by deploying the nsx-unified-appliance ova and selecting “NSX Global Manager” as Rolename. In a production environment I would also replace the certificate and deploy two additional appliances to create an NSX-T Global Manager Cluster. In my lab I was happy with deploying a single appliance.

2. Added vCenter Server in VCF instance 1 (BGO) as a Compute Manager.

3. Created an IP Pool for Remote Tunnel Endpoints in the Local NSX-T Manager in VCF instance 1 (BGO).

4. Set the NSX-T Global Manager to Active.

5. Obtained the Certificate Thumbprint for the existing NSX-T Manager in VCF Instance 1 (BGO). This can be done by SSH to vCenter and run the following command:

echo -n | openssl s_client -connect nsx-mgmt-1.vcf.sddc.lab:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

6. Enabled NSX-T Federation by adding the existing NSX-T Manager in VCF Instance 1 (BGO) as a location to the NSX-T Global Manager. Then it became a Local NSX-T Manager.

First attempt gave me this error message:

NSX-T Federation requires NSX Data Center Enterprise Plus license, so after upgrading my license it worked fine.

7. Configured networking for the NSX-T Local Manager node in VCF Instance 1 (BGO).

8. Imported the NSX-T Local Manager configurations for VCF Instance 1 (BGO) to the NSX-T Global Manager.

9. Created a Tier-1 Gateway to be stretched between both VCF instances.

10. Connected the existing Cross-Region Segment to the stretched Tier-1 Gateway.

11. Deployed an NSX-T Global Manager appliance in VCF instance 2 (OSL). This is simply done by deploying the nsx-unified-appliance ova and selecting “NSX Global Manager” as Rolename. In a production environment I would also replace the certificate and deploy two additional appliances to create an NSX-T Global Manager Cluster. In my lab I was happy with deploying a single appliance.

12. Connected the new NSX-T Global Manager Node to the vCenter Server in VCF instance 2 (OSL).

13. Created an IP Pool for Remote Tunnel Endpoints in NSX-T Data Center in VCF Instance 2 (OSL).

14. Obtained the Certificate Thumbprint for the existing NSX-T Manager in VCF Instance 2. This can be done by SSH to vCenter and run the following command:

echo -n | openssl s_client -connect nsx-mgmt-1.vcf.nils.lab:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

15. Deleted the existing Cross-Region Segment from the NSX-T Manager in VCF Instance 2 (OSL) since we will stretch the one deployed in VCF Instance 1 (BGO). The name in the image below contains “VXLAN”, but this name has stuck in the lab since VCF 3.x where NSX-V was used. It is in fact a regular NSX-T Overlay Segment.

16. Enabled NSX-T Federation by adding the existing NSX-T Manager in VCF Instance 2 (OSL) as a location to the NSX-T Global Manager. Then it became a Local NSX-T Manager. Note that this is done in the NSX-T Global Manager in VCF Instance 1 (BGO), which is the Active one.

17. Configured networking for the NSX-T Local Manager node in VCF Instance 2 (OSL).

Remote Tunnel Endpoints in OSL looking good.

Remote Tunnel Endpoints in BGO also looking good.

18. Imported the NSX-T Local Manager configuration in VCF Instance 2 (OSL) to the NSX-T Global Manager.

19. Deleted the Existing Tier-0 Gateway for the Management Domain in VCF Instance 2 (OSL). First I had to disconnect the Tier-1 Gateway from the Tier-0 Gateway.

20. Reconfigured the Tier-0 Gateway in VCF Instance 1 (BGO) to stretch the network between VCF Instance 1 (BGO) and VCF Instance 2 (OSL). Added OSL as a Location to existing bgo-mgmt-domain-tier0-gateway.

21. Set interfaces for VCF Instance 2 (OSL) on the Tier-0 Gateway.

22. Configured BGP neighbors for VCF Instance 2 (OSL).

23. Configured an Any IP Prefix in the Tier-0 Gateway.

24. Created a Route Map for No Export Traffic in the Tier-0 Gateway.

25. Configured Route Filters and Route Redistribution for BGP. Repeated for all four BGP neighbourships.

26. Configured route redistribution for VCF Instance 2 (OSL) on the Tier-0 Gateway.

27. Connected the Tier-1 Gateway in VCF Instance 2 (OSL) to the stretched Tier-0 Gateway.

28. Deleted VCF Instance 1 (BGO) as a Location for this Tier-1 Gateway since this is a local only Tier-1 Gateway.

29. Added VCF Instance 2 (OSL) as a Location in the stretched Tier-1 Gateway (mgmt-domain-stretched-t1-gw01).

30. Set the NSX-T Global Manager in VCF Instance 2 (OSL) as Standby for the NSX-T Global Manager in VCF Instance 1 (BGO). This provides high availability of the active NSX-T Global Manager.

First step was to retreive the SHA-256 thumbprint of the NSX-T Global Manager certificate in VCF Instance 2 (OSL) using this command from the vCenter Server:

echo -n | openssl s_client -connect nsx-global-mgmt.vcf.nils.lab:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

Then I added the NSX-T Global Manager in VCF Instance 2 (OSL) as standby.

That’s it! I now have NSX-T Federation between my two VCF Instances which I find very useful. I haven’t done everything required in the VVD to call my instances Region A and Region B, but I can still simulate a lot of cool use cases. Hopefully you found this useful and please let me know if you have any comments or questions.

References

Introducing NSX-T Federation support in VMware Cloud Foundation

Deploy NSX-T Federation for the Management Domain in the Dual-Region SDDC

Avoid Packet Loss in NSX-T

April 19, 2021May 23, 2021Nils Kristiansen2 Comments

I have been working a lot with NSX-T the last few years and I have come across a misconfiguration that may cause massive packet loss for the workloads connected to Overlay segments. Since NSX-T 2.5, the recommended Edge Node design has been the “Single N-VDS – Multi-TEP” design which looks like this:

Image from NSX-T Reference Design Guide 3.0.

What people, and VCF, sometimes get wrong when implementing this design, is that they configure Trunk1 PG and Trunk2 PG with a Teaming and failover policy of Active/Unused instead of Active/Standby. Note that there are two TEP-IPs, each using a separate vNIC, Trunk PG and physical NIC. When one of the physical NICs or one of the Top of Rack (ToR) switches fail, the TEP-IP using that connection will go offline instead of failing over. This causes long lasting packet loss for any VM connected to a Segment that is using that TEP. I thought the Host Transport Nodes eventually would stop using the failed Edge Node TEP IP after some time, but I waited 20 minutes without any correction.

This is what the Teaming and failover Policy should look like:

Trunk1 PG

What if you have a fully collapsed cluster with only two physical NICs per host? Meaning NSX Manager, Host Transport Nodes, and NSX Edge VMs are running on a single cluster. You don’t have any regular Trunk PGs on a VDS since you run everything on a single N-VDS. Then you have to create Trunk Segments in NSX-T instead and configure them with an Active/Standby Teaming Policy like this:

I recommend using meaningful names for the Teamings so that you can easily see on the Segments what policy will be used. Note that the opposite uplink is Standby for each Active uplink.

If you run everything on a single VDS 7.0 you may have a mix of regular Trunk PGs and NSX-T Segments on the same VDS. Same rules still apply. In NSX-T 3.1 and later you can use the same VLAN ID for both your Edge Node TEPs and your Host TEPs, but then you need to use Trunk Segments in NSX-T. So there are several options and easy to get it wrong.

One of the reasons people mess this up, is because they want to achieve deterministic peering for their uplink interfaces, meaning they want to peer with ToR Left using physical NIC 1 and peer with ToR Right using physical NIC 2, and they misunderstand how to achieve that. Named Teaming Policies in the Edge Nodes Uplink Profile will handle that and I will link to a document and a few blog posts below that will show you step-by-step how to do this, so don’t worry if you are more confused than ever 🙂

The Edge Nodes Uplink Policy should look similar to this:

Note that there are no Standby Uplinks for the Named Teamings.

VCF 4.x and VVD 6.x also use this design, but it is documented a bit vague, so people still get it wrong. The wording has been improved in VVD 6.2 after me complaining about it, so kudos to VMware for actually reading the feedback given on docs.vmware.com and updating accordingly.

What about VCF where all of this is deployed automatically? Unfortunately, VCF 4.0 also got this wrong, but it was fixed in VCF 4.1, but only when installing it from scratch. If you upgrade an existing VCF 4.0 environment to VCF 4.1 or 4.2, the error remains. Ouch! So, if you have any VCF 4.x installations, please verify the teaming policy before it’s too late. The fix is to manually change the Teaming Policy on both Port Groups.

Simulating a physical NIC failure without involving your Networking team can be done like this:

[root@bgo-lab-esx-01:~] esxcli network nic list
Name    PCI Device    Driver  Admin Status  Link Status  Speed  
------  ------------  ------  ------------  -----------  -----  
vmnic0  0000:06:00.0  nenic   Up            Up           10000  
vmnic1  0000:07:00.0  nenic   Up            Up           10000  
vmnic2  0000:08:00.0  nenic   Up            Up           10000  
vmnic3  0000:09:00.0  nenic   Up            Up           10000  
[root@bgo-lab-esx-01:~] esxcli network nic down -n vmnic2
[root@bgo-lab-esx-01:~] esxcli network nic list
Name    PCI Device    Driver  Admin Status  Link Status  Speed  
------  ------------  ------  ------------  -----------  -----  
vmnic0  0000:06:00.0  nenic   Up            Up           10000  
vmnic1  0000:07:00.0  nenic   Up            Up           10000  
vmnic2  0000:08:00.0  nenic   Down          Down             0  
vmnic3  0000:09:00.0  nenic   Up            Up           10000  
[root@bgo-lab-esx-01:~] esxcli network nic up -n vmnic2
[root@bgo-lab-esx-01:~] esxcli network nic list
Name    PCI Device    Driver  Admin Status  Link Status  Speed  
------  ------------  ------  ------------  -----------  -----  
vmnic0  0000:06:00.0  nenic   Up            Up           10000  
vmnic1  0000:07:00.0  nenic   Up            Up           10000  
vmnic2  0000:08:00.0  nenic   Up            Up           10000  
vmnic3  0000:09:00.0  nenic   Up            Up           10000

Esxtop can be used to see if the Edge Nodes vNIC fails over or not when taking down vmnic2.

Here it shows eth1 not failing over to vmnic3 when having an Active/Unused Teaming Policy:

Here it shows eth1 failed over to vmnic3 when having an Active/Standby Policy:

Here we can see that eth1 failed back to vmnic2 when taking vmnic2 back up:

While speaking of failover and failback testing, I would like to mention another issue that keeps coming up. When taking down one Top of Rack switch, everything fails over to the other physical NIC and we usually see one lost ping. When the switch is taken back up, everything fails back to the recovered physical NIC again, but this time we get a huge amount of packet loss. Why? Because when the switch brings the link back up, ESXi starts failing back after 100 ms, but the switch isn’t ready to forward traffic. How long this takes varies depending on vendor and switch type. We can change the network teaming failback delay to avoid this problem. Normally we change it to 30 000 or 40 000 ms.

To modify the TeamPolicyUpDelay, from the vSphere Client go to each ESXi host, Configure > Advanced System Settings > Edit. Change Net.TeamPolicyUpDelay to 30 000 and test again to see if it works better in your environment.

I hope this post was more helpful than confusing and thanks for reading.

Useful links for more information

NSX-T Reference Design Guide 3.0

NSX-T 3.0 – Edge Design Step-by-Step UI WorkFlow

Network Design for the NSX-T Edge Nodes in VMware Validated Design 6.2

Single N-VDS per Edge VM
NSX-T Single NVDS Multi-TEP Edge VM Deployment & Configuration on vSphere DVS
NSX-T Single NVDS Multi-TEP Edge VM Deployment & Configuration on Host NVDS Networking
Achieving Deterministic Peering using NSX-T Named Teaming Policies

vCenter Server blocked by NSX firewall

March 29, 2021May 23, 2021Nils Kristiansen3 Comments

Recently I had a customer calling me with panic in his voice. He had managed to create a rule in NSX where sources and destinations were both any, and action was set to drop. This rule was added high up in the rule set so almost all their workloads were blocked from the network, including their vCenter Server. This environment was still running NSX for vSphere (NSX-V) where firewall rules are managed using the NSX plugin in vCenter Server, so he couldn’t fix the rule.

Since I have been working with NSX for many years, I am aware of this risk and knew exactly how to solve it. VMware has a KB (2079620) addressing this issue so we followed that and got the problem fixed in a few minutes. We used a REST API client and ran a call against their NSX Manager to roll back the distributed firewall to its default firewall rule set. This means one default Layer3 section with three default allow rules and one default Layer2 section with one default allow rule. This restored access to the network for all workloads including the vCenter Server appliances. Then we simply logged into vCenter Server and loaded an autosaved firewall configuration from a time before they made the error. We also made sure to add their vCenter Server appliances to the Exclusion List in NSX to avoid getting into this situation again in the future. The NSX Manager appliance is added to the Exclusion List automatically, but you can’t log in directly to NSX Manager GUI in NSX-V to edit the firewall configuration. Note that it may be a good idea to keep vCenter Server off the Exclusion List to be able to secure it with the firewall, but then you need to make sure you don’t make the same mistake as this customer did.

It is possible to retrieve the existing firewall configuration using the following API call:

GET /api/4.0/firewall/globalroot-0/config

This can be useful if you don’t trust that you have a valid autosaved firewall configuration to restore after resetting it. You can also use this to fix the exact rule locking you out instead of resetting the entire configuration, but I will not go into details on how to do that here.

This problem could also happen with NSX-T, but vCenter Server is not where you manage firewall rules in NSX-T, that is done directly in NSX Manager. According to VMware, NSX-T automatically adds NSX Manager and NSX Edge Node virtual machines to the firewall exclusion list. I have been checking all my NSX Managers, currently three separate instances, and none of them display the NSX Managers in the System Excluded VMs list, only the Edge Nodes like you can see in the screen shot below.

Exclusion List
User Excluded Groups
bgo-lab-edge-01
bgo-lab-edge-02
bgo-lab-tkgi-edge-01
osl-lab-edge-01
osl-lab-edge-02
osl-lab-edge-03
OSI-lab-edge-05
osl-lab-edge-07
System Excluded V Ms
bgo-lab-esx-OS.nolab.local
bgo-lab-esx-04.nolab.local
bgo-ldb-esx-01 .nolab.local
osl-mgmt-esx-02.nolab.local
osl-mgmt-esx-03.nolab.local
osl-mgmt-esx-01.nolab.local
osl-mgmt-esx-02.nolab.local
osl-mgmt-esx-02.nolab.local
Tags
Operating System
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Ubuntu Linux (64-bit)
Filter by Name. Path and more
Running
Running
Running
Running
Running
Running
Running
Running

I have been trying to retrieve the exclusion list from the REST API, to see if the Managers are listed there, but so far, I have not been successful. My API calls keeps getting an empty list every time, so I am still investigating how to do this.

I also tried the following CLI command on the NSX Managers, but it lists the same content as the GUI:

get firewall exclude-list

I have been able to confirm that none of the NSX Manager VMs have any firewall rules applied by using the following commands on the ESXi hosts running the VMs, so they seem to be excluded, but I think it would be nice to actually see them on the list.

This is how we can verify if a VM is excluded from the distributed firewall. As you can see my NSX Manager appliance VM has no rules applied.

[root@bgo-mgmt-esx-01:~] summarize-dvfilter | grep -A 3 vmm
world 2130640 vmm0:bgo-mgmt-nsxmgr-01 vcUuid:'50 2b fe 43 98 6f d5 be-fe fd e3 eb 36 3e 17 1d'
 port 33554441 bgo-mgmt-nsxmgr-01.eth0
  vNic slot 2
   name: nic-2130640-eth0-vmware-sfw.2
--
world 4700303 vmm0:bgo-vrops-arc-01 vcUuid:'50 2b 40 6d 17 22 e0 48-d1 5b 31 c7 d6 30 48 04'
 port 33554442 bgo-vrops-arc-01.eth0
  vNic slot 2
   name: nic-4700303-eth0-vmware-sfw.2
--
world 8752832 vmm0:bgo-runecast-01 vcUuid:'50 2b 60 41 6b 35 e9 ca-e5 10 a6 57 95 2e f9 f7'
 port 33554443 bgo-runecast-01.eth0
  vNic slot 2
   name: nic-8752832-eth0-vmware-sfw.2
[root@bgo-mgmt-esx-01:~] vsipioctl getrules -f nic-2130640-eth0-vmware-sfw.2
No rules.
[root@bgo-mgmt-esx-01:~]

For comparison, this is how it looks like for a VM not being on the exclusion list:

[root@esxi-1:~] vsipioctl getrules -f nic-2105799-eth0-vmware-sfw.2
ruleset mainrs {
  # generation number: 0
  # realization time : 2021-03-11T12:58:27
  # FILTER (APP Category) rules
  rule 3 at 1 inout inet6 protocol ipv6-icmp icmptype 135 from any to any accept;
  rule 3 at 2 inout inet6 protocol ipv6-icmp icmptype 136 from any to any accept;
  rule 4 at 3 inout protocol udp from any to any port {67, 68} accept;
  rule 2 at 4 inout protocol any from any to any accept;
}

ruleset mainrs_L2 {
  # generation number: 0
  # realization time : 2021-03-11T12:58:27
  # FILTER rules
  rule 1 at 1 inout ethertype any stateless from any to any accept;
}

Since I have been talking about both NSX-V and NSX-T here I would like to remind you that NSX-V has end of general support 2022-01-16. It can be complex and time consuming to migrate from NSX-V to NSX-T so start planning today.

Thanks for reading.