ESXi Temporary Directory Exist Error on VCF Upgrade Precheck

I helped a customer upgrading from VCF 4.3 to 4.4 and encountered the following error message during the Upgrade Precheck:

“ESXi Temporary Directory Exist”

Expanding the error message gave me the following error description with several remediation suggestions:

“Directory /locker on host does not support the upgrade for resource, Please check”

None of the suggestions worked, so I started investigating log files and found that the ESXi hosts had an inaccessible Datastore which the customer’s backup solution had left behind after a failed backup job. Unmounting the Datastore from the hosts resolved the issue.

Convert vSphere Lifecycle Manager Mode in VCF

I recently had a customer who was unable to upgrade the Management Workload Domain in their VCF 4.3 deployment. After digging around for a while I found that the cluster was configured with vSphere Lifecycle Manager Images which is not compatible with the following features which they were using:

               – Stretched Cluster

               – Management Workload Domain

The Management Workload Domain must use vSphere Lifecycle Manager Baselines. That is also the case in with any clusters that are stretched.

I couldn’t find any procedure to change from vSphere Lifecycle Manager Images to vSphere Lifecycle Manager Baselines, and the customer had production workloads running in this cluster, so a redeploy was not a good option. VMware Support gave me the following procedure which did the trick:

The “lifecycleManaged” property needs to be changed to “false”. The lifecylceManaged property manipulations are provided via VMODL1 APIs:

               1. Access the internal API through mob access using the URL: https://<vc_ip>/mob/&vmodl=

               2. Navigate to the “content” property

               3. Navigate to the “rootFolder” property

               4. Navigate to the datacenter where the cluster is, using the datacenter list provided in the “childEntity” property

               5. Navigate to the “hostFolder” property. This should list all the hosts and clusters within that datacenter you chose in Step 4

               6. Navigate to the cluster that you wish to reset property on and check whether the URL still contains “&vmodl=1” at the end. The internal APIs get listed only when “&vmodl=1” is appended to the URL. At this stage verify if lifecycleManaged property is set to true at the onset

               7. You should see the below two methods. Invoke the disableLifecycleManagement method and refresh to see if lifecyleManaged is set to false

void

disableLifecycleManagement

vim.LifecycleOperationResult

enableLifecycleManagement

               8. If the cluster is vSphere HA enabled, please restart updatemgr and vpxd using below commands on the vCenter appliance:

vmon-cli -r updatemgr

vmon-cli -r vpxd

I strongly recommend doing the above fix together with VMware Support if you are working on a production environment.

DHCP is crucial for VMware Cloud Foundation (VCF)

When deploying VMware Cloud Foundation (VCF) one can choose between configuring NSX-T Host Overlay using a static IP Pool or DHCP. This is specified in the vcf-ems-deployment-parameter spreadsheet which is used by the Cloud Builder appliance to bring up the Management Workload Domain as you can see in the following images:

If No is chosen, it is crucial that you provide a highly available DHCP server in your NSX-T Host Overlay VLAN to provide IP addresses for the TEPs. In the past I have had several customers who have either broken their DHCP server or their DHCP Relay/DHCP Helper so that the TEPs were unable to renew their IP configuration. This will make NSX-T configure APIPA (Automatic Private IP Addressing) addresses on these interfaces. The IP address range for APIPA is 169.254.0.1-169.254.255.254, with the subnet mask of 255.255.0.0.

The following image shows this issue in the NSX-T GUI:

You can also see this in the vSphere Client by checking the IP configuration on vmk10 and vmk11.

The impact is that the Geneve tunnels will be down and nothing on Overlay Segments will be able to communicate. This is critical, so keep you DHCP servers up or migrate to using a Static IP Pool.

To renew the DHCP lease on vmk10 and vmk11 you can run the following commands from ESXi Shell:


esxcli network ip interface set -e false -i vmk10 ; esxcli network ip interface set -e true -i vmk10


esxcli network ip interface set -e false -i vmk11 ; esxcli network ip interface set -e true -i vmk11

Most customers choose to configure NSX-T Host Overlay using DHCP because that used to be the only option. Using a static IP Pool is also still not compatible with some deployment topologies of VMware Cloud Foundation, such as multiple availability zones (stretched clusters) or clusters which span Layer 3 domains.

It is possible to migrate to Static IP pool from DHCP following this KB:

Day N migration of NSX-T TEPS to Static IP pool from DHCP on VCF 4.2 and above (84194)

Migrate to DHCP from Static post deployment is not possible.

vCenter 7.0 requires LACP v2

When upgrading to vCenter 7.0 one of the prerequisites to check is the versions of the existing vSphere Distributed Switches (VDS) in the environment. They need to be at version 6.0 or higher. Recently I was helping a customer upgrading to vCenter 7.0 and all their VDS’es were at version 6.6.0, and all other prerequisites where met as well, so we went ahead with the upgrade.

vCenter Upgrade Stage 1 completed successfully but early in vCenter Upgrade Stage 2 we got the following error message:

“Source vCenter Server has instance(s) of Distributed Virtual Switch at unsupported lacpApiVersion.”

I found this strange since the customer was running vCenter 6.7, and basic LACP (LACPv1) is only supported on vSphere versions 6.5 or below. Somehow this customer had upgraded to vCenter 6.7 and VDS 6.6.0 without upgrading to LACP v2. When you upgrade a vSphere Distributed Switch from version 5.1 to version 6.5, the LACP support is enhanced automatically. If basic LACP support was enabled on the distributed switch before the upgrade, the LACP support should be enhanced manually. The customer was not longer using LACP, but they had used it in the past.

The following KB provides the prerequisites and steps to convert to the Enhanced Link Aggregation Control Protocol (LACP v2) support mode on a vSphere Distributed Switch in vSphere:

Converting to Enhanced LACP Support on a vSphere Distributed Switch- “Source vCenter Server has instance(s) of Distributed Virtual Switch at unsupported lacpApiVersion” (2051311)

Unable to query vSphere health information. Check vSphere Client logs for details.

I have been helping quite a few customers with upgrading to vSphere 7.0 lately. In this post I will share an issue we encountered when upgrading from vCenter 6.7 to 7.0 and how we resolved it.

Prior to starting the upgrade I always do a thorough health check of the existing environment. The following error message quickly got my attention when trying to run Skyline Health:

“Unable to query vSphere health information. Check vSphere Client logs for details.”

Rebooting the vCenter Appliance didn’t resolve this, and we always try to turn it off and on again first, right? I started looking into the logs and searching the web for others who have had the same error message and found a few cases where this issue was resolved by starting the VMware Analytics Service but this was already running fine.

After digging further through the logs I found that the following KB may be exactly what we needed:

vCenter Server certificate validation error for external solutions in environments with Embedded Platform Services Controller (2121689)

The resolution in the KB solved this problem and we could move on with the upgrade.

Running vSAN Witness in The Cloud

vSAN Stretched Cluster requires three independent sites; two data sites and one witness site. If you don’t have a third site, you can run the vSAN Witness Appliance in the cloud. This post will show you how I deployed a vSAN Witness Appliance in Proact Hybrid Cloud (PHC) which runs VMware Cloud Director.

I started with deploying the vSAN Witness Appliance in PHC by selecting Add vApp from OVF.

When prompted to Select Source I browsed to VMware-VirtualSAN-Witness-7.0U3c-19193900.ova which I had already downloaded from VMware. Please make sure you download the version matching your environment, or upgrade it after deployment. I continued through the wizard supplying all the regular details like hostname, IP configuration and so on, and my vApp including the vSAN Witness Appliance VM was deployed.

Next step was to configure VPN between my on-premises lab in Oslo and PHC in Sweden. Both environments run NSX-T which makes it easy, but most third-party hardware routers or VPN gateways that support IPSec is supported. I started with configuring it in PHC by going to Networking, Edge Gateways, and clicking on the name of my Edge Gateway to open the configuration page. I then selected IPSec VPN under Services in the menu and clicked on New to bring up the Add IPSec VPN Tunnel wizard. I provided a name, selected the default security profile, and entered a Pre-Shared Key for the Authentication Mode although using a certificate is also an option. Endpoint Configuration was then configured like this:

Local Endpoint IP Address was set to the public IP assigned to my Edge Gateway in PHC, and Remote Endpoint IP Address was set to the public IP assigned to the firewall in my on-premises lab environment. Since we are using NAT, Remote ID was set to the internal private IP configured on the Local Endpoint. If you are not using NAT, Remote ID should be set to the public IP being used. The 10.10.10.0/24 network is where the vSAN Witness Appliance is running in PHC, and the 192.168.224.0/24 network is where I run the vSAN Nodes in my on-premises lab.

After verifying that the VPN tunnel was working, I was able to add the vSAN Witness Appliance as an ESXi host in my on-premises vCenter:

I configured my vSAN nodes for Witness Traffic Separation so that they would communicate with the witness host via their Management network (vmk0). This removed the need to route the vSAN data network to the Witness site, and since I am using vmk0 I also didn’t need to add any static routes or override the default gateway. When configuring vSAN on the cluster in my on-premises lab, I could simply select the witness host running in PHC:

Checking the Fault Domains configuration on my vSAN Cluster after deployment shows that the witness in PHC is being used by the cluster.

Please don’t forget the importance of making sure that the Preferred site and the Secondary site are connected to the Witness site independently of each other. This means that the Secondary site should not connect to the Witness site via the Preferred site and vice versa. The reason for this is that if you lose one of your data sites, the surviving data site still needs to talk to the witness for the cluster to be operational.

For more information about vSAN Stretched Cluster I recommend reading the vSAN Stretched Cluster Guide.

New Undocumented Feature in NSX-T 3.2 MC

Migration Coordinator (MC) is VMware’s tool to migrate from NSX Data Center for vSphere (NSX-V) to NSX-T Data Center (NSX-T). Last week I was helping a customer migrating their Distributed Firewall Configuration from NSX-V to NSX-T 3.2 and we were presented with a new step in the process called “Prepare Infrastructure”. I have never seen this step in NSX-T 3.1 and earlier, and was surprised to see it now since I had checked the documentation for any changes to the process.

The new step looks like this:

The documentation said that no changes would be done to NSX-V during the migration so I was trying to find anyone able to tell me what this step would do. Finally someone at VMware could to tell me that this step would create temporary IP Sets in NSX-V to to maintain security during the migration. When you migrate a VM from one vCenter (NSX-V) to another vCenter (NSX-T), the VM will no longer be included in the Security Groups in NSX-V since the object is not longer present there. Before NSX-T 3.2 we had to create these IP Sets manually in NSX-V so this is a welcome feature in NSX-T 3.2 MC. MC has already been creating temporary IP sets in NSX-T for some time. More details on this can be found here.

The latest version of the NSX-T 3.2 docs has now been updated with the missing information:

“In the Prepare Infrastructure step, temporary IP sets will be added to NSX-V if the NSX-V security groups are used in a distributed firewall rule. This is required to maintain security while the VMs are migrated from NSX-V to NSX-T. After the migration, during the finalize infrastructure phase, the temporary IP sets will be deleted.
You can skip the Prepare Infrastructure step. However, doing so may compromise security until the finalize infrastructure phase is complete.”

Deploy Nested VCF to NSX-T Overlay

I have used VLC to deploy nested VCF for a long time and I am quite happy with how it works. VLC is usually deployed to a VLAN Trunk Port Group. This requires the VLANs used in the nested VCF to be configured on the physical switches in the environment. This does not scale well, and it is hard to automate. By following the steps below we are able to deploy VLC to an NSX-T Overlay Segment which allows each VFC instance to be isolated on their own layer 2 network. NSX-T Overlay Segments can be deployed automatically and they don’t require any changes to the physical network. This also allows us to use overlapping IP addressing between them. I have not yet tested to connect my Segment to a Tier-1 Gateway so that the nested VCF can connect to any external networks, but I plan to do this soon and update this post.

NSX-T Configuration

The following configuration needs to be done on the hosting NSX-T environment.

IP Discovery Profile

Namevcf-nested-ip-profile
Duplicate IP DetectionDisabled
ARP SnoopingEnabled
ARP Binding Limit256
ND SnoopingDisabled
ND Snooping Limit3
ARP ND Binding Limit Timeout10
Trust on First UseEnabled
DHCP SnoopingDisabled
DHCP Snooping – IPv6Disabled
VMware ToolsEnabled
VMware Tools – IPv6Disabled

MAC Discovery Profile

Namevcf-nested-mac-profile
MAC ChangeEnabled
MAC Learning Aging Time600
MAC LearningEnabled
MAC Limit4096
MAC Limit PolicyAllow
Unknown Unicast FloodingEnabled

Segment Security Profile

Namevcf-nested-security-profile
BPDU FilterDisabled
BPDU Filter Allow ListNot Set
Server BlockDisabled
Server Block – IPv6Disabled
Non-IP Traffic BlockDisabled
Rate LimitsDisabled
Receive Broadcast
0
Receive Multicast0
Client BlockDisabled
Client Block – IPv6Disabled
RA GuardEnabled
Transmit Broadcast0
Transmit Multicast0

Segments

Namevcf-nested-trunk-01
Transport Zoneoverlay-tz
Connected GatewayNone
SubnetNone
Profilesvcf-nested-ip-profile
vcf-nested-mac-profile
vcf-nested-security-profile
VLAN0-4094

Jump Host Configuration

To deploy VLC we need a jump host with two network adapters, one connected to your management network so that we can access it with RDP, and one connected to the nested environment so that we can connect to the nested appliances there. More details on this can be found in the VLC installation guide.

Jump Host 01

Namejumpy-01
OSWindows Server 2019
NIC1  Portgroup: pg-management
Driver: VMXNET3
NIC2Portgroup: vcf-nested-trunk-01
GW: None
VLAN: 10 (Tagged in Guest OS)
Driver: VMXNET3
SoftwarePowershell 5.1+
PowerCLI 12.1+
OVFTool 4.4+
.Net Framework
Static RoutesFor example: route ADD 10.50.0.0 MASK 255.255.255.0 10.0.0.221
Windows FirewallOn
Powershell PolicySet-ExecutionPolicy Unrestricted

VLC Configuration

LocationC:\VLC
Bringup ConfigurationC:\VLC\NOLIC-44-TMM-vcf-ems-public.json
ESXi Host ConfigurationC:\VLC\conf\default_mgmt_hosthw.json
LicensesMust be added to NOLIC-44-TMM-vcf-ems-public.json
Cloud BuilderC:\VLC\VMware-Cloud-Builder-4.4.0.0-19312029_OVF10.ova
MTU1700 (can probably be increased to 8800 or more)

VLCGui.ps1 Configuration

The following changes need to be done to the default VLCGui.ps1 to make it work with NSX-T.

Changed the following to be able to select NSX-T Segments:

From
If ($isSecSet.AllowPromiscuous.Value -and $isSecSet.ForgedTransmits.Value -and $isSecSet.MacChanges.Value){   
To
If(-not ($isSecSet.AllowPromiscuous.Value -and $isSecSet.ForgedTransmits.Value -and $isSecSet.MacChanges.Value)){  

Changed the following to get 1500 bytes MTU on vSwitch0:

From
$kscfg+="esxcli network vswitch standard set -v vSwitch0 -m 9000`n"
To
$kscfg+="esxcli network vswitch standard set -v vSwitch0 -m 1500`n"

Added the following to recreate vmk0 so that it gets a unique MAC address:

$kscfg+="esxcfg-vmknic --del vmk0 -p `"Management Network`"`n"
$kscfg+="esxcfg-vmknic --add vmk0 --portgroup `"Management Network`" --ip `${IPADDR} --netmask `${SUBNET} --mtu 1500`n"
$kscfg+="esxcfg-route -a default `${IPGW}`n"

Change the MAC Address of NSX-T Virtual Distributed Router

You must change the default MAC address of the NSX-T virtual distributed router in the nested VCF deployment so that it does not use the same MAC address that is used by the hosting NSX-T virtual distributed router.

Change the MAC Address of NSX-T Virtual Distributed Router

An alternative is to configure the hosting NSX-T environment’s Overlay Transport Zone with the nested_nsx property set to true, but this has to be done when creating the Transport Zone.

Thanks to Ben Sier for helping me getting this to work.

Upgrading to VMware Cloud Foundation 4.4 in my Lab

VMware Cloud Foundation 4.4 was just released so I wanted to check out what was new and upgrade my lab. Going into SDDC Manager and selecting Lifecycle Management and Release Versions gave me an overview of what is new:

  • Flexible vRealize Suite product upgrades: Starting with VMware Cloud Foundation 4.4 and vRealize Lifecycle Manager 8.6.2, upgrade and deployment of the vRealize Suite products is managed by vRealize Suite Lifecycle Manager. You can upgrade vRealize Suite products as new versions become available in your vRealize Suite Lifecycle Manager. vRealize Suite Lifecycle Manager will only allow upgrades to compatible and supported versions of vRealize Suite products. Specific vRealize Automation, vRealize Operations, vRealize Log Insight, and Workspace ONE Access versions will no longer be listed in the VMware Cloud Foundation BOM.
  • Improvements to upgrade prechecks: Upgrade prechecks have been expanded to verify filesystem capacity and passwords. These improved prechecks help identify issues that you need to resolve to ensure a smooth upgrade.
  • SSH disabled on ESXi hosts: This release disables the SSH service on ESXi hosts by default, following the vSphere security configuration guide recommendation. This applies to new and upgraded VMware Cloud Foundation 4.4 deployments.
  • User Activity Logging: New activity logs capture all the VMware Cloud Foundation API invocation calls, along with user context. The new logs will also capture user logins and logouts to the SDDC Manager UI.
  • SDDC Manager UI workflow to manage DNS and NTP configurations: This feature provides a guided workflow to validate and apply DNS and NTP configuration changes to all components in a VMware Cloud Foundation deployment.
  • 2-node vSphere clusters are supported when using external storage like NFS or FC as the principal storage for the cluster: This feature does not apply when using vSAN as principal storage or when using vSphere Lifecycle Manager baselines for updates.
  • Security fixes: This release includes fixes for the following security vulnerabilities:
  • Multi-Instance Management is deprecated: The Multi-Instance Management Dashboard is no longer available in the SDDC Manager UI.
  • BOM updates: Updated Bill of Materials with new product versions.

Going to my Management Workload Domain showed that the upgrade was available for download:

I did a precheck to verify that my environment was ready to be upgraded:

I checked what my current versions were at:

I downloaded and installed all the update bundles in the order dictated by SDDC Manager, and everything went well except for the first ESXi host upgrade:

The first ESXi host did not exit Maintenance Mode after being upgraded, hence the post check failed:

Message: VUM Remediation (installation) of an ESXi host failed.
Remediation Message: High: VUM Remediation (installation) of an ESXi host failed. Manual intervention needed as upgrade failed during install stage. Check for errors in the lcm log files located on SDDC Manager under /var/log/vmware/vcf/lcm. Please retry the upgrade once the upgrade is available again.

Health check failed on vSAN enabled cluster while exiting maintenance mode on the host: vSAN cluster is not healthy because vSAN health check(s): com.vmware.vsan.health.test.controlleronhcl failed. The host is currently in maintenance mode.

The following KB addresses this issue, and I chose workaround number 3 which was to exit maintenance mode manually: https://kb.vmware.com/s/article/87698

Retrying the upgrade successfully upgraded the rest of my ESXi hosts.

NSX-v to NSX-T – End-to-End Migration #1

The End of General Support for VMware NSX Data Center for vSphere (NSX-v) was January 16, 2022. More details can be found here: https://kb.vmware.com/s/article/85706

If you are still running NSX-v, please start planning your migration to NSX-T as soon as possible as this may be a complex and time consuming job.

I have done several NSX-v to NSX-T migrations lately and thought I should share some of my experiences, starting with my latest End-to-End Migration using Migration Coordinator. This was a VMware Validated Design implementation containing a Management Workload Domain and one VI Workload Domain. I will not go into all the details involved, but rather focus on the challenges we faced and how we resolved them.

Challenge 1

The first error we got in Migration Coordinator was this:

Config translation failed [Reason: TOPOLOGY failed with ”NoneType’ object has no attribute ‘get”]

After some investigation we tried to reboot the NSX-V Manager appliance, but this did not seem to resolve it. We then noticed that EAM status was in a Starting state on NSX Manager, but never changed to Up. We tried to figure out why, and after a while we found that the “Universal Synchronization Service” was stopped so we started it manually, and this made the EAM status change to Up. Not sure how this is related really, but we never saw the error again.

Challenge 2

The next error we got in Migration Coordinator:

Config migration failed [Reason: HTTP Error: 400: Active list is empty in the UplinkHostSwitchProfile. for url: http://localhost:7440/nsxapi/api/v1/host-switch-profiles%5D

We went through the logs trying to figure out what was causing this, but never found any clues. We ended up going through every Portgroup on the DVS and found one unused Portgroup with no Active Uplinks. We deleted this Portgroup since it was not in use and this resolved the problem. If the Portgroup had been in use, we could have added an Uplink to it.

Challenge 3

Migration Coordinator failed to remove NSX-v VIBs from the ESXi host. At first we didn’t figure out why, but we tried to manually remove the VIBs using these commands:

esxcli software vib get -n esx-nsxv

Showed that the VIB was installed.

esxcli software vib remove -n esx-nsxv

Failed with “No VIB matching VIB search specification ‘esx-nsxv'”

After rebooting the host, the above command was successfully removing the NSX-v VIBs.

Challenge 4

NSX-T VIBs fail to install/upgrade, due to insufficient space in bootbank on ESXi host.

This was a common problem a few years back, but I hadn’t seen this in a while.

The following KB has more details and helped us figure out what to do: https://kb.vmware.com/s/article/74864

After removing many unused VIBs we were able to make enough room to install NSX-T. When we did this in advance on other hosts, we also got rid of Challenge 3.

Challenge 5

During the “Migrate NSX Data Center for vSphere Hosts” step we noticed the following in the doc:

“You must disable IPFIX and reboot the ESXi hosts before migrating them.”

We had already disabled IPFIX, but we hadn’t rebooted the hosts, so we decided to do that, however this caused all VMs to lose network connectivity. NSX-v is running in CDO mode, so I am not sure why this happened, but probably got to do with the fact that the control plane is down at this point in the process. We had a maintenance window scheduled so the customer didn’t care, but next time I would make sure to do this in advance.

Challenge 6

The customer were using LAG and when checking the Host Transport Nodes in NSX-T Manager, they all had PNIC/Bond Status Degraded. Since we had migrated all PNICs and VMKs to N-VDS, the hosts still had a VDS connected which had no PNICs attached. Removing the VDS solved this problem.

Challenge 7

Since we were migrating from NSX-v to NSX-T in a Management Cluster, we would end up migrating NSX-T Manager from a VDS to an N-VDS.

The NSX-T Data Center Installation Guide has the following guidance as well as details on how to configure this:

“In a single cluster configuration, management components are hosted on an N-VDS switch as VMs. The N-VDS port to which the management component connects to by default is initialized as a blocked port due to security considerations. If there is a power failure requiring all the four hosts to reboot, the management VM port will be initialized in a blocked state. To avoid circular dependencies, it is recommended to create a port on N-VDS in the unblocked state. An unblocked port ensures that when the cluster is rebooted, the NSX-T management component can communicate with N-VDS to resume normal function.”

Hopefully this post will help you avoid some of these challenges when migrating from NSX-v to NSX-T.