ESXi 6.5: Host fails with PSOD when IPV6 is disabled

I have a habit of reading all new KB articles published by VMware every week. Not only is it give a visibility of the current issues that VMware products have, but it helps to be proactive with learning some behaviour and workarounds and prepared to remediate them if required.

Therefore, after writing a few blog posts about vCenter 6.5 and IPv6 here and here, it caught my eye that ESXi 6.5 Hosts could also fail with a Purple Screen of Death when IPV6 is disabled.

VMware has published a KB 2150794 that explained this behaviour.

The only workaround at this moment is to re-enable IPv6 on all hosts in your environment.

ESXi65-IPv6-PSOD

VCSA 6.5: The mysterious dependency on the IPv6 protocol – Part 1

Starting from vSphere 4.1, IPv6 support has been introduced to the virtual platform from VMware. It is enabled in the vCenter Server Appliance by default and can be controlled in VCSA 6.0 and 6.5 from the Direct Console User Interface (Customize System > Configure Management Network > IPv6 Configuration).

IPv6-Issue-01

To my surprise, disabling IPv6 can cause some problems with the VCSA updates. I will explain this statement and provide a workaround in the paragraphs below.

Imagine your security team requires IPv6 to be turned off on vCenter Server. Following this call, you proceeded with the configuration change in DCUI.

IPv6-Issue-02

After rebooting the virtual machine, it all should work fine. Now, it is time to update the virtual appliance to a newer version. You downloaded a patch file, attached it to the VM, and started the update process from the VMware vSphere Appliance Management Interface.

When the server reboots, you will notice the Appliance Management User Interface is not accessible anymore. To troubleshoot this issue further, we need to open SSH session with the appliance and enable Shell mode.

Firstly, we need to netstat command to see if any service is listening on TCP port 5480. The command output does not show anything.

IPv6-Issue-03

The next step is to identify the service which provides the Appliance MUI and its current status. Fortunately, I have noticed an error message which is related to the problem when the operating system is booting up.

IPv6-Issue-04

Querying the vami-lighttp.service status shows the following results.

IPv6-Issue-05

So it is a duplicate parameter server.use-ipv6 in the configuration file which was causing this behaviour. To find this file, I was using a combination of rpm and egrep commands to filter the output.

IPv6-Issue-06

A quick search in /opt/vmware/etc/lighttpd/lighttpd.conf shows that there are two identical lines with IPv6 settings as follows:

IPv6-Issue-07

To fix this issue, I removed one of the lines, started the vami-lighttp.service and checked that the service works as expected.

IPv6-Issue-08

To be continued…

vSphere 6.x: The beauty and ugliness of the Content Library – Part 1

The title of this blog post seems to be a bit provocative, and this has been done for a reason.

I believe many VMware engineers, including myself, were really excited about the Content Library feature introduced in vSphere 6.0. The product itself is not completely new for VMware, as it merges code from the content management feature of vCloud Director.

In What’s New in the VMware vSphere 6.0 Platform whitepaper, VMware states the following:

“The Content Library… centrally manages virtual machine templates, ISO images, and scripts, and it performs the content delivery of associated data from the published catalog to the subscribed catalog at other sites.”

Sounds really cool! Now we can centralise all objects that were previously residing on different datastores in one place, and manage them from vSphere Web Client.

In vSphere 6.5, VMware continues improving and polishing this feature:

“Administrators can now mount an ISO directly from the Content Library, apply a guest OS customization specification during VM deployment, and update existing templates.”

However, this article is not only about embracing the tool provided. 🙂 I would like to share with you three specific examples when it doesn’t work as expected, and possible workarounds.

Issue #1 – Provisioning a virtual machine template with the advanced parameters

Affected platform: vSphere 6.0 prior to Update 3.

It was a great surprise to know that provisioning a virtual machine from a VM template which has advanced parameters set can cause any problems in vSphere 6.0. Although the provisioning operation starts as expected, it shows an error message “Failed to deploy OVF package” at the end of it.

CL-Issue01-01

Unfortunately, the Error Report in vSphere Web Client wouldn’t be able to clarify the root cause of this event.

CL-Issue01-02

After contacting VMware GSS about this issue (SR # 16255562909) in late 2016, I had been advised that this bug would be addressed in vSphere 6.0 Update 3.

In March 2017 I updated my environment to this version and tested this feature, the VM creation was working smoothly. So it took almost two years for VMware since the Content Library feature was generally available to fix it.

Gladly, vSphere 6.5 does not have this problem at all.

Resolution: Update your environment to vSphere 6.0 Update 3 or newer version.

Issue #2 – Provisioning a virtual machine from the Content Library on the vSAN datastore

Affected platform: vSphere 6.5 Standard.

The issue is not related to the Content Library directly, rather to OVA/OVF provisioning. For some reason, when you create a new VM from the template in vSphere 6.5, it triggers “Call DRS for cross vMotion placement recommendations” task.

If you use vSphere 6.5 Standard, for which the DRS feature is not available, it causes this task to fail with the error message “The operation failed due to The operation is not allowed in the current state.”

CL-Issue02-01

CL-Issue02-02

The Error Report in vSphere Web Client looks similar to one in the picture below.

CL-Issue02-03

In the Known Issues in VMware vSAN 6.6 Release Notes, the vendor states the following:

VM OVF deploy fails if DRS is disabled
If you deploy an OVF template on the vSAN cluster, the operation fails if DRS is disabled on the vSAN cluster. You might see a message similar to the following: The operation is not allowed in the current state.

Workaround: Enable DRS on the vSAN cluster before you deploy an OVF template.

After doing some troubleshooting and trying different scenarios, the only difference with the provisioning task I was able to identify was the VM storage policy. Regardless the way the VM creation was initiated (from the OVA/OVF file, or Content Library template), it was the Virtual SAN Default Storage Policy call for the DRS to perform a cross vMotion check.

For example, if you set the VM storage policy in the Select storage dialogue box to “None”, the OVA/OVF file can be provisioned on the vSAN datastore.

CL-Issue02-04

The same happens for the VM template from the Subscribed Content Library when the VM storage policy is “None”.

Unfortunately, this trick doesn’t work with the templates in the Local Content Library.

So I decided to dig a bit dipper into the Content Library structure to see if anything can be done there.

The Content Library keeps its data in the contentlib-GUID folder. Each template has its own subfolder with the unique name. Inside the subfolder, there are few files: a descriptor (*.ovf) and one or more data files (*.vmdk).

In vSphere 6.0 those files are named as descriptor_GUID.ovf and disk-vdcs-Disk_Number_GUID.vmdk.

With vSphere 6.5 the files are self-explanatory: Template_Name_GUID.ovf and Template_Name-Disk_Number_GUID.vmdk.

CL-Issue02-05

I compared the descriptor files for the VM templates in the Local and Subscribed Content Libraries, and found they had different vmw:name values in the StorageGroupSection. For the Local Content Library it was a “Virtual SAN Default Storage Policy”, and for the subscribed one it was different.

CL-Issue02-06

It all led me to the idea of changing this descriptor for the VM template in the Local Content Library. So I could provision the VMs using one of the workarounds below.

Workarounds:

  • When provision from the OVA/OVF file, set the VM storage policy in the Select storage dialogue box as “None”,
  • You can provision from the Subscribed Content Library if it has the VM templates with the VM storage policy different from the “Virtual SAN Default Storage Policy”. Set the VM storage policy in the Select storage dialogue box as “None”,
  • You can provision from the Local Content Library if you edit the descriptor file for the VM template and replace the “Virtual SAN Default Storage Policy” with something else. Set the VM storage policy in the Select storage dialogue box as “None”.

Resolution: The support case has been opened, and I am waiting for VMware to resolve this issue. The ETA for this to be fixed is in vSphere 6.5 Update 1 (please refer to SR # 17393663302 when contacting VMware GSS for the future updates).

To be continued

VMware Log Insight 4.0 and a slow login with the domain user credentials

Recently I was spinning up one more instance of VMware Log Insight 4.0 appliance in a branch office.

After enabling authentication against Active Directory, I have noticed it was relatively slow to log on to the Log Insight web interface. Moreover, when I pointed the Authentication Configuration to the local domain controllers the connection test was always failing.

li-ad-integration-02

I did not have enough time to troubleshoot this issue. So I decided to continue with this task later on.

Few days after the situation became even worth: domain users could not successfully log on to the appliance with the rolling wheel appears when pressing the login button.

li-ad-integration-01

Fortunately, I am not the first customer who came across of this issue. VMware has published an article “Unable to Log In Using Active Directory Credentials” which helps to locate the cause of this behaviour.

As suggested by the vendor, I looked through the records in the /storage/var/loginsight/runtime.log file and have found the following:

[com.vmware.loginsight.aaa.krb5.KrbAuthenticator] [Attempting Kerberos login: [[ user=XXXXX ], [ domain=XXXXX ]]]

[com.vmware.loginsight.aaa.krb5.KrbAuthenticator] [Kerberos login in 270817ms]

jsonResult: {“result”:”Cannot reach kerberos servers through TCP.“}

suggestion Please verify that your firewall settings allow TCP ports for active directory and kerberos.

Here I need to say that Active Directory has the hub-and-spoke topology with the domain controllers in the local and central sites being available to the clients.

By default, Log Insight could be pointed to the specific domain controllers, but not Kerberos servers. As a result, the Kerberos client uses auto-discovery as a mechanism to contact any server listed in the _ldap._tcp.dc._msdcs.[domain_name] namespace and delays with reaching ones that are available. To illustrate this, you can execute the following command from the Log Insight CLI:

~# netstat -A inet –program | egrep -i “kerberos”

It should show you all active UDP sessions which were initiated by the Kerberos client.

The next step is to find the way to narrow down a list of the domain controllers to those which are available to the client. VMware helps us with this task providing “advanced options for Active Directory integration in Log Insight beyond what is available in the administrative user interface.

The problem can be resolved with the following steps:

  1. Open https://loginsight_hostname_or_ipaddress/internal/config web-page.
  2. Add krb-domain-servers option with the appropriate values for the available domain controllers to the advanced configuration and save those changes.
  3. Restart Log Insight server.

After all those changes completed, you should be able to log on quickly to Log Insight with the domain account:

[com.vmware.loginsight.aaa.krb5.KrbAuthenticator] [Attempting Kerberos login: [[ user=XXXXX ], [ domain=XXXXX ]]]

[com.vmware.loginsight.aaa.krb5.KrbAuthenticator] [Kerberos login in 22ms]

03/03/2017 – Update 1: With the release of vRealize Log Insigh 4.3 the issue has been resolved. Please see the release notes for more details.

 

vCenter Support Assistant 6.5: This type of network adapter is not supported by {0}Other Linux (64-bit)

VMware has just released a new version of vCenter Support Assistant 6.5 which officially supports vSphere 6.5 and has a few noticeable improvements comparing to the previous release.

In this appliance, SUSE Linux has been replaced with Photon OS. The shift looks quite logical, as VMware pushes their own Linux flavour to more and more new products. Not only is it help to maintain a holistic approach when distributing virtual appliances, but it also promises an improved performance of the operating system, as VMware heavily invested into making it lightweight and fast.

However, when I completed provisioning vSA 6.5 in my environment and checked the virtual machine settings; to my surprise, it was a warning message shown in the screenshot below.

vsa-issue-01

It is not problematic to understand a root cause of this issue and eliminate it completely.To keep backwards compatibility with previous versions of vCenter Server, the VM hardware was set to version 8 (ESXi 5.0 and later).

To keep backwards compatibility with earlier versions of vCenter Server, the VM hardware was set to version 8 (ESXi 5.0 and later).

vsa-issue-02

This choice of the OS is entirely unexpected, as ‘Other Linux (64-bit)‘ was classified as a Legacy operating system by the vendor.

vsa-issue-03

It is until the VM hardware version 10 when it is possible to change the guest operating system to ‘Other 3.x or later Linux (64-bit)‘ to resolve the problem. So the workaround would be upgrading the VM to at least hardware version 10, and then chose the compatible OS type.

My suggestion to VMware would be to introduce a new Guest OS version called ‘Linux / Photon OS’ with the compatible hardware profile to prevent similar warnings in the future.

vSphere 6.0: Available storage for /storage/log reached warning thershold – less then 30 % available space

For those who have vCenter Server Appliance with an External Platform Services Controller, you might notice a warning message in Services Health area in Administration -> System Configuration -> Summary tab.

VMware Syslog Service reports a warning message as soon as /storage/log has less than 30 percent of free space, similar to what is in the picture below.

syslog-service-issue-01

syslog-service-issue-02

The problem appears to be with the VMDK disk for a /storage/log mount point. On PSC, it has a default size of 5 GB and is quickly filling in with the SSO log files.

syslog-service-issue-03

VMware has two possible solutions to resolve this issue, as follows:

The second option sounds more preferable, as it eliminates the need to monitor changes in the log4j.properties file after a system update. However, the commands in the VMware KB 2126276 do not apply to the Platform Services Controller appliance. It doesn’t have a vpxd_servicecfg script to automate the volume extension.

Fortunately, Florian Grehl has documented a workaround for PSC, which requires us to extend the VMDK5 using the vSphere Web Client and execute the following commands in an SSH session on the affected server:

1. Rescan the SCSI Bus to make Linux aware of the resized virtual disk

# rescan-scsi-bus.sh -w –forcerescan

2. Change the size of the Volume Group by using the Disk Device from the table above

# pvresize /dev/sde

3. Resize the Logical Volume by using the name from the table above

# lvresize –resizefs -l +100%FREE /dev/log_vg/log

After completing the commands and verifying the volume size, we should restart VMware Syslog Service to refresh its state. It can be done from the same SSH session or using vSphere Web Client.

syslog-service-issue-04

And this is how things are back to normal 🙂