ESXi 6.5: Host fails with PSOD when IPV6 is disabled

I have a habit of reading all new KB articles published by VMware every week. Not only is it give a visibility of the current issues that VMware products have, but it helps to be proactive with learning some behaviour and workarounds and prepared to remediate them if required.

Therefore, after writing a few blog posts about vCenter 6.5 and IPv6 here and here, it caught my eye that ESXi 6.5 Hosts could also fail with a Purple Screen of Death when IPV6 is disabled.

VMware has published a KB 2150794 that explained this behaviour.

The only workaround at this moment is to re-enable IPv6 on all hosts in your environment.

ESXi65-IPv6-PSOD

VCSA 6.5: The mysterious dependency on the IPv6 protocol – Part 2

In Part 1 of this mini-series, I was writing about the issue with the Appliance Management User Interface. However, a dependency on the IPv6 protocol in VCSA 6.5 can cause an unexpected behaviour with the vSphere ESXi Dump Collector service as well. Let’s look into this one now.

In the environment with many ESXi hosts, it is vital to have their logs available for troubleshooting. By default, each host has a diagnostic coredump partition available on the local storage. The hypervisor can preserve diagnostic information in one or more pre-configured locations such as the local partition, a file located on VMFS datastore, or a network dump server on vCenter Server.

ESXi-dump-collection-06

In a case of the critical failure with the host, when the system gets into the Purple Screen of Death (PSOD) state, the hypervisor generates a set of diagnostic data archived in a coredump. In my opinion, it is more efficient to have this information stored in the centralised location, and this is where vSphere ESXi Dump Collector service can be useful.

Initially, the vSphere ESXi Dump Collector service is disabled on the vCenter Server Appliance.

ESXi-Dump-Collector-01

The setup process is straightforward: you should select a startup type of this service (by default, it is set to Manual) and click on a Start button to enable it.

ESXi-Dump-Collector-02

Depending on the network requirements and the number of ESXi hosts, you might need changing the Coredump Server UDP Port (6500) and increasing the Repository max size (2GB). Both settings require restarting the vSphere ESXi Dump Collector service.

This process becomes a little bit complicated when IPv6 is disabled on VCSA. An attempt to start the vSphere ESXi Dump Collector service generates an error message in vSphere Web Client as follows:

ESXi-Dump-Collector-03

If we remote to the virtual appliance and run the netdumper service from the console session, it will show us more information:

root@n-vcsa-01 [ ~ ]# service-control –start netdumper
Perform start operation. vmon_profile=None, svc_names=[‘netdumper’], include_coreossvcs=False, include_leafossvcs=False
2017-07-04T10:15:32.179Z Service netdumper state STOPPED
Error executing start on service netdumper. Details {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}
Service-control failed. Error {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}

The next step to troubleshoot this issue is to look into the vSphere ESXi Dump Collector service log file (/var/log/vmware/netdumper/netdumper.log). It reports that the address is already in use:

root@n-vcsa-01 [ ~ ]# cat /var/log/vmware/netdumper/netdumper.log
2017-07-04T10:19:32.121Z| netdumper| I125: Log for vmware-netdumper pid=8347 version=XXX build=build-5318154 option=Release
2017-07-04T10:19:32.121Z| netdumper| I125: The process is 64-bit.
2017-07-04T10:19:32.121Z| netdumper| I125: Host codepage=UTF-8 encoding=UTF-8
2017-07-04T10:19:32.121Z| netdumper| I125: Host is Linux 4.4.8 VMware Photon 1.0 Photon VMware Photon 1.0

2017-07-04T10:19:32.123Z| netdumper| I125: Configured to handle 1024 clients in parallel.
2017-07-04T10:19:32.123Z| netdumper| I125: Configuring /var/core/netdumps as the directory to store the cores
2017-07-04T10:19:32.123Z| netdumper| I125: Configured to use wildcard [::0/0.0.0.0]:6500 as IP address:port
2017-07-04T10:19:32.123Z| netdumper| I125: Using /var/log/vmware/netdumper/netdumper.log as the logfile.
2017-07-04T10:19:32.123Z| netdumper| I125: Nothing to post process
2017-07-04T10:19:32.123Z| netdumper| I125: Couldn’t bind socket to port 6500: 98 Address already in use
2017-07-04T10:19:32.123Z| netdumper| I125:

Playing a bit with the Linux commands gave me some clues:

root@n-vcsa-01 [ ~ ]# netstat -lup
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 *:kerberos *:* 1489/vmdird
udp 0 0 *:sunrpc *:* 1062/rpcbind
udp 0 0 n-vcsa-01.testorg.l:ntp *:* 1249/ntpd
udp 0 0 photon-machine:ntp *:* 1249/ntpd
udp 0 0 *:ntp *:* 1249/ntpd
udp 0 0 *:epmap *:* 1388/dcerpcd
udp 0 0 *:syslog *:* 2229/rsyslogd
udp 0 0 *:794 *:* 1062/rpcbind
udp 0 0 *:ideafarm-door *:* 3905/vpxd
udp 0 0 *:llmnr *:* 1223/systemd-resolv
udp6 0 0 [::]:tftp [::]:* 1/systemd
udp6 0 0 [::]:sunrpc [::]:* 1062/rpcbind
udp6 0 0 [::]:ntp [::]:* 1249/ntpd
udp6 0 0 [::]:syslog [::]:* 2229/rsyslogd
udp6 0 0 [::]:794 [::]:* 1062/rpcbind
udp6 0 0 [::]:boks [::]:* 17377/vmware-netdum

root@n-vcsa-01 [ ~ ]# ps -p 17377
PID TTY TIME CMD
17377 ? 00:00:00 vmware-netdumpe

root@n-vcsa-01 [ ~ ]# cat /proc/17377/cmdline
/usr/sbin/vmware-netdumper-d/var/core/netdumps-o6500-l/var/log/vmware/netdumper/netdumper.log

Even if it reports an error at startup, the vSphere ESXi Dump Collector service is running (partially) on the virtual appliance.

Thanks to Michael (for sharing a detailed guide), I was able to test this assumption quickly.

ESXi-Dump-Collector-04

ESXi-Dump-Collector-05

The coredump was successfully transferred from the ESXi host to the /var/core/netdumps/ folder on the VCSA appliance. However, there were no records about this operation in the netdumper.log.

This issue has been reported to VMware GSS (SR # 17385781602) and should be resolved in the future updates to VCSA 6.5.

vSphere 6.0 issue: the VMware Client Integration Plugin has updated its SSL certificate in Firefox

I have noticed that with the recent releases of Mozilla Firefox and Google Chrome, the ability to launch VMware Client Integration Plugin was broken again. vSphere Web Client 6.0 constantly keeps showing a pop-up message as follows:

CIP Issue - 01

It happens because both browsers have removed support for the NPAPI plugins. So it drops some operations in the Web Client, such as deploying OVF or OVA templates and transferring files with the datastore browser.

The only workable solution for this issue I found is to use Firefox 52 Extended Support Release (32-bit version) which will support the NPAPI plugins until early 2018.

Alternatively, vCenter Server Appliance should be upgraded to the version 6.5 where “the VMware Enhanced Authentication Plug-in replaces the Client Integration Plug-in from vSphere 6.0 releases and earlier“, and the NPAPI support does not require.

21/06/2017 – Update 1: This message pops up also when the web-browser is configured to use a proxy server. Switching to ‘no proxy’ mode stops it from appearing.

A few noticeable changes to vROps 6.6 and Log Insight 4.5

With the recent releases of VMware vRealize Operations Manager 6.6 and Log Insight 4.5, VMware has made some changes to its products. They are as follows:

  • vRealize Operations Manager Plugin for vSphere Web Client should be removed after upgrading to vROps 6.6.vROps Plugin - 01
    In KB 2150394, VMware provides detailed instructions on how to do it for both vCenter for Windows and vCenter Virtual Appliance.
  • Native support for Active Directory in vRealize Log Insight is now deprecated.Log Insight - AD Integration - 01
    As per KB 2148976, VMware Identity Manager (VIDM) should be configured as an alternative. It is not confirmed yet whether VIDM is available for free for Log Insight users. More information will be posted on this thread on VMware Technology Network.I understand that this change widens the business scenarios for the product. However, for those of us who use Log Insight purely for collecting and analysing vSphere logs, it would be great to have Active Directory replaced with vCenter Single Sign-On (vCenter SSO). It sounds more logical.You can vote for the Log Insight integration with vCenter SSO on this link. It will be great if more people request this feature.

21/06/2017 – Update 1: VMware Identity Manager for Log Insight has been officially released and it is free! The VIDM virtual appliance is available on the Log Insight 4.5 download page.

27/06/2017 – Update 2: VMware has changed its policy and continues to provide support for the Active Directory integration in Log Insight 4.5. However, “it may be removed in a future version” (and probably will).

VCSA 6.5: The mysterious dependency on the IPv6 protocol – Part 1

Starting from vSphere 4.1, IPv6 support has been introduced to the virtual platform from VMware. It is enabled in the vCenter Server Appliance by default and can be controlled in VCSA 6.0 and 6.5 from the Direct Console User Interface (Customize System > Configure Management Network > IPv6 Configuration).

IPv6-Issue-01

To my surprise, disabling IPv6 can cause some problems with the VCSA updates. I will explain this statement and provide a workaround in the paragraphs below.

Imagine your security team requires IPv6 to be turned off on vCenter Server. Following this call, you proceeded with the configuration change in DCUI.

IPv6-Issue-02

After rebooting the virtual machine, it all should work fine. Now, it is time to update the virtual appliance to a newer version. You downloaded a patch file, attached it to the VM, and started the update process from the VMware vSphere Appliance Management Interface.

When the server reboots, you will notice the Appliance Management User Interface is not accessible anymore. To troubleshoot this issue further, we need to open SSH session with the appliance and enable Shell mode.

Firstly, we need to netstat command to see if any service is listening on TCP port 5480. The command output does not show anything.

IPv6-Issue-03

The next step is to identify the service which provides the Appliance MUI and its current status. Fortunately, I have noticed an error message which is related to the problem when the operating system is booting up.

IPv6-Issue-04

Querying the vami-lighttp.service status shows the following results.

IPv6-Issue-05

So it is a duplicate parameter server.use-ipv6 in the configuration file which was causing this behaviour. To find this file, I was using a combination of rpm and egrep commands to filter the output.

IPv6-Issue-06

A quick search in /opt/vmware/etc/lighttpd/lighttpd.conf shows that there are two identical lines with IPv6 settings as follows:

IPv6-Issue-07

To fix this issue, I removed one of the lines, started the vami-lighttp.service and checked that the service works as expected.

IPv6-Issue-08

To be continued…

vSphere 6.x: The beauty and ugliness of the Content Library – Part 2

In part 1 of this mini-series, I wrote about some technical problems that I had had with the Content Library and provided the workable solutions for them.

Here I am going to touch the security aspect of this technology. Fortunately, there are no complications with restricting the virtual machine provisioning. It is just not as straightforward, as I or some of the readers would expect.

Issue #3 – Preventing users from provisioning the virtual machine from the Content Library

Affected platform: vSphere 6.0 and 6.5, all versions.

In vCenter, permissions are assigned to objects in the object hierarchy called vSphere Inventory Hierarchy. The individual permissions are called privileges. They are combined into roles which then allocated to the users.

The Content Library has its own set of privileges under All Privileges > Content Library. They designed to manage different settings related to the configuration of the object. There is a predefined role in vSphere called Content Library Administrator. The primary purpose of it is to give a user privileges to monitor and manage a library and its contents.

However, if you would like to restrict the VM provisioning from the Content Library and look at the long list, there is no privilege which can help to achieve this task there.

After doing some testing and discussing this subject with VMware GSS, the only solution we were able to come up was removing all Content Library privileges from the role and assigning it to the users on the vCenter Server level. In this case, users won’t be able to get access to the items in the Content Library. I was a bit frustrated with this limitation and even contacted the engineering team at VMware directly about the issue.

Coincidentally, I was working on restricting the VM provisioning from other sources: vApps and OVA/OVF Templates. It was then I realised it was actually possible to implement the complete solution to my problem.

As you might know, the Content Library keeps the VM template objects in OVF format.

CL-02

So I decided to play with the privileges that control deploying process from OVF templates. Surprisingly, it was a vApp import that helped me to achieve my goal. Happy days!

CL-03

Resolution: Remove All Privileges > vApp > Import privilege from the user role, as described in VMware KB 2105932.