VCSA 6.5: The mysterious dependency on the IPv6 protocol – Part 2

In Part 1 of this mini-series, I was writing about the issue with the Appliance Management User Interface. However, a dependency on the IPv6 protocol in VCSA 6.5 can cause an unexpected behaviour with the vSphere ESXi Dump Collector service as well. Let’s look into this one now.

In the environment with many ESXi hosts, it is vital to have their logs available for troubleshooting. By default, each host has a diagnostic coredump partition available on the local storage. The hypervisor can preserve diagnostic information in one or more pre-configured locations such as the local partition, a file located on VMFS datastore, or a network dump server on vCenter Server.

ESXi-dump-collection-06

In a case of the critical failure with the host, when the system gets into the Purple Screen of Death (PSOD) state, the hypervisor generates a set of diagnostic data archived in a coredump. In my opinion, it is more efficient to have this information stored in the centralised location, and this is where vSphere ESXi Dump Collector service can be useful.

Initially, the vSphere ESXi Dump Collector service is disabled on the vCenter Server Appliance.

ESXi-Dump-Collector-01

The setup process is straightforward: you should select a startup type of this service (by default, it is set to Manual) and click on a Start button to enable it.

ESXi-Dump-Collector-02

Depending on the network requirements and the number of ESXi hosts, you might need changing the Coredump Server UDP Port (6500) and increasing the Repository max size (2GB). Both settings require restarting the vSphere ESXi Dump Collector service.

This process becomes a little bit complicated when IPv6 is disabled on VCSA. An attempt to start the vSphere ESXi Dump Collector service generates an error message in vSphere Web Client as follows:

ESXi-Dump-Collector-03

If we remote to the virtual appliance and run the netdumper service from the console session, it will show us more information:

root@n-vcsa-01 [ ~ ]# service-control –start netdumper
Perform start operation. vmon_profile=None, svc_names=[‘netdumper’], include_coreossvcs=False, include_leafossvcs=False
2017-07-04T10:15:32.179Z Service netdumper state STOPPED
Error executing start on service netdumper. Details {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}
Service-control failed. Error {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}

The next step to troubleshoot this issue is to look into the vSphere ESXi Dump Collector service log file (/var/log/vmware/netdumper/netdumper.log). It reports that the address is already in use:

root@n-vcsa-01 [ ~ ]# cat /var/log/vmware/netdumper/netdumper.log
2017-07-04T10:19:32.121Z| netdumper| I125: Log for vmware-netdumper pid=8347 version=XXX build=build-5318154 option=Release
2017-07-04T10:19:32.121Z| netdumper| I125: The process is 64-bit.
2017-07-04T10:19:32.121Z| netdumper| I125: Host codepage=UTF-8 encoding=UTF-8
2017-07-04T10:19:32.121Z| netdumper| I125: Host is Linux 4.4.8 VMware Photon 1.0 Photon VMware Photon 1.0

2017-07-04T10:19:32.123Z| netdumper| I125: Configured to handle 1024 clients in parallel.
2017-07-04T10:19:32.123Z| netdumper| I125: Configuring /var/core/netdumps as the directory to store the cores
2017-07-04T10:19:32.123Z| netdumper| I125: Configured to use wildcard [::0/0.0.0.0]:6500 as IP address:port
2017-07-04T10:19:32.123Z| netdumper| I125: Using /var/log/vmware/netdumper/netdumper.log as the logfile.
2017-07-04T10:19:32.123Z| netdumper| I125: Nothing to post process
2017-07-04T10:19:32.123Z| netdumper| I125: Couldn’t bind socket to port 6500: 98 Address already in use
2017-07-04T10:19:32.123Z| netdumper| I125:

Playing a bit with the Linux commands gave me some clues:

root@n-vcsa-01 [ ~ ]# netstat -lup
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 *:kerberos *:* 1489/vmdird
udp 0 0 *:sunrpc *:* 1062/rpcbind
udp 0 0 n-vcsa-01.testorg.l:ntp *:* 1249/ntpd
udp 0 0 photon-machine:ntp *:* 1249/ntpd
udp 0 0 *:ntp *:* 1249/ntpd
udp 0 0 *:epmap *:* 1388/dcerpcd
udp 0 0 *:syslog *:* 2229/rsyslogd
udp 0 0 *:794 *:* 1062/rpcbind
udp 0 0 *:ideafarm-door *:* 3905/vpxd
udp 0 0 *:llmnr *:* 1223/systemd-resolv
udp6 0 0 [::]:tftp [::]:* 1/systemd
udp6 0 0 [::]:sunrpc [::]:* 1062/rpcbind
udp6 0 0 [::]:ntp [::]:* 1249/ntpd
udp6 0 0 [::]:syslog [::]:* 2229/rsyslogd
udp6 0 0 [::]:794 [::]:* 1062/rpcbind
udp6 0 0 [::]:boks [::]:* 17377/vmware-netdum

root@n-vcsa-01 [ ~ ]# ps -p 17377
PID TTY TIME CMD
17377 ? 00:00:00 vmware-netdumpe

root@n-vcsa-01 [ ~ ]# cat /proc/17377/cmdline
/usr/sbin/vmware-netdumper-d/var/core/netdumps-o6500-l/var/log/vmware/netdumper/netdumper.log

Even if it reports an error at startup, the vSphere ESXi Dump Collector service is running (partially) on the virtual appliance.

Thanks to Michael (for sharing a detailed guide), I was able to test this assumption quickly.

ESXi-Dump-Collector-04

ESXi-Dump-Collector-05

The coredump was successfully transferred from the ESXi host to the /var/core/netdumps/ folder on the VCSA appliance. However, there were no records about this operation in the netdumper.log.

This issue has been reported to VMware GSS (SR # 17385781602) and should be resolved in the future updates to VCSA 6.5.