VCSA 6.5: The mysterious dependency on the IPv6 protocol – Part 2

In Part 1 of this mini-series, I was writing about the issue with the Appliance Management User Interface. However, a dependency on the IPv6 protocol in VCSA 6.5 can cause an unexpected behaviour with the vSphere ESXi Dump Collector service as well. Let’s look into this one now.

In the environment with many ESXi hosts, it is vital to have their logs available for troubleshooting. By default, each host has a diagnostic coredump partition available on the local storage. The hypervisor can preserve diagnostic information in one or more pre-configured locations such as the local partition, a file located on VMFS datastore, or a network dump server on vCenter Server.

ESXi-dump-collection-06

In a case of the critical failure with the host, when the system gets into the Purple Screen of Death (PSOD) state, the hypervisor generates a set of diagnostic data archived in a coredump. In my opinion, it is more efficient to have this information stored in the centralised location, and this is where vSphere ESXi Dump Collector service can be useful.

Initially, the vSphere ESXi Dump Collector service is disabled on the vCenter Server Appliance.

ESXi-Dump-Collector-01

The setup process is straightforward: you should select a startup type of this service (by default, it is set to Manual) and click on a Start button to enable it.

ESXi-Dump-Collector-02

Depending on the network requirements and the number of ESXi hosts, you might need changing the Coredump Server UDP Port (6500) and increasing the Repository max size (2GB). Both settings require restarting the vSphere ESXi Dump Collector service.

This process becomes a little bit complicated when IPv6 is disabled on VCSA. An attempt to start the vSphere ESXi Dump Collector service generates an error message in vSphere Web Client as follows:

ESXi-Dump-Collector-03

If we remote to the virtual appliance and run the netdumper service from the console session, it will show us more information:

root@n-vcsa-01 [ ~ ]# service-control –start netdumper
Perform start operation. vmon_profile=None, svc_names=[‘netdumper’], include_coreossvcs=False, include_leafossvcs=False
2017-07-04T10:15:32.179Z Service netdumper state STOPPED
Error executing start on service netdumper. Details {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}
Service-control failed. Error {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}

The next step to troubleshoot this issue is to look into the vSphere ESXi Dump Collector service log file (/var/log/vmware/netdumper/netdumper.log). It reports that the address is already in use:

root@n-vcsa-01 [ ~ ]# cat /var/log/vmware/netdumper/netdumper.log
2017-07-04T10:19:32.121Z| netdumper| I125: Log for vmware-netdumper pid=8347 version=XXX build=build-5318154 option=Release
2017-07-04T10:19:32.121Z| netdumper| I125: The process is 64-bit.
2017-07-04T10:19:32.121Z| netdumper| I125: Host codepage=UTF-8 encoding=UTF-8
2017-07-04T10:19:32.121Z| netdumper| I125: Host is Linux 4.4.8 VMware Photon 1.0 Photon VMware Photon 1.0

2017-07-04T10:19:32.123Z| netdumper| I125: Configured to handle 1024 clients in parallel.
2017-07-04T10:19:32.123Z| netdumper| I125: Configuring /var/core/netdumps as the directory to store the cores
2017-07-04T10:19:32.123Z| netdumper| I125: Configured to use wildcard [::0/0.0.0.0]:6500 as IP address:port
2017-07-04T10:19:32.123Z| netdumper| I125: Using /var/log/vmware/netdumper/netdumper.log as the logfile.
2017-07-04T10:19:32.123Z| netdumper| I125: Nothing to post process
2017-07-04T10:19:32.123Z| netdumper| I125: Couldn’t bind socket to port 6500: 98 Address already in use
2017-07-04T10:19:32.123Z| netdumper| I125:

Playing a bit with the Linux commands gave me some clues:

root@n-vcsa-01 [ ~ ]# netstat -lup
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 *:kerberos *:* 1489/vmdird
udp 0 0 *:sunrpc *:* 1062/rpcbind
udp 0 0 n-vcsa-01.testorg.l:ntp *:* 1249/ntpd
udp 0 0 photon-machine:ntp *:* 1249/ntpd
udp 0 0 *:ntp *:* 1249/ntpd
udp 0 0 *:epmap *:* 1388/dcerpcd
udp 0 0 *:syslog *:* 2229/rsyslogd
udp 0 0 *:794 *:* 1062/rpcbind
udp 0 0 *:ideafarm-door *:* 3905/vpxd
udp 0 0 *:llmnr *:* 1223/systemd-resolv
udp6 0 0 [::]:tftp [::]:* 1/systemd
udp6 0 0 [::]:sunrpc [::]:* 1062/rpcbind
udp6 0 0 [::]:ntp [::]:* 1249/ntpd
udp6 0 0 [::]:syslog [::]:* 2229/rsyslogd
udp6 0 0 [::]:794 [::]:* 1062/rpcbind
udp6 0 0 [::]:boks [::]:* 17377/vmware-netdum

root@n-vcsa-01 [ ~ ]# ps -p 17377
PID TTY TIME CMD
17377 ? 00:00:00 vmware-netdumpe

root@n-vcsa-01 [ ~ ]# cat /proc/17377/cmdline
/usr/sbin/vmware-netdumper-d/var/core/netdumps-o6500-l/var/log/vmware/netdumper/netdumper.log

Even if it reports an error at startup, the vSphere ESXi Dump Collector service is running (partially) on the virtual appliance.

Thanks to Michael (for sharing a detailed guide), I was able to test this assumption quickly.

ESXi-Dump-Collector-04

ESXi-Dump-Collector-05

The coredump was successfully transferred from the ESXi host to the /var/core/netdumps/ folder on the VCSA appliance. However, there were no records about this operation in the netdumper.log.

This issue has been reported to VMware GSS (SR # 17385781602) and should be resolved in the future updates to VCSA 6.5.

vSphere 6.0 issue: the VMware Client Integration Plugin has updated its SSL certificate in Firefox

I have noticed that with the recent releases of Mozilla Firefox and Google Chrome, the ability to launch VMware Client Integration Plugin was broken again. vSphere Web Client 6.0 constantly keeps showing a pop-up message as follows:

CIP Issue - 01

It happens because both browsers have removed support for the NPAPI plugins. So it drops some operations in the Web Client, such as deploying OVF or OVA templates and transferring files with the datastore browser.

The only workable solution for this issue I found is to use Firefox 52 Extended Support Release (32-bit version) which will support the NPAPI plugins until early 2018.

Alternatively, vCenter Server Appliance should be upgraded to the version 6.5 where “the VMware Enhanced Authentication Plug-in replaces the Client Integration Plug-in from vSphere 6.0 releases and earlier“, and the NPAPI support does not require.

21/06/2017 – Update 1: This message pops up also when the web-browser is configured to use a proxy server. Switching to ‘no proxy’ mode stops it from appearing.

“The device cannot start. (Code 10)” for Microsoft ISATAP and Microsoft Teredo Tunneling adapters

I was checking the system settings for one of the Windows 2008 R2 virtual machines that had been provisioned from the template recently when ran over this issue.

Both Microsoft ISATAP Adapter and Microsoft Teredo Tunneling Adapter had warning icons in the Device Manager.

ipv6-issue-01

ipv6-issue-02

Even if it is a minor obstacle, I prefer to resolve any problems with the operating system before installing and configuring applications.

After searching on Microsoft web-site, I came to the following forum thread where the user named Dork Man pointed at the DisabledComponents registry value in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\TCPIP6\Parameters registry hive.

In my case, it has been set to a value of 0xfffffff for some reason.

ipv6-issue-03

 

I found the Microsoft KB # 929852 that explained this parameter. In this article Microsoft states the following:

…system startup will be delayed for 5 seconds if IPv6 is disabled by incorrectly, setting the DisabledComponents registry setting to a value of 0xfffffff.

Microsoft supports only the following values to configure the IPv6 protocol:

  • 0 – re-enables all IPv6 components (Windows default setting)
  • 0xff – disables all IPv6 components except the IPv6 loopback interface
  • 0x20 – makes IPv4 preferable over IPv6 by changing entries in the prefix policy table
  • 0x10 – disables IPv6 on all non-tunnel interfaces (both LAN and PPP)
  • 0x01 – disables IPv6 on all tunnel interfaces
  • 0x11 – disables all IPv6 interfaces except for the IPv6 loopback interface.

I haven’t had any specific requirements for the setting. So changing the DisabledComponents registry value to 0 and rebooting the server resolved the problem completely.

vCenter Server 6.0: vmware-dataservice-sca and vsphere-client status change from green to yellow

vsphere-client-issue-01

Some of you might be aware of the behaviour in vCenter Server 6.0 when the vmware-dataservice-sca and vsphere-client status change from green to yellow continually, and VMware KB # 2144950 provided as a workaround for this one.

However, two questions should be explained in the knowledge base above but seem to be missing.

The first question is current memory usage by the vsphere-client service. Knowing this data will help to choose the correct value for the vSphere Web Client’s maximum heap size when setting it manually. Fortunately, William Lam has a great article that explains a dynamic memory reconfiguration process when vCenter Server is booting up. One of the suggestions that William had was to use a CLI utility cloudvm-ram-size to monitor the memory usage.

cloudvm-ram-size -S | grep -e Service-* -e Linux* -e OS -e vsphere-client -e TOTAL

In the picture below, an output of the command shows memory usage in MB for the vCenter Server with external PSC in a small environment.

vsphere-client-issue-02

As a rule of thumb, to determine the maximum heap size for the vsphere-client service, I usually round the MaxMB value to the nearest gigabytes and add extra 512 MB as a reserve. In this example, AllocatedMB is 2,048 Mb + 512 MB = 2,560 MB.

Then, I set this parameter manually and restarted the vSphere Web Client service using the commands below.

cloudvm-ram-size -C 2560 vsphere-client | service vsphere-client restart

The dynamic memory algorithm adjusts the value of this setting automatically. After few minutes the service reinitialises, and memory allocation looks much better.

vsphere-client-issue-03

On rare occasions, you might notice that vAPI Endpoint service generate error messages after restarting vsphere-client. Restarting this service helps to resolve the problem.

service vmware-vapi-endpoint restart

Now we come to the second question: does this setting change survive the vCenter Server reboots? The answer is yes! And this is great news.

08/12/2016 – Update 1: You should reapply this setting after VCSA has been updated.

19/07/2017 – Update 2: VMware released a KB 2150757 to guide through the process of manually changing the heap memory on vCenter Server components in vCenter 6.x.

EMC VNX POOL LUNS + VMWARE VSPHERE + VAAI = STORAGE DEATH v2

Recently one of my friends has been testing a greenfield vSphere environment and came across an issue with Storage vMotion being slow. It took almost one hour to copy a VM with 510GB VMDK (thick provision eager zeroed) across two LUNs within the same physical array.

vm-disk-properties

In this case, it was EMC VNX5200 with the following firmware versions:

  • OE for Block – 05.33.009.5.155
  • OE for File – 8.1.9-155.

Multipathing policies are default for the storage with SATP set to VMW_SATP_ALUA_CX and PSP set to VMW_PSP_RR.

According to EMC and VMware HCL, this storage should offload XCOPY operations using VAAI feature in ESXi 6.0.

The ESXi hosts were connected through 8Gb SAN, all with firmware and driver versions supported by VMware.

What was more interesting, he noticed that the host had had warning messages as follows:

Device naa. performance has deteriorated. I/O latency increased from average value of XXXXX microseconds to XXXXXX microsecond

VMware KB article # 2007236 states that the possible root causes for this behaviour could be changes made on the target, disk or media failures, overload conditions on the device, and failover. Storage system didn’t report any hardware failures in the past. So, most probably, it was the result of misconfiguration or software fault.

A quick search on the Internet directed me to Neal Dolson’s blog post published in 2014 that described a similar problem. Using the same methodology as the author did, we have received the same results.

esxtop-vaai-enabled

Esxtop showed high storage device command latency and constant switches between vmhba3 and vmhba4.

ua-vaai-enabled

On the storage side, a response time in Unisphere Analyser went up from few milliseconds to 850-900 milliseconds. In the graph above, VMFS_05 is the LUN from which data has been migrated.

Neal’s article suggested contacting the vendor and upgrading the storage firmware. EMC released a fix for this particular problem in version 05.32.000.5.217 of VNX OE for Block (page 17 of the document). However, it applies only to the first generation of VNX:

Platforms:
VNX5100 VNX5150 VNX5300 VNX-VSS100 VNX5500 VNX5700 VNX7500

Severity:
Medium

Frequency of occurrence:
Always under a specific set of circumstances

Tracking number:
61525078/624886

Slow performance was seen on a storage system when running VMware ESX operations that use the VAAI (vStorage APIs for Array Integration) data move primitive (xcopy), such as cloning virtual machines or templates, migrating virtual machines with storage vmotion, and deploying virtual machines from template.

This software has multiple enhancements to improve latency, as well as new code efficiencies to greatly improve cloning and vmotion.

KnowledgeBase ID:
None

Fixed in version:
05.32.000.5.217

I looked at the latest release notes for VNX Operating Environment for Block for VNX5200, and couldn’t find similar information there.

As a workaround, we disabled “DataMover.HardwareAcceleratedMove” option in Advanced System Settings on all hosts using this simple PowerCLI command:

Get-VMHost | Get-AdvancedSetting -Name DataMover.HardwareAcceleratedMove | Set-AdvancedSetting -Value 0

This change is not destructive and can be done online (even if you have Storage vMotion running).

The next step is to log the case with VMware and wait for the resolution.

If you had a similar problem, feel free to share your experience in the comments.

I will keep updating this post when more information is available.

23/09/2016 – Update 1: VMware GSS confirmed that the system had been configured correctly and suggested contacting the storage vendor about the matter.

16/03/2017 – Update 2: A workaround for this issue is to follow the recommendations from EMC and increase the value of DataMover.MaxHwTransferSize parameter to “16384” on each host connected to the LUN.