vSAN 6.6.1: vSAN Build Recommendation Engine Health issue [RESOLVED]

In my previous post about vSAN Build Recommendation Engine Health test, I have concluded that it was a bug in vSAN 6.6.1 that prevented vSAN Health service from properly connecting to the Internet via proxy.

With vCenter Server Appliance 6.5 Update 1d release, I have noticed that one of two warning messages disappeared from the vSphere Web Client leaving that task in the ‘Unexpected vSphere Update Manager (VUM) baseline creation failure‘ state.

After checking vSAN configuration one more, I concluded the following:

  • Internet connectivity for automatic updates of the HCL database has been set up properly (vSAN_Cluster > Configure > vSAN > General):

vSAN-BRE-01

  • The HCL database is up-to-date and CEIP is enabled (vSAN_Cluster > Configure > vSAN > Health and Performance):

vSAN-BRE-02

vSAN-BRE-03

  • Update Manager has proxy settings configured and working (vSAN_Cluster > Update Manager > Go to Admin View > Manage > Settings > Download Settings):

vSAN-BRE-04

vSAN-BRE-05

At the same time, the proxy server replaces SSL certificates with its own one signed by the corporate CA when establishing HTTPS connection with the remote peer.

As a result, it causes an error message for the vSAN Build Recommendation Engine Health task as follows (extract from vmware-vsan-health-service.log):

INFO vsan-health[Thread-49] [VsanVumConnection::RemediateVsanClusterInVum] build = {u’release’: {u’baselineName’: u’VMware ESXi 6.5.0 U1 (build 5969303)’, u’isoDisplayName’: u’VMware ESXi Release 6.5.0, Build 5969303′, u’bldnum’: 5969303, u’vcVersion’: [u’6.5.0′], u’patchids’: [u’ESXi650-Update01′], u’patchDisplayName’: u’VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303)’}}

INFO vsan-health[Thread-49] [VsanVumConnection::_LookupPatchBaseline] Looking up baseline for patch VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303) (keys: [40], hash: None)…

INFO vsan-health[Thread-49] [VsanVumConnection::_LookupPatchBaseline] Looking up baseline for patch vSAN recommended patch to be applied on top of ESXi 6.5 U1: ESXi650-201712401-BG (keys: [], hash: None)…

ERROR vsan-health[Thread-49] [VsanVumConnection::RemediateAllClusters] Failed to remediate cluster ‘vim.ClusterComputeResource:domain-c61’

Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 1061, in RemediateAllClusters
performScan = performScan)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 876, in RemediateVsanClusterInVum
patchName, patchMap[chosenRelease])
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 373, in CreateBaselineFromOfficialPatches
baseline = self._LookupPatchBaseline(name, keys)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 411, in _LookupPatchBaseline
result = bm.QueryBaselinesForUpdate(update = updateKeys)
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 557, in <lambda>
self.f(*(self.args + (obj,) + args), **kwargs)
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 362, in _InvokeMethod
list(map(CheckField, info.params, args))
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 883, in CheckField
raise TypeError(‘Required field “%s” not provided (not @optional)’ % info.name)
TypeError: Required field “update” not provided (not @optional)

INFO vsan-health[Thread-49] [VsanVumSystemUtil::AddConfigIssue] Add config issue createBaselineFailed

INFO vsan-health[Thread-49] [VsanVumConnection::_DeleteUnusedBaselines] Deleting baseline VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303) (id 424) because it is unused

INFO vsan-health[Thread-49] [VsanVumSystemUtil::VumRemediateAllClusters_DoWork] Complete VUM check for clusters [‘vim.ClusterComputeResource:domain-c61’]

ERROR vsan-health[Thread-49] [VsanVumConnection::RemediateAllClusters] Failed to remediate cluster

Following the community advice, I decided to add Root CA and subordinate CA certificates (in *.pem format) to the local keystore on vCenter Server Appliance. After copying certificates to /etc/ssl/certs and running the c_rehash command, I added proxy servers to /etc/sysconfig/proxy and rebooted the server.

vSAN-BRE-07

To test that new configuration works, I used the wget command, and it all seemed to work smoothly.

vSAN-BRE-06

Regardless of all that changes, I still got error messages with the vSAN Build Recommendation Engine Health test, but this time they looked a bit different:

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=ServiceInstance, info=content

WARNING vsan-health[Thread-11125] [VsanPhoneHomeWrapperImpl::_try_connect] Cannot connect to VUM. Will retry connection

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=group-d1, info=name

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=group-d1, info=name

ERROR vsan-health[Thread-11125] [VsanCloudHealthDaemon::run] VsanCloudHealthSenderThread exception: Exception: HTTP Error 411: Length Required, Url: https://vcsa.vmware.com/ph/api/dataapp/send?_v=1.0&_c=VsanCloudHealth.6_5&_i=<Support_Tag&gt;, Traceback: Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthUtil.py”, line 511, in getResponse
resp = proxyOpener.open(*args, **kwargs)
File “/usr/lib/python2.7/urllib2.py”, line 435, in open
response = meth(req, response)
File “/usr/lib/python2.7/urllib2.py”, line 548, in http_response
‘http’, request, response, code, msg, hdrs)
File “/usr/lib/python2.7/urllib2.py”, line 473, in error
return self._call_chain(*args)
File “/usr/lib/python2.7/urllib2.py”, line 407, in _call_chain
result = func(*args)
File “/usr/lib/python2.7/urllib2.py”, line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 411: Length Required
Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 353, in run
self._sendCloudHealthData(clusterUuid, data=data)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 321, in _sendCloudHealthData
objectId=clusterUuid, additionalUrlParams=additionalUrlParams)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 156, in send
dataType=dataType, pluginType=pluginType, url=postUrl)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 139, in sendRawData
raise ex
VsanCloudHealthHTTPException: Exception: HTTP Error 411: Length Required, Url: https://vcsa.vmware.com/ph/api/dataapp/send?_v=1.0&_c=VsanCloudHealth.6_5&_i=<Support_Tag&gt;, Traceback: Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthUtil.py”, line 511, in getResponse
resp = proxyOpener.open(*args, **kwargs)
File “/usr/lib/python2.7/urllib2.py”, line 435, in open
response = meth(req, response)
File “/usr/lib/python2.7/urllib2.py”, line 548, in http_response
‘http’, request, response, code, msg, hdrs)
File “/usr/lib/python2.7/urllib2.py”, line 473, in error
return self._call_chain(*args)
File “/usr/lib/python2.7/urllib2.py”, line 407, in _call_chain
result = func(*args)
File “/usr/lib/python2.7/urllib2.py”, line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 411: Length Required

INFO vsan-health[Thread-11125] [VsanCloudHealthDaemon::run] VsanCloudHealthSenderThread done.

vsan-health[Thread-9] [VsanCloudHealthDaemon::_sendExceptionsToPhoneHome] Exceptions for collection/sending exceptions

I thought that the vSAN Health service might try to contact vSphere Update Manager directly, and the proxy settings set on the OS level redirected this request to the Internet proxy instead.

I have added the local domain to the exception list in /etc/sysconfig/proxy and rebooted the server again.

vSAN-BRE-08

After reading about ‘HTTP Error 411’, the only idea was to add a domain service account and its password to HTTP_PROXY and HTTPS_PROXY lines in /etc/sysconfig/proxy. If the password has special characters, they should be added in ASCII encoding to work correctly.

To my great surprise, all communication issues have been resolved, and the vSAN Health service was able to synchronise data with vSphere Update Manager and online services correctly.

vSAN-BRE-09

vSAN-BRE-11

A few minutes later vSAN system baselines and baseline groups appeared in vSphere Update Manager.

vSAN-BRE-10

Of cause, those modifications in Photon OS configuration files are not supported by VMware and could be overwritten by future updates. Yet I hope engineers and developers are working on better integration between vSAN Health and vSphere Update Manager when vCenter resides behind the proxy.

23/02/2018 – Update 1: Per VMware documentation, a starting point to troubleshoot connectivity to the CEIP web server is to make sure the following prerequisites are met:

vSphere: Response to Meltdown and Spectre vulnerabilities

meltdown-spectre-logos

For those who were responding quickly to Meltdown and Spectre by applying security patches to their ESXi environment, it can be a bit frustrating to know that VMware pulled those packages down few days after they were released.

This is related to a reboot issue in the recent CPU microcode updates released by Intel, and both vendors aks for some time to provide a revised version of firmware.

Currently, VMware urges to apply the latest patches (released on January 9, 2018) to vCenter Server and VCSA as follows:

More information (and possibly updates) will come next week.

Meanwhile, I would leave here a few more articles that are worth reading:

25/01/2018 – Update 1: Two more articles that seem to be quite helpful are as follows:

09/02/2018 – Update 2: VMware released a new security advisory (VMSA-2018-0007) in regards to mitigating CVE-2017-5753, CVE-2017-5715, and CVE-2017-5754 in VMware Virtual Appliances.

12/02/2018 – Update 3: Another great summary in regards to the subject: Meltdown and Spectre: far from the solution?

vSAN 6.5: Virtual Machine with more than 64GB memory fails to Storage vMotion to vSAN cluster

VMware has just posted an article in the Virtual Blocks blog which describes this behaviour. It happens only when trying to Storage vMotion a virtual machine with a swap file larger than 64GB to the vSAN datastore.

The task fails and generates the following error messages:

SvMotion-Issue-01

There are two possible workarounds available: either increase the swap file maximum size on the destination ESXi host or set a reservation of memory on the virtual machine. The former one is more preferable, as it does not require host reboot.

VMware provides a KB 2150316 with “more log samples and specifics for identifying the issue as a cause of a migration failure”.

vSphere Web Client 6.x: ‘Shockwave Flash has crashed’ issue

It was a great surprise for many virtualisation specialists to see an error message saying ‘Shockwave Flash has crashed’ immediately after authenticating in the vSphere Web Client 6.x earlier this week.

Flash-Issue-01

Most of the reports came from those who were using the latest version of Google Chrome (61.0.3163.100). However, there were similar issues with other web-browsers and Adobe Flash version 27.0.0.170.

William Lam wrote a special post on his blog about this issue, and he keeps it updated with the number of hacks.

Gladly, VMware has been quick with publishing a KB 2151945 which tracks information about the same problem and providing some workarounds as well. Thanks to Dennis Lu for pointing to this article!

This is a classic example of how dependency on a third-party technology can affect your solution. I hope that VMware accelerates the development of vSphere Client (HTML5) and provides feature parity between it and the Flash one.

ESXi 6.5: Host fails with PSOD after upgrading to 6.5 Update 1 [RESOLVED]

For those who have plans upgrading their environment from vSphere 6.0 to 6.5 Update 1, I would suggest postponing this until VMware resolves issue documented in KB 2151749.

ESXi650-2151749

Hosts will be affected if they equipped with 10 Gbps NICs.

The only workaround that the vendor has at the moment is to downgrade ESXi to 6.0 Update 2.

17/10/2017 – Update 1: According to VMware GSS, this issue is going to be “resolved in ESXi 6.5 Patch 02, which is schedule to release this month (The release date may change without notice).” Please refer to the SR #17599111410 when contacting GSS for more information.

08/02/2018 – Update 2: This issue is resolved in VMware ESXi 6.5 P02 (ESXi-6.5.0-20171204001-standard).

vSAN 6.5-6.6.1: An urgent hotfix ESXi650-201710401

VMware has just released a new hotfix for ESXi and vSAN (KB 2151081) urging customers with all-flash configuration with deduplication enabled to upgrade their environment as soon as possible. This patch resolves data corruption issue which might appear in rare circumstances.

ESXi650-201710401

The affected versions of vSAN include 6.5, 6.6, and 6.6.1.

06-10-2017 – Update 1: As listed in KB 2151042, similar issue has been fixed for ESXi 6.0.

vSAN 6.6.1: vSAN Build Recommendation Engine Health fails

As you might already know, vSAN 6.6.1 is the first release with automated build recommendations for vSAN clusters for vSphere Update Manager, which should help to keep your hardware in a supported state by comparing information from the VMware Compatibility Guide and vSAN Release Catalog with information about the installed ESXi releases.

Obviously, this feature requires vSAN to have Internet access to update release metadata, as well as valid My VMware credentials to download ISO images for upgrades.

To help customers with enabling vSAN build recommendations, VMware embedded some health checks into vSAN 6.6.x that contribute to resolve configuration issues. The build recommendation engine health check detects the following states:

  • Internet access is unavailable.
  • vSphere Update Manager (VUM) is disabled or is not installed.
  • VUM is not responsive.
  • vSAN release metadata is outdated.
  • My VMware login credentials are not set.
  • My VMware authentication failed.
  • Unexpected VUM baseline creation failure.

If the virtual environment seats behind the proxy, you should configure proxy settings in the Internet Connectivity option in vSAN_ClusterConfigure > vSAN > General.

vSAN Health Engine Issue - 02

Those parameters are kept in /etc/vmware-vsan-health/config.conf. Be careful with the user password, as it is added to this file without any encryption.

To test access through the proxy, you can click on the Get latest version online button in vSAN_ClusterConfigure > Health and Performance to update the HCL Database. If everything setup correctly, it will generate the following lines in /var/log/vmware/vsan-health/vmware-vsan-health-service.log:

INFO vsan-health[ID] [<user_name> op=UpdateHclDbFromWeb obj=VsanHealthService] Update HCL database from Web
INFO vsan-health[ID] [VsanHclUtil::_getHttpResponse] Download via proxy

However, even if the Internet connection works, the vSAN Build Recommendation Engine Health test will produce a warning message as follows:

vSAN Health Engine Issue - 01

In the log file you will see lines like these:

WARNING vsan-health[healthThread-c3ad57ea-a3f1-11e7] [VsanCloudHealthUtil::checkNetworkConnection] Internet is not connected.

File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 337, in run
profiler=self.profiler):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 279, in collectedResults
VsanCloudHealthCollector.updateManifestWithPerCluster(serviceInstance)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 230, in updateManifestWithPerCluster
cls._updateManifest()
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 190, in _updateManifest
manifestVersion = cls._queryManifestVersion()
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 174, in _queryManifestVersion
dataType=’manifest_version’, objectId=MANIFEST_VERSION_UUID)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 209, in getClusterHealth
maxRetries=maxRetries, waitInSec=waitInSec)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 247, in getObject
responseBody = self._getPhoneHomeResultsWithRetries(urlParams)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 279, in _getPhoneHomeResultsWithRetries
raise e
VsanCloudHealthConnectionException: <urlopen error [Errno 110] Connection timed out>

Apparently, it is a bug in the current version of vSAN that is documented in the VMware KB 2151692. Neither fix nor workaround is available at the time of writing this blog post.

07/02/2018 – Update 1: A workaround to resolve this issue has been found.