vSAN 6.6.1: Replacing a faulty NIC on Dell PowerEdge server

Not long ago I have noticed a 10G network interface flipping on one of the vSAN nodes.

vSAN-NIC-issue-01

I immediately started investigating this issue. An interesting thing was that this device was part of the integrated NIC on the server, and only it was generating connection errors.

vSAN-NIC-issue-02

After consulting with the Networks, we found the following:

  • The interface was up,
  • The port connecting to the switch had no traffic (can send to the server, but not receiving from the server),
  • No errors were recorded,
  • SFP signals were good.

The plan of attack was to replace SFPs – first on the server and, if it didn’t help, on the switch. During this operation we’ve found out the SFP on the server side was unusually warm. Unfortunately, replacing SFPs didn’t help and after approximately 15 minutes of complete silence, disconnects continued.

The next move was to contact a vendor. In our case, it was Dell EMC.

We’ve lodged a support request and sent the SupportAssist Collection to them. The response from the Support was to replace an embedded NIC with a new one. Considering the server was in use, it all sounded tricky to me.

However, thanks to the new algorithm which assigns device names for I/O devices beginning in ESXi 5.5, it all went smoothly. VMware states the following:

vSAN-NIC-issue-03

The number of ports on the embedded NIC hasn’t changed. As a result, hypervisor assigned the same aliases to the onboard ports.

ESXi initialised new ports and vSAN configuration was updated successfully without any human interaction.

As a bonus, when the server was booting after the card replacement, Lifecycle Controller detected an older version of firmware on the device and initiated a firmware update operation automatically.

vSAN-NIC-issue-04

All in all, I am impressed by how robust modern platforms both from Dell EMC and VMware.

ESXi 6.5: Retrieve IPMI SEL request to host failed [FIXED BY VENDOR]

From time to time you might want to check the host hardware health manually in Monitor>Hardware Health (vSphere Client) or Monitor>Hardware Status (vSphere Web Client).

For many months this functionality has been broken for ESXi 6.5 on DellEMC servers.

vSphere Web Client - IPMI Error

When opening the Sensors page, vpxd.log shows the following message:

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — BEGIN task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog

error vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] No Content-Length header, WSMan IPMI SEL operation failed

info vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] WSMan Msg size 59: part:401 Unauthorized
–> WWW-Authenticate: Basic realm=”OPENWSMAN”)l▒\x7f

warning vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] Closing Response processing in unexpected state: 3

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — FINISH task-35318

info vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — ERROR task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog: vmodl.fault.SystemError:
–> Result:
–> (vmodl.fault.SystemError) {
–> faultCause = (vmodl.MethodFault) null,
–> faultMessage = <unset>,
–> reason = “Retrieve IPMI SEL request to host failed”
–> msg = “”
–> }
–> Args:
–>

Many people were pointing to vpxa.cfg (here and here) as a source of the error:

<log>
<level>verbose</level>
<maxFileNum>10</maxFileNum>
<maxFileSize>1048576</maxFileSize>
<memoryLevel>verbose</memoryLevel>
<outputToConsole>false</outputToConsole>
<outputToFiles>false</outputToFiles>
<outputToSyslog>true</outputToSyslog>
<syslog>
<facility>local4</facility>
<ident>Vpxa</ident>
<logHeaderFile>/var/run/vmware/vpxaLogHeader.txt</logHeaderFile>
</syslog>
</log>

It was not the end of the world, and I didn’t want to edit default log levels manually. So the issue was ignored for a while.

To my great surprise, it all went back to normal after updating hypervisor to the latest version using Dell EMC customised VMware ESXi 6.5 U1 A10 image.

Now, we can see multiple events in vpxd.log generated by VpxLRO:

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — BEGIN lro-490638 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — FINISH lro-490638

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — BEGIN lro-490639 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — FINISH lro-490639

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — BEGIN lro-490640 — HostProfileManager — vim.profile.ProfileManager.findAssociatedProfile

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — FINISH lro-490640

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — BEGIN lro-490641 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — FINISH lro-490641
2018-04-12T14:02:41.702+08:00 info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — BEGIN lro-490642 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — FINISH lro-490642

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — BEGIN lro-490643 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — FINISH lro-490643

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — BEGIN task-35322 — healthStatusSystem-28 — vim.host.HealthStatusSystem.FetchSystemEventLog

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — FINISH task-35322

As a result, the ‘Refresh hardware IPMI System Event Log’ task completes successfully.

vSphere Web Client - IPMI Success

vCenter 6.0: VMware Common Logging Service Health Alarm [RESOLVED]

A few days ago I noticed a warning message appearing in vCenter Server pointing to some issues with the VMware Common Logging Service.

VCLS Issue - 01

The service status was showing that the disk space had been filling in steadily reaching the 30% warning threshold.

VCLS Issue - 00

VCLS Issue - 02

Considering the infrastructure had not experienced significant changes, I decided to postpone disk space extension and try to find the root cause of this problem.

With the help of the du command, it has become clear some of the subfolders in /var/log/vmware were quite large in size.

VCLS Issue - 03

It shouldn’t be a problem if the log rotation happens and old data is removed from the disk. However, the /storage/log/vmware/cloudvm/cloudvm-ram-size.log file size was 1.4G, and it seemed to be increasing without log rotation.

VCLS Issue - 04

An attempt to find out about the cloudvm-ram-size.log file pointed me to the article which William Lam wrote in early 2015 – apparently it logs activities of a built-in dynamic memory reconfiguration process called cloudvm-ram-size.

The issue is documented in ‘Log rotation of the cloudvm-ram-size.log file is not working (2147261)‘ in VMware Knowledge Base.

As per that article, the problem is fixed in vSphere 6.5. For vSphere 6.0, you need to configure log rotation for the cloudvm-ram-size.log file and run the logrotate command manually to archive it to cloudvm-ram-size.log-xxxxxxxxx-.bz2 file.

VCLS Issue - 05

VMware recommends to do periodic cleanup of older .bz2 files in the /storage/log/vmware/cloudvm location!!! This can be done by adding a rotate parameter to the configuration file as follows:

/storage/log/vmware/cloudvm/cloudvm-ram-size.log{
missingok
notifempty
compress
size 20k
monthly
create 0660 root cis
rotate 7
}

It was a quick fix!

vSAN 6.6.1: vSAN Build Recommendation Engine Health issue [RESOLVED]

In my previous post about vSAN Build Recommendation Engine Health test, I have concluded that it was a bug in vSAN 6.6.1 that prevented vSAN Health service from properly connecting to the Internet via proxy.

With vCenter Server Appliance 6.5 Update 1d release, I have noticed that one of two warning messages disappeared from the vSphere Web Client leaving that task in the ‘Unexpected vSphere Update Manager (VUM) baseline creation failure‘ state.

After checking vSAN configuration one more, I concluded the following:

  • Internet connectivity for automatic updates of the HCL database has been set up properly (vSAN_Cluster > Configure > vSAN > General):

vSAN-BRE-01

  • The HCL database is up-to-date and CEIP is enabled (vSAN_Cluster > Configure > vSAN > Health and Performance):

vSAN-BRE-02

vSAN-BRE-03

  • Update Manager has proxy settings configured and working (vSAN_Cluster > Update Manager > Go to Admin View > Manage > Settings > Download Settings):

vSAN-BRE-04

vSAN-BRE-05

At the same time, the proxy server replaces SSL certificates with its own one signed by the corporate CA when establishing HTTPS connection with the remote peer.

As a result, it causes an error message for the vSAN Build Recommendation Engine Health task as follows (extract from vmware-vsan-health-service.log):

INFO vsan-health[Thread-49] [VsanVumConnection::RemediateVsanClusterInVum] build = {u’release’: {u’baselineName’: u’VMware ESXi 6.5.0 U1 (build 5969303)’, u’isoDisplayName’: u’VMware ESXi Release 6.5.0, Build 5969303′, u’bldnum’: 5969303, u’vcVersion’: [u’6.5.0′], u’patchids’: [u’ESXi650-Update01′], u’patchDisplayName’: u’VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303)’}}

INFO vsan-health[Thread-49] [VsanVumConnection::_LookupPatchBaseline] Looking up baseline for patch VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303) (keys: [40], hash: None)…

INFO vsan-health[Thread-49] [VsanVumConnection::_LookupPatchBaseline] Looking up baseline for patch vSAN recommended patch to be applied on top of ESXi 6.5 U1: ESXi650-201712401-BG (keys: [], hash: None)…

ERROR vsan-health[Thread-49] [VsanVumConnection::RemediateAllClusters] Failed to remediate cluster ‘vim.ClusterComputeResource:domain-c61’

Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 1061, in RemediateAllClusters
performScan = performScan)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 876, in RemediateVsanClusterInVum
patchName, patchMap[chosenRelease])
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 373, in CreateBaselineFromOfficialPatches
baseline = self._LookupPatchBaseline(name, keys)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 411, in _LookupPatchBaseline
result = bm.QueryBaselinesForUpdate(update = updateKeys)
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 557, in <lambda>
self.f(*(self.args + (obj,) + args), **kwargs)
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 362, in _InvokeMethod
list(map(CheckField, info.params, args))
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 883, in CheckField
raise TypeError(‘Required field “%s” not provided (not @optional)’ % info.name)
TypeError: Required field “update” not provided (not @optional)

INFO vsan-health[Thread-49] [VsanVumSystemUtil::AddConfigIssue] Add config issue createBaselineFailed

INFO vsan-health[Thread-49] [VsanVumConnection::_DeleteUnusedBaselines] Deleting baseline VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303) (id 424) because it is unused

INFO vsan-health[Thread-49] [VsanVumSystemUtil::VumRemediateAllClusters_DoWork] Complete VUM check for clusters [‘vim.ClusterComputeResource:domain-c61’]

ERROR vsan-health[Thread-49] [VsanVumConnection::RemediateAllClusters] Failed to remediate cluster

Following the community advice, I decided to add Root CA and subordinate CA certificates (in *.pem format) to the local keystore on vCenter Server Appliance. After copying certificates to /etc/ssl/certs and running the c_rehash command, I added proxy servers to /etc/sysconfig/proxy and rebooted the server.

vSAN-BRE-07

To test that new configuration works, I used the wget command, and it all seemed to work smoothly.

vSAN-BRE-06

Regardless of all that changes, I still got error messages with the vSAN Build Recommendation Engine Health test, but this time they looked a bit different:

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=ServiceInstance, info=content

WARNING vsan-health[Thread-11125] [VsanPhoneHomeWrapperImpl::_try_connect] Cannot connect to VUM. Will retry connection

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=group-d1, info=name

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=group-d1, info=name

ERROR vsan-health[Thread-11125] [VsanCloudHealthDaemon::run] VsanCloudHealthSenderThread exception: Exception: HTTP Error 411: Length Required, Url: https://vcsa.vmware.com/ph/api/dataapp/send?_v=1.0&_c=VsanCloudHealth.6_5&_i=<Support_Tag&gt;, Traceback: Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthUtil.py”, line 511, in getResponse
resp = proxyOpener.open(*args, **kwargs)
File “/usr/lib/python2.7/urllib2.py”, line 435, in open
response = meth(req, response)
File “/usr/lib/python2.7/urllib2.py”, line 548, in http_response
‘http’, request, response, code, msg, hdrs)
File “/usr/lib/python2.7/urllib2.py”, line 473, in error
return self._call_chain(*args)
File “/usr/lib/python2.7/urllib2.py”, line 407, in _call_chain
result = func(*args)
File “/usr/lib/python2.7/urllib2.py”, line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 411: Length Required
Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 353, in run
self._sendCloudHealthData(clusterUuid, data=data)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 321, in _sendCloudHealthData
objectId=clusterUuid, additionalUrlParams=additionalUrlParams)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 156, in send
dataType=dataType, pluginType=pluginType, url=postUrl)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 139, in sendRawData
raise ex
VsanCloudHealthHTTPException: Exception: HTTP Error 411: Length Required, Url: https://vcsa.vmware.com/ph/api/dataapp/send?_v=1.0&_c=VsanCloudHealth.6_5&_i=<Support_Tag&gt;, Traceback: Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthUtil.py”, line 511, in getResponse
resp = proxyOpener.open(*args, **kwargs)
File “/usr/lib/python2.7/urllib2.py”, line 435, in open
response = meth(req, response)
File “/usr/lib/python2.7/urllib2.py”, line 548, in http_response
‘http’, request, response, code, msg, hdrs)
File “/usr/lib/python2.7/urllib2.py”, line 473, in error
return self._call_chain(*args)
File “/usr/lib/python2.7/urllib2.py”, line 407, in _call_chain
result = func(*args)
File “/usr/lib/python2.7/urllib2.py”, line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 411: Length Required

INFO vsan-health[Thread-11125] [VsanCloudHealthDaemon::run] VsanCloudHealthSenderThread done.

vsan-health[Thread-9] [VsanCloudHealthDaemon::_sendExceptionsToPhoneHome] Exceptions for collection/sending exceptions

I thought that the vSAN Health service might try to contact vSphere Update Manager directly, and the proxy settings set on the OS level redirected this request to the Internet proxy instead.

I have added the local domain to the exception list in /etc/sysconfig/proxy and rebooted the server again.

vSAN-BRE-08

After reading about ‘HTTP Error 411’, the only idea was to add a domain service account and its password to HTTP_PROXY and HTTPS_PROXY lines in /etc/sysconfig/proxy. If the password has special characters, they should be added in ASCII encoding to work correctly.

To my great surprise, all communication issues have been resolved, and the vSAN Health service was able to synchronise data with vSphere Update Manager and online services correctly.

vSAN-BRE-09

vSAN-BRE-11

A few minutes later vSAN system baselines and baseline groups appeared in vSphere Update Manager.

vSAN-BRE-10

Of cause, those modifications in Photon OS configuration files are not supported by VMware and could be overwritten by future updates. Yet I hope engineers and developers are working on better integration between vSAN Health and vSphere Update Manager when vCenter resides behind the proxy.

23/02/2018 – Update 1: Per VMware documentation, a starting point to troubleshoot connectivity to the CEIP web server is to make sure the following prerequisites are met:

vSphere: Response to Meltdown and Spectre vulnerabilities

meltdown-spectre-logos

For those who were responding quickly to Meltdown and Spectre by applying security patches to their ESXi environment, it can be a bit frustrating to know that VMware pulled those packages down few days after they were released.

This is related to a reboot issue in the recent CPU microcode updates released by Intel, and both vendors aks for some time to provide a revised version of firmware.

Currently, VMware urges to apply the latest patches (released on January 9, 2018) to vCenter Server and VCSA as follows:

More information (and possibly updates) will come next week.

Meanwhile, I would leave here a few more articles that are worth reading:

25/01/2018 – Update 1: Two more articles that seem to be quite helpful are as follows:

09/02/2018 – Update 2: VMware released a new security advisory (VMSA-2018-0007) in regards to mitigating CVE-2017-5753, CVE-2017-5715, and CVE-2017-5754 in VMware Virtual Appliances.

12/02/2018 – Update 3: Another excellent summary in regards to the subject: Meltdown and Spectre: far from the solution?

25/02/2018 – Update 4: Over the last week Dell EMC released new BIOS for 13G and 14G server platforms. Still, it will take some time for VMware to update their HCL with the supported configurations. Meanwhile, it is recommended to apply Photon OS security patches to VCSA 6.5 as per the following article: https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vcenter-server-appliance-photonos-security-patches.html.

 

vSAN 6.5: Virtual Machine with more than 64GB memory fails to Storage vMotion to vSAN cluster

VMware has just posted an article in the Virtual Blocks blog which describes this behaviour. It happens only when trying to Storage vMotion a virtual machine with a swap file larger than 64GB to the vSAN datastore.

The task fails and generates the following error messages:

SvMotion-Issue-01

There are two possible workarounds available: either increase the swap file maximum size on the destination ESXi host or set a reservation of memory on the virtual machine. The former one is more preferable, as it does not require host reboot.

VMware provides a KB 2150316 with “more log samples and specifics for identifying the issue as a cause of a migration failure”.

vSphere Web Client 6.x: ‘Shockwave Flash has crashed’ issue

It was a great surprise for many virtualisation specialists to see an error message saying ‘Shockwave Flash has crashed’ immediately after authenticating in the vSphere Web Client 6.x earlier this week.

Flash-Issue-01

Most of the reports came from those who were using the latest version of Google Chrome (61.0.3163.100). However, there were similar issues with other web-browsers and Adobe Flash version 27.0.0.170.

William Lam wrote a special post on his blog about this issue, and he keeps it updated with the number of hacks.

Gladly, VMware has been quick with publishing a KB 2151945 which tracks information about the same problem and providing some workarounds as well. Thanks to Dennis Lu for pointing to this article!

This is a classic example of how dependency on a third-party technology can affect your solution. I hope that VMware accelerates the development of vSphere Client (HTML5) and provides feature parity between it and the Flash one.