vSAN 6.6.1: Replacing a faulty NIC on Dell PowerEdge server

Not long ago I have noticed a 10G network interface flipping on one of the vSAN nodes.

vSAN-NIC-issue-01

I immediately started investigating this issue. An interesting thing was that this device was part of the integrated NIC on the server, and only it was generating connection errors.

vSAN-NIC-issue-02

After consulting with the Networks, we found the following:

  • The interface was up,
  • The port connecting to the switch had no traffic (can send to the server, but not receiving from the server),
  • No errors were recorded,
  • SFP signals were good.

The plan of attack was to replace SFPs – first on the server and, if it didn’t help, on the switch. During this operation we’ve found out the SFP on the server side was unusually warm. Unfortunately, replacing SFPs didn’t help and after approximately 15 minutes of complete silence, disconnects continued.

The next move was to contact a vendor. In our case, it was Dell EMC.

We’ve lodged a support request and sent the SupportAssist Collection to them. The response from the Support was to replace an embedded NIC with a new one. Considering the server was in use, it all sounded tricky to me.

However, thanks to the new algorithm which assigns device names for I/O devices beginning in ESXi 5.5, it all went smoothly. VMware states the following:

vSAN-NIC-issue-03

The number of ports on the embedded NIC hasn’t changed. As a result, hypervisor assigned the same aliases to the onboard ports.

ESXi initialised new ports and vSAN configuration was updated successfully without any human interaction.

As a bonus, when the server was booting after the card replacement, Lifecycle Controller detected an older version of firmware on the device and initiated a firmware update operation automatically.

vSAN-NIC-issue-04

All in all, I am impressed by how robust modern platforms both from Dell EMC and VMware.

ESXi 6.5: Retrieve IPMI SEL request to host failed [FIXED BY VENDOR]

From time to time you might want to check the host hardware health manually in Monitor>Hardware Health (vSphere Client) or Monitor>Hardware Status (vSphere Web Client).

For many months this functionality has been broken for ESXi 6.5 on DellEMC servers.

vSphere Web Client - IPMI Error

When opening the Sensors page, vpxd.log shows the following message:

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — BEGIN task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog

error vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] No Content-Length header, WSMan IPMI SEL operation failed

info vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] WSMan Msg size 59: part:401 Unauthorized
–> WWW-Authenticate: Basic realm=”OPENWSMAN”)l▒\x7f

warning vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] Closing Response processing in unexpected state: 3

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — FINISH task-35318

info vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — ERROR task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog: vmodl.fault.SystemError:
–> Result:
–> (vmodl.fault.SystemError) {
–> faultCause = (vmodl.MethodFault) null,
–> faultMessage = <unset>,
–> reason = “Retrieve IPMI SEL request to host failed”
–> msg = “”
–> }
–> Args:
–>

Many people were pointing to vpxa.cfg (here and here) as a source of the error:

<log>
<level>verbose</level>
<maxFileNum>10</maxFileNum>
<maxFileSize>1048576</maxFileSize>
<memoryLevel>verbose</memoryLevel>
<outputToConsole>false</outputToConsole>
<outputToFiles>false</outputToFiles>
<outputToSyslog>true</outputToSyslog>
<syslog>
<facility>local4</facility>
<ident>Vpxa</ident>
<logHeaderFile>/var/run/vmware/vpxaLogHeader.txt</logHeaderFile>
</syslog>
</log>

It was not the end of the world, and I didn’t want to edit default log levels manually. So the issue was ignored for a while.

To my great surprise, it all went back to normal after updating hypervisor to the latest version using Dell EMC customised VMware ESXi 6.5 U1 A10 image.

Now, we can see multiple events in vpxd.log generated by VpxLRO:

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — BEGIN lro-490638 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — FINISH lro-490638

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — BEGIN lro-490639 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — FINISH lro-490639

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — BEGIN lro-490640 — HostProfileManager — vim.profile.ProfileManager.findAssociatedProfile

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — FINISH lro-490640

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — BEGIN lro-490641 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — FINISH lro-490641
2018-04-12T14:02:41.702+08:00 info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — BEGIN lro-490642 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — FINISH lro-490642

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — BEGIN lro-490643 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — FINISH lro-490643

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — BEGIN task-35322 — healthStatusSystem-28 — vim.host.HealthStatusSystem.FetchSystemEventLog

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — FINISH task-35322

As a result, the ‘Refresh hardware IPMI System Event Log’ task completes successfully.

vSphere Web Client - IPMI Success

ESXi 6.5: Host fails with PSOD after upgrading to 6.5 Update 1 [RESOLVED]

For those who have plans upgrading their environment from vSphere 6.0 to 6.5 Update 1, I would suggest postponing this until VMware resolves issue documented in KB 2151749.

ESXi650-2151749

Hosts will be affected if they equipped with 10 Gbps NICs.

The only workaround that the vendor has at the moment is to downgrade ESXi to 6.0 Update 2.

17/10/2017 – Update 1: According to VMware GSS, this issue is going to be “resolved in ESXi 6.5 Patch 02, which is schedule to release this month (The release date may change without notice).” Please refer to the SR #17599111410 when contacting GSS for more information.

08/02/2018 – Update 2: This issue is resolved in VMware ESXi 6.5 P02 (ESXi-6.5.0-20171204001-standard).

vSAN 6.5-6.6.1: An urgent hotfix ESXi650-201710401

VMware has just released a new hotfix for ESXi and vSAN (KB 2151081) urging customers with all-flash configuration with deduplication enabled to upgrade their environment as soon as possible. This patch resolves data corruption issue which might appear in rare circumstances.

ESXi650-201710401

The affected versions of vSAN include 6.5, 6.6, and 6.6.1.

06-10-2017 – Update 1: As listed in KB 2151042, similar issue has been fixed for ESXi 6.0.

ESXi 6.5: Host fails with PSOD when IPV6 is disabled

I have a habit of reading all new KB articles published by VMware every week. Not only is it give a visibility of the current issues that VMware products have, but it helps to be proactive with learning some behaviour and workarounds and prepared to remediate them if required.

Therefore, after writing a few blog posts about vCenter 6.5 and IPv6 here and here, it caught my eye that ESXi 6.5 Hosts could also fail with a Purple Screen of Death when IPV6 is disabled.

VMware has published a KB 2150794 that explained this behaviour.

The only workaround at this moment is to re-enable IPv6 on all hosts in your environment.

ESXi65-IPv6-PSOD