[URGENT] vSAN 6.6.1: Potential data loss due to resynchronisation mixed with object expansion

Last week VMware released an urgent hotfix to remediate potential data loss in vSAN 6.6.1 due to resynchronisation mixed with object expansion.

This is a known issue affecting earlier versions of ESXi 6.5 Express Patch 9. The vendor states that a sequence of the following operations might cause it:

  1. vSAN initiates resynchronisation to maintain data availability.
  2. You expand a virtual machine disk (VMDK).
  3. vSAN initiates another resync after the VMDK expansion.

Detailed information about this problem is available in KB 60299.

If you are a vSAN customer, additional considerations are required before applying this hotfix:

  • If hosts have already been upgraded to ESXi650-201810001, you can proceed with this upgrade,
  • If hosts have not been upgraded to ESXi650-201810001, and if an expansion of a VMDK is likely, the in-place expansion should be disabled on all of them by setting the VSAN.ClomEnableInplaceExpansion advanced configuration option to ‘0‘.

The VSAN.ClomEnableInplaceExpansion advanced configuration option is not available in vSphere Client. I use the following one-liner scrips to determine and change its value via PowerCLI:

# To check the current status
Get-VMHost | Get-AdvancedSetting -Name “VSAN.ClomEnableInplaceExpansion” | select Entity, Name, Value | Format-Table -AutoSize

# To disable the in-place expansion
Get-VMHost | Get-AdvancedSetting -Name “VSAN.ClomEnableInplaceExpansion” | ? {$_.Value -eq “1”} | Set-AdvancedSetting -Value “0”

Note: No reboot is required after the change.

After hosts were upgraded to ESXi650-201810001 or ESXi650-201811002, you can set VSAN.ClomEnableInplaceExpansion back to ‘1‘ to enable the in-place expansion.

vSAN 6.6.1: Replacing a faulty NIC on Dell PowerEdge server

Not long ago I have noticed a 10G network interface flipping on one of the vSAN nodes.

vSAN-NIC-issue-01

I immediately started investigating this issue. An interesting thing was that this device was part of the integrated NIC on the server, and only it was generating connection errors.

vSAN-NIC-issue-02

After consulting with the Networks, we found the following:

  • The interface was up,
  • The port connecting to the switch had no traffic (can send to the server, but not receiving from the server),
  • No errors were recorded,
  • SFP signals were good.

The plan of attack was to replace SFPs – first on the server and, if it didn’t help, on the switch. During this operation we’ve found out the SFP on the server side was unusually warm. Unfortunately, replacing SFPs didn’t help and after approximately 15 minutes of complete silence, disconnects continued.

The next move was to contact a vendor. In our case, it was Dell EMC.

We’ve lodged a support request and sent the SupportAssist Collection to them. The response from the Support was to replace an embedded NIC with a new one. Considering the server was in use, it all sounded tricky to me.

However, thanks to the new algorithm which assigns device names for I/O devices beginning in ESXi 5.5, it all went smoothly. VMware states the following:

vSAN-NIC-issue-03

The number of ports on the embedded NIC hasn’t changed. As a result, hypervisor assigned the same aliases to the onboard ports.

ESXi initialised new ports and vSAN configuration was updated successfully without any human interaction.

As a bonus, when the server was booting after the card replacement, Lifecycle Controller detected an older version of firmware on the device and initiated a firmware update operation automatically.

vSAN-NIC-issue-04

All in all, I am impressed by how robust modern platforms both from Dell EMC and VMware.

VMware: StorageHub Portal Refresh

For those of us who have been interested in getting explicit information about VMware vSAN, Site Recovery Manager, and vSphere storage in general, VMware StorageHub was a unique source of technical documentation.

It is great to see the vendor working on improving this portal with the design and user interface refresh.

SorageHub-01

Now it is possible to choose between English (US) and Mandarine languages for some of the articles.

SorageHub-02

All seems quite logical, and I personally like navigation and how fast search works.

SorageHub-03

Well done, VMware!