[URGENT] vSAN 6.6.1: Potential data loss due to resynchronisation mixed with object expansion

Last week VMware released an urgent hotfix to remediate potential data loss in vSAN 6.6.1 due to resynchronisation mixed with object expansion.

This is a known issue affecting earlier versions of ESXi 6.5 Express Patch 9. The vendor states that a sequence of the following operations might cause it:

  1. vSAN initiates resynchronisation to maintain data availability.
  2. You expand a virtual machine disk (VMDK).
  3. vSAN initiates another resync after the VMDK expansion.

Detailed information about this problem is available in KB 60299.

If you are a vSAN customer, additional considerations are required before applying this hotfix:

  • If hosts have already been upgraded to ESXi650-201810001, you can proceed with this upgrade,
  • If hosts have not been upgraded to ESXi650-201810001, and if an expansion of a VMDK is likely, the in-place expansion should be disabled on all of them by setting the VSAN.ClomEnableInplaceExpansion advanced configuration option to ‘0‘.

The VSAN.ClomEnableInplaceExpansion advanced configuration option is not available in vSphere Client. I use the following one-liner scrips to determine and change its value via PowerCLI:

# To check the current status
Get-VMHost | Get-AdvancedSetting -Name “VSAN.ClomEnableInplaceExpansion” | select Entity, Name, Value | Format-Table -AutoSize

# To disable the in-place expansion
Get-VMHost | Get-AdvancedSetting -Name “VSAN.ClomEnableInplaceExpansion” | ? {$_.Value -eq “1”} | Set-AdvancedSetting -Value “0”

Note: No reboot is required after the change.

After hosts were upgraded to ESXi650-201810001 or ESXi650-201811002, you can set VSAN.ClomEnableInplaceExpansion back to ‘1‘ to enable the in-place expansion.

Windows Installer: MSI installation fails with the error status 1603

During the process of distributing an MSI package to the remote Windows Server 2012 R2 hosts via the Start-Process cmdlet, I ran across an interesting behaviour. In some cases, that MSI package was installed without any issues; in others, it was failing silently generating an event ID 10837 in the Application log.

With the verbose logging enabled, the following error message was observed in the MSI log file:

Installation success or error status: 1603.

The error status 1603 is documented on Microsoft Technet. However, none of those scenarios listed in the article applied to my case. I was able to install that MSI package locally with no issues, and the error popped up randomly when doing installation via PowerShell.

With more testing, I have realised the issue was only popping up when the user account, from which the script was running, had never previously log on to the target system.

I asked one of my colleagues, who has a better understanding of how Windows Installer works, to help with this case. After a thorough investigation, he pointed me to the following lines in the MSI log file:

MSI (s) (2C:C4) [02:22:15:584]: SECREPAIR: New Hash Database creation complete.
MSI (s) (2C:C4) [02:22:15:651]: SECREPAIR: CryptAcquireContext: Could not create the default key container
MSI (s) (2C:C4) [02:22:15:651]: SECREPAIR: Crypt Provider not initialized. Error:-2146892987

MSI (s) (2C:C4) [02:22:15:651]: SECUREREPAIR: Failed to CreateContentHash of the file: installer.msi: for computing its hash. Error: -2146892987
MSI (s) (2C:C4) [02:22:15:651]: SECREPAIR: Failed to create hash for the install source files
MSI (s) (2C:C4) [02:22:15:651]: Note: 1: 2262 2: SourceHash 3: -2147287038
MSI (s) (2C:C4) [02:22:15:651]: SECUREREPAIR: SecureRepair Failed. Error code: 8009034524E29A18
Action start 2:22:15: ProcessComponents.
The requested operation cannot be completed. The computer must be trusted for delegation and the current user account must be configured to allow delegation.

Apparently, in 2014 Microsoft released a security bulletin MS14-049 containing a patch to fix a vulnerability in the Windows Installer service. However, after you install this security update it breaks the MSI package installation. This is documented as a ‘Known issue 1’ in the bulletin and explained in more details here.

To resolve this issue, Microsoft recommends installing update 3000988.

Another option, which is documented in the same bulletin under the ‘Known issue 2’ section, is to opt-out the affected programs by using registry settings. However, this workaround implies more manual work and removes the defence-in-depth security feature for those programs.

I have tested those options and can confirm they both working. Hope this article saves you some time with troubleshooting a similar problem.

vSphere 6.x: SEsparse snapshot may cause guest OS file system corruption

Early this month, VMware published a KB 59216 named ‘Virtual Machines running on a SEsparse snapshot may report guest data inconsistencies’.

As per the vendor’s documentation, ‘SEsparse is a snapshot format introduced in vSphere 5.5 for large disks, and is the preferred format for all snapshots in vSphere 6.5 and above with VMFS-6‘. On VMFS-5 and NFS datastores, the SEsparse format is used for virtual disks that are 2 TB or larger; whereas on VMFS-6, SEsparse is the default format for all snapshots.

The knowledge base article states that the issue affects vSphere 5.5 and later versions. As of today, it has been fixed only in VMware ESXi 6.7 Update 1, with the Express Patches pending for VMware ESXi 6.0 and 6.5.

How is this related to your production environment? Well, it depends…

For example, when the backup software creates a system snapshot and it coexists with the operating system (OS) experiencing ‘a burst of non-contiguous write IO in a very short period of time‘, this can potentially trigger the data corruption. There might be other scenarios when a snapshot is used during the OS or software upgrades.

While waiting for a permanent solution, VMware provides a workaround that requires disabling SEsparse IO coalescing on each affected host. The advanced setting that controls IO Coalescing (COW.COWEnableIOCoalescing) is not available through the vSphere Client:

ESXi-SEspare-Issue-01

In spite of that, you can always determine and change its value via PowerCLI:

Get-VMHost | Get-AdvancedSetting -Name “COW.COWEnableIOCoalescing” | select Entity,Name,Value | Format-Table -AutoSize

Get-VMHost | Get-AdvancedSetting -Name “COW.COWEnableIOCoalescing” | ? {$_.Value -eq “1”} | Set-AdvancedSetting -Value “0”

Note: After disabling the IO coalescing, all virtual machines resided on that host ‘must be power-cycled or migrated (vMotion) to other hosts that have the config option set‘.

VMware states there will be a performance penalty when disabling IO coalescing and ‘the extent of degradation depends on the individual virtual machine workload‘.

Note: ‘After patches are released, the workaround needs to be rolled back to regain performance benefits of IO coalescing‘.

24/01/2019 – Update 1: This issue has been resolved with the following set of patches:

[IMPORTANT] VMware ESXi 6.x: Denial-of-service vulnerability in 3D-acceleration feature

This week VMware published a security advisory VMSA-2018-0025 about the denial-of-service vulnerability in the 3D-acceleration feature in VMware ESXi, Workstation, and Fusion.

VM3DSupport-Issue-01

It affects all versions of those products if 3D-acceleration feature is enabled for virtual machines (VMs). This is a default setting for all VMs on VMware Workstation and Fusion and might be an issue for the VMs managed by VMware Horizon.

More information about this issue can be found here.

At the moment of writing this article, there were no patches or updates provided by VMware to mitigate this problem. So a workaround would be to disable the 3D-acceleration feature for affected systems.

To identify the VMs that have the 3D-acceleration feature enabled, I wrote the following PowerCLI script:

As soon as the permanent solution provided by the vendor, I will update this blog post with more information.

URGENT: VMDKs residing on vSAN 6.6 and later that have been extended may encounter data inconsistencies [RESOLVED]

Last week VMware published a KB 58715 reporting virtual machine disks residing on vSAN 6.6 and later that have been extended may encounter data inconsistencies. For those who subscribed to VMware email communications, the following message has been sent recently.

vSAN66-Issue-01

As stated in the article, this issue might happen in a rare occurrence. Still, VMware encourages their clients to check the value of the advanced setting VSAN.ClomEnableInplaceExpansion on all ESXi hosts that are part of the vSAN cluster. If it is set to the default value of “1”, the vendor recommends changing it to “0” immediately. This can be done using the following PowerCLI command:

Foreach ($VMHost in (Get-Cluster -Name (Read-Host “Cluster Name”) | Get-VMHost)) {Get-AdvancedSetting -Entity $VMHost -Name VSAN.ClomEnableInplaceExpansion | Where-Object {$_.Value -ne ‘0’} | Set-AdvancedSetting -Value ‘0’ -Confirm:$false}

Fortunately, no reboot or service restart is required for this change to take effect, and it will become effective within 60 seconds.

It is good to see how much effort the vendor put into supporting vSAN and proactively inform users about any problems. Great service, VMware!

04/10/2018 – Update 1: VMware has realeased patches for both vSAN 6.6 and 6.7 that remediate this issue. Please read the resolution section in KB 58715 for more information.