[URGENT] vSAN 6.6.1: Potential data loss due to resynchronisation mixed with object expansion

Last week VMware released an urgent hotfix to remediate potential data loss in vSAN 6.6.1 due to resynchronisation mixed with object expansion.

This is a known issue affecting earlier versions of ESXi 6.5 Express Patch 9. The vendor states that a sequence of the following operations might cause it:

  1. vSAN initiates resynchronisation to maintain data availability.
  2. You expand a virtual machine disk (VMDK).
  3. vSAN initiates another resync after the VMDK expansion.

Detailed information about this problem is available in KB 60299.

If you are a vSAN customer, additional considerations are required before applying this hotfix:

  • If hosts have already been upgraded to ESXi650-201810001, you can proceed with this upgrade,
  • If hosts have not been upgraded to ESXi650-201810001, and if an expansion of a VMDK is likely, the in-place expansion should be disabled on all of them by setting the VSAN.ClomEnableInplaceExpansion advanced configuration option to ‘0‘.

The VSAN.ClomEnableInplaceExpansion advanced configuration option is not available in vSphere Client. I use the following one-liner scrips to determine and change its value via PowerCLI:

# To check the current status
Get-VMHost | Get-AdvancedSetting -Name “VSAN.ClomEnableInplaceExpansion” | select Entity, Name, Value | Format-Table -AutoSize

# To disable the in-place expansion
Get-VMHost | Get-AdvancedSetting -Name “VSAN.ClomEnableInplaceExpansion” | ? {$_.Value -eq “1”} | Set-AdvancedSetting -Value “0”

Note: No reboot is required after the change.

After hosts were upgraded to ESXi650-201810001 or ESXi650-201811002, you can set VSAN.ClomEnableInplaceExpansion back to ‘1‘ to enable the in-place expansion.

vSAN 6.6.1: Replacing a faulty NIC on Dell PowerEdge server

Not long ago I have noticed a 10G network interface flipping on one of the vSAN nodes.

vSAN-NIC-issue-01

I immediately started investigating this issue. An interesting thing was that this device was part of the integrated NIC on the server, and only it was generating connection errors.

vSAN-NIC-issue-02

After consulting with the Networks, we found the following:

  • The interface was up,
  • The port connecting to the switch had no traffic (can send to the server, but not receiving from the server),
  • No errors were recorded,
  • SFP signals were good.

The plan of attack was to replace SFPs – first on the server and, if it didn’t help, on the switch. During this operation we’ve found out the SFP on the server side was unusually warm. Unfortunately, replacing SFPs didn’t help and after approximately 15 minutes of complete silence, disconnects continued.

The next move was to contact a vendor. In our case, it was Dell EMC.

We’ve lodged a support request and sent the SupportAssist Collection to them. The response from the Support was to replace an embedded NIC with a new one. Considering the server was in use, it all sounded tricky to me.

However, thanks to the new algorithm which assigns device names for I/O devices beginning in ESXi 5.5, it all went smoothly. VMware states the following:

vSAN-NIC-issue-03

The number of ports on the embedded NIC hasn’t changed. As a result, hypervisor assigned the same aliases to the onboard ports.

ESXi initialised new ports and vSAN configuration was updated successfully without any human interaction.

As a bonus, when the server was booting after the card replacement, Lifecycle Controller detected an older version of firmware on the device and initiated a firmware update operation automatically.

vSAN-NIC-issue-04

All in all, I am impressed by how robust modern platforms both from Dell EMC and VMware.

VMware: StorageHub Portal Refresh

For those of us who have been interested in getting explicit information about VMware vSAN, Site Recovery Manager, and vSphere storage in general, VMware StorageHub was a unique source of technical documentation.

It is great to see the vendor working on improving this portal with the design and user interface refresh.

SorageHub-01

Now it is possible to choose between English (US) and Mandarine languages for some of the articles.

SorageHub-02

All seems quite logical, and I personally like navigation and how fast search works.

SorageHub-03

Well done, VMware!

vSphere 6.x: The beauty and ugliness of the Content Library – Part 1

The title of this blog post seems to be a bit provocative, and this has been done for a reason.

I believe many VMware engineers, including myself, were really excited about the Content Library feature introduced in vSphere 6.0. The product itself is not completely new for VMware, as it merges code from the content management feature of vCloud Director.

In What’s New in the VMware vSphere 6.0 Platform whitepaper, VMware states the following:

“The Content Library… centrally manages virtual machine templates, ISO images, and scripts, and it performs the content delivery of associated data from the published catalog to the subscribed catalog at other sites.”

Sounds really cool! Now we can centralise all objects that were previously residing on different datastores in one place, and manage them from vSphere Web Client.

In vSphere 6.5, VMware continues improving and polishing this feature:

“Administrators can now mount an ISO directly from the Content Library, apply a guest OS customization specification during VM deployment, and update existing templates.”

However, this article is not only about embracing the tool provided. 🙂 I would like to share with you three specific examples when it doesn’t work as expected, and possible workarounds.

Issue #1 – Provisioning a virtual machine template with the advanced parameters

Affected platform: vSphere 6.0 prior to Update 3.

It was a great surprise to know that provisioning a virtual machine from a VM template which has advanced parameters set can cause any problems in vSphere 6.0. Although the provisioning operation starts as expected, it shows an error message “Failed to deploy OVF package” at the end of it.

CL-Issue01-01

Unfortunately, the Error Report in vSphere Web Client wouldn’t be able to clarify the root cause of this event.

CL-Issue01-02

After contacting VMware GSS about this issue (SR # 16255562909) in late 2016, I had been advised that this bug would be addressed in vSphere 6.0 Update 3.

In March 2017 I updated my environment to this version and tested this feature, the VM creation was working smoothly. So it took almost two years for VMware since the Content Library feature was generally available to fix it.

Gladly, vSphere 6.5 does not have this problem at all.

Resolution: Update your environment to vSphere 6.0 Update 3 or newer version.

Issue #2 – Provisioning a virtual machine from the Content Library on the vSAN datastore

Affected platform: vSphere 6.5 Standard.

The issue is not related to the Content Library directly, rather to OVA/OVF provisioning. For some reason, when you create a new VM from the template in vSphere 6.5, it triggers “Call DRS for cross vMotion placement recommendations” task.

If you use vSphere 6.5 Standard, for which the DRS feature is not available, it causes this task to fail with the error message “The operation failed due to The operation is not allowed in the current state.”

CL-Issue02-01

CL-Issue02-02

The Error Report in vSphere Web Client looks similar to one in the picture below.

CL-Issue02-03

In the Known Issues in VMware vSAN 6.6 Release Notes, the vendor states the following:

VM OVF deploy fails if DRS is disabled
If you deploy an OVF template on the vSAN cluster, the operation fails if DRS is disabled on the vSAN cluster. You might see a message similar to the following: The operation is not allowed in the current state.

Workaround: Enable DRS on the vSAN cluster before you deploy an OVF template.

After doing some troubleshooting and trying different scenarios, the only difference with the provisioning task I was able to identify was the VM storage policy. Regardless the way the VM creation was initiated (from the OVA/OVF file, or Content Library template), it was the Virtual SAN Default Storage Policy call for the DRS to perform a cross vMotion check.

For example, if you set the VM storage policy in the Select storage dialogue box to “None”, the OVA/OVF file can be provisioned on the vSAN datastore.

CL-Issue02-04

The same happens for the VM template from the Subscribed Content Library when the VM storage policy is “None”.

Unfortunately, this trick doesn’t work with the templates in the Local Content Library.

So I decided to dig a bit dipper into the Content Library structure to see if anything can be done there.

The Content Library keeps its data in the contentlib-GUID folder. Each template has its own subfolder with the unique name. Inside the subfolder, there are few files: a descriptor (*.ovf) and one or more data files (*.vmdk).

In vSphere 6.0 those files are named as descriptor_GUID.ovf and disk-vdcs-Disk_Number_GUID.vmdk.

With vSphere 6.5 the files are self-explanatory: Template_Name_GUID.ovf and Template_Name-Disk_Number_GUID.vmdk.

CL-Issue02-05

I compared the descriptor files for the VM templates in the Local and Subscribed Content Libraries, and found they had different vmw:name values in the StorageGroupSection. For the Local Content Library it was a “Virtual SAN Default Storage Policy”, and for the subscribed one it was different.

CL-Issue02-06

It all led me to the idea of changing this descriptor for the VM template in the Local Content Library. So I could provision the VMs using one of the workarounds below.

Workarounds:

  • When provision from the OVA/OVF file, set the VM storage policy in the Select storage dialogue box as “None”,
  • You can provision from the Subscribed Content Library if it has the VM templates with the VM storage policy different from the “Virtual SAN Default Storage Policy”. Set the VM storage policy in the Select storage dialogue box as “None”,
  • You can provision from the Local Content Library if you edit the descriptor file for the VM template and replace the “Virtual SAN Default Storage Policy” with something else. Set the VM storage policy in the Select storage dialogue box as “None”.

Resolution: The support case has been opened, and I am waiting for VMware to resolve this issue. The ETA for this to be fixed is in vSphere 6.5 Update 1 (please refer to SR # 17393663302 when contacting VMware GSS for the future updates).

To be continued

Configuring PERC H730/730p cards for VMware vSAN 6.x

One of the necessary steps to create a new VMware vSAN cluster is to configure the RAID controller.

I have found Joe’s post about setting up Dell PERC H730 cards very informative and easy to follow. However, the latest generation of Dell’s PowerEdge servers has a slightly different configuration interface. So I decided to document configuration process using the BIOS graphical interface.

You can get into it either pressing an F2 key during the server boot or choosing a BIOS Setup option in the Next Boot drop-down menu in the iDRAC Virtual Console.

step-00

The next step is to click on the Device Settings and select the RAID controller from the list of available devices.

step-01

Step-02.png

There are two configuration pages that we should be interested in, as follows:

  • Controller Management > Advanced Controller Management
  • Controller Management > Advanced Controller Properties.

The former gives us ability to switch from RAID mode to HBA mode.

Step-03.png

The latter allows disabling the controller caching and setting the BIOS Boot mode.

Step-04.png

Please note the system reboot is required for the change to take effect. It is always a good idea to double check that the parameters above were setup correctly.