EMC VNX POOL LUNS + VMWARE VSPHERE + VAAI = STORAGE DEATH v2

Recently one of my friends has been testing a greenfield vSphere environment and came across an issue with Storage vMotion being slow. It took almost one hour to copy a VM with 510GB VMDK (thick provision eager zeroed) across two LUNs within the same physical array.

vm-disk-properties

In this case, it was EMC VNX5200 with the following firmware versions:

  • OE for Block – 05.33.009.5.155
  • OE for File – 8.1.9-155.

Multipathing policies are default for the storage with SATP set to VMW_SATP_ALUA_CX and PSP set to VMW_PSP_RR.

According to EMC and VMware HCL, this storage should offload XCOPY operations using VAAI feature in ESXi 6.0.

The ESXi hosts were connected through 8Gb SAN, all with firmware and driver versions supported by VMware.

What was more interesting, he noticed that the host had had warning messages as follows:

Device naa. performance has deteriorated. I/O latency increased from average value of XXXXX microseconds to XXXXXX microsecond

VMware KB article # 2007236 states that the possible root causes for this behaviour could be changes made on the target, disk or media failures, overload conditions on the device, and failover. Storage system didn’t report any hardware failures in the past. So, most probably, it was the result of misconfiguration or software fault.

A quick search on the Internet directed me to Neal Dolson’s blog post published in 2014 that described a similar problem. Using the same methodology as the author did, we have received the same results.

esxtop-vaai-enabled

Esxtop showed high storage device command latency and constant switches between vmhba3 and vmhba4.

ua-vaai-enabled

On the storage side, a response time in Unisphere Analyser went up from few milliseconds to 850-900 milliseconds. In the graph above, VMFS_05 is the LUN from which data has been migrated.

Neal’s article suggested contacting the vendor and upgrading the storage firmware. EMC released a fix for this particular problem in version 05.32.000.5.217 of VNX OE for Block (page 17 of the document). However, it applies only to the first generation of VNX:

Platforms:
VNX5100 VNX5150 VNX5300 VNX-VSS100 VNX5500 VNX5700 VNX7500

Severity:
Medium

Frequency of occurrence:
Always under a specific set of circumstances

Tracking number:
61525078/624886

Slow performance was seen on a storage system when running VMware ESX operations that use the VAAI (vStorage APIs for Array Integration) data move primitive (xcopy), such as cloning virtual machines or templates, migrating virtual machines with storage vmotion, and deploying virtual machines from template.

This software has multiple enhancements to improve latency, as well as new code efficiencies to greatly improve cloning and vmotion.

KnowledgeBase ID:
None

Fixed in version:
05.32.000.5.217

I looked at the latest release notes for VNX Operating Environment for Block for VNX5200, and couldn’t find similar information there.

As a workaround, we disabled “DataMover.HardwareAcceleratedMove” option in Advanced System Settings on all hosts using this simple PowerCLI command:

Get-VMHost | Get-AdvancedSetting -Name DataMover.HardwareAcceleratedMove | Set-AdvancedSetting -Value 0

This change is not destructive and can be done online (even if you have Storage vMotion running).

The next step is to log the case with VMware and wait for the resolution.

If you had a similar problem, feel free to share your experience in the comments.

I will keep updating this post when more information is available.

23/09/2016 – Update 1: VMware GSS confirmed that the system had been configured correctly and suggested contacting the storage vendor about the matter.

16/03/2017 – Update 2: A workaround for this issue is to follow the recommendations from EMC and increase the value of DataMover.MaxHwTransferSize parameter to “16384” on each host connected to the LUN.