[IMPORTANT] VMware ESXi 6.x: Denial-of-service vulnerability in 3D-acceleration feature

This week VMware published a security advisory VMSA-2018-0025 about the denial-of-service vulnerability in the 3D-acceleration feature in VMware ESXi, Workstation, and Fusion.

VM3DSupport-Issue-01

It affects all versions of those products if 3D-acceleration feature is enabled for virtual machines (VMs). This is a default setting for all VMs on VMware Workstation and Fusion and might be an issue for the VMs managed by VMware Horizon.

More information about this issue can be found here.

At the moment of writing this article, there were no patches or updates provided by VMware to mitigate this problem. So a workaround would be to disable the 3D-acceleration feature for affected systems.

To identify the VMs that have the 3D-acceleration feature enabled, I wrote the following PowerCLI script:

As soon as the permanent solution provided by the vendor, I will update this blog post with more information.

vSphere 6.5: Additional considerations when migrating to VMFS-6 – Part 1

For those who use the Virtual Machine File System (VMFS) datastores, one of the steps when upgrading to vSphere 6.5 is to migrate them to VMFS-6.

VMFS6-01

VMware provides a detailed overview of VMFS-6 on the StorageHub, as well as an example of how the migration from VMFS-5 can be automated using PowerCLI.

However, there are three edge cases that require extra steps to continue with the migration. They are as follows:

All those objects, if they exist, prevent the ESXi host from unmounting the datastore, and they need to be moved to a new location before migration continues. The required steps to relocate them will be reviewed in the paragraphs below.

Relocating the system swap

The system swap location can be checked and set via vSphere Client in Configure > System > System Swap settings of the ESXi host.

VMFS6-02

Alternatively, the system swap settings can be retrieved via PowerCLI:

The script above can be modified to create the system swap files on a new datastore:

Note: The host reboot is not required to apply this change.

Moving the persistent scratch location

A persistent scratch location helps when investigating the host failures. It preserves the host log files on a shared datastore. So they can be reachable for troubleshooting, even if the host experienced the Purple Screen of Death (PSOD) or went down.

To identify the persistent scratch location, filter the key column by the ‘scratch’ word in Settings > System > Advanced System Settings of the ESXi host in vSphere Client.

VMFS6-03

You only need to point the ScratchConfig.ConfiguredScratchLocation setting to a new location and reboot the host for this change to take effect.

Note: Before doing any changes, make sure that the .locker folder (should be unique for each configured host to avoid data mixing or overwrites) has been created on the desired datastore. Otherwise, the persistent scratch location remains the same.

To review and modify advanced host parameters including the persistent scratch location via PowerCLI, look for two cmdlets named Get-AdvancedSetting and Set-AdvancedSetting. This procedure is well-documented in KB 1033696.

An information about how to automate the diagnostic coredump file relocation will be covered in Part 2 or this series later this month. Keep you posted!

URGENT: VMware Tools 10.3.0 was recalled

VMware has just announced that VMware Tools 10.3.0 was recalled due to a functional issue with 10.3.0 in ESXi 6.5.

VMware-Tools-1030-Issue

As per KB 57796, the VMXNET3 driver released with VMware Tools 10.3.0 can result in a Purple Diagnostic Screen (PSOD) or guest network connectivity loss in certain configurations. Those configurations include:

  • VMware ESXi 6.5 hosts
  • VM Hardware version 13
  • Windows 8/Windows Server 2012 or higher guest operating system (OS).

As a workaround, VMware recommends uninstalling VMware Tools 10.3.0 and then reinstalling VMware Tools 10.2.5 for affected systems.

The vendor will be releasing a revised version of the VMware Tools 10.3 family at some point in the future.

More information is available in VMSA-2018-0017.

25/09/2018 – Update 1: VMware Tools were updated to version 10.3.2 to resolve the issue with VMXNET3 driver. VMware recommends to install VMware Tools 10.3.2, or VMware Tools 10.2.5 or an earlier version of VMware Tools.

vSphere 6.5: Switching to Native Drivers in ESXi 6.5

The Native Device Driver architecture is not something new. Since its introduction more than five years ago, VMware encourages their hardware ecosystem partners to work on developing native drivers. A list of supported hardware is growing with every major release of ESXi, with the company’s aim to deprecate the vmkLinux APIs and associated driver ecosystem completely in the future releases of vSphere.

The benefits of using the native drivers are as follows:

  • It removes the complexity of developing and maintaining Linux derived drivers,
  • It improves the system performance,
  • It frees from the functional limitations of Linux derived drivers,
  • It increases the stability and reliability of the hypervisor, as native drivers are designed specifically for VMware ESXi.

Saying that one of the steps when upgrading to a new version of vSphere is to check that the hardware supports native drivers. By default, if ESXi identifies a native driver for a device it will be loaded instead of Linux derived driver. However, it is not always a case, and you need to check whether native drivers are in use after the system upgrade.

Following steps in KB 1031534 and KB 1034674, you can pinpoint PCI devices and corresponding drivers loaded for each of them:

  • To identify a storage HBA (such as a fibre card or RAID controller), run this command:

# esxcfg-scsidevs -a

  • To identify a network card, run this command:

# esxcfg-nics -l

  • To list device state and note the hardware IDs, run this command:

# vmkchdev -l

The /etc/vmware/default.map.d/ folder on ESXi host contains a full list of map files referring to the native drivers available for your system.

ESXi-Native-Driver-01

To quickly identify the driver version, you can run this command:

# esxcli software vib list | grep <native_driver_name>

In addition, information about available vSphere Installation Bundles (VIBs) in vSphere 6.5 can be found via the web client or PowerCLI session:

  • To view all installed VIBs in vSphere Client (HTML5), open Configure > System > Packages tab in the host settings:

ESXi-Native-Driver-02

  • To view all installed VIBs in VMware Host Client, open Manage > Packages tab in the host settings:

ESXi-Native-Driver-03

  • To list all installed VIBs in PowerCLI, run this command:

# (Get-VMHost -Name ‘<host_name>‘ | Get-EsxCli).software.vib.list() | select Name,Vendor,Version | sort Name

Comparing findings above with information in the IO Devices section in VMware Hardware Compatibility List, you would be able to find out whether native drivers available for your devices, as well as the recommended combination of the driver and firmware, tested and supported by VMware.

It worth reading the release notes for the corresponding drivers and search any reference to it on VMware and the third-party vendors’ websites, in case there are any known issues or limitations that might affect how device function.

If everything seems good, it is time to enable the native driver following steps in KB 2147565:

# esxcli system module set –enabled=true –module=<native_driver_name>

This change requires a host reboot and a thorough testing afterwards. The following commands can be quite helpful when troubleshooting native drivers:

  • To get the driver supported module parameters, run this command:

# esxcfg-module -i <native_driver_name>

  • To get the driver info, run this command:

# esxcli network nic get -n <vmnic_name>

  • To get an uplink stats, run this command:

# esxcli network nic stats -n <vmnic_name>

31/08/2018 – Update 1: After some feedback provided, I have decided to list well-known issues with the native drivers that exist currently. They are as follows:

  • The Mellanox ConnectX-4/ConnectX-5 native ESXi driver might exhibit performance degradation when its Default Queue Receive Side Scaling (DRSS) feature is turned on (Reference: vSphere 6.7 Release Notes),
  • Native software FCoE adapters configured on an ESXi host might disappear when the host is rebooted (Reference: vSphere 6.7 Release Notes),
  • HP host with QFLE3 Driver Version 1.0.60.0 experienced a PSOD or stuck at “Shutting down device drivers…” shutdown or restart (Reference: KB 55088),
  • ESXi 6.5 Storage Performance Issues and Fix (Reference: Anthony Spiteri’s blog).

VMware ESXi 6.0-6.5: Low network receive throughput for VMXNET3 on Windows VM

VMware has just released a new KB 57358 named ‘Low receive throughput when receive checksum offload is disabled and Receive Side Coalescing is enabled on Windows VM‘. This requires attention when configuring the VMXNET3 adapter on Windows operating systems (OS). However, it only affects virtual environments with VMware ESXi 6.0 and ESXi 6.5 only.

VMware states that it happens when the following conditions are met:

  • Guest OS is Windows 2012 / Windows 8 or later
  • VM hardware version 11 or later
  • Virtual network adapter is VMXNET3
  • Receive Side Coalescing (RSC) is enabled on the VMXNET3 driver on the guest OS
  • Some or all of following receive checksum offloads have value Disabled or only Tx Enabled on the VMXNET3 driver on the guest operating system:
    • IPv4 Checksum Offload
    • TCP Checksum Offload (IPv4)
    • TCP Checksum Offload (IPv6)
    • UDP Checksum Offload (IPv4)
    • UDP Checksum Offload (IPv6).

This shouldn’t be a problem if the VMXNET3 driver has the default settings.

VMXNET3-RSC-Issue-00

For example, copying 3.4 gigabytes of data to the test VM via the 1Gbps link took me seconds.

VMXNET3-RSC-Issue-01

With TCP Checksum Offload (IPv4) set to Tx Enabled on the VMXNET3 driver the same data takes ages to transfer.

VMXNET3-RSC-Issue-02

VMware provides a workaround for this issue: you either need to disable RSC, if any of receive checksum offloads is disabled, or manually enable receive checksum offloads. The knowledge base includes the PowerShell commands that help to automate the latter.

vSAN 6.6.1: Replacing a faulty NIC on Dell PowerEdge server

Not long ago I have noticed a 10G network interface flipping on one of the vSAN nodes.

vSAN-NIC-issue-01

I immediately started investigating this issue. An interesting thing was that this device was part of the integrated NIC on the server, and only it was generating connection errors.

vSAN-NIC-issue-02

After consulting with the Networks, we found the following:

  • The interface was up,
  • The port connecting to the switch had no traffic (can send to the server, but not receiving from the server),
  • No errors were recorded,
  • SFP signals were good.

The plan of attack was to replace SFPs – first on the server and, if it didn’t help, on the switch. During this operation we’ve found out the SFP on the server side was unusually warm. Unfortunately, replacing SFPs didn’t help and after approximately 15 minutes of complete silence, disconnects continued.

The next move was to contact a vendor. In our case, it was Dell EMC.

We’ve lodged a support request and sent the SupportAssist Collection to them. The response from the Support was to replace an embedded NIC with a new one. Considering the server was in use, it all sounded tricky to me.

However, thanks to the new algorithm which assigns device names for I/O devices beginning in ESXi 5.5, it all went smoothly. VMware states the following:

vSAN-NIC-issue-03

The number of ports on the embedded NIC hasn’t changed. As a result, hypervisor assigned the same aliases to the onboard ports.

ESXi initialised new ports and vSAN configuration was updated successfully without any human interaction.

As a bonus, when the server was booting after the card replacement, Lifecycle Controller detected an older version of firmware on the device and initiated a firmware update operation automatically.

vSAN-NIC-issue-04

All in all, I am impressed by how robust modern platforms both from Dell EMC and VMware.

ESXi 6.5: Retrieve IPMI SEL request to host failed [FIXED BY VENDOR]

From time to time you might want to check the host hardware health manually in Monitor>Hardware Health (vSphere Client) or Monitor>Hardware Status (vSphere Web Client).

For many months this functionality has been broken for ESXi 6.5 on DellEMC servers.

vSphere Web Client - IPMI Error

When opening the Sensors page, vpxd.log shows the following message:

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — BEGIN task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog

error vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] No Content-Length header, WSMan IPMI SEL operation failed

info vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] WSMan Msg size 59: part:401 Unauthorized
–> WWW-Authenticate: Basic realm=”OPENWSMAN”)l▒\x7f

warning vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] Closing Response processing in unexpected state: 3

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — FINISH task-35318

info vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — ERROR task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog: vmodl.fault.SystemError:
–> Result:
–> (vmodl.fault.SystemError) {
–> faultCause = (vmodl.MethodFault) null,
–> faultMessage = <unset>,
–> reason = “Retrieve IPMI SEL request to host failed”
–> msg = “”
–> }
–> Args:
–>

Many people were pointing to vpxa.cfg (here and here) as a source of the error:

<log>
<level>verbose</level>
<maxFileNum>10</maxFileNum>
<maxFileSize>1048576</maxFileSize>
<memoryLevel>verbose</memoryLevel>
<outputToConsole>false</outputToConsole>
<outputToFiles>false</outputToFiles>
<outputToSyslog>true</outputToSyslog>
<syslog>
<facility>local4</facility>
<ident>Vpxa</ident>
<logHeaderFile>/var/run/vmware/vpxaLogHeader.txt</logHeaderFile>
</syslog>
</log>

It was not the end of the world, and I didn’t want to edit default log levels manually. So the issue was ignored for a while.

To my great surprise, it all went back to normal after updating hypervisor to the latest version using Dell EMC customised VMware ESXi 6.5 U1 A10 image.

Now, we can see multiple events in vpxd.log generated by VpxLRO:

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — BEGIN lro-490638 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — FINISH lro-490638

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — BEGIN lro-490639 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — FINISH lro-490639

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — BEGIN lro-490640 — HostProfileManager — vim.profile.ProfileManager.findAssociatedProfile

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — FINISH lro-490640

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — BEGIN lro-490641 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — FINISH lro-490641
2018-04-12T14:02:41.702+08:00 info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — BEGIN lro-490642 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — FINISH lro-490642

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — BEGIN lro-490643 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — FINISH lro-490643

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — BEGIN task-35322 — healthStatusSystem-28 — vim.host.HealthStatusSystem.FetchSystemEventLog

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — FINISH task-35322

As a result, the ‘Refresh hardware IPMI System Event Log’ task completes successfully.

vSphere Web Client - IPMI Success