vSAN 6.6.1: Replacing a faulty NIC on Dell PowerEdge server

Not long ago I have noticed a 10G network interface flipping on one of the vSAN nodes.

vSAN-NIC-issue-01

I immediately started investigating this issue. An interesting thing was that this device was part of the integrated NIC on the server, and only it was generating connection errors.

vSAN-NIC-issue-02

After consulting with the Networks, we found the following:

  • The interface was up,
  • The port connecting to the switch had no traffic (can send to the server, but not receiving from the server),
  • No errors were recorded,
  • SFP signals were good.

The plan of attack was to replace SFPs – first on the server and, if it didn’t help, on the switch. During this operation we’ve found out the SFP on the server side was unusually warm. Unfortunately, replacing SFPs didn’t help and after approximately 15 minutes of complete silence, disconnects continued.

The next move was to contact a vendor. In our case, it was Dell EMC.

We’ve lodged a support request and sent the SupportAssist Collection to them. The response from the Support was to replace an embedded NIC with a new one. Considering the server was in use, it all sounded tricky to me.

However, thanks to the new algorithm which assigns device names for I/O devices beginning in ESXi 5.5, it all went smoothly. VMware states the following:

vSAN-NIC-issue-03

The number of ports on the embedded NIC hasn’t changed. As a result, hypervisor assigned the same aliases to the onboard ports.

ESXi initialised new ports and vSAN configuration was updated successfully without any human interaction.

As a bonus, when the server was booting after the card replacement, Lifecycle Controller detected an older version of firmware on the device and initiated a firmware update operation automatically.

vSAN-NIC-issue-04

All in all, I am impressed by how robust modern platforms both from Dell EMC and VMware.

ESXi 6.5: Retrieve IPMI SEL request to host failed [FIXED BY VENDOR]

From time to time you might want to check the host hardware health manually in Monitor>Hardware Health (vSphere Client) or Monitor>Hardware Status (vSphere Web Client).

For many months this functionality has been broken for ESXi 6.5 on DellEMC servers.

vSphere Web Client - IPMI Error

When opening the Sensors page, vpxd.log shows the following message:

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — BEGIN task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog

error vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] No Content-Length header, WSMan IPMI SEL operation failed

info vpxd[7FBE59924700] [Originator@6876 sub=MoHost opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] WSMan Msg size 59: part:401 Unauthorized
–> WWW-Authenticate: Basic realm=”OPENWSMAN”)l▒\x7f

warning vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] Closing Response processing in unexpected state: 3

info vpxd[7FBE59924700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — FINISH task-35318

info vpxd[7FBE59924700] [Originator@6876 sub=Default opID=dam-auto-generated: HardwareStatusViewMediator:dr-425:CimMonitorPropertyProvider:200359:14133-31991-ngc:70004153-e9] [VpxLRO] — ERROR task-35318 — healthStatusSystem-34 — vim.host.HealthStatusSystem.FetchSystemEventLog: vmodl.fault.SystemError:
–> Result:
–> (vmodl.fault.SystemError) {
–> faultCause = (vmodl.MethodFault) null,
–> faultMessage = <unset>,
–> reason = “Retrieve IPMI SEL request to host failed”
–> msg = “”
–> }
–> Args:
–>

Many people were pointing to vpxa.cfg (here and here) as a source of the error:

<log>
<level>verbose</level>
<maxFileNum>10</maxFileNum>
<maxFileSize>1048576</maxFileSize>
<memoryLevel>verbose</memoryLevel>
<outputToConsole>false</outputToConsole>
<outputToFiles>false</outputToFiles>
<outputToSyslog>true</outputToSyslog>
<syslog>
<facility>local4</facility>
<ident>Vpxa</ident>
<logHeaderFile>/var/run/vmware/vpxaLogHeader.txt</logHeaderFile>
</syslog>
</log>

It was not the end of the world, and I didn’t want to edit default log levels manually. So the issue was ignored for a while.

To my great surprise, it all went back to normal after updating hypervisor to the latest version using Dell EMC customised VMware ESXi 6.5 U1 A10 image.

Now, we can see multiple events in vpxd.log generated by VpxLRO:

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — BEGIN lro-490638 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectTabbedViewMediator:dr-519,dam-auto-generated: ObjectPropertyFilter:dr-521):01-e6] [VpxLRO] — FINISH lro-490638

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — BEGIN lro-490639 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE58B08700] [Originator@6876 sub=vpxLro opID=combined(dam-auto-generated: ObjectPropertyFilter:dr-529,dam-auto-generated: ObjectPropertyFilter:dr-533):01-86] [VpxLRO] — FINISH lro-490639

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — BEGIN lro-490640 — HostProfileManager — vim.profile.ProfileManager.findAssociatedProfile

info vpxd[7FBE5B45A700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: ObjectPropertyFilter:dr-529:AssociationHostSystemAdapter:200359:14388-32550-ngc:70004210-ce] [VpxLRO] — FINISH lro-490640

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — BEGIN lro-490641 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: RelatedItemsManager:dr-535:01-78] [VpxLRO] — FINISH lro-490641
2018-04-12T14:02:41.702+08:00 info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — BEGIN lro-490642 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5A236700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:01-d9] [VpxLRO] — FINISH lro-490642

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — BEGIN lro-490643 — ResourceModel — cis.data.provider.ResourceModel.query

info vpxd[7FBE5ACCB700] [Originator@6876 sub=vpxLro opID=urn:vmomi:HostSystem:host-28:9a78adfb-4c75-4b84-8d9a-65ab2cc71e51.properties:01-c1] [VpxLRO] — FINISH lro-490643

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — BEGIN task-35322 — healthStatusSystem-28 — vim.host.HealthStatusSystem.FetchSystemEventLog

info vpxd[7FBE5A53C700] [Originator@6876 sub=vpxLro opID=dam-auto-generated: HardwareStatusViewMediator:dr-545:CimMonitorPropertyProvider:200359:14395-32555-ngc:70004212-2b] [VpxLRO] — FINISH task-35322

As a result, the ‘Refresh hardware IPMI System Event Log’ task completes successfully.

vSphere Web Client - IPMI Success

VMware Tools 10.2.5: Changes to VMXNET3 driver settings

Last week VMware released a new version of VMware Tools.

VMware Tools 10.2.5

It might look like a minor upgrade. However, it includes important changes to the Receive Side Scaling (RSS) and Receive Throttle options in VMXNET3 driver which require attention and careful planning when implemented.

According to the vendor:

RSS is a mechanism which allows the network driver to spread incoming TCP traffic across multiple CPUs, resulting in increased multi-core efficiency and processor cache utilization. If the driver or the operating system is not capable of using RSS, or if RSS is disabled, all incoming network traffic is handled by only one CPU. In this situation, a single CPU can be the bottleneck for the network while other CPUs might remain idle.

Despite all benefits, this technology has been disabled on Windows 8 and Windows 2012 Server or later due to an issue with the vmxnet3 driver which affects Windows guest operating systems with VMware Tools 9.4.15 and later.

It was finally resolved in mid-2017 with the release of VMware Tools 10.1.7. However, only vmxnet3 driver version 1.7.3.7 in VMware Tools 10.2.0 was recommended by VMware for Windows and Microsoft Business Critical applications.

Few months after, VMware introduces the following changes to vmxnet3 driver version 1.7.3.8:

  • Receive Side Scaling is enabled by default,
  • The default value of the Receive Throttle is set to 30.

If you install VMware Tools 10.2.5 on a new virtual machine with Windows 8 and Windows 2012 Server or later, those settings will apply automatically; with the VMware Tools upgrade, they remain the same as it was before.

To check the current status of RSS and the Receive Throttle, you can execute the following PowerShell script inside the VM:

Get-NetAdapter | Where-Object { $_.InterfaceDescription -like “vmxnet3*” } | Get-NetAdapterAdvancedProperty | Where-Object { $_.RegistryKeyword -like “*RSS” -or $_.RegistryKeyword -like “RxThrottle” } | Format-Table -AutoSize

If you would like to edit those advanced options for all VMXNET3 NICs inside the VM, it can be done with the following two lines:

Get-NetAdapter | Where-Object { $_.InterfaceDescription -like “vmxnet3*” } | Set-NetAdapterAdvancedProperty -DisplayName “Receive Side Scaling” -DisplayValue “Enabled” -NoRestart
Get-NetAdapter | Where-Object { $_.InterfaceDescription -like “vmxnet3*” } | Set-NetAdapterAdvancedProperty -DisplayName “Receive Throttle” -DisplayValue “30” -NoRestart

Remember that after applying those settings, the virtual machine should be rebooted. As a result, the output will look similar to this:

VMXNET3-RSS

The only thing that is left is to perform thorough testing. Some ideas how to do it can be found in here.

25/04/2018 – Update 1: VMware released a knowledge base article about Windows 7 and 2008 virtual machines losing network connectivity on VMware Tools 10.2.0. To resolve this issue they recommend to upgrade to VMware Tools 10.2.5.

01/05/2018 – Update 2: VMware released VMware Tools 10.2.1. This minor update resolves an issue when ‘network ports are exhausted on Guest VM after a few days when using VMware Tools 10.2.0’.

VMware: A few exciting events this week

Hi All,

For those of you who are based in Sydney or Melbourne, VMware User Group organises the most prominent annual community event this week – VMUG UserCon 2018. I’ve been participating in VMUG UserCon in Sydney for a few years, and it was always above my expectations meeting technology experts and raising startups, and socialise with the peers.

VMUG UserCon

According to the event’s agenda, this year the primary focus is going be on the solutions to build a hybrid VMware Cloud on AWS, provide automation and analysis for virtual machines and containerised workloads, secure environment with VMware NSX and more.

In addition to local VMUG leaders and stars, there will be industry pioneers like Bruce Davie (NSX) and Cormac Hogan (Storage and Availability). So if you would like to learn about modern trends in the virtual space, have some questions for VMware or the event sponsors, or just want to mingle with the crowd – the registration is still open here and here.

Another news I would like to share with you is related to hyper-converged infrastructure (HCI). VMware is planning to hold a virtual event called ‘Deploy, Manage and Scale vSAN and HCI with vRealize Operations‘ on March 21 and 22, 10 am PDT. The first webinar will be focused on business needs to accelerate HCI adoption, whereas the second one is promised to be a technical deep dive into the components that comprise the HCI architecture. Looks like a fascinating subject to explore to me!

vCenter 6.0: VMware Common Logging Service Health Alarm [RESOLVED]

A few days ago I noticed a warning message appearing in vCenter Server pointing to some issues with the VMware Common Logging Service.

VCLS Issue - 01

The service status was showing that the disk space had been filling in steadily reaching the 30% warning threshold.

VCLS Issue - 00

VCLS Issue - 02

Considering the infrastructure had not experienced significant changes, I decided to postpone disk space extension and try to find the root cause of this problem.

With the help of the du command, it has become clear some of the subfolders in /var/log/vmware were quite large in size.

VCLS Issue - 03

It shouldn’t be a problem if the log rotation happens and old data is removed from the disk. However, the /storage/log/vmware/cloudvm/cloudvm-ram-size.log file size was 1.4G, and it seemed to be increasing without log rotation.

VCLS Issue - 04

An attempt to find out about the cloudvm-ram-size.log file pointed me to the article which William Lam wrote in early 2015 – apparently it logs activities of a built-in dynamic memory reconfiguration process called cloudvm-ram-size.

The issue is documented in ‘Log rotation of the cloudvm-ram-size.log file is not working (2147261)‘ in VMware Knowledge Base.

As per that article, the problem is fixed in vSphere 6.5. For vSphere 6.0, you need to configure log rotation for the cloudvm-ram-size.log file and run the logrotate command manually to archive it to cloudvm-ram-size.log-xxxxxxxxx-.bz2 file.

VCLS Issue - 05

VMware recommends to do periodic cleanup of older .bz2 files in the /storage/log/vmware/cloudvm location!!! This can be done by adding a rotate parameter to the configuration file as follows:

/storage/log/vmware/cloudvm/cloudvm-ram-size.log{
missingok
notifempty
compress
size 20k
monthly
create 0660 root cis
rotate 7
}

It was a quick fix!

VMware: StorageHub Portal Refresh

For those of us who have been interested in getting explicit information about VMware vSAN, Site Recovery Manager, and vSphere storage in general, VMware StorageHub was a unique source of technical documentation.

It is great to see the vendor working on improving this portal with the design and user interface refresh.

SorageHub-01

Now it is possible to choose between English (US) and Mandarine languages for some of the articles.

SorageHub-02

All seems quite logical, and I personally like navigation and how fast search works.

SorageHub-03

Well done, VMware!