vCenter 6.7 Update 2: Error in creating a backup schedule

One of the improvements in vCenter 6.7 Update 2 includes Samba (SMB) protocol support for the built-in File-Based Backup and Restore. Excited about the news, I decided to test this functionality and backup data to the Windows share.

I filled in the backup schedule parameters in the vCenter Server Appliance Management Interface (VAMI) and pressed the Create button, when the following error message appeared: Error in method invocation module ‘util.Messages’ has no attribute ‘ScheduleLocationDoesNotExist’.

Puzzled with this message and not knowing which log file to inspect, I ran the following command in the local console session on the
vCenter Server Appliance (VCSA):

grep -i ‘ScheduleLocationDoesNotExist’ $(find /var/log/vmware/ -type f -name ‘*.log’)

The search results led me to /var/log/vmware/applmgmt/applmgmt.log where I found another clue:

2019-04-30T01:25:24.111 [2476]ERROR:vmware.appliance.backup_restore.schedule_impl:Failed to mount the cifs share //fileserver.company.local/Archive/VMware at /storage/remote/backup/cifs/fileserver.company.local/D4Ji3vNM/fmuCEc6m; Err: rc=32, stdOut:, stdErr: mount error(13): Permission denied
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

At first, after some reading, I thought it was related to the SMB protocol version or the wrong security type for the server. So I decided to look for any security events on the file server.

In Windows Event Log, I saw the following:

After double-checking the NTFS and share permissions for the network share, I was confident that the user had permissions to access it and write data into it.

Run out of ideas, I was just looking into the official documentation and some blog posts to see if something was missing. What stroke me was no references to the domain name, neither in a UPN format nor in a form of sAMAccountName, in the backup server credentials in the Create Backup Schedule wizard.

It was easy for me to test if skipping the domain name would make any difference, and it did! The backup job worked like a charm and was completed successfully.

A tip when using the vCenter Server Converge Tool

As you might know, VMware is dropping support for the external Platform Services Controller (PSC) deployment model with the next major release of vSphere.

To make a smooth transition, you have to use a command-line interface of the vCenter Server Converge Tool to do the migration to an embedded PSC. This functionality is available starting from vCenter Server 6.7 Update 1 and also in vCenter Server 6.5 Update 2d and onward.

With the upcoming release of vSphere 6.7 Update 2, there will be an option to complete the whole migration using the vSphere Client – super easy!

Meanwhile, the process of moving from an external PSC deployment to the embedded one using CLI consists of two manual steps – converge and decommission. A detailed instruction of how to prepare for and execute each of those steps is documented in the David Stamen’s post ‘Understanding the vCenter Server Converge Tool‘.

What I found tricky was running the converge step when the external PSC had been previously joined to the child domain in Active Directory. In this case, the vCenter Server Converge Tool precheck ran with the default parameters generates the following error message in vcsa-converge-cli.log:

2019-04-06 03:08:15,979 – vCSACliConvergeLogger – ERROR – AD Identity store present on the PSC:root.domain.com
2019-04-06 03:08:15,979 – vCSACliConvergeLogger – INFO – ================ [FAILED] Task: PrecheckSameDomainTask: Running PrecheckSameDomainTask execution failed at 03:08:15 ================
2019-04-06 03:08:15,980 – vCSACliConvergeLogger – DEBUG – Task ‘PrecheckSameDomainTask: Running PrecheckSameDomainTask’ execution failed because [ERROR: Template AD info not providded.], possible resolution is [Refer to the log for details]
2019-04-06 03:08:15,980 – vCSACliConvergeLogger – INFO – =============================================================
2019-04-06 03:08:16,104 – vCSACliConvergeLogger – ERROR – Error occurred. See logs for details.
2019-04-06 03:08:16,105 – vCSACliConvergeLogger – DEBUG – Error message: com.vmware.vcsa.installer.converge.prechecksamedomain: ERROR: Template AD info not providded.

In this example, the root.domain.com refers to the root domain; whereas, the computer object for PSC is in the child domain.

To workaround this issue, I had to use the –skip-domain-handling flag to skip the AD Domain related handling in both precheck and actual converge.

By doing this, the vCenter Server Appliance should be joined to the correct AD domain manually after the converge succeed and before the external PSC will be decommissioned.

vCSA 6.x: WinSCP fails with the error ‘Received too large SFTP packet’

Back to basics… When you try connecting to vCenter Server Virtual Appliance 6.x (vCSA) using WinSCP, the error message ‘Received too large (1433299822 B) SFTP packet‘ might appear.

vCSA6x-WinSCP-01

This is due to the configuration of vCSA when the default shell used for the root account set to the Appliance Shell.

To fix this issue, VMware recommends switching the vCSA 6.x to the Bash Shell. This can be done in the SSH session with the following command:

chsh -s /bin/bash root

Note: You need to log out from the Appliance Shell and log in back again for the changes to take effect.

vCenter 6.0: VMware Common Logging Service Health Alarm [RESOLVED]

A few days ago I noticed a warning message appearing in vCenter Server pointing to some issues with the VMware Common Logging Service.

VCLS Issue - 01

The service status was showing that the disk space had been filling in steadily reaching the 30% warning threshold.

VCLS Issue - 00

VCLS Issue - 02

Considering the infrastructure had not experienced significant changes, I decided to postpone disk space extension and try to find the root cause of this problem.

With the help of the du command, it has become clear some of the subfolders in /var/log/vmware were quite large in size.

VCLS Issue - 03

It shouldn’t be a problem if the log rotation happens and old data is removed from the disk. However, the /storage/log/vmware/cloudvm/cloudvm-ram-size.log file size was 1.4G, and it seemed to be increasing without log rotation.

VCLS Issue - 04

An attempt to find out about the cloudvm-ram-size.log file pointed me to the article which William Lam wrote in early 2015 – apparently it logs activities of a built-in dynamic memory reconfiguration process called cloudvm-ram-size.

The issue is documented in ‘Log rotation of the cloudvm-ram-size.log file is not working (2147261)‘ in VMware Knowledge Base.

As per that article, the problem is fixed in vSphere 6.5. For vSphere 6.0, you need to configure log rotation for the cloudvm-ram-size.log file and run the logrotate command manually to archive it to cloudvm-ram-size.log-xxxxxxxxx-.bz2 file.

VCLS Issue - 05

VMware recommends to do periodic cleanup of older .bz2 files in the /storage/log/vmware/cloudvm location!!! This can be done by adding a rotate parameter to the configuration file as follows:

/storage/log/vmware/cloudvm/cloudvm-ram-size.log{
missingok
notifempty
compress
size 20k
monthly
create 0660 root cis
rotate 7
}

It was a quick fix!

vSAN 6.6.1: vSAN Build Recommendation Engine Health issue [RESOLVED]

In my previous post about vSAN Build Recommendation Engine Health test, I have concluded that it was a bug in vSAN 6.6.1 that prevented vSAN Health service from properly connecting to the Internet via proxy.

With vCenter Server Appliance 6.5 Update 1d release, I have noticed that one of two warning messages disappeared from the vSphere Web Client leaving that task in the ‘Unexpected vSphere Update Manager (VUM) baseline creation failure‘ state.

After checking vSAN configuration one more, I concluded the following:

  • Internet connectivity for automatic updates of the HCL database has been set up properly (vSAN_Cluster > Configure > vSAN > General):

vSAN-BRE-01

  • The HCL database is up-to-date and CEIP is enabled (vSAN_Cluster > Configure > vSAN > Health and Performance):

vSAN-BRE-02

vSAN-BRE-03

  • Update Manager has proxy settings configured and working (vSAN_Cluster > Update Manager > Go to Admin View > Manage > Settings > Download Settings):

vSAN-BRE-04

vSAN-BRE-05

At the same time, the proxy server replaces SSL certificates with its own one signed by the corporate CA when establishing HTTPS connection with the remote peer.

As a result, it causes an error message for the vSAN Build Recommendation Engine Health task as follows (extract from vmware-vsan-health-service.log):

INFO vsan-health[Thread-49] [VsanVumConnection::RemediateVsanClusterInVum] build = {u’release’: {u’baselineName’: u’VMware ESXi 6.5.0 U1 (build 5969303)’, u’isoDisplayName’: u’VMware ESXi Release 6.5.0, Build 5969303′, u’bldnum’: 5969303, u’vcVersion’: [u’6.5.0′], u’patchids’: [u’ESXi650-Update01′], u’patchDisplayName’: u’VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303)’}}

INFO vsan-health[Thread-49] [VsanVumConnection::_LookupPatchBaseline] Looking up baseline for patch VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303) (keys: [40], hash: None)…

INFO vsan-health[Thread-49] [VsanVumConnection::_LookupPatchBaseline] Looking up baseline for patch vSAN recommended patch to be applied on top of ESXi 6.5 U1: ESXi650-201712401-BG (keys: [], hash: None)…

ERROR vsan-health[Thread-49] [VsanVumConnection::RemediateAllClusters] Failed to remediate cluster ‘vim.ClusterComputeResource:domain-c61’

Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 1061, in RemediateAllClusters
performScan = performScan)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 876, in RemediateVsanClusterInVum
patchName, patchMap[chosenRelease])
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 373, in CreateBaselineFromOfficialPatches
baseline = self._LookupPatchBaseline(name, keys)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanVumConnection.py”, line 411, in _LookupPatchBaseline
result = bm.QueryBaselinesForUpdate(update = updateKeys)
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 557, in <lambda>
self.f(*(self.args + (obj,) + args), **kwargs)
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 362, in _InvokeMethod
list(map(CheckField, info.params, args))
File “/usr/lib/vmware-vpx/pyJack/pyVmomi/VmomiSupport.py”, line 883, in CheckField
raise TypeError(‘Required field “%s” not provided (not @optional)’ % info.name)
TypeError: Required field “update” not provided (not @optional)

INFO vsan-health[Thread-49] [VsanVumSystemUtil::AddConfigIssue] Add config issue createBaselineFailed

INFO vsan-health[Thread-49] [VsanVumConnection::_DeleteUnusedBaselines] Deleting baseline VMware ESXi 6.5.0 U1 (vSAN 6.6.1, build 5969303) (id 424) because it is unused

INFO vsan-health[Thread-49] [VsanVumSystemUtil::VumRemediateAllClusters_DoWork] Complete VUM check for clusters [‘vim.ClusterComputeResource:domain-c61’]

ERROR vsan-health[Thread-49] [VsanVumConnection::RemediateAllClusters] Failed to remediate cluster

Following the community advice, I decided to add Root CA and subordinate CA certificates (in *.pem format) to the local keystore on vCenter Server Appliance. After copying certificates to /etc/ssl/certs and running the c_rehash command, I added proxy servers to /etc/sysconfig/proxy and rebooted the server.

vSAN-BRE-07

To test that new configuration works, I used the wget command, and it all seemed to work smoothly.

vSAN-BRE-06

Regardless of all that changes, I still got error messages with the vSAN Build Recommendation Engine Health test, but this time they looked a bit different:

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=ServiceInstance, info=content

WARNING vsan-health[Thread-11125] [VsanPhoneHomeWrapperImpl::_try_connect] Cannot connect to VUM. Will retry connection

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=group-d1, info=name

INFO vsan-health[Thread-11125] [VsanPyVmomiProfiler::InvokeAccessor] Invoke: mo=group-d1, info=name

ERROR vsan-health[Thread-11125] [VsanCloudHealthDaemon::run] VsanCloudHealthSenderThread exception: Exception: HTTP Error 411: Length Required, Url: https://vcsa.vmware.com/ph/api/dataapp/send?_v=1.0&_c=VsanCloudHealth.6_5&_i=<Support_Tag&gt;, Traceback: Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthUtil.py”, line 511, in getResponse
resp = proxyOpener.open(*args, **kwargs)
File “/usr/lib/python2.7/urllib2.py”, line 435, in open
response = meth(req, response)
File “/usr/lib/python2.7/urllib2.py”, line 548, in http_response
‘http’, request, response, code, msg, hdrs)
File “/usr/lib/python2.7/urllib2.py”, line 473, in error
return self._call_chain(*args)
File “/usr/lib/python2.7/urllib2.py”, line 407, in _call_chain
result = func(*args)
File “/usr/lib/python2.7/urllib2.py”, line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 411: Length Required
Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 353, in run
self._sendCloudHealthData(clusterUuid, data=data)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthDaemon.py”, line 321, in _sendCloudHealthData
objectId=clusterUuid, additionalUrlParams=additionalUrlParams)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 156, in send
dataType=dataType, pluginType=pluginType, url=postUrl)
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthConnector.py”, line 139, in sendRawData
raise ex
VsanCloudHealthHTTPException: Exception: HTTP Error 411: Length Required, Url: https://vcsa.vmware.com/ph/api/dataapp/send?_v=1.0&_c=VsanCloudHealth.6_5&_i=<Support_Tag&gt;, Traceback: Traceback (most recent call last):
File “/usr/lib/vmware-vpx/vsan-health/pyMoVsan/VsanCloudHealthUtil.py”, line 511, in getResponse
resp = proxyOpener.open(*args, **kwargs)
File “/usr/lib/python2.7/urllib2.py”, line 435, in open
response = meth(req, response)
File “/usr/lib/python2.7/urllib2.py”, line 548, in http_response
‘http’, request, response, code, msg, hdrs)
File “/usr/lib/python2.7/urllib2.py”, line 473, in error
return self._call_chain(*args)
File “/usr/lib/python2.7/urllib2.py”, line 407, in _call_chain
result = func(*args)
File “/usr/lib/python2.7/urllib2.py”, line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 411: Length Required

INFO vsan-health[Thread-11125] [VsanCloudHealthDaemon::run] VsanCloudHealthSenderThread done.

vsan-health[Thread-9] [VsanCloudHealthDaemon::_sendExceptionsToPhoneHome] Exceptions for collection/sending exceptions

I thought that the vSAN Health service might try to contact vSphere Update Manager directly, and the proxy settings set on the OS level redirected this request to the Internet proxy instead.

I have added the local domain to the exception list in /etc/sysconfig/proxy and rebooted the server again.

vSAN-BRE-08

After reading about ‘HTTP Error 411’, the only idea was to add a domain service account and its password to HTTP_PROXY and HTTPS_PROXY lines in /etc/sysconfig/proxy. If the password has special characters, they should be added in ASCII encoding to work correctly.

To my great surprise, all communication issues have been resolved, and the vSAN Health service was able to synchronise data with vSphere Update Manager and online services correctly.

vSAN-BRE-09

vSAN-BRE-11

A few minutes later vSAN system baselines and baseline groups appeared in vSphere Update Manager.

vSAN-BRE-10

Of cause, those modifications in Photon OS configuration files are not supported by VMware and could be overwritten by future updates. Yet I hope engineers and developers are working on better integration between vSAN Health and vSphere Update Manager when vCenter resides behind the proxy.

23/02/2018 – Update 1: Per VMware documentation, a starting point to troubleshoot connectivity to the CEIP web server is to make sure the following prerequisites are met:

vSphere 6.0: Templates are shown as ‘Unknown’ in the local Content Library

Another day another case… This time, I was surprised to see an empty list when provisioning a new virtual machine from a Content Library.

CL Issue - 01

I went to check the Content Library status and found all templates were shown as ‘Unknown’ in there.

CL Issue - 02

Funny enough, this behaviour was happening only with the local Content Library. A subscribed one didn’t have any issues at all, and the synchronisation between those two was still working.

CL Issue - 03

More interestingly, the objects of other types were not affected at all.

There is not enough information about how to troubleshoot the Content Library in vSphere 6.0. Some of the diagnostic files can be found in the /var/log/vmware/vdcs directory on vCenter Server Appliance (VCSA). Unfortunately, they are not that informative.

So I opened the case with VMware GSS (SR # 17504701707) and the response was that “this issue is occurring as there is a corrupted or stale PID for the content library service which has not been cleared from the previous running state.”

VMware is working on this to be resolved, but no ETA at the moment.

A workaround provided by VMware:

  1. Connect to the vCenter Server Appliance using SSH and root credentials.
  2. Navigate to /var/log/vmware/vdcs.
  3. Create a new folder to move the PID file to.
  4. Move the vmware-vdcs.pid file to the folder created in step 3.
  5. Reboot the vCenter Server Appliance (In case of external PSC, reboot the PSC first and then the vCenter).

I personally found that restarting VCSA resolves this issue. However, it reappears after some time.

VCSA 6.5: The mysterious dependency on the IPv6 protocol – Part 2

In Part 1 of this mini-series, I was writing about the issue with the Appliance Management User Interface. However, a dependency on the IPv6 protocol in VCSA 6.5 can cause an unexpected behaviour with the vSphere ESXi Dump Collector service as well. Let’s look into this one now.

In the environment with many ESXi hosts, it is vital to have their logs available for troubleshooting. By default, each host has a diagnostic coredump partition available on the local storage. The hypervisor can preserve diagnostic information in one or more pre-configured locations such as the local partition, a file located on VMFS datastore, or a network dump server on vCenter Server.

ESXi-dump-collection-06

In a case of the critical failure with the host, when the system gets into the Purple Screen of Death (PSOD) state, the hypervisor generates a set of diagnostic data archived in a coredump. In my opinion, it is more efficient to have this information stored in the centralised location, and this is where vSphere ESXi Dump Collector service can be useful.

Initially, the vSphere ESXi Dump Collector service is disabled on the vCenter Server Appliance.

ESXi-Dump-Collector-01

The setup process is straightforward: you should select a startup type of this service (by default, it is set to Manual) and click on a Start button to enable it.

ESXi-Dump-Collector-02

Depending on the network requirements and the number of ESXi hosts, you might need changing the Coredump Server UDP Port (6500) and increasing the Repository max size (2GB). Both settings require restarting the vSphere ESXi Dump Collector service.

This process becomes a little bit complicated when IPv6 is disabled on VCSA. An attempt to start the vSphere ESXi Dump Collector service generates an error message in vSphere Web Client as follows:

ESXi-Dump-Collector-03

If we remote to the virtual appliance and run the netdumper service from the console session, it will show us more information:

root@n-vcsa-01 [ ~ ]# service-control –start netdumper
Perform start operation. vmon_profile=None, svc_names=[‘netdumper’], include_coreossvcs=False, include_leafossvcs=False
2017-07-04T10:15:32.179Z Service netdumper state STOPPED
Error executing start on service netdumper. Details {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}
Service-control failed. Error {
“resolution”: null,
“detail”: [
{
“args”: [
“netdumper”
],
“id”: “install.ciscommon.service.failstart”,
“localized”: “An error occurred while starting service ‘netdumper'”,
“translatable”: “An error occurred while starting service ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}

The next step to troubleshoot this issue is to look into the vSphere ESXi Dump Collector service log file (/var/log/vmware/netdumper/netdumper.log). It reports that the address is already in use:

root@n-vcsa-01 [ ~ ]# cat /var/log/vmware/netdumper/netdumper.log
2017-07-04T10:19:32.121Z| netdumper| I125: Log for vmware-netdumper pid=8347 version=XXX build=build-5318154 option=Release
2017-07-04T10:19:32.121Z| netdumper| I125: The process is 64-bit.
2017-07-04T10:19:32.121Z| netdumper| I125: Host codepage=UTF-8 encoding=UTF-8
2017-07-04T10:19:32.121Z| netdumper| I125: Host is Linux 4.4.8 VMware Photon 1.0 Photon VMware Photon 1.0

2017-07-04T10:19:32.123Z| netdumper| I125: Configured to handle 1024 clients in parallel.
2017-07-04T10:19:32.123Z| netdumper| I125: Configuring /var/core/netdumps as the directory to store the cores
2017-07-04T10:19:32.123Z| netdumper| I125: Configured to use wildcard [::0/0.0.0.0]:6500 as IP address:port
2017-07-04T10:19:32.123Z| netdumper| I125: Using /var/log/vmware/netdumper/netdumper.log as the logfile.
2017-07-04T10:19:32.123Z| netdumper| I125: Nothing to post process
2017-07-04T10:19:32.123Z| netdumper| I125: Couldn’t bind socket to port 6500: 98 Address already in use
2017-07-04T10:19:32.123Z| netdumper| I125:

Playing a bit with the Linux commands gave me some clues:

root@n-vcsa-01 [ ~ ]# netstat -lup
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 *:kerberos *:* 1489/vmdird
udp 0 0 *:sunrpc *:* 1062/rpcbind
udp 0 0 n-vcsa-01.testorg.l:ntp *:* 1249/ntpd
udp 0 0 photon-machine:ntp *:* 1249/ntpd
udp 0 0 *:ntp *:* 1249/ntpd
udp 0 0 *:epmap *:* 1388/dcerpcd
udp 0 0 *:syslog *:* 2229/rsyslogd
udp 0 0 *:794 *:* 1062/rpcbind
udp 0 0 *:ideafarm-door *:* 3905/vpxd
udp 0 0 *:llmnr *:* 1223/systemd-resolv
udp6 0 0 [::]:tftp [::]:* 1/systemd
udp6 0 0 [::]:sunrpc [::]:* 1062/rpcbind
udp6 0 0 [::]:ntp [::]:* 1249/ntpd
udp6 0 0 [::]:syslog [::]:* 2229/rsyslogd
udp6 0 0 [::]:794 [::]:* 1062/rpcbind
udp6 0 0 [::]:boks [::]:* 17377/vmware-netdum

root@n-vcsa-01 [ ~ ]# ps -p 17377
PID TTY TIME CMD
17377 ? 00:00:00 vmware-netdumpe

root@n-vcsa-01 [ ~ ]# cat /proc/17377/cmdline
/usr/sbin/vmware-netdumper-d/var/core/netdumps-o6500-l/var/log/vmware/netdumper/netdumper.log

Even if it reports an error at startup, the vSphere ESXi Dump Collector service is running (partially) on the virtual appliance.

Thanks to Michael (for sharing a detailed guide), I was able to test this assumption quickly.

ESXi-Dump-Collector-04

ESXi-Dump-Collector-05

The coredump was successfully transferred from the ESXi host to the /var/core/netdumps/ folder on the VCSA appliance. However, there were no records about this operation in the netdumper.log.

This issue has been reported to VMware GSS (SR # 17385781602) and should be resolved in the future updates to VCSA 6.5.