Solving critical Windows services restart during Puppet agent upgrades
In this blog we’ll share the journey we went on to solve a not-so-easy customer problem: a critical Windows services restart during Puppet agent upgrades. As with most software fixes, it starts with a customer ticket: component DHCP Server service restarts after an upgrade. This troubleshooting journey goes from analyzing Windows installer logs, to using undocumented Windows API calls, and back to the Windows installer logs, until we eventually found a solution.
On Puppet Enterprise (PE) deployments, each Windows agent runs a service to allow execution of various actions on remote nodes — the pxp-agent. The pxp-agent is registered as a Windows service with the help of NSSM.
NSSM allows the user — in this case, us — to easily register/unregister an application as a Windows service and customize many related options. Because NSSM also registers itself as an Event Log message source, upgrading the Puppet agent (which contains pxp-agent and NSSM) while Event Viewer is open triggers a restart of critical Windows services (DHCP Client/Server), leading to unresponsive remote hosts.
While you may not have an Event Viewer open very often, the same issue occurs when a log collector service is running, leading to increased occurrence of the initial problem.
The problem varies between Windows versions, but generally it looks like this:
The reason behind this is that the puppet-agent package upgrades the nssm.exe file that is currently loaded in the Event Log service process image:
This leads to an Event Log service restart that also restarts other services registered as Event Log message sources, including critical services like DHCP server, DHCP client, etc.
As the critical Windows services restart is caused by the following interaction:
nssm upgrade -> Event Log service restart -> critical Windows services restart
Not registering nssm.exe as an Event Log message source was our first step, so that NSSM upgrades would no longer trigger a restart of the Event Log service.
This solved the issue for newer deployments of the Puppet agent, but not existing ones. When nssm.exe was already a registered Event Log message source, the problem persisted. See our initial NSSM correction.
Open source and undocumented Windows API
Another solution we investigated was to detect whether nssm.exe was already loaded in the EventLog.exe process image, and if possible, unload it, by doing what ProcessHacker does. It was a fun time reading the ProcessHacker source code and experimenting with undocumented Windows API.
While getting close to a stable version, we realized that we could not guarantee the atomicity of the check/install steps — nssm.exe could still load into EventLog.exe between our check and the installation. We archived this solution and continued our journey.
Back to the installer
Looking to get a bird’s-eye view on the problem, we realized that renaming the nssm.exe file to nssm-pxp-agent.exe solves the conflict, as the new package did not replace nssm.exe. Still, even with this change, the package validation failed.
It took us some time to realize that it was not the new package validation that failed, it was the package validation of the previous package uninstaller that failed, as it would try to uninstall the nssm.exe that is still loaded in EventLog.exe.
To trick the uninstaller, we tested and found that renaming the file or removing the file reference from the HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Components registry key both tricks the previous package uninstaller validation and performs the upgrade without any critical service restarting. Between the two similar solutions, we implemented the registry change because it is more stable in case of install/uninstall failures.
As we were already using WiX toolset to create the installer package, we implemented the registry update using the WixQuietExecCmdLine custom action right before the RemoveExistingProducts step. See our [Puppet Agent correction] (https://github.com/puppetlabs/puppet-agent/pull/1912/files).
The "final" solution
After a bumpy ride, the solution to our customer problem was to:
- No longer register nssm.exe as an Event Log message source, as this requires critical Windows services to restart with upgrades.
- If packages with nssm.exe registered as an Event Log message source are already delivered: a. Rename nssm.exe in the newer packages to a name specific to the application. For example, nssm-myapp.exe. b. Remove any nssm.exe references from the HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Components registry key during the new package installation.
- The puppet agent is a collection of software that is required for Puppet and its dependencies to run. This includes Puppet, Facter, and other Puppet software, but also vendored dependencies like Ruby, curl, OpenSSL, and more: https://github.com/puppetlabs/puppet-agent
- NSSM is a service helper that allows you to run standard executables and scripts as Windows services.
- Process Hacker is a free, powerful, multi-purpose tool that helps you monitor system resources, debug software, and detect malware.
- Windows Installer XML Toolset is a free software toolset that builds Windows Installer packages from XML.