Monitor Puppet infrastructure with the puppet_metrics_dashboard module
Hi! I’m Erik Hansen, one of the senior support engineers here at Puppet. As someone who helps some of our largest Puppet Enterprise customers address issues of performance and scale, I want to enable every person in charge of a Puppet infrastructure to be able to view and monitor metrics from Puppet services.
Our support team has developed a few great tools to help us diagnose and respond to PE performance issues, and I’d like to introduce one that I’ve worked on personally, the puppet_metrics_dashboard module!
If you administer a PE or Puppet installation, and you expect that installation to grow (or even if you just geek out on observability and metrics), I highly recommend this module. Installing the module is as simple as downloading it from the Puppet Forge and applying the main class with a few parameters (I recommend you use an agent other than your master for this). For example:
Applying the configuration above will configure a service called telegraf that will begin polling your master for metrics. These metrics will be stored in another service called influxdb, and rendered in a visualization tool (grafana) which is accessible in a browser. You can configure telegraf to poll multiple masters and/or multiple PuppetDB instances — just add them to the master_list / puppetdb_list arrays above.
I really encourage folks that run Puppet infrastructure to get this up and running before you see a performance issue. It’s great to get a baseline set of metrics to compare against, just in case something changes and you start to notice a problem. This can help pinpoint when the problem first occurred and also what triggered it!
Here’s an example from working with a recent customer. The initial report was that agent runs were taking longer than usual. Once I had collected and graphed metrics from the master, I was presented with this graph:
Here we see some of their Puppet master’s key indicators around JRuby availability. JRubies are what a Puppet master uses to perform work. When an agent requests a catalog from a master, JRubies execute all the Ruby code involved in compiling and serving the catalog, and also processing the report. Each catalog could involve hundreds of requests for a JRuby. In this case, the blue line is the average wait time for a JRuby to become available. This metric is about nine seconds at the point we’re hovering at here (not great at all).
What’s even more indicative of an issue though, is the number of average free JRubies. I’ve removed the blue line below so you can see this more clearly:
Average free JRubies (the green line) represents the number of JRubies that are ready to perform work at any given time. Within this set of metrics, there is never a point where this number is above zero. This means that agents are always waiting for a JRuby to become available before anything can happen during a Puppet run. This issue is compounding, because a single agent run could require hundreds of JRubies over the course of compiling and receiving a complete catalog.
In this case, our customer was able to increase the number of JRuby threads available to the master, which increased the overall throughput. With this fix in place, the slow agent runs quickly subsided!
The set of example dashboards included shorten the time to diagnose and remediate the majority of the performance-related issues we see in Support. If you’re a PE customer and you need to contact support about an issue with performance, this tool is likely to greatly reduce the time it takes to diagnose and mitigate your problem!
Erik Hansen is a senior support engineer at Puppet.