Preventative maintenance for PuppetDB: node-purge-ttl
Almost three years ago, I wrote a blog post on how to maintain your Puppet Enterprise console. That was a long time ago, and the Puppet Enterprise console has changed a lot since then. The biggest change is that we removed the console database, and now all the data in the console user interface (UI) is pulled from the PuppetDB database. As a result, points 1, 2, and 3 in the older blog post no longer apply to versions of PE since 2015.2, and maintaining your PuppetDB database is much more important now than it was back then.
Despite the improvements, the support team at Puppet still gets a lot of questions about how to maintain the databases, and we often tune a number of settings while diagnosing performance issues with the console UI. I’d like to discuss one of those settings today, the node-purge-ttl setting for PuppetDB.
The node-purge-ttl setting defaults to off. This means as nodes are deactivated via the command line, or they simply do not check in for node-ttl amount of time ( seven days by default), then the node, its facts, and most recent catalog sit in the database taking up space. Not deleting these nodes potentially slows down queries that always exclude deactivated and expired nodes.
Before we go off and enable the node-purge-ttl setting, we want to make sure we understand how many deactivated nodes are already in your PuppetDB database.
Determine how many expired nodes are in PuppetDB
Paste the following into
\set day_interval '\'14 days\'' WITH limited_certnames AS (SELECT certname FROM certnames WHERE expired < NOW() - INTERVAL :day_interval or deactivated < NOW() - INTERVAL :day_interval ) SELECT COUNT(*) FROM certnames WHERE certname IN ( SELECT certname FROM limited_certnames ) ;
We have a variable at the top that indicates we want to see all expired or deactivated nodes older than 14 days. We use 14 days as the default because this is the default for report-ttl. If report-ttl is more than 14 days, then increase the day_interval variable in the script to match report-ttl. It is best to not delete nodes from PuppetDB while they still have reports, as it takes longer than if those reports were already deleted by the report-ttl mechanism.
We can run the script now with the following command:
su - pe-postgres -s /bin/bash -c "psql -d pe-puppetdb -f /tmp/count_expired_nodes.sql"
Delete expired nodes from PuppetDB
After inspecting the results of the query, we may find there are a few hundred nodes that can be deleted from the database. Save the following script to
/tmp/delete_nodes.sql, and please make sure to set the day_interval as we did in the past script.
\set day_interval '\'14 days\'' WITH limited_certnames AS (SELECT certname FROM certnames WHERE expired < NOW() - INTERVAL :day_interval or deactivated < NOW() - INTERVAL :day_interval LIMIT 100 ) DELETE FROM certnames WHERE certname IN ( SELECT certname FROM limited_certnames ) ;
You can delete nodes with the following:
su - pe-postgres -s /bin/bash -c "psql -d pe-puppetdb -f /tmp/delete_nodes.sql"
After executing the query, we will see that it deleted 100 nodes. We can expect that the command queue increases as we monitor it in the PuppetDB performance dashboard.
If there are more than 100 nodes to delete, keep executing the query until it returns that it has deleted 0 nodes.
Now that we’ve completed this manual inspection and remediation we can set the node-purge-ttl to whatever value report-ttl is set to, and PuppetDB will automatically delete nodes that no longer exist in your infrastructure. This will save you space, improve the performance of your database, and subsequently the performance of the console UI.
Why did we have to do all of this manual checking and deleting of nodes? We know that node-purge-ttl will delete all nodes that are past the ttl, right? In order to delete a node, PuppetDB will delete all of the facts, catalog, and report information for that node. So it can take a while to delete nodes, and the more you delete at one time, the more you block incoming facts and catalogs from being stored. It is best to block for a little bit, let the facts and catalogs be stored, then block for a bit again. You really don't want to block for a long time and then have a backlog of facts and catalogs to store that takes hours for Puppet DB to work down.
Notes about the future
We would not normally recommend using queries directly in the PuppetDB database to make changes to PuppetDB. However, there is currently no way to limit the number of nodes purged by PuppetDB’s GC process. You can watch the following tickets; these should resolve that issue in future releases.
- PuppetDB node-purge should allow limiting the number of nodes to delete at one time - If the API allowed limiting the number of nodes to be purged, we could replace the database queries we used with API calls.
- Allow setting the interval for each of the garbage collection queries individually
- Change the node-purge-ttl default from 0 (don't delete deactivated nodes) to the same as report-ttl (defaults to 14 days) - If we default the node-purge-ttl to the same as report-ttl, we would presumably have to make sure that when users upgrade the node, purging doesn’t cause immediate performance issues.
Nick Walker is a principal customer support engineer at Puppet.