We’re excited to announce the release of PuppetDB 1.6. This release focuses on:
TL:DR; Most users of PuppetDB should see faster queries, faster catalog and fact updates, and a lighter load on the database server.
We have taken a close look at PuppetDB and tuned all aspects of the system, from how we determine whether or not a catalog has changed to more efficiently storing catalog and fact-related changes. Incoming and outgoing data is now compressed when the client supports it, and we have reduced our memory usage through streaming of database queries.
PuppetDB, in case you aren’t familiar with it, is the centralized storage service for Puppet Enterprise. If you're not familiar with it, you can read about how it fits in with Puppet Enterprise 3.x.
Optimizing PuppetDB Storage
Most Puppet runs do not produce real catalog changes — typically a user-initiated action is required. An example of such a change is adding or removing a user from a node.
A typical install has changes on roughly 10 percent of Puppet agent runs. PuppetDB avoids storing catalogs that have not changed by storing a catalog’s hash and comparing a new catalog’s hash with that node’s last known hash. When these hashes don’t match, an update needs to be stored.
Prior to PuppetDB 1.6, a new catalog was stored and the previous version of the catalog was deleted later, as a background process.
Better performance through improved deduplication
We have improved how we structure the catalog data before hashing so there are significantly fewer false positives. A false positive in this case is when two catalogs hash differently but have the same content. Unimportant differences such as the order of the JSON resource parameters map could cause one of these false positives.
Early reports from the field have shown users who previously had deduplication rates in the 0-10 percent range jumping up to the 60-70 percent range. This has a massive impact on performance, as the fastest way to persist data is to already have it persisted! PuppetDB users with multiple puppetmasters will see the biggest improvements with this change.
Better performance from optimized database storage
In previous versions of PuppetDB, when a catalog change was detected, the catalog was persisted in its entirety as a new catalog. Catalog sizes vary, but typically this is several thousand resources, and potentially more edges. Even a small single resource addition to a catalog would result in persisting several thousand rows to the database.
In 1.6, we have traded more expensive database writes for cheaper database reads. We now analyze the incoming data, compare it to the previously stored data for the node, and store only the incremental differences. This results in a massive reduction in IO utilization when deduplication fails. We have observed a reduction in database writes of over 95 percent in the common case described above. In terms of wall clock time, persisting a catalog in 1.6 takes roughly half the time it did in 1.5.2.
For facts, a similar common case is that a few facts change on every facter run. Facts like “uptime_seconds” and “uptime_hours,” by their nature, change constantly. Previous versions of PuppetDB would always delete and re-insert all facts if a change was detected. With PuppetDB 1.6, we have observed a 98 percent reduction in database writes of facts for this common case.
Fewer database writes mean less load on the database, less disk space and faster persistence of catalogs and facts.
Better debugging of deduplication issues
Low catalog duplication rates can result in poor database and PuppetDB performance. Although the impact of catalog changes is less with the newly optimized database storage, it can still be a good indication of a problem. To help diagnose these duplication problems, we now let users configure a directory where we can dump diagnostic files in the event that deduplication isn't happening as much as the user would like. These files and standard Unix tools like diff can help the user diagnose what is changing from one Puppet run to another. For more information on this feature, see the PuppetDB docs.
Other Database Performance Improvements
Better performance from supporting separate read-only databases
Postgres supports [Hot Standby]((http://wiki.postgresql.org/wiki/Hot_Standby), which is one technique for scaling Posgres. It uses one database for writes and a separate database for reads. PuppetDB now supports this setup by allowing read-only queries to be directed at the hot standby, resulting in improved IO throughput for writes and faster reads. For more information on configuring a read database for PuppetDB, see the docs.
Better performance from streaming queries
Previously, we'd load all rows from a resource or fact query into RAM, then do a bunch of sorting and aggregation to transform them into a format that clients expect. That has obvious problems involving RAM usage for large result sets. Furthermore, this does all the work for querying up-front...if a client disconnects, the query continues to tax the database until it completes. And lastly, we'd have to wait until all the query results have been paged into RAM before we could send anything to the client.
New streaming support in 1.6 massively reduces RAM usage and time-to-first-result. This is achieved by converting the results from the database to JSON, row by row, while the results are still being returned from the database. As each row is converted to JSON, it’s immediately sent to the client via HTTP. Results will not be explicitly sorted unless the client has requested it.
Better performance through compression
We now compress the output of our APIs when clients request it using GZIP. This compression trades some higher CPU usage for greatly reduced bandwidth requirements. For more information on GZIP compression over HTTP see How gzip compression works.
If you’re noticing any other differences in performance with the new PuppetDB release, we’d like to hear about it. Please add your comments below.