published on 8 July 2014

PuppetDB is the storage service for Puppet-generated data. Designed to empower Puppet deployments, and built on technologies chosen for their high performance, PuppetDB is highly parallel, making full use of resources. It stores data asynchronously, freeing up the master to go compile more catalogs.

For our latest release, PuppetDB 2.1.0, we've significantly streamlined our HTTP handling, completely refactored our query subsystem, and beefed up report storage.

As a consequence of this work, we managed to provide some new benefits for you:

  • New query engine for v4. Introduces increased operator, subquery and field query support across our HTTP endpoints.
  • Streaming support on all collections. Helps decrease large memory consumption and reduce out-of-memory errors when querying large data sets.
  • Report status. Failed reports and report status will finally be stored alongside successful reports.

I’ll begin by diving into the report status change first.

Report Status

When Puppet runs on a system, it generates reports showing the success or failure of each event as Puppet is applied. If something goes wrong — for example, the catalog fails to compile — then the report shows failed beside that event.

In the past, PuppetDB would not report a failed event. To see these, you had to go to the puppet master log, where PuppetDB would log failed events as warnings.

The new version of PuppetDB supports submitting reports of failed events. The status field can be populated with failed, changed or unchanged. Now you can query on failed reports, and distinguish quickly between changed and unchanged reports.

As an example, you can see how here we are filtering to see only reports with unchanged status fields:

# curl 'http://localhost:8080/v4/reports?query=\[“=”,”status”,”unchanged”\]'
[ {
  "hash" : "a00e322780b38a9b5916d8c324afadd8fa29e7d0",
  "puppet-version" : "3.6.2",
  "receive-time" : "2014-06-30T13:22:47.994Z",
  "report-format" : 4,
  "start-time" : "2014-06-30T13:22:44.942Z",
  "end-time" : "2014-06-30T13:22:45.761Z",
  "transaction-uuid" : "f1dc670a-ec77-4525-922b-82ebcaa5e409",
  "status" : "unchanged",
  "environment" : "production",
  "configuration-version" : "1404134567",
  "certname" : "kb.local"
} ]

New Query Engine for V4

Inconsistencies in common code paths can cause a lot of hassles when you’re trying to develop new features or fix bugs in any piece of software. In our prior release, we had extended the query capabilities within PuppetDB to add environment awareness. During this work, we realized that some of the feature support available for the PuppetDB query language was mildly disjointed in our code: Trying to do something apparently simple like adding a new endpoint or adding support for querying timestamps and other field types was not always a trivial matter for our developers.

To clean up this inconsistency problem — and to save our sanity — we decided to invert the logic internally. Instead of having to add support for a particular feature to a particular endpoint, we’ve written a new “type aware” query engine that allows all operators supported for a particular field type to be supported on any endpoint. The engine supports now a more declarative way of defining the PuppetDB schema including endpoints, fields and operators.

This query engine is exclusive to the v4 endpoint, and will be the way forward from here on in. Older versions of the Query API will continue to use the old engine for now.

As we add new features, you’ll also get the benefit of using a common query engine for everything, in that we’ll be able to add new endpoints and fields more rapidly. So this really is an exciting feature for us all, one that will pay dividends going forward.

For more details of what is supported now for version 4 of the query API, I recommend reading the formal documentation here: http://docs.puppetlabs.com/puppetdb/2.1/api/query/v4/query.html

Nodes

Previously, we supported querying on the name and [“fact”, <name>] fields only. Now that we’ve added proper handling for the timestamp type, you can do proper querying on the catalog-timestamp, facts-timestamp and report-timestamp fields, including constraining the results using a time range with operators such as <, >.

For example, if you wanted to see only nodes that have had their latest reports submitted before a particular time, you could provide a query such as:

# curl 'http://localhost:8080/v4/nodes?query=\["<","report-timestamp","2014-02-27T16:34:15.800Z"\]'
[ {
  "deactivated" : null,
  "report-environment" : "foo",
  "certname" : "node-0",
  "facts-timestamp" : "2014-07-01T14:06:33.835Z",
  "facts-environment" : "foo",
  "catalog-timestamp" : null,
  "report-timestamp" : "2014-02-27T16:34:15.781Z",
  "catalog-environment" : null
} ]

Reports

We now support range querying (using the < and > style operators) on end-time, receive-time and other date fields. We also support the normal operators for all the other fields, such as hash, report-format and puppet-version, for example.

# curl 'http://localhost:8080/v4/reports?query=\["<","receive-time","2014-06-18T21:23:01.980Z"\]'
[ {
  "hash" : "e2b4cb9339bc14f04bc2978b60999702bc5985dc",
  "puppet-version" : "3.6.1",
  "receive-time" : "2014-06-18T17:49:40.429Z",
  "report-format" : 4,
  "start-time" : "2014-06-18T17:49:37.293Z",
  "end-time" : "2014-06-18T17:49:37.690Z",
  "transaction-uuid" : "219f5fb0-63a0-40c0-a86e-5463556d822d",
  "status" : null,
  "environment" : null,
  "configuration-version" : "1403113780",
  "certname" : "kb.local"
} ]

Resources

The resources endpoint has also received an improvement. In the past, you couldn’t do regular expression queries on resource parameters, but now, with the new query engine, you can. For example, if you wanted to search for all resources with name set to main you could do something as follows:

# curl 'http://localhost:8080/v4/resources?query=\["~",\["parameter","name"\],"ma.*"\]'
[ {
  "tags" : [ "stage" ],
  "file" : null,
  "type" : "Stage",
  "title" : "main",
  "line" : null,
  "resource" : "19f758a34555e7897f89df3f6f96f3c4a8185616",
  "environment" : "production",
  "certname" : "kb.local",
  "parameters" : {
    "alias" : [ "main" ],
    "name" : "main"
  },
  "exported" : false
}, {
  "tags" : [ "class" ],
  "file" : null,
  "type" : "Class",
  "title" : "main",
  "line" : null,
  "resource" : "02eedf3d0c02cba430c3e9b147a69d9fa5d15c60",
  "environment" : "production",
  "certname" : "kb.local",
  "parameters" : {
    "alias" : [ "main" ],
    "name" : "main"
  },
  "exported" : false
} ]

Subquery: select-nodes

The changes in PuppetDB 2.1.0 also include extending subquery support for other logical types. For now we’ve added select-nodes. Since the ability to add these is now simpler, we hope to eventually add full subquery support for all endpoints.

For example, let’s say you wanted the memorytotal fact information for any nodes that had reported earlier then a particular timestamp:

# curl 'http://localhost:8080/v4/facts?query=\["and",\["=","name","memorytotal"\],\["in","certname",\["extract","certname",\["select-nodes",\["<","report-timestamp","2014-02-27T16:34:15.800Z"\]\]\]\]\]'
[ {
  "certname" : "node-0",
  "name" : "memorytotal",
  "value" : "16.00 GB",
  "environment" : "foo"
} ]%

Streaming Support on All Collections

It’s often buried in the release notes, but over time, we have been working very hard at reducing memory consumption for PuppetDB during normal usage. I want to explain more of our efforts on this frontier — in this particular case, our final completion of streaming responses for all endpoints.

One of the major causes of memory consumption has been the need to do a full query and JSON serialization in memory before returning results. For example, if you wanted to query all facts, the entire fact list across your enterprise would have had to be temporarily stored in memory, before being returned to the client.

This not only caused a very obvious pause before receiving any results on the HTTP request, it also increased the chances of causing an “out of memory” JVM error. In addition, it was likelier that this memory usage would cause heavier garbage collection runs slowing down the overall performance of PuppetDB.

Previously, we had switched over to streaming the facts and events JSON responses, so upon receiving a HTTP request, a cursor was opened to the database in another thread. Results were delivered directly to the user via a streaming JSON parser. This meant that, instead of the full results plus the full serialized JSON string being stored in memory, only the part that was being processed during streaming was using any memory.

Now we’ve finally switched this on for all endpoints that return collections, which will help with the general effort to reduce needless RAM consumption. Not only that, internally the code for each endpoint is now the same, reducing the amount of special casing that was occurring only for the facts and events endpoints. Because all endpoints are now using the same code, any development work we perform on endpoints going forward should be simpler.

Going Forward

As we learn more about how PuppetDB data is used, we find very often that people want more querying capabilities surfaced. The changes we’ve introduced are a way forward to provide that. We generally believe that when querying PuppetDB, you shouldn’t have to worry about lack of support for various field types and operators. As we move forward, we hope to ensure this. While all this work isn’t yet complete, we’ve laid a good foundation for how we will continue developing this part of the software.

From the entire PuppetDB engineering team, many thanks all for your continued support. We hope that you enjoy this new release as much as we enjoyed coding it.

Ken Barber is a software engineer at Puppet Labs.

Learn More

Share via:
Posted in:
Tagged:

Add new comment

The content of this field is kept private and will not be shown publicly.

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.