[prefix]:
The AST (abstract syntax tree) query language for PuppetDB is a language that presents itself as a raw AST format. It can be used to provide complex querying via REST on each of PuppetDB’s query endpoints.
This document outlines the operator syntax for this query language.
An easier to use alternative to this query language is the Puppet query language, which is largely based on the AST query language.
An AST query string passed to the query
URL parameter of a REST endpoint must be a URL-encoded
JSON array, which may contain scalar data types (usually strings) and additional arrays, that describes a
complex comparison operation in prefix notation with an operator first and its arguments following.
That is, before being URL-encoded, all AST query strings follow this form:
[ "<OPERATOR>", "<ARGUMENT>", (..."<ARGUMENT>"...) ]
Different operators may take different numbers (and types) of arguments.
Each of these operators accepts two arguments: a field and a value. These operators are non-transitive, which means that their syntax must always be:
["<OPERATOR>", "<FIELD>", "<VALUE>"]
The available fields for each endpoint are listed in that endpoint’s documentation.
=
(equality)Works with: strings, numbers, timestamps, Booleans, arrays, multi, path.
Matches if: the field’s actual value is exactly the same as the provided value.
>
(greater than)Works with: numbers, timestamps, multi.
Matches if: the field is greater than the provided value.
<
(less than)Works with: numbers, timestamps, multi.
Matches if: the field is less than the provided value.
>=
(greater than or equal to)Works with: numbers, timestamps, multi.
Matches if: the field is greater than or equal to the provided value.
<=
(less than or equal to)Works with: numbers, timestamps, multi.
Matches if: the field is less than or equal to the provided value.
~
(regexp match)Works with: strings, multi.
Matches if: the field’s actual value matches the provided regular expression. The provided value must be a regular expression represented as a JSON string:
/rexegp/
) that delimit regexps in many languages.\d
would be represented as \\d
, and a literal backslash (represented in a regexp as a double-backslash \\
) would be represented as a quadruple-backslash (\\\\
).The following example would match if the certname
field’s actual value resembled something like www03.example.com
:
["~", "certname", "www\\d+\\.example\\.com"]
Note: Regular expression matching is performed by the database backend, so the available regexp features are determined by PostgreSQL. For best results, use the simplest and most common features that can accomplish your task.
~>
(regexp array match)Works with: paths.
Matches if: the array matches using the regular expressions provided within in each element. Array indexes are coerced to strings.
The following example would match any network interface names starting with “eth”:
["~>", "path", ["networking", "eth.*", "macaddress"]]
If you want to match any index for an array path element, you can use regular expressions, as the element acts like a string:
["~>", "path", ["array_fact", ".*"]]
null?
(is null)Works with: fields that may be null.
Matches if: the field’s value is null (when second argument is true
) or the field is not null, or has a real value (when second argument is false
).
The following example would return events that do not have an associated line number:
["null?", "line", true]
Similarly, the below query would return events that do have a specified line number:
["null?", "line", false]
Every argument of these operators should be a complete query string in its own right. These operators are transitive: the order of their arguments does not matter.
and
Matches if: all of its arguments would match. Accepts any number of query strings as its arguments.
or
Matches if: at least one of its arguments would match. Accepts any number of query strings as its arguments.
not
Matches if: its argument would not match. Accepts a single query string as its argument.
extract
To reduce the keypairs returned for each result in the response, you can use extract:
["extract", ["hash", "certname", "transaction_uuid"],
["=", "certname", "foo.com"]]
When only extracting a single column, the [] are optional:
["extract", "transaction_uuid",
["=", "certname", "foo.com"]]
When applying an aggregate function over a group_by
clause, an extract
statement takes the form:
["extract", [["function", "count"], "status"],
["=", "certname", "foo.com"],
["group_by", "status"]]
Extract can also be used with a standalone function application:
["extract", [["function", "count"]], ["~", "certname", ".\*.com"]]
or
["extract", [["function", "count"]]]
function
The function operator is used to call a function on the result of a subquery. Supported functions are described below.
avg
, sum
, min
, max
These functions will operator on any numeric column and take the column name as an argument, as in the examples above.
count
The count
function can be used with or without a column. When no column is
supplied, it will return the number of results in the associated subquery.
Using the function with a column will return the number of results where the
specified column is not null.
to_string
The to_string
function operates on timestamps and integers, allowing them to
be formatted in a user-defined manner before being returned from puppetdb.
Available formats are the same as those documented for PostgreSQL’s to_char
function.
group_by
The group_by operator must be applied as the last argument of an extract, and takes one or more column names as arguments. For instance, to get event status counts for active certname by status, you can query the events endpoint with:
["extract", [["function", "count"], "status", "certname"],
["group_by", "status", "certname"]]
To get the average uptime for your nodes:
["extract", [["function", "avg", "value"]], ["=", "name", "uptime_seconds"]]
Note: Dot notation for hash descendence is under development. Currently it has
full support on the facts
and trusted
response keys of the inventory
endpoint, and partial support on the parameters
column of the resources
endpoint. It may be expanded to other endpoints in the future based on demand.
Certain types of JSON data returned by PuppetDB can be queried in a structured
way using dot notation
. The rules for dot notation are:
inventory
only) is represented with brackets ([]) on the
end of a key.inventory
only) is represented with the
match
keyword.For example, given the inventory response
{
"certname" : "mbp.local",
"timestamp" : "2016-07-11T20:02:33.190Z",
"environment" : "production",
"facts" : {
"kernel" : "Darwin",
"operatingsystem" : "Darwin",
"macaddress_p2p0" : "0e:15:c2:d6:f8:4e",
"system_uptime" : {
"days" : 0,
"hours" : 1,
"uptime" : "1:52 hours",
"seconds" : 6733
},
"macaddress_awdl0" : "6e:31:ef:e6:36:54",
"processors": {
"models": [
"Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz",
"Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz",
"Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz",
"Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz"],
"count": 4,
"physicalcount": 1
},
...
},
"trusted" : {
"domain" : "local",
"certname" : "mbp.local",
"hostname" : "mbp",
"extensions" : { },
"authenticated" : "remote"
}
}
valid queries would include
[”=”, “facts.kernel”, “Darwin”]
[”=”, “facts.system_uptime.days”, 0]
[”>”, “facts.system_uptime.hours”, 0]
[”~”, “facts.processors.models[0]”, “Intel.*”]
[”=”, “partitions.match("sda.*").mount”, “/home”]
Note: Setting the context at the top of the query is only supported on the root endpoint.
Setting context in a query allows you to choose the entity you are querying
on. This augments the endpoint support we have today, whereby the endpoint
decides the context. For example, /pdb/query/v4/nodes
sets the context of the query
to nodes
.
from
The from
operator allows you to choose the entity that you want to query and
provide optional query and paging clauses to filter those results. This operator can
be used at the top-level context of a query:
["from", "nodes", ["=", "certname", "myserver"]]
The from
operator can also be used in a subquery for setting the context when
using the in
operator.
When querying a particular endpoint, such as /pdb/query/v4/nodes
, the endpoint provides
the context for the query. Querying the root endpoint requires specifying a
context explicitly.
limit
, offset
, order_by
)PuppetDB allows specification of paging clauses within a “from” clause in a
query or subquery. The limit
and offset
operators both accept an
integer-valued argument, and order_by
accepts a vector of either column names
or vector pairs containing a column name and an ordering of “asc” or “desc”.
For example,
["limit", 1]
["offset", 1]
["order_by", ["certname"]]
["order_by", ["certname", ["timestamp", "desc"]]]
When no ordering is explicitly specified, as in the case of “certname” in the example above, ascending order is assumed. Here are a few examples of queries using paging operators:
Return the most recent ten reports for a certname:
["from", "reports",
["=", "certname", "myserver"],
["order_by", [["timestamp", "desc"]]],
["limit", 10]]
Return the next page of ten reports:
["from", "reports",
["=", "certname", "myserver"],
["order_by", [["timestamp", "desc"]]],
["limit", 10],
["offset", 10]]
Return the most recent ten reports for any certname:
["from", "reports",
["order_by", [["timestamp", "desc"]]],
["limit", 10]]
Return the nodes represented in the ten most recent reports:
["from", "nodes",
["in", "certname",
["from", "reports",
["extract", "certname"],
["limit", 10],
["order_by", [["certname", "desc"]]]]]]
The order in which paging operators are supplied does not matter.
Subqueries allow you to correlate data from multiple sources or multiple
rows. For instance, a query such as “fetch the IP addresses of all nodes with
Class[Apache]
” would have to use both facts and resources to return a list of facts.
There are two forms of subqueries, implicit and explicit, and both forms work the same under the hood. Note, however, that the implicit form only requires you to specify the related entity, while the explicit form requires you to be specify exactly how data should be joined during the subquery.
subquery
(implicit subqueries)Implicit queries work like most operators, and simply require you to specify the related entity and the query to use:
["subquery", "<ENTITY>", <SUBQUERY STATEMENT>]
The <ENTITY>
is the particular entity you are subquerying on, however not
all entities are implicitly relatable to all other entities, as not every relationship makes sense.
Consult the documentation for the chosen <ENTITY>
for details on what
implicit relationships are supported.
In PuppetDB, we keep a map of how different entities relate to each
other, and therefore no data beyond the entity is needed in this case. This is
different from explicit subqueries, where you must specify how
two entities are related. Implicit subqueries can be used to join any two
entities that have a certname
field. Additional relationships are described
in the endpoint-specific documentation as applicable.
A query string like the following on the nodes
endpoint will return the list
of all nodes with the Package[Tomcat]
resource in their catalog, and a certname starting
with web1
:
["and",
["~", "certname", "^web1"],
["subquery", "resources",
["and",
["=", "type", "Package"],
["=", "title", "Tomcat"]]]]
If you want to display the entire networking
fact, and the host’s interface uses a certain mac address,
you can do the following on the facts
endpoint:
["and",
["=", "name", "networking"],
["subquery", "fact_contents",
["and",
["~>", "path", ["networking", ".*", "macaddresses", ".*"]],
["=", "value", "aa:bb:cc:dd:ee:00"]]]]
While implicit subqueries can make your syntax succinct, not all relationships are mapped internally. For these more advanced subqueries, you need to specify exactly the fields that a subquery should join on. This is where an explicit subquery can be useful.
Explicit subqueries are unlike the other operators listed above. They always appear together in one of the following forms:
["in", ["<FIELDS>"], ["extract", ["<FIELDS>"], <SUBQUERY STATEMENT>] ]
The second new methodology uses from
to set the context, and now looks like this:
["in", ["<FIELDS>"], ["from", <ENTITY>, ["extract", ["<FIELDS>"], <SUBQUERY>] ] ]
That is:
in
operator results in a complete query string. The extract
operator and the subqueries do not.in
statement must contain one or more fields and an extract
statement.extract
statement must contain one or more fields and a subquery statement.These statements work together as follows (working “outward” and starting with the subquery):
extract
statement collects the value of one or more fields across every object returned by the subquery.in
statement matches if its field values are present in the list returned by the extract
statement.Subquery | Extract | In |
---|---|---|
Every resource whose type is “Class” and title is “Apache.” (Note that all resource objects have a certname field, among other fields.) |
Every certname field from the results of the subquery. |
Match if the certname field is present in the list from the extract statement. |
The complete in
statement described in the table above would match any object that shares a certname
with a node that has Class[Apache]
. This could be combined with a Boolean operator to get a specific fact from every node that matches the in
statement.
in
An in
statement constitutes a full query string, which can be used alone or as an argument for a Boolean operator.
“In” statements are non-transitive and take two arguments:
extract
statement, which acts as a list of fields to extract during
the subquery for matching against the fields in the in
clause.from
statement, which sets the context, and allows for an extract
statement to be provided.array
statement, which acts as a list of values to match against the
field in the in
clause.Matches if: the field values are included in the list of values created by the extract
or from
statement.
array
An in
statement also accepts an array
statement as a second argument.
“Array” statements take a single vector argument of values to match the first
argument of in
against.
The following query filters for the nodes, foo.local
, bar.local
, and
baz.local
:
["in", "certname",
["array",
["foo.local",
"bar.local",
"baz.local"]]]
which is equivalent to the following query:
["or",
["=","certname","foo.local"],
["=","certname","bar.local"],
["=","certname","baz.local"]]
The in
-array
operators support much of the same syntax as the =
operator.
For example, the following query on the /nodes
endpoint is valid:
["in", ["fact", "uptime_seconds"],
["array",
[20000.0,
150.0,
30000.0]]]
from
This statement works like the top-level from
operator,
and expects an entity as the first argument and an optional query in
the second argument. However, when used within an in
clause, an extract
statement is expected to choose the fields:
["in", "certname",
["from", "facts",
["extract", "certname",
[<QUERY>]]]]
extract
“Extract” statements are non-transitive and take two arguments:
from
operator, may contain an optional query.As the second argument of an in
statement, an extract
statement acts as a
list of possible values. This list is compiled by extracting the value of the
requested field from every result of the subquery.
select_<ENTITY>
subquery statementsA subquery statement does not constitute a full query string. It may only be used as the second argument of an extract
statement.
Subquery statements are non-transitive and take two arguments:
As the second argument of an extract
statement, a subquery statement acts as a collection of PuppetDB objects. Each of the objects returned by the subquery has many fields; the extract
statement takes the value of one field from each of those objects, and passes that list of values to the in
statement that contains it.
Each subquery acts as a normal query to one of the PuppetDB endpoints. For info on constructing useful queries, see the docs page for the endpoint matching the subquery:
select_catalogs
select_edges
select_environments
select_events
select_facts
select_fact_contents
select_fact_paths
select_nodes
select_producers
select_reports
select_resources
This query string queries the /facts
endpoint for the IP address of
all nodes with Class[Apache]
:
["and",
["=", "name", "ipaddress"],
["in", "certname",
["extract", "certname",
["select_resources",
["and",
["=", "type", "Class"],
["=", "title", "Apache"]]]]]]
This query string queries the /nodes
endpoint for all nodes with Class[Apache]
:
["in", "certname",
["extract", "certname",
["select_resources",
["and",
["=", "type", "Class"],
["=", "title", "Apache"]]]]]]
This query string queries the /facts
endpoint for the IP address of
all Debian nodes.
["and",
["=", "name", "ipaddress"],
["in", "certname",
["extract", "certname",
["select_facts",
["and",
["=", "name", "operatingsystem"],
["=", "value", "Debian"]]]]]]
This query string queries the /facts
endpoint for uptime_hours of all nodes with
facts_environment production
:
["and",
["=", "name", "uptime_hours"],
["in", "certname",
["extract", "certname",
["select_nodes",
["=", "facts_environment", "production"]]]]]
To find node information for a host that has a macaddress of aa:bb:cc:dd:ee:00
as
its first macaddress on the interface eth0
, you could use this query on ‘/nodes’:
["in", "certname",
["extract", "certname",
["select_fact_contents",
["and",
["=", "path", ["networking", "eth0", "macaddresses", 0]],
["=", "value", "aa:bb:cc:dd:ee:00"]]]]]
To exhibit a subquery using multiple fields, you could use the following on ‘/facts’ to list all top-level facts containing fact contents with paths starting with “up” and value less than 100:
["in", ["certname", "name"],
["extract", ["certname", "name"],
["select_fact_contents",
["and",
["~>", "path", ["up.*"]],
["<", "value", 100]]]]]
Queries are restricted to active nodes by default; to make this explicit, the special “node_state” field may be queried using the values “active”, “inactive”, or “any”. For example, to list all catalogs from inactive nodes, use this on the /catalogs endpoint:
["=", "node_state", "inactive"]
This expands internally into comparisons against each node’s deactivation and expiration time; a node is consider inactive if either field is set.
from
operator)Additions to the query language in support of PQL introduced new ways to
express subqueries using the from
operator. For example, a query such as this:
["and",
["=", "name", "ipaddress"],
["in", "certname",
["extract", "certname",
["select_resources",
["and",
["=", "type", "Class"],
["=", "title", "Apache"]]]]]]
will now look like this:
["and",
["=", "name", "ipaddress"],
["in", "certname",
["from", "resources",
["extract", "certname",
["and",
["=", "type", "Class"],
["=", "title", "Apache"]]]]]]
Executing this query on the /facts
endpoint would filter for uptime_hours
for all nodes with
facts_environment
set to production
:
["and",
["=", "name", "uptime_hours"],
["in", "certname",
["from", "nodes",
["extract", "certname",
["=", "facts_environment", "production"]]]]]
To find node information for a host that has a macaddress of aa:bb:cc:dd:ee:00
as
its first macaddress on the interface eth0
, you could use this query on /nodes
:
["in", "certname",
["from", "fact_contents",
["extract", "certname",
["and",
["=", "path", ["networking", "eth0", "macaddresses", 0]],
["=", "value", "aa:bb:cc:dd:ee:00"]]]]]