Event:

Presenter:
Deepak Giridharagopal
Company:
Puppet Labs

PuppetDB gives users fast, robust, centralized storage for Puppet-produced data. The 1.0 version landed at Puppetconf 2012, and now we're one year older and one year wiser. It's been deployed in thousands of sites, people have written libraries and tools on top of it, and there's been plenty of activity in the past year. We've tightly integrated it into Puppet Enterprise. We've added new features like report storage, event querying, import/export, better HTTP endpoints, and unified querying. And though we've added features, we've also made PuppetDB faster and consume less disk space. This talk will cover what's happened in the PuppetDB world between Puppetconf 2012 and now. We'll go into the new features, talk about performance and correctness, and discuss lessons learned.

Presenter Bio:

Deepak is Director of Engineering at Puppet Labs, one of the authors of PuppetDB, and a many-times-over Puppetconf veteran. Prior to joining Puppet Labs, he was Principal Engineer at Dell/MessageOne, using Puppet to manage thousands of production systems.

Learn more:
  • Ready to dig in? The Learning Puppet VM provides you with a safe, convenient virtual environment to get started.
  • Just getting started with Puppet open source or Puppet Enterprise? All our docs are available as downloadable PDFs for easy reference.
  • PuppetConf 2014 will be back in San Francisco. Developer Day will be held September 22, PuppetConf will be held September 23-24. Save the date!
Transcript:

Deepak Giridhargopal: So this is me [email protected] if you have any further questions or want to sign me up for awesome email forwards about discount, shoes or cat pictures. That’s me on Twitter. How many people are actually already using PuppetDB? Alright that’s good, it’s okay, it’s okay, we need to talk. We need to talk about data. You wouldn’t believe how long I spent trying to find that, by the way.

So it turns out that Puppet actually generates a pretty large and useful abundance of information. So you know you have got your agent, you have got your master. An agent starts up, collects some facts about itself, so what are those, what do those look like? So facts are basically you know a bunch of key value pairs about all kinds of interesting you know metadata about the system that you are actually running on. You know everything from how many processors do you have? You know am I virtual? How much RAM? You know etc, etc, that’s a lot of very interesting thoughtful information that the agent sends to the master, the master takes that and turns it into a catalog and just throws it away, because that sounds smart.

The master generates a catalog for you, so what’s a catalog? Catalog is something that takes your Puppet source code you know I have a file, tmp/foo this is the content of that file, “This is a test” that’s going to get turned into this is a YAML representation, no one should use YAML for anything by the way because it’s basically the worst, except for slides, slides is totally legit. So there is a lot of information about this resource, you have you know what file it’s in, what line number it is, any parameters to it, all the tags, all kinds of interesting stuff, as well as you know things like relationships that resource as to other resources.

Taken in summation this is called the graph so this is an example of a graph from an actual Puppet Enterprise installation and I did a dump of the resource graph for, it’s only Puppet Enterprise on a node. And it’s a small snapshot of it, we could zoom out, it’s an actual bigger graph. We can zoom out, that’s actually a pretty gigantic graph.

So there is a lot of useful information in there, right, I mean it’s every single thing that Puppet is managing on every single one of your systems, how they relate to each other in every possible way, all available. So of course what the Puppet master does is it doesn’t store copy of itself and it happily hands that back to the agent to run, to configure the system and then the agent just trashes it, of course again very thoughtful. Agent generates a report what happened, so what’s in a report? A report again, YAML a report tells you what actually happened, the catalog contains what you want the universe to look like. You know I want this file to have this content, you know that’s the way the way things should be, the report is what dashes all of your hopes and dreams and it tells you how the world actually looks. And it’s you know hopefully most of the time it tells you that it’s actually configured the system the look the way that you want it to look, but in case it didn’t, it at least tells you what went wrong.

You know, so the agent will take that report send it over to the master, which maybe able to toss it on the disk, but you know basically I consider if you want to look through reports, if you have to crack open grep and then go through a directory like that you may as well just be throwing it away, that’s unstructured data. What century is this? Come on man, we can do better than that.

So what PuppetDB brings to the table, it’s kind of a slight variation on this, you know an agent generates some fact, facts go to the master, facts get sent to PuppetDB which does not throw them away, it’s such a simple idea. Why didn’t we think about this before? Master generates a catalog it can spool off a copy to PuppetDB for storage at the same time that it hands that copy back to the agent, which is also great, notice again well it’s not being thrown away. Very, very handy. Agent generates a report, the report comes all the way back to the master, master needs to figure out what to do with it, the master sends it so PuppetDB where it is kept. Okay so that’s a really brief introduction to where all the data goes, it goes into the big green cylinder obviously. Not included with every installation of PuppetDB.

So this is not a new idea, we have had this before in Puppet for any of you that have used store configs in the olden days, I like to call them the slow configs or shitty configs. It’s based on a library in Ruby called Active Record and basically what we try to do using that system is exactly the same kind of setup we try to capture as much of the state as we actually can in a structured way in a database so people can actually query it and do interesting things.

It turns out though that this is actually -- this didn’t work very well and it’s a little hard to, it seems great at first, right, like have some catalogs I am sending some catalogs over everything is great, how are you guys doing? Everything alright? You know Puppet deployments are going okay and then all of a sudden it’s like --, oh my shit it’s on fire and everything is ruined, because it’s like the most inefficient stuff that we ever had in the universe, but it’s okay because you know I mean the worst possible case scenario is, is that your database is kind of hosed and that’s totally fine but you know you at least retain the ability to configure all, oh my god, oh my god this is all ruined, it’s all ruined, because of course if you take down what are the processes that’s trying to store all of your stuff then that process can no longer process any new catalog which is a bit unfortunate. But at least you have all of your agents. Surely nothing, possible oh man this is, this is bad times. Darkness washed across the land.

You know in addition to it, just routinely going up in a ball of fire, which is always hilarious, you know this isn’t really a way to programmably get access to any of this information because you know ultimately this is a SQL database unless you are comfortable writing a bunch of SQL statements you know you are really not going to get much interesting stuff out of it. You know so you can’t really, there are a lot of questions people want to ask right like I got all these resource graphs for everything maybe I just want to know which of my boxes are you know running nginx. You know how many servers are running a vulnerable version of rails? Spoiler alert, every version is vulnerable. What are IP addresses of my web servers? You know which users have sudo access? You know these are all perfectly reasonable questions that anyone would want to pose to a kind of a data warehouse of all of their Puppety knowledge right.

So you know what the old system used to do is it would basically think about it, it’s like okay that’s pretty good maybe I don’t know I will wait a little bit longer, think about it a little more. I got, I don’t know since really long time. It’s like that’s unfortunate. That’s literally what would happen, I would get like look of disapproval. Son I am disappoint. You know but --, it would it’s so slow, so slow. Alright so I am not going to shit on Active Record anymore, I will let it do that by itself.

Okay so now for something completely different, the new world, you know so I want to ask PuppetDB a question like I need to know all the machines that have nginx running, I could just do a query /resource/Service/nginx you know and it happily gives that back to me and you have a smiley face. That’s how you know it’s better. And you could do this for you know a whole bunch of different careers that you would like to do, I need to find you know all the versions of rails that are actually running somewhere, you know I can get that, I happily get my resources back and again, smiles, smiles around. It smile oriented development I think is what I was really going for. I want to find all the record you know all the parameters for setting up my own user account, across my entire infrastructure or it actually just on the /nodes/foo.com you know so /nodes/foo.com/resources/User/me, that should be really, really straightforward right.

So briefly, let me try and actually show you guys PuppetDB actually running, I will try to mirror, well alright. You guys, you don’t really need to read it but trust me it’s actually doing stuff. Okay so PuppetDB is a daemon, so unfortunately there is not a lot of like, if you are actually, if you actually care about what it looks like when it’s running, I have probably done something really, really wrong, right. I mean you don’t want to know, I don’t want you to know. I am going to watch it open up, not that badly.

Okay so PuppetDB handily includes within its own source tree a benchmark tool. So what I am going to do is I am going to run benchmark against my PuppetDB installation, I am going to be handing in it a bunch of catalogs that actually got from snapshot from our operations team. I am going to simulate 1800 nodes, this is running on my laptop by the way. And 1% of the time I am going to randomize resources. So we will let that spin up in a second, buffering, buffering, buffering, it’s got to think about it you know. Alright so spinning catalogs and getting information there. Alright, so the reason why this is happening at this cadence is if you have a 30 minute run interval, 1800 nodes, if you do math, this is America, but trust me, you do math that turns out to about 1 catalog every second. So what I could do is bring up the dashboard and you can actually see it you know running and doing things.

So this particular database I have you know there is about 700,000 nodes in it, I have got some pretty high duplication measures that I will actually talk about a little bit. There is an internal queue, you could see the help of it, you know there is a bunch of other interesting metrics on here, this is all just to say, that it’s instrumented out of the ass that it’s actually fast enough to handle 1800 nodes worth of catalogues on this bitterly antiquated MacBook Pro Retina, it’s pretty sweet actually, you should get it.

So anyway that’s what it does, I don’t need to belabor that point so I am going to kill this and go back to here zoom.

Okay, so in a nutshell, what that benchmark was doing is it’s taking all these catalogs from our operations team, it’s stimulating 1800 nodes worth of data firing them off at PuppetDB and then what you were seeing if you recall the numbers on that dashboard are like you know a couple of 100 milliseconds to actually store that entire catalog, that’s the entire graph, every single resource, every single parameter, every single relationship, all the edges, all the facts reports everything that all gets stored in a couple of 100 milliseconds.

In order to make that happen, and for reference I mean the old store configs implementation would happily work up to about maybe 50 nodes, maybe a 100 nodes and then you know you would be ruined.

So we had to built something that was pretty different I think in order to make this work. And we did a couple of at least internal to Puppet Labs Engineering we did a couple of what I would say are some fairly unconventional things, relative to how the rest of Puppet or Puppet Lab software is constructed. #1) Is we baked in asynchronous operation kind of into the core of how the stuff works. Really PuppetDB what it’s supposed to do it’s supposed to provide two basic features, 1) One is it has to store stuff and 2) Two it has to let you query it. It turns out that there is an architectural pattern within software called CQRS, which kind of describes those patterns pretty nicely, it’s called Command Query Responsibility Separation, but in a nutshell what that means is you want to use a different model to update information than the model you used to read your information. So effectively that means that you can think of your application as two separate pipelines.

There is like a pipeline for how you want to make modifications to all of your data and there is completely separate pipeline for how you want to actually consume that information, right. So they are going to be different, this is kind of a most software is not designed in this way. So what that let’s us do is I will write pipeline we made asynchronous from the beginning, it’s heavily parallelized, it uses an internal MQ that you guys never have to actually maintain because it’s just embedded inside of the software itself and we do things like automatic retry.

So when I say write pipeline, what I mean is you are not doing an http post, you know when I first did this I think it was somewhat controversial because people are like well if I want to store catalog, why can’t I just do http post/catalogs and then here’s my 18 megabytes worth of catalog and that you just wait and then you hand me a 200 ok when it’s been stored. I am like that’s fantastic until you run out of like available threads or you run out of available cores or any of a number of things that you know pose real scalability limits on what you can do.

So this is what a command looks like all the right request to PuppetDB are formulated like this there is a command name, like I want to replace the catalog, every command has a version there is documented wire format for this protocol and it has a payload. And if you want to know what the payload looks like you can go to docs.puppetlabs.com Nick Fagerlund did a really great job at documenting all of the stuff. So if you wanted to have your own system that pumps data into PuppetDB that’s completely open, it’s completely supported version backwards compatible all that good stuff.

Internally the pipeline looks something like this commands come in they go into this embedded message queue that we have it’s actually active MQ but like I said you never have to actually mess with it, it’s completely internal. We take the command, we parse it, slice and dice it, do some validations and transformation and then finally we process it so if we get a replace catalog command, the processing step is actually going to you know translate that big object into you know database insert statements you know it will do duplication all kinds of other interesting stuff.

What’s novel about this is what happens if I want to do an http post here are all my catalogs and I am waiting for my 200 ok but say my database is down, like what do you do? If you were using software, I mean most of the software you guys probably use, this is going to throw you an error, which sucks, it sucks. Why do people do that?

Instead what we can do is because we already have the baked in assumption that rights are going to be asynchronous we can just queue up that same catalog or retry is later on the database is backup again. So we can toss it back into the MQ. All my magic tricks come from like how SMPT worked in like the late 80s so if you can’t relay a message to someone and it doesn’t work like what does your mail server do does it just drop your message on the floor or does it hand you back a bounce? No, I mean it will try again, with exponential back off, that is literary what we do, we like, we will queue it again for second later, two seconds, four seconds, eight seconds, two to the whatever, all the way up to like 9 hours or something stuff can bounce around.

And then finally at the end if none of that works we have this directory on disk called the Dead Letter Office will be spooled all the data to and it’s actually a bit for bit copy of what you sent us over the wire, which is really nice for me as a developer because that way if you guys in the field are using PuppetDB and you run into problems or errors you can actually take the file from that directory send it to me and I could replay it through my own PuppetDB system and reproduce your error exactly, because there is no other mysterious information in play, awesome, why don’t people do that?

And you need to basically, you need to expect failure because it’s going to happen, you know it’s a complicated system, I am getting new data, I have to dump it into a database like there is a lot of stuff involved there. You know even if at any point in time you know things go sour you know you need to be able to deal with it. Failures like I don’t know, if a database explodes.

In fact let me see if I can explode my database. Okay, shut that down, please work, it will work, it will work. It’s a sound architectural principle you know that’s of course it’s going to work. So what you should see in theories is that the command queue actually going up, yeah there you go. Yeah of course you are not getting information about how many nodes or resources you have because I actually have to have the database in order to figure it out, but that’s the key. So notice that, the benchmark is still submitting all the catalogs that it can about 1 every second, but you know we are happily queuing things up. So now in theory. So wait for it, wait for it, alright so stuff starts to go down. That is how software should work. So take your database down for maintenance it’s going to be okay, everything in Puppet should work that way, really. Oh also did you notice I am actually using more than one core, just thought it would throw that out there. Things Ruby can’t do? Is it too soon? It’s too soon, two soon.

Other unconventional thing we did relatively to all other Puppet Labs software, we used a new language runtime, notice things Ruby can’t do. Stuff I know I needed, we needed stuff, we needed a runtime that was fast, we need one that was free, because it has to be installable everywhere, that’s reportable so it can’t exist, you know it has to work on Windows, it has to work on you know all the platforms that we every hope to support or have PuppetDB run on, it has to be multi-core because again it’s 2013 and my phone has more than one core. Try running Ruby on that, it’s sad, you can’t, Ruby app can’t max out my phone. Like a Lego Mindstorms kit come on man, what century is this, who is the President? Ronald Regan the actor. And it needs to be something popular like something with an actual ecosystem whose libraries I could read, it turns out that the most boring of all possible choices the JVM actually has all of these things in spades. Haters gonna hate! Fuck those guys.

Actually it turns out that there were very few people that complained, I could actually count them probably on one hand I think and then punch them with it, but you know it turns out no one really cares. I mean at the end of the day, no one cares what I mean Hacker News, Amateur Architecture aside nobody gives two shits about how your stuff is actually built. You know as a user, what I want is I want my stuff to work. You know if it’s easy to install it’s efficient and it actually works, it could be called, it could be a bunch of PHP scripts cobbled together, it could be a Visual Basic app like I don’t know very --, maybe I would care about DB but you know it doesn’t really matter.

So one of the really nice things immediately that we got from using a new runtime is that there are tons and tons of high quality libraries that are available, for example, I can embed an MQ inside a PuppetDB it’s just you know Active MQ is like a 700 kilobyte jar file and I can brood in there and I can start using it as a library, you never need to know. We include our own web server, I started up PuppetDB notice what I did not have to do was install Apache, install Passenger, tune a bunch of random crap that nobody cares about, instead I can just do like here’s the you know Java-jar PuppetDB and then it starts, it automatically figures out how many cores I have, it automatically allocated threads, it has a web server built in, it has an MQ built in. We even include an embedded database if you are like so morally bankrupt you don’t want to setup postgres with like the two commands that it needs to do it. You could set it up and that actually works totally fine, it turns out about like 30% of users out there run the embedded database despite every talk I give telling them please don’t do that, don’t do that.

It’s pretty fast actually that’s because it loads up literary all of your data in the RAM at once, go figure. And there is a ton of debugging and profiling tools available.

It’s just been, it’s been this was something that was controversial at first and now seems really stupid that we even debated this internally. You know for example, I was helping a customer last week they were having you know some issues where PuppetDB was you know it would start and then it would crash and they didn’t know why and I was like are you getting in out of memory here and they were like, “Oh, yeah, we are can you fix that?” and I am like, “Yes I can, if you look in this directory there is a heap snapshot because we set it up with the unit scripts to automatically dump the heap whenever we have an out of memory error. You can send that us and I could load it up in a profiler and triangulate exactly what line of code that’s leaking memory. And I could fix it.” And if act I will talk about this secret alien technology we use, I could actually if they have a port open I could tell them into their PuppetDB instance, they have to do it, they have network access and you could actually hot patch the server life to include a code fix without taking it down. These things you can do with a real runtime.

We can ship in uberjar in Java parlance that’s basically you take all of your dependencies and you essentially it’s a structured way of vendoring everything, which sounds terrifying but what that means is I can hand anybody, like a 12 or 15, it’s like 12 or 13 megabyte file and that is PuppetDB, that’s literary all you need. You don’t need any other dependencies except the JDK itself which is usually an apt get anyway. And it just makes deployment really, really simple and there are no moving pieces again you don’t need to solve a web server or any of the other banana stuff in order to get this stuff working.

And it’s plenty fast, like I said it’s a couple of hundred milliseconds to actually store a catalog but think about what that means, right. And we have done performance statistics because in order to get the data to PuppetDB we have Ruby code that plugs into the Puppet master that takes the catalog, takes your facts, serializes it to the wire and sends it over, that usually takes on the Ruby side. I will put it this way, it takes more time on the Ruby side just to get a catalog and serialize it to JSON over the network. It takes more time to do that than it does for PuppetDB to wake up an embedded web server thread, spool all that stuff into memory, parse the command, do data transformation, validation, put it on to the MQ, wait for a transaction acknowledgement that it’s been accepted by the MQ wait for a completely different thread to actually pop that off the front of the queue, again do data sanitizing and validation, translate that into a bunch of SQL statements, open the transaction to the database, serialize it, transaction commit, transaction commit to the MQ, return acknowledgment to the user. That happens in almost about 60% of the time it takes for the Ruby code to do that one small tasks the PuppetDB code is already persisted to the disk. So what the hell man? And like I said ultimately nobody cares what runtime we use, they just want our stuff to work, which makes sense.

Other controversial thing we did. The query language is basically and abstract syntax stream which really freaked a lot of people out I think when I first did it, this is one of the things that took the longest time I think for me to convince people that this was not a terrible idea. Queries are expressed in their own kind of miniature language, right like I know I didn’t want to have people send me raw SQL statements over the wire. SQL injection is a service like what could go wrong? I guess Wordpress already exists so I don’t need to, too soon, too soon.

So we use, we basically came up with the domain specific kind of language and queries look weird and like this, like if you want to find a user, where the title is me, I formulated like that, for any of the computer science that’s out there I mean this is an abstract syntax tree like if you parse a language or you parse a grammar that’s effectively the data structure that you get.

If I want to find all the Debian system that have an up time of you know more than 10,000 seconds like I could you know do it like that, we have subquery support, this is an example, if I want to select all the resources that have the Apache class and then for each of those records I get back, I want to yank out the certname or the node what node they are on and then I want to find all the nodes where I want to get the IP address fact for each of those nodes that that have Apache on them. That’s a really simple thing, I just want all the IP addresses of all my Apache, of all the nodes that have the Apache class apply to them, right.

Because it’s an abstract syntax tree we can potentially start doing things like optimizations someone could submit a query that looks like this and I could translate that into say like a more efficient SQL statement right like an “in” operation where instead of having a bunch of “or” classes chained together I could actually collapse that part of the tree and do something more interesting. So in a nutshell we walk that tree compiling it to the most efficient SQL that we can possibly come up with so we don’t use any object relational app or anything like that.

People, you know you look at that and you are like really I got to do that, that’s ridiculous. There really is a benefit to this and the benefit is two fold, #1 is I was not at all confident in my ability to come up with the query syntax that I think everybody would like, that’s hard to do, I mean it’s hard. Especially one that needs to be able to support all the features that we care about you know, it needs to be able to support subqueries, you need to be able to correlate stuff between resources facts, nodes, regular expression, equality, greater than less I mean there is a lot of stuff that we would need to do and if I just got an opaque string over the wire like that’s just really, really unfortunate.

What I was confident of is that, if you write a library and you want to spit out queries or generate queries to integrate with someone’s stuff, you want an interface like this. Anybody who has ever written a script that smashes together like strings in order to produce like a SQL statement like A) I am sorry and B) Like I am doubly sorry, because that’s really unfortunate, what’s way easier to do is actually assemble this in your code like constructs and that’s exactly what a syntax tree lets you do.

So it’s people that are smarter than me like Eric from Spotify who is also giving a talk, it may have already happened, I apologize Eric. You know he wrote a query language that looks like this, you know it’s much more natural language, much more friendly for a user and it compiles, you should look at the code that actually implements serializing this to a PuppetDB query and it’s really tight because you don’t have to do any crazy string concatenation madness in order to make that actually work. Having it be based on the syntax tree lets us manipulate it in code.

So you know in PuppetDB 1.2 I think we rolled out changes to our APIs so that you no longer have to by default will only search nodes that are currently active, that are actively getting catalogs and actively seeing facts. Now in the normal world, what you would have to do is you would have to take that string that someone was passing in as a query, parse it into the tree and then you manipulate the tree to add another clause on top of it, that says and that the node is active and the rest of it what you asked for, but we can do that all in code in just a couple of lines and that has allowed us to be evolve that query API much more rapidly than we were able to do before.

Like I said because this turns out to be really easy to generate queries that means it’s kind of, it’s wonderful for integrators I think and you end up with people writing in a really interesting dashboards and stuff like that to talk to our stuff, this is from Danny from the Puppet Free Node Channel, he made Puppetboard which is really, it’s really cool, actually the other day, he sat down with myself and Nick Lewis and we hacked together some more pages to display real-time metrics out of PuppetDB so it’s pretty nice.

The Puppet Enterprise, Event Inspector that you saw during the keynote this morning is actually based on those exact same APIs and the exact same interfaces. We don’t cheat, PuppetDB exposes these APIs, they are versioned, they are documented and if our commercial products need something out of it, then we know we need to evolve that API to make it work so everybody kind of benefits, but everything that you see these tools do you could do from the command line, from the script, you could use it to integrate with other tools whatever it is that you want.

There is a bunch of --, so really okay, the proof is in the pudding, like my conjecture is that having these APIs design the way they are, is conducive to people, you know creating more interesting and novel integrations with our stuff. CERN produced Foreman integration there are two additional Web UI projects in varying states of complexity and completeness, the bottom one I know uses Play Framework Scala. There are a handful of different Ruby integration libraries Eric from Spotify is probably the most popular one which is PuppetDB query which actually provides in-language, you could use in your Puppet DSL code you can just write queries against PuppetDB, you know like rapidly spit you know get me all the services everywhere and I am going to take that throw it into a template and then spit out Nagios checks and like in a 100 milliseconds instead of taking like a 150 seconds or whatever the default Nagios types take. He actually wrote a library that for the Ruby programmers out there that use a library called datamapper which is kind of an object relational mapper for Ruby, it is sort of backend for that that talks directly to PuppetDB. So in your Ruby code you can leverage that stuff, exactly like you know you leverage anything else.

Oh yeah, no that’s right, he wrote a back, a Hiera backend so you could do a higher look ups and that data is actually yank straight out of PuppetDB, which is awesome. R.I. Pienaar wrote another library that has its own query language that he uses for MCollective integration. So on the Mcollective command line you can actually talk to PuppetDB and do discovery. So if you have a 1000 host or 100000 hosts and you want to say I need to tell all of my Debian machines that are running Apache to restart Apache. Normally we do have to do an Mcollective as you do like a discovery that’s going to send a ping out to everybody and you have to wait for that to propagate over the network and get responses back, but you don’t need to do that because all that data is already in PuppetDB and that takes, an operation that would take a couple of seconds and it collapses 10s of milliseconds which is awesome.

There are at least 3 different python metrics. The one that’s used in Puppetboard is actually the one on the top it’s pretty good, I highly recommend it. There are definitely some other ones and some ancillary tools like grep programs that are also written as well. There are bindings for Java, for Go, for Scala, for someone who wrote Gist that has some CoffeeScript code that talks to it, there is Node.js implementation of some of the protocol side of things that we have got. It’s crazy there is Mcollective discovery plug-in that you could use today, there are two different orchestration plug-ins for Rundeck. There is a Dan Bode’s OpenStack libraries, leverage is talking to PuppetDB in order to do orchestration between you know master nodes, slave nodes things like that. People if you look on github and do a search for like their Vagrant setups to setup a Puppet environment they all include PuppetDB now or they are automated using some libraries. There is even a dude who has as is the backend of his PowerDNS setup, he actually talks to the PuppetDB to dynamically figure out exactly where, basically do determine routes. Which I thought that was crazy but awesome at the same time, I was like why wouldn’t you do that, that’s pretty awesome.

The other controversial thing shouldn’t be controversial is that we choose to use boring technology when everyone else is not using boring technology. In particular we opted to have a relational database as the backend. You can either have an embedded one or we support postgres because it turns out if you need to do adhoc queries which is what people want to do. You know infrastructure is data, right, infrastructure is code than infrastructure is data then you want to actually be able to query that data to do interesting things. So if you want to do that, turns out databases are actually pretty good. This is, I love this comic, this guy is like so how do I create a database, it’s not a database, it’s key value store. Oh it’s not a database, how do I create it? Well you just write a distributed map reduced function in Erlang, did you just tell me to go fuck myself. I believe I did Bob.

Yeah, but imagine, you know imagine where I tell the user I am like, “Yeah well you can search all your catalogs and everything and it’s going to be great. Oh by the way you have to setup like a 28 node Cassandra cluster in order to work, they are like you just tell me to go fuck myself. Depends are you a commercial customer or not I guess."

We also use a lot of weird alien technology in particular, even though it runs on a JVM we didn’t write it in Java we actually wrote in functional programming language called Closure, which I would highly recommend. You look at, there are a couple of novel properties to it, it’s a functional language so it takes very seriously everything in the language is immutable, you know you don’t have variables as used traditionally you know think of you have constants for stuff, and that basically means any object that I am manipulating in the language at any point in time, I could hand to literary as many other threads that I want and I don’t have to do any locking. So you kind of have to wrap your brain around it in order to write your code that way, but we have among every single other Puppet Labs project we have the literary the lowest incoming defect rate. There have been zero bugs related to deadlocks, zero bugs relating to in VM state management, so it’s worked out pretty well.

And of course I understand you can write any software and anything else this guy did a perfectly adequate job of reproducing Starry Night on Etch A Sketch. I just wouldn’t do it. I wouldn’t do it.

So across all the thousands of deployments that we have I mean there are 100s of threads that we have going on per installation and yeah we have no deadlocks, no bugs involving any mutable state and because we ship around code that plugs into the Puppet Master that’s written in Ruby the terminus code if you look in red line that actually has about 10x the defect rate of the stuff that we are doing on the server. And the entire PuppetDB source tree if you exclude all the test cases is about 7000 lines of code. For reference, if you look in the Puppet source tree type.rb and parser.rb added together are almost 7000 lines of code, I am just saying.

I had a lot of conjectures about performance in some of the optimizations that I wanted to do. I had a theory that a resource often exists across multiple hosts, you know my user account probably doesn’t exists on a bunch of systems.

So we implemented a feature where we would only store that resource once, because it’s the same resource across the thousand nodes, why would I store a 1000 times, that’s 999x stupid. And it turns and you know that was a useful assumption I think.

We had an additional theory that for a given host, we will often receive the same catalog for that host again and again and again, because if you have a web server and Puppet runs every 30 minutes are you really pushing out of change to that web server’s config every 30 minutes like probably not, you are probably only pushing it out relatively and frequently. So if I get the same catalog every time why would I store the whole thing again? I should take a fast path and not store it at all, you do. So we do single instance of catalog storage.

This was theoretical right at the beginning because we didn’t know if that was actually the case and it turns out in the field that we see resource and catalog duplication rates above 85%, which basically means that we cut down, we basically do no work in PuppetDB about 85% of the time, which is good so I think that proves out the theory you know and what’s nice is you can load up the PuppetDB dashboard look at these statistics and you should be able to get a good measure of roughly how volatile your environment is. Like if your resource duplication is really, really low that means you have a lot of resources that only exists on one system. So it’s kind of like the, you are snowflake quotient, right and you don’t want that to be particularly high. If your catalog dup rate is really low than similarly that means that you are pushing out a lot of changes in regularity, it’s really, really volatile and maybe that’s okay, maybe you are in the middle of an application deployment or something like that but if you are not, it’s steady state, things should be pretty good.

And ultimately, I think what this means is the monitoring and instrumentation is a really, really big deal I mean if you present users information like this they use it, I mean everybody loves numbers, everybody loves analytics, everybody loves at least consuming monitoring, maybe not setting it up or actually using it but numbers are really, really powerful being on a JVM runtime means that we can get all kinds of information about the way that PuppetDB is operating and expose that to the universe.

Like this dashboard itself it’s getting all, oh kind of wish I had a higher res version. While when you install PuppetDB, which I know all of you will, you can bring this up yourself. And send me your screenshots, because I collect them I am going to make a quilt or something. But all of this stuff actually all these data points are going over http to PuppetDB and anything you see displayed here, you can get via command line or plug it into Nagios. And in fact people have, Jason Hancock is in the front row, Nagios PuppetDB, which is awesome. There is an external implementation of Naginator so instead of using the built in Puppet Nagio swipes you can actually talk to PuppetDB directly to find all of your hosts resources, whatever services and spit out checks. Two different implementations of Munin bindings collected. I mean it really just turns out that people appreciate all the details that I think we took care of when you know when we built this stuff.

You know in particular so this is a time lapse of when we released PuppetDB 1.0 I change the color whenever we release a new version and you can kind of see it’s spread overtime. Last year I gave people a lot of shit, because we had every continent covered except for Africa and lo and behold now we have Africa. So thank you whoever did that.

The center of the world, right underneath Africa? I don’t know. Oh that must have been when we released 1.2 I think or maybe 1.3 and then we will see more stuff kind of come in anyway, hopefully while you are staring at this and these aren’t actually downloads, these are actual like deployments because if you bring up the PuppetDB dashboard we can tell you if there is a new version available. So we use that information to basically figure out like what versions are deployed in the field, which helps me as a developer like really you know make the software better. So you could see the browser slowdown because there are so many like SVG elements going on everywhere.

Okay so that’s a lot of stuff, I don’t know what’s going on down here either that’s I don’t know deployment in rapture or something perhaps.

So yeah, so there are literary thousands of production deployments, you know that I have, and it’s anything from small shops that just have a couple of dozen host to the largest installation I know about is in the 10s of thousands of nodes you know with the database cluster, a multiple PuppetDB daemons, people run standalone and people run with the clustered database or multiple daemons working together in concert or load balancer. There is approximately a new deployment of PuppetDB every 15 minutes, so by the time this talk is done, 3 new people will have installed it and actually have it running.

So since the last time, I actually spoke to you guys about this which was a year ago, there have been a number of changes which I can blast through pretty quickly because I don’t want to just stand up here and be like a human change log, because it turns out that they are robotic change logs that are much more effective.

Biggest one is that this is available in Puppet Enterprise 3. It’s on by default, it’s fully supported, so if you run in any problems you know you can actually like talk to me, which maybe great or it may suck depending on how much you like me. Give me a chance though like me, cool dude yeah like it’s cool software, cool languages so the cardigan --.

Most importantly I mean PuppetDB in Puppet Enterprise 3x is going to be the basis for all the upcoming reporting and analytic features, they are going to be putting in. A lot of changes performance wise, since the one that was released a year ago, storage is approximately 20% faster we made a lot of improvements to how we are doing caching internally, again one of the nice things with the runtime that actually has threads, so I can cache information that one thread needs and another thread can actually access it, crazy town.

We also did some database optimizations getting rid of some indexes that we don’t actually use and that ended up saving a lot of disk space which is really, really nice. We made some, actuall y this is a community contributed one, some improvements to the terminus code that we have where we actually hoisted some logic to a higher level outside of an inner, of a tight loop which actually drops the serialization time for an example catalog we had with 10000 resources for about 80 seconds down to about 6 seconds, which again is still more than a couple of 100 milliseconds but it’s Ruby so you know you got to take what you can get.

We added a lot of features in terms of resilience, nobody likes one of the downsides, it’s not all roses you know. The downsides of running on a JVM is you know basic things that you would take for granted in other languages like Interop would, SSL certificates and things like that are done in Java’s own brain dead kind of way where you have to load up certificates into a keystore or a truststore, any of you have used Jenkins or any of those other things like you have got to do that kind of stuff. And this was by far the most, I mean you could talk to Ken Barber who has been handling, he actually wrote the fixes for this which was he worked with Chris Price and it’s amazing work, but we actually now automatically will create an in memory keystore and truststore for you and all you have to supply is, is the path your certificate files. So you never have to worry about that ever again. This is the largest source of problems like on IRC you know people complaining to us about, they couldn’t get stuff running.

HTTPS is now much more configurable in particular you can configure the set of cipher suites and SSL protocols that you actually care about. And you could do that in order to match whatever your company security policy needs are, like if you only want this particular version of TLS or you don’t want to use a certain cipher you only want to use like Diffie-Hellman encoding or something like that, this is important for installations that have like security standards that they need to meet like government installations, military stuff I think. Use my software for good, not evil, even though evil will win, because good is dumb.

A lot of things are more automatic now, we automatically recover for Message Queue corruption, which was never really serious but it would prevent you from starting up unless you had to manually clear out of directory and I was like well you know what if I was a user, there was an error and the remediation was to RM this directory why don’t I just automatically do that when I detect the error on startup. So you don’t have to worry about that anymore.

Anything that goes into the dead letter office is now automatically compressed. We can now automatically purge information from the database about nodes that are no longer active. We can now automatically recycle connections to a database in case you are running your database connections through like a load balancer which is much more, which you want to like round robin new connections so at least persistent connections alone, but you want to send a new one to a different part of the database cluster, that’s really useful, that was contributed by Chuck Wiser from State Farm.

We now have backup and restore which is really, really nice its integrated into the daemon, you don’t have to download anything extra and you could actually restore it to a PuppetDB instance while it’s running. The restore actually uses that asynchronous right pipeline and what it does is under the hood, it just does HTTP puts to that slash commands endpoint which is nice because we kind of eat our own, by building the system in that way we can have more elaborate features that leverage that stuff.

There are some interesting query changes the biggest one is that there is now a version V2 API and you guys maybe familiar with if you have been using it. Some changes there are that you no longer need to ask for only active nodes. Fact queries are very, very fast and you can do fully queries in V1 all you could do is you could say give me all the facts for this node, or for this fact give me just the list of nodes that it’s on which worked for the Puppet inventory service API but not really for much else. Now you can do like full fact inspection, find me all of the fact, find me all the IP addresses to start with 10 dot you know that kind of stuff.

Note metadata is much richer and it’s exploration friendly, all those URLs I had in the slides earlier where I want to find my user account and it’s like /resources/user/myusername that didn’t exist in V1 and now it exists in V2, which means that for the vast majority of cases you probably never need to write one of those wacky queries, you could probably just use standard APIs to do that. I am not going to have time to demo it unfortunately but if you go on to docs.puppetlabs.com we actually have a query guide that Nick Lewis wrote that walks you through construction of these queries and what you can do.

A personal pet peeve of mine was that this is also a community contribution. We use to require that you would have to give an explicit accept header of application JSON anytime you wanted to do any query for anything which really sucks in the command line. Now we accept wildcard so you can just do this and it’s less typing, this will give you all the nodes.

We now have subquery support, you can now correlate data for resource queries with fact queries with node queries this is the example where I want the IP address, which is the fact query of all my nodes running the Apache service that combines information queried about resources with facts that’s something you couldn’t do before.

A big one is we now have report storage we have had that under, it’s currently under /experimental but that will probably moved out of experimental relatively soon. We come with the report processing plug in you guys can start using this now, we store report level metadata, we can do queries on events that span reports and all the stuff is the basis for the Event Inspector demo that you saw but you guys can use that for your own purposes if you want.

One of the things that just landed in master the other day, I apologize because I actually broke all our acceptance tests, but I fixed it, I fixed it. It's streaming queries, streaming queries, you can basically stream results to clients on the fly, as they come in from the database. So what that means is you guys basically get massively lower latency for the first response. You know so in a nutshell if I say I just want to get literally every single resource I have, you know what it’s going to do is its going to open up a cursor to the database and just starts siphoning the data to you as soon as it possibly can and then a smiley face.

I will show this really quickly. Okay, so I have Version 1 of PuppetDB running on port 9080 and I want to get all of my packages. How it’s going everybody? Doing okay, you look beautiful. There once was a man from Nantucket, oh alright there we go, sorry I can’t finish that Limerick.

So yeah it took a while before I could actually get all this information however if I do that exact same query, against the new version, bam there you go, so it shows up instantly.

In fact, how big is this, 6-megs, 9-megs, 11-megs, 14-megs okay, so you could wait for all this data actually come in or you could just get the first result as fast as you possibly can. How can you possibly do this Deepak, like don’t you need something fancy like space-age NodeJS technology in order to like do all this asynchronous stuff? It’s like no it turns out, if you have a runtime that has actual threads you can like do something in the background while you are talking to someone else at same time like -- Threads its state-of-the-art for 1985.

Stuff that’s coming up, this is the last stuff I got. Last year I said that there were three things that we were going to be working on, we were going to be working on historical catalogue storage, we were going to be working on report storage and we were going to be working on something else I don’t remember, but I am pretty sure we got 2 out of 3 so victory? I think that is passing.

Coming up next is I want to start developing tools that allow for replication, this is really common request that we have from people. I think the first MVP like the foray into this that I think is the simplest thing, it will likely work, is basically and out of band tool that will actually different mirroring. The reason why these works with PuppetDB and it may not work with other tools is because effectively PuppetDB is kind of like git, it’s basically an object store, only the objects that it’s storing are catalogs, fax and reports and every single of those has a stable hash that we compute because we need that hash in order to figure out how to de-duplicate things because if we get two things with the same hash, we don’t store it again. So because we have those hashes I can say things like you know list all the hashes here, list all the hashes on the slave, diff them, you know do a set difference and then copy overall the things that are missing. And I could do that.

Prototypes that we have on a laptop, you can handle basically volatility for I think I was trying it with maybe 2000 or 3000 nodes worth of data. And having just run on the background like constantly and it kept up just fine. The nice thing about it is the Master actually does all the deduplication effort in all that CPU, so the diff just it already knows its precomputed, what’s missing. So it’s actually less work to send it over that way. By developing a tool like this we can actually combine different mirroring in more interesting ways because it’s concept is very simple, its like an rsync right and you could rsync to basically do effectively like a Master/Master kind of replication scenario, take everything from here send it over there, don’t delete stuff that’s missing though and then do the same the other way around. And then there you go, you end up with data that’s a unification, that’s the union of all the data that you have available.

And of course later on if you really need to we can optimize this by putting it instead of an asynchronous different mirroring setup that runs out of Cron or it’s a persistent daemon or whatever that wakes up and runs when it needs to because we have those embedded MQs we can actually just hook them up directly over you know TCP protected with SSL something like TLS protocol. And that way we would use the latency to basically nothing.

Last thing I want to say is that we are working on more flexible routing that we want to do, we have had a lot of request especially from really absurdly huge installations, they want to have, they have database clusters and they want to do read/write splits. So they want Puppet to be able to get catalogue facts and reports from one PuppetDB daeomn but to store them but they want to do reads from a completely different one maybe like a hot standby or something like that. So it turns out that’s not hard to do and we will be working on it.

And the last thing that people actually want is they want soft failure mode, so basically for whatever reason PuppetDB is not available and you can write to it, rather than catastrophically failing the compilation instead we can just submit a warning, if the user wants, user can set this up so that we just demit a warning and then Puppet will just continue doing it’s business, but the downside is we are going to put the gap in your data but when Puppet runs again 30 minutes later you know it will get populated, but I mean its up to the user, I mean availability is something every user has different needs around and I think this is something we need to get some better control over.

So anyway, this stuff is documented here, it’s packaged or if we use PE you don’t have to worry about it because it’s already taken care of for you. You can install all of it using the Puppet Labs, PuppetDB module which is just a couple of lines and it takes care of everything for you. It is open source so I encourage contributions. This is me, alright, it’s sort of a little late, but I know there is party to go to but anyone has any questions, I am happy to take them now or after I am mildly hammered.

So the question is, why do we need to design replication when instead we could do it at the database level? There is a couple of reasons, the straight up people right now are doing it at the database level and it works totally fine using off-the-shell stuff, PGpool or streaming replication in postgres 9. However, there are couple of issues with that, I think number 1, at the database level it’s pretty low level right so it’s going to be replicating at a block level like any rows that have changed it’s going to be sending you know those row deltas over the wire to someone else.

However PuppetDB the application layer has more intelligence about what’s going on like it doesn’t necessarily have to say replicate every single block, what it can instead say is okay, here is a catalog that you are missing like store it the way you normally would and if you actually can dedup things you can go ahead and dedup them and actually result in less I/O.

So basically in a nutshell we can take advantage of some of the assumptions in semantics that PuppetDB the application has, that Postgres doesn’t have because its just the database. So we can basically be smarter.

And I think its simpler to do for example setting up multi-master replication in Postgres is actually extremely difficult, because there is a lot of transaction issues, you have to think about, I don’t know why my SQL did it, because I think multi-master replication or relational databases are callously fucking stupid idea.

Yeah its not really multi-master ultimately right I mean it’s kind of anybody’s guess if you have a conflict, but in PuppetDB because we know that it’s, here is the secret sauce in PuppetDB we don’t update any of the data in place, we always insert new stuff and then clean up the old crap later. So because we don’t do that we don’t have to worry about conflicts. You know if two things have the same hash it’s fine, we don’t need to store it again, whereas at a database layer all you are talking about is rows and it doesn’t have that higher level intelligence, but that is an excellent question.

Yes sir.

Audience: Hi, you had an example where you were querying both facts and resources and then like based on the query of one you could get some of the other or even a little bit of both, that’s not very well documented on the web docs right now and I wonder if I could bug you to add more examples, especially like not just like documentation as like here are the principals go figure out how to put it together, but rather like straight up examples, or even maybe I will just come pick your brain later and like write them for you [overlapping] [00:53:25] oh my God this is exactly what I need and I can’t figure out how to put it together?

Deepak Giridharagopal: So the question or the comment was basically about the relative posity of documentation on doing this subqueries between facts and resource, yeah we don’t have many examples of that I would absolutely welcome more, actually all of our documentation is stored in our gitrepo for PuppetDB itself so you can actually just write examples and then submit repo request and we will totally merge them. Yeah I don’t know, I will buy you a beer or send you a T-shirt or something like that, I will staple a $5 bill at the inside of the T-shirt, I don’t know, grease the wheels a little bit. We have VMware money now, I am going to make it, make it rain. Anyone else?

Deepak Giridharagopal: The question was does anyone use PuppetDB to store things other than catalogs? Not that I am aware of, people have put in catalog looking things that haven’t necessarily come from Puppet itself. People have definitely put in facts that don’t come from Puppet itself.

A really good example is we had some prototype floating around and this will probably land in very soon upcoming version of Razor, Razors are bare metal provisioning tool, it actually boots a machine over PXE with like a microkernel and then microkernel actually has facter on it and that runs facter to figure out info about the machine and then it sends it over the wire. And it’s trivial to actually modify that to send it to PuppetDB which means if you are using Puppet Enterprise or anything else it’s the inventory service API, you can actually start doing inventory service queries for systems that don’t even have an operating system on them yet or don’t even have Puppet on them yet. And again that’s the power of actually having documented version like proper APIs that aren’t like a freaking SQL query. That’s the best example that I can think of, I haven’t heard of any.

The downside of that CQRS approach is that the database schema is hyper optimized for the exact kinds of queries that Puppet users want to do, which makes the tradeoff there is that its not generic, you know it’s specific and we get speed out of it, whereas if it was generic it would be slower so that’s kind of the, it’s a mixed bag, but it’s the truth.

Anyone else, go home and download my stuff, let me know. Thanks.