Puppet Labs videographer Kent Bye interviewed UX designer Joe Wagner about the Puppet Enterprise event inspector, a new feature rolled out in Puppet Enterprise 3.1. Take a look at this quick video demo from Joe, then read about the design decisions behind the event inspector.
Kent Bye: Why is the new event inspector in Puppet Enterprise 3.1 important?
Joe Wagner: The event inspector gives you a reporting capability that was previously unavailable in Puppet Enterprise console, or elsewhere. It gives you three different views into your infrastructure.
Generally, Puppet Enterprise reporting tends to be node-centric. Nodes are important. They tell you where something has happened. They give you an idea of the impact when changes occurred, or failure to change has occurred. But classes and resources give you additional information: Classes tell you what has been affected, and the resources give you granular detail about how it occurred.
So really, one of the great benefits of using the event inspector is being able to get that immediate, broad perspective about what is happening within your infrastructure. And if something is of concern or interest, then it gives you the ability to quickly drill into any of those three perspectives for more information.
Where sysadmins may typically scour logs to try to understand what's happened, we've produced an opinionated reporting tool to understand what has happened and to give you enough information to take action.
KB: In your PuppetConf presentation about the event inspector, you talked about making the design decision to not include historical reporting in this initial release . Can you talk about why you focused on providing data to take action on rather than showing historical trends?
Sysadmins don't have time on their side. Sysadmins are generally overworked and are constantly context-shifting. What we really want to do with event inspector is get sysadmins in and out of the tool as quickly as possible. We do this by giving them the right information at the right time.
A single glance at the event inspector will tell you whether you've had failures or not. If you haven't, then that may tell you everything you need to know, and you're out of the event inspector and onto the next thing. If you do see that you've had failures, then you have robust and nimble tools for actually understanding what has occurred there. The idea is for a single glance, to understand if something's wrong. And if something is wrong, then let's take as little of your time as possible to actually get you to the source of that issue.
KB: Can you talk about the debugging information that is made available to help isolate exactly where in the code failures may be occurring?
JW: Event inspector is called "event inspector" because it allows you to ultimately … inspect events. When you first enter the tool you get the class, node, and resource perspectives in a summary rollup. But within a couple of clicks, you can drill from that high-level perspective all the way down to a single resource event that's occurred.
So, if you have a file resource Puppet was expecting to change but wasn't able to for some reason, within a couple of clicks, you can actually learn about what exactly occurred with that file resource. Why did it fail to change? Is there an issue in the code? What file and line number should I take a look at? And also, when you're looking at an event, you're still able to see on which node that event and the class that contains it. So it really gives you a pretty comprehensive view of what's occurred there, and all of the context you need to know what to do next.
KB: Why should Puppet users consider adopting Puppet Enterprise in order to have access to the event inspector?
Let's take a look at using Puppet without this tool. If you have a large-scale environment, let's say that you are managing hundreds or even thousands of nodes -- or maybe you're planning to get there some day. Being able to actually assess what's happening across all those nodes is challenging given current tools that are available.
A sysadmin might look at log files to actually try to understand what's happening. But that's sifting through an awful lot of information that hasn't been synthesized and rolled up for the user in any kind of consumable way.
There are tools out there that provide reporting on a node basis, but again if you have hundreds or even thousands of nodes, you're looking through a very long list of nodes trying to understand. And perhaps dozens or hundreds of those have had issues trying to understand where the sources and what was effected can be pretty challenging.
Now imagine using event inspector where you have hundreds of nodes and many of those nodes have experienced a failure. But you can trace this back to a single class or to a resource or two that have changed or failed to change. Suddenly, you have a much better, immediate understanding of what may have occurred. You're not digging into a node, jumping back out and into another one, back out and into another one. What we're really saying is, "Hey. You should look here." We believe this is going to save sysadmins a lot of time.
KB It sounds like what you’re saying is that Puppet reporting has traditionally been very node-centric, and that the event inspector is adding new ways to drill down what is happening in Puppet classes and all the way down to specific resources within those classes. If you have failures across many nodes, then can you talk about how the event inspector is helping users to more quickly determine what may have gone wrong within their system?
Imagine you're managing a set of nodes, and a sizable portion of those are experiencing failure.
If there's just a single source, or maybe a couple of sources or errors, then looking at your classes may clue you in immediately that there is a single class that's experienced failures and that this is clearly affecting all of the nodes that have have failures. Or looking at the resources you might see that you have a service that's failing, and "Bingo!" That's the culprit. And of course, the converse could be true as well. You could have a single node that's having failures, but you could have many resources failing, many classes.
So these three views really work together in that the classes are telling you WHAT in your system has failed. They cue you into the services or applications that may be experiencing issues. And then the nodes are telling you WHERE this has occurred. And the resources are telling you HOW these failures have occurred. And so, by using these three together you immediately understand WHAT, WHERE and HOW. You immediately have a story with perspective about what's happening in your infrastructure. So that you immediately have actionable information that you can use to troubleshoot, and resolve the issue. And then of course, you can come back to event inspector and validate that you've in fact resolved the issue.