This week we’re bringing readers an interview with Eric Zounes, a member of our SysOps team. Eric helps build and support Puppets Labs’ infrastructure, using Puppet. Here, Eric shares insights about how he manages our internal logging and monitoring.
1. Please introduce yourself. What do you do here at Puppet Labs? (And where did you work prior to joining Puppet?)
My name is Eric Zounes and I've worked as a SysOps engineer here at Puppet Labs for just over a year. While some parts of the job are that of a traditional systems administrator, I also spend a fair amount of time writing code and, of course, using Puppet.
In addition to using Puppet in production, I'm also fairly involved in testing new features and giving feedback. Previously I interned at Mozilla as a Web Operations Engineer where I maintained infrastructure for their web properties and worked on a project to improve log analysis.
2. What are some of the significant initiatives you’ve worked on at Puppet Labs? What business reasons have driven the initiatives?
In my role at Puppet Labs I started as more of a generalist as did most of my team members. Over the past year we've experienced so much change that comes with rapid growth that we have started to specialize and own various services that we had previously managed as a whole.
I took on managing logging and monitoring infrastructure as part of this change. Immediately, my co-workers were now customers of these services. My focus shifted from thinking about how to monitor services to managing monitoring as a service. This means that I was now tasked with building and maintaining a workflow around monitoring.
My primary concerns were to provide a convenient way for users to add/remove service checks and alert interested parties when these checks fail. They would need a way to test their checks without paging whomever is on-call. Since different service owners would be adding checks, we needed clear documentation on actionable alerts.
3. What kind of technology changes have you and your team had to go through to accomplish the change or initiative?
In order to address these new concerns, I turned to configuration management. We had already been managing monitoring through Puppet, but we didn't have a lot of policy around how alerts are defined. Through writing more rigorous Puppet types for service checks I was able to make them more tunable and enforce rules, such as requiring documentation for anything that escalates to on-call. Using these strict types allowed me to create "monitoring environments" in order to fully test new service checks.
This greatly reduced the amount of false noise from the production monitoring service because we weren't adding alerts that hadn't run in a real environment. The other piece to creating a robust workflow was having the ability to get metrics on historical alert data. For example, we have dashboards that can show us how many pages we received in any given week, which ones were the noisiest, and much more.
4. What hurdles did you have to overcome?
Improving the monitoring pipeline wasn't an easy process. The first issue I encountered was building monitoring requirements for each organization. Each organization wanted to fine tune how checks were carried out and how they received alerts.
I had to support a number of different operating systems for our test environments which made automation more difficult. Thus, the requirements became more of a matrix than a flat list.
Using Puppet I was able to abstract the implementation of these checks so end users really only need to think about how they want to route the alerts.
5. What are some of the outcomes of the work you’ve done?
The outcome of improving our monitoring workflow is less alert fatigue, user friendliness, better insight into troublesome services, and clearer definitions of SLA's. Checks become more reliable when they can be tested.
Those who are on-call have a much better experience when they have documentation to explain failure scenarios for services they might not be familiar with.
Historical alert data allows us to identify trending service issues and take appropriate action.
Despite whether an organization uses Puppet or not, with configuration management you can enforce process and policy around monitoring to provide consistent, reliable, and actionable alerts.