How To Sell Configuration Management to Your Boss
From its earliest days, Puppet has been first and foremost a tool written for practitioners. The original engineering intent behind the software was was to build a better configuration management tool based on research and hands-on experience with the problems and pain points present in early tools in the space. Consequently, Puppet has always found firm traction with system administrators directly responsible for implementation and management of diverse, proliferative and/or complex collections of systems.
Hands-on experience with older means of systems management — legacy tooling such as BMC’s BladeLogic platform, succeeding generations of CFEngine, or in-house mixes of manual procedure and proprietary scripting — gives system administrators a natural appreciation for automated, standard, repeatable configuration management.
If you need to sell the value of configuration management to the business higher up the stack, however, you’ll need to be prepared to lay out exactly what benefits configuration management can bring to the organization. If you’re at this point, then you are probably in a position to evaluate new technologies yourself, and have a clear channel of communication open so you can contribute to technology adoption decisions.
But how do you go about isolating and nailing down exactly what configuration management does bring to an organization, not in terms of sysadmin happiness but in terms of direct value to the business? The final form of how to frame presentations and reports, which numbers and metrics are needed, mandatory coversheet elements, etc. etc. etc. varies from org to org, but in principle the common core values laid out below are all applicable and can help describe in a boss-friendly context why you and your organization should embrace configuration management.
Benefits of Configuration Management
From this point forward I’ll refer to Configuration Management as CM, as the former is quite the mouthful.
Most of the effort involved in system administration workflows is not about the final deployment step, but about the configuration. Duplicating a perfect golden image across tens or hundreds of instances is arguably a very efficient way to rapidly deploy a new configuration, but benchmarking the deployment time alone doesn’t capture the whole story. The process that led to the creation of the golden image can’t be discounted.
Knowing what configuration needs to happen to take a vanilla distribution — RedHat, Debian, Solaris, Server 2012 — to a system with relevant application(s) installed, security policies implemented, networking configured and the metaphorical “all systems go” state is the responsibility of system administrators. Traditionally, server deployment from a pristine state has involved drawn-out, manual procedures. Once authorization has been issued, it can take days, weeks, or even months to go through the handoff and verification steps needed to deliver a new instance of a server or application stack. Mistakes made while manually editing files, changing settings and installing software can necessitate repeated rollback and repetition of steps — if the mistakes are caught. If mistakes aren’t caught before the system goes live the result can be inconsistent systems with differing “personalities.”
Golden images, once touted as the solution to this problem, in conjunction with virtualization, don’t address requirements around the minor configuration differences from one stack instance to the next, so imperfections introduced during manual tweaking can still arise. Golden images also don’t obviate the need for a process to build and modify the pristine state which the image is intended to preserve. Images tend to spawn tree-like structures of minor variations, datacenter-specific editions and multiplying lineages and version numbers, as development of the image progresses. Working with large binary data objects is slow and difficult to manage well.
Sysadmins working with images can be likened to a Hollywood production crew trying to produce a movie. Every scene is going to require multiple takes, and if an actor laughs at the wrong time or a boom mic drifts down into the picture, the whole scene needs to be reset, beginning positions re-assumed and everything started again from the top. The process is manual and expensive. That said, the end result is certainly worth working for. A finished film can be mass produced, distributed, and simultaneously entertain a huge audience. It won’t be perfect and won’t work for everyone. But with some more manual work things like subtitles and dubbing can expand the audience to new locations, editing can open it up to more restrictive markets, and global success can eventually be achieved.
Sysadmins, however, have an added requirement that Hollywood directors don’t have to deal with: almost the second the film is finished, they will be asked to go back in and add scenes, change dialog, adjust special effects and then replicate those changes across every edition and language in which the film has been released. Continuously. Because of the medium being worked, making these changes across a stable of golden images manually or with limited tooling is always going to be an inefficient and labor-intensive process.
Modern CM is “infrastructure as code.” Changes to code are cheap. Changes to code are easy to revision control. Changes to code can be validated in a continuous integration pipeline, statically analyzed for errors, unit tests run to check assertions, acceptance tests run to verify functionality — all orchestrated by an automated process. Once verified, changes can be deployed simultaneously to a target set of instances.
CM eliminates manual variation and error. It boils away image management to cheap, easy, blank OS images that are fed in as input, and the myriad editions, languages, and special-cut configured and deployed variations are dropped out the hopper at the other end. That’s assuming a product image is even necessary, as CM can be used directly on every node or system as it comes up, always starting with a blank OS base. Since configuration management isn’t limited to new deployments, the same process by which deployments are defined becomes the process by which changes are rolled out, eliminating variance and deviation between generations of servers.
Visibility into current infrastructure state is like a smartphone. If you’ve never had it before it’s easy to live without, but as soon as you get one, the part of your life when you could do without is over.
A good CM solution is not just about running a series of scripts the first time a machine is deployed. It’s also about being able to confidently say to yourself, your sysadmins, and your auditors, “this is how it is.” Present tense. With regard not only to pristine systems just deployed, but also systems deployed months ago, systems that have undergone patching, application updates, policy changes and more.
In the context of systems management, visibility means knowing whether or not a particular setting has been set, bit has been toggled, resource has been reserved for a particular application or service is currently deployed. This is not the same as monitoring, since monitoring is really about letting you know if things are working and if impending events require your attention. CM visibility is about knowing if the more static configuration and settings assertions still hold true.
For any environment where users or admins have manual login privileges or are known to make changes themselves on live servers, this is a big deal. Say an application developed some instability on a couple of servers and the on-call admin logged in as root to investigate. Maybe the admin needed to turn on a debug interface to trace down the problem. And BOOM, we’re immediately opening the door to all kinds of human error. After the issue is resolved and the app stabilized, did the debug interface get turned off? We don’t know! Did any other changes get made to the app during the debugging process? We don’t know! Can we guarantee that on the servers investigated, the app continues to behave in accordance with known guidelines? We don’t know.
What CM brings to the table is a means to obtain this knowledge. Unlike custom scripts, CM should be running continuously, checking and verifying at a granular level a set of specific assertions — such as ensuring the debug interface is turned off — which in aggregate comprise a general assurance that the app configuration matches a known good state. CM will assert and report on this state for every system under management.
The benefits CM offers through this kind of visibility are huge. In brief, visibility grants you:
- Confidence. When a change is made, you know it is being made to a well-understood field of systems.
- Auditability. You are able to, in effect, audit a configuration state and verify an assertion against a current record without having to create a custom check script (with possible bugs) to be run on a target machine.
- Predictability. When it's time to roll out a new application version or submit to the mercy of an auditor, having visibility into current configuration state removes surprises, reduces unexpected downtime and speeds up the audit process.
Unlike efficiency, which is typically seen as a “bigger, better, faster” kind of benefit, the level of visibility and auditability offered by mature CM is often something brand-new to organizations adopting a CM solution for the first time. It doesn’t mean visibility is any more important than efficiency or other benefits listed, but since the you’re trying to sell the benefits of CM to other people, it’s worth pointing out.
Configuration Drift Remediation
A corollary of visibility in the context of CM is exposure and remediation of configuration drift. To define the term, configuration drift is what happens whenever a configuration change is made to a common system on one server but not another. If the changed system was common between two servers, then that’s drift. Usually common configuration should stay common between different servers. For example, two servers on the same network running the same application stack should usually both be using the same NTP server.
Notwithstanding extraordinary circumstances, two identical server instances in the same app pool running the same app should define the same file descriptor limits for the app users, the same Java heap size for the app, and the same version of the app should be running. Configuration drift typically compounds over time as a lingering debt of natural human error. Servers develop personalities. Cluster A starts to behave differently than Cluster B, even though when initially deployed, they were identical. Individual machines start to be more or less reliable depending on who made changes, when, and how.
Besides exposing configuration drift through improved visibility into real-time configuration state, CM offers the option of forcibly eliminating it. If a sysadmin manually edits a managed registry key, CM will put it back. If an admin changes the password on a local account, CM will come in behind and undo the change. Taking the hard line like this gives a CM tool the ability to require that if changes are made, they are made uniformly and consistently by gating access at the CM level.
Taking a softer line, CM can alternatively alert admins to differences in the configuration on their systems and the defined “known good” configuration under management, and allow them to make the decision whether to forcibly revert the drift identified, or instead incorporate it into the known good state and ensure that it is consistently applied everywhere. In either mode of operation, good CM gives you a means to anchor configuration drift, prevent ad-hoc configuration changes and keep your environment, as a whole, aligned with a single source of truth.
Back to the “source of truth” analysis, sysadmins are no better than developers at keeping documentation up to date, especially for rapidly iterating systems and components. The general form of a procedure might stand the test of time very well, but even when it does, the details will be constantly shifting. CM should be a living document that provides machine information for implementing the infrastructure it describes while simultaneously remaining easy to use and understand.
Like an RFC or a product requirements document, CM should not be about implementation. The living documentation should be a description of what the infrastructure needs to look like, and it should cleanly separate that description out from specific implementation components that will drive application of the virtual blueprint. As system administrators, we want first and foremost to be architects, not electricians. We have the ability and the skill necessary to go in and fix the wiring if we have to, but our time is better spent designing layouts and rooms, ensuring structural integrity, and talking to our customers to make sure we are meeting their needs. Painting, carpeting, plumbing — all these can be handled by a subcontractor, and once the statement of work has been drawn up, shouldn’t require our attention on a day-to-day basis.
While not sufficient documentation in and of itself, CM can fulfill the role of a blueprint in outlining in an organized, standardized fashion everything necessary to understand how an infrastructure is put together. Like a blueprint, CM intrinsically provides low-level documentation of what configuration on any given machine in your infrastructure looks like. Combine that with the earlier assertion, modern CM is “configuration as code,” and we mix in an intent log. Like code, CM should be kept in a revision-controlled repository, and every change to the blueprint commented on by the administrator who commits the change. In the absence of up-to-date documentation in a non-CM situation, when RTFM fails, you have to go straight to debugging live systems to figure out why something isn’t working or how to replicate it on a new machine. With CM, on the other hand, after RTFM fails, we have RTFS to fall back to before needing to do the sysadmin equivalent of RTFB — completely start over, rediscover, repeat, and recapture previously expended configuration effort.
(See here for a brief explanation of the terms RTFM, RTFS, and RTFB.)
Speaking of which, at the end of the day when you go home and get hit by a beer truck, what does your boss have left? Back in the BOFH epoch the answer would have been “not much.” The procedural innovation of documenting infrastructure means that even shops without CM will have something left to show for all those years of service when the suds finally get hosed away. But collections of custom Perl scripts tying in to in-house apps with implementation-defined APIs, no matter how powerful, can be likened to Willy Wonka’s Wonkavator in that yes, it can do what it needs to do, but it’s a one-of-a-kind system that doesn’t drive itself, and pushing buttons at random is probably a really bad idea.
The documentation that has been left behind will be helpful if it’s up to date. Wordy, a lot of information to consume, but helpful. Unfortunately, system administrators are historically, er, lenient when it comes to the dual task of implementing a system and documenting it, updating the system and updating the documentation, upgrading the system and re-writing the documentation. It’s the latter half of this dual-wielding feat that tends to suffer the greatest neglect, and it’s the latter half that will be hardest to rebuild in the wake of the beer truck incident.
The same capabilities that make CM living documentation and allow it to eliminate configuration drift mean that every time an admin makes a tweak, fixes a bug, updates an application, or otherwise expends configuration effort, that effort is repeatable. It’s captured in the code, and accessible to co-workers and successors. The key difference between capturing work done in a CM solution, versus an in-house script, is standardization. That’s especially important in factoring in how long it will take a successor to ramp up and understand how to use the system.
Telling the CM Story
Practitioners have a natural attraction towards being a wizard. Magically, an app stack is deployed. Magically, a deployed application is updated across hundreds of instances. We’re attracted to methodologies and technologies that enable this kind of wizardry.
There’s distinct danger, though, in blindly chasing after new magic purely for the “ooh shiny!” factor. That’s why it’s important to shift one’s perspective from the benefit gained by the practitioner (newer, shiner magic) to the benefit gained by the business.
Talking about gains in efficiency, visibility, configuration drift remediation, documentation and captured effort, and other benefits helps give context to what configuration management can bring to the table. But the real gain from reading these stories comes from finding in them an allegory for your own organization’s IT business pain points. If provisioning and configuring a new app stack takes two weeks, that’s not just sysadmin time, that’s business time. Errors and inconsistencies across application instances aren’t just headaches for the sysadmins — they impact customer satisfaction and business success. The common theme of benefits such as faster updates, faster audits and reduced avoidable downtime is that they all contribute to improving the speed of business. Exposition of how configuration management adds value comes from mapping the cost of dealing with pain points to the savings achieved by automating them away. At the end of the day, that is what will sell the idea of configuration management up the stack.
Reid Vandewiele is a technical solutions engineer at Puppet Labs.
- Take a look at the graphic representation of a configuration management maturity model in Mike Stahnke's PuppetConf talk.
- Puppet Enterprise 3.2 offers the benefits of automation to non-root users. Reid offers the how-to.
- Puppet Enterprise is a key configuration management system — and so much more. Download and try it out for free.
- Show your boss a Puppet Labs whitepaper.