Practicing DevOps at Puppet
Our mission at Puppet is to empower people through software, and by “people” we mean “IT professionals whose job is to deliver applications and services to internal and external end-users.” The users of our software work at over 37,000 organizations worldwide, and have traditionally had titles like sysadmin or operations engineer. But this has been changing: Over the last few years, we’ve seen a transition in how organizations deliver applications and services, moving from functional teams (Development, QA, and Operations) with a linear software delivery lifecycle, into cross-functional teams that rapidly iterate to produce end-user value faster. As an industry, we’ve come to embrace the term DevOps to describe this transition.
Puppet has been part of the DevOps movement since its inception. We’ve surveyed thousands of organizations over the last seven years on the acceleration they’ve experienced by incorporating DevOps practices, publishing the results in our annual State of DevOps Reports. We’ve seen our own customers experience incredible success utilizing our software to automate their delivery processes and become more agile. In summary, we are true believers in DevOps.
We’ve always challenged ourselves to incorporate the practices we preach into our own software delivery lifecycle, and yet there’s always more to do. I’d like to describe some recent lessons from our own drive towards improving how we build software in hopes that they are useful to our customers, users, and the broader DevOps community.
Practicing continuous improvement
The core precept of a DevOps culture is the commitment to continuous improvement. This is the acknowledgement that we are on a never-ending journey to learn and get more efficient at delivering products and services.
The most important process we practice is the retrospective. We run retros for everything — from the traditional retros that feature teams run at the end of each sprint, to learning from production incidents via “blameless retrospectives,” to improving rhythms of the business (like our monthly products review and our annual kickoff), and even extending to lessons we’ve learned in HR processes such as our employee review processes.
Having a culture that enables feedback to flow from the people that are part of a system is a necessary condition to driving improvement in that system. This commitment to continuous improvement is what’s enabled us to identify bottlenecks and issues in our engineering practices and address them.
Going from “water-scrum-fall” to Agile
Agile practices and DevOps go hand in hand. But for many organizations, exercising a set of Agile rituals like sprint planning, daily standups, end-of-sprint demos, and retrospectives can obfuscate the spirit of the Agile manifesto. In Puppet’s case, we found that there was more we could do to adopt more agile practices. We explicitly recommitted to the principles espoused in the Agile manifesto, and brought in an Agile coach to refresh our knowledge and drive up support across all levels of the organization.
One of the most important changes we’ve instituted is the notion of always being ready to ship. The output of every sprint needs to be deployable. This in turn drives many other good patterns: reducing the size of each batch of work; getting all of the disciplines (Dev, QA, PM, UX, Ops, Docs) to collaborate earlier in the cycle than they typically would; and being able to deliver the results of any sprint to our customers, thereby increasing our agility and allowing us to learn more quickly. Always being ready to ship is a great theoretical goal, but how do we take advantage of it in practice?
Shipping every sprint
Unlike the software-as-a-service (SaaS) world, the vast majority of our customers operate Puppet in their own environments and control their upgrade lifecycle. Most of these customers have told us they can’t absorb more than two releases a year, so our release cadence for Puppet Enterprise is semi-annual. This can produce a temptation to batch some “endgame” work at the end of a cycle, a process we used to call “hardening.” One of the most important practices we’ve incorporated into our latest release (2017.3, shipped in October 2017) is to avoid this temptation and release every sprint to Operations.
We operate a sizable Puppet deployment in-house, and by deploying every sprint into our Site Reliability Engineering (SRE) team’s staging environment, we were able to “shift left” and find quality issues and regressions much earlier in the release lifecycle. More concretely, our SRE team identified 15 ship-stoppers in early sprints that we would have previously found only very late in our cycle. With the upcoming release of Puppet Enterprise (due in the first half of 2018), we’re deploying every sprint into Puppet’s SRE production environment, increasing that benefit even further. Any regression now becomes an “Andon cord pull” moment.
As another example, our new product, Puppet Discovery, was built from the ground up to be self-updatable. In fact, since we’ve shipped the tech preview in mid-October 2017 at PuppetConf, we’ve been able to update our customers’ bits three times, delivering new features. We’ve done this by packaging our roles as container images and having the software function like a virtual appliance, checking for updates periodically and upgrading roles when new container images become available. In this way, the Puppet Discovery team is able to function just like a service team would, even though the product is delivered as an on-premises appliance.
Finally, our most recent product, Puppet Pipelines, is delivered first as a SaaS service and also as an on-premises product. The Pipelines team automatically deploys every change into its staging environment, and does production deployments a few times a week.
Regardless of whether you can deliver features to end-users with every sprint, deploying software to production on a regular basis is a great way to ensure that you’re always ready to ship. But how did we get the Puppet Enterprise product from being ready to ship at the end of the release cycle to being ready to ship at the end of every sprint?
Reducing feedback time for developer check-ins
Puppet produces software that runs on over a hundred different platform targets, and we’ve found there to be a difficult trade-off between running a complete regression test suite on every pull request across every platform and OS combination that we support, and delivering feedback to developers on the quality/safety of their check-ins in a reasonable amount of time. Some of our engineers have reported feedback cycles measured in hours, and in worst-case scenarios, even more than a day! This is simply not good enough for a modern software delivery organization. When an engineer is confronted with that kind of lag time, they tend to batch up work so that they can amortize the overhead. This increases the risk that some of that work produces a regression when it is eventually checked in — an anti-pattern that undermines the principle of always being ready to ship. This is what causes the emergence of an end-game and a hardening period to begin with! The best way to support engineers in following the best practice of checking in small batches of work is to reduce the feedback time for each check-in.
Over the last few months, we’ve invested considerable effort in refactoring and tiering tests, applying the same automation patterns we have for common platforms to the legacy platforms we support, and moving code that doesn’t need to be in the core platform into external modules. To do that, we’ve had to create the time and space to do that kind of work.
Budgeting time for engineering systems improvements
Fast-moving organizations tend to focus all of their energy on delivering customer value. But focusing entirely on user-facing features invariably leads to more cruft, which reduces velocity and puts more pressure on teams to deliver more features faster, leaving even less time for “hygiene” work. This happens even with organizations that don’t accumulate the typical types of debt — test debt, documentation debt, and so on. It’s a simple outcome of the precept that engineering systems that work at a certain level of scale (number of developers, number of concurrent users, number of nodes under management) will rarely scale to the next order of magnitude without a refactor or rearchitecture.
Over the last few months we’ve explicitly asked our teams to budget time (and even whole sprints) for engineering systems health, and the results have allowed us to increase our velocity in the long run. And the best way to identify the biggest bottlenecks (and therefore the biggest potential gains) is to get the engineering disciplines (Dev, QA, Ops) working hand in hand to identify them.
Bringing Dev and QA together organizationally
DevOps is about getting Dev, QA, UX, PM, Ops, and Docs all working together. We believe in cross-functional delivery teams, but extending that notion to reporting structure represents a trade-off between putting all the people on a cross-functional team under a common manager who may only be deep in one of those disciplines, versus staying with functional management structures to encourage discipline excellence and growth, while creating the right incentives for delivering features in an end-to-end fashion.
When we looked at the role breakdown and handoff processes between Dev and QA, we felt like we could get better results if we removed the bright lines between these disciplines. We now have a single title — Engineer — with the expectation that all engineers own both writing code and ensuring its quality. Engineering managers have engineers that bring both the Dev and QA perspectives, and they work together to optimize the velocity of merging features into our master branch. The teams that operate services (for example, the Forge) extend this principle to Operations as well.
One last benefit of practicing DevOps at Puppet has been clarifying what we value in our product teams. First and foremost, it’s about delivering value to our users, and working together as a team to optimize around that goal. To best accomplish that, we focus on building long-lived, autonomous teams. To quote the Agile manifesto, we value people and interactions over processes and tools. We gain leverage by removing silos and working towards shared responsibility across teams. And we empower team members to act, while providing clear lines of ownership and accountability.
In summary, I’m blessed to work with a world-class product development organization with fantastic, hard-working individuals. But no matter how sophisticated our organization is, there’s always more we can do to become the engineering organization that we aspire to be. By no means have we addressed all of our challenges, but we’ve made significant improvements over the last few months that have enabled us to ship more value to our customers on a more predictable basis. Best of all, our team members are feeling more productive than they have in the past, and the more productive a team member is, the more fired up they are to deliver more value to their customer. And that’s a huge win in my book!
Omri Gazitt is the chief product officer at Puppet.
- Check out the products we’re working on: Puppet Enterprise, Puppet Discovery, Puppet Pipelines for Applications, and Puppet Pipelines for Containers.
- Read AJ and Rajasree’s blog posts about how we’re doing code integration and continuous delivery.
- Want to work with us? We’re hiring in Portland, Belfast, London, Seattle, Singapore, Sydney, Tokyo and your home office.