One of the core principles of DevOps is that responsibilities should be shared across teams and there should be no silos. Silos can have a number of detrimental effects on product development, from making communication across teams less efficient to creating an “us v. them” mentality. One of the most insidious impacts of team silos is the “chuck it over the fence” anti-pattern, where teams complete their part of a project and then relinquish responsibility to the next downstream team. With this mode of operation, ownership is hard to establish until, ultimately, you get to the end of the pipeline and the downstream teams are forced to own the remaining, often tedious, tasks required to ship code.
As we evolved our practices for building and shipping our flagship product, Puppet Enterprise, we found some of these operating modes entrenched in the way we work.
Integration as a team
As a young company, early in our development, we struggled to release our software both predictably and with the scope that we had planned. Some of our problems seemed to stem from a lack of ownership: We had continuous integration (CI) testing but it lacked a single owner, it was often red, and cutting a release was an exercise in herding cats.
In looking to assign ownership for these areas we decided to establish an integration team as a part of a reorganization of our engineering department. One of the core responsibilities of this new team was working to get our continuous integration pipelines green in preparation for any release in progress — a key requirement for shipping our product on schedule. In addition, the Integration team also served to guarantee that new features could be integrated into our installer and that existing customers could be upgraded to new versions (with those new features) without breaking their deployment.
The thinking was that by explicitly assigning ownership for these important responsibilities we could make sure a single team was accountable for addressing integration issues blocking release. However, it doesn’t take a crystal ball to see how the creation of this team led to silos. DevOps stresses the importance of systems thinking, and clearly we had developed a siloed system that pushed a core responsibility — namely code integration into our product — to a downstream team.
All I see is red
At Puppet, we have a standard that we don’t integrate new code into builds when our CI tests are red — nothing revolutionary there. For the most part we were pretty good to adhering to that standard for component-level tests (i.e. tests for the individual components that comprise our stack). However, the integration of these components into our product build was a different story. Our product-level CI tests were not only often red — they were often failing tests in multiple areas. What’s worse was that we continued to integrate new features and new development on existing components on top of the already broken builds.
How did we get to this state? Silos.
Our engineering teams were pretty good about keeping their component level tests green, but because they didn’t own product-level tests it was easy to pass ownership downstream to the integration team. The result: We needed 6+ weeks of a “hardening period,” where we froze all development on our release branch to address our open issues in order get our product shipped.
While having a 6-week freeze period is inefficient, we had worse problems. Because we continued to merge code on top of already failing tests, we eliminated much of the “early warning system” value from our CI systems, and reduced visibility into whether new development was integrating successfully with our existing code base. The backlog of issues that accumulated was so large we needed to temporarily assign engineers across the organization to help the integration team burn through them. And further, not only was this a morale killer for folks being moved, it violated another principle we try to adhere to: maintaining long-lived teams.
Breaking down the silos
As with most issues encountered in product development, solutions usually requires both a technological and a cultural component. And this challenge was no different.
On the technological side, we developed integration “smoke” tests that provided engineers with quicker and more comprehensive feedback on whether their component code would successfully integrate with the product build. We also spent a great deal of time optimizing our CI system to avoid transient errors that resulted from scheduling jobs despite a lack of compute resources, reworking poorly constructed tests, and moving some complex acceptance tests to the unit/component level.
Alone, though, these improvements wouldn’t have succeeded unless we addressed the cultural aspect. The fundamental change we needed to undertake as an organization was a shift to our scrum teams owning integration, and changing how new code gets successfully integrated into the mainline product. This change manifested in a number of ways ranging from integrating code earlier in the cycle, reverting code that broke the build immediately (as opposed to trying to fix it in the mainline branch, which left CI red for extended periods), and clear ownership by the scrum teams for failing integration tests. No longer was it okay to promote code into mainline and wait for the integration team to address any issues with integration.
Continuous delivery to boot
Once we had addressed many of the technological and cultural barriers that led us to establish an integration team, we gained some additional key benefits. With our scrum teams owning integration and the focus on keeping our build green, we found that we were able to continually deliver our software — and find critical bugs earlier in our process.
Achieving continuous delivery was an initiative in and of itself, but getting to CD would not have been possible with our siloed approach to integration. Today, integration occurs early and is owned by the scrum teams, continuous integration systems remain green (and it’s all hands on deck when failures occur), and we’re able to release our software after every sprint — all thanks to eliminating silos.
What’s next? While we’ve made great strides over the last couple of years in improving our engineering process, there’s still areas where we can reduce the time required for developers to receive feedback on the code they merge, and the features they are developing. Again, it’s pretty clear that we’ll need both cultural and technical advancements. Improvements to our tooling to provide quicker cycle times on whether code is integrating and tests are passing, and faster feedback on new features to ensure we are continuing to solve our customer’s most challenging problems. Stay tuned!
AJ Johnson is a director of engineering at Puppet.