published on 17 April 2017

GitHub has over 21 million users and hosts more than 52 million projects. People and companies rely on GitHub's service to be available, so it’s critical to avoid causing — or even risking — user-facing outages. This post tells the success story of how GitHub accomplished the Puppet 4 upgrade and migration to Puppet Server with no internal or external downtime, and reduced deployment risk in the process.

Here's some background first. GitHub uses Puppet to manage many thousands of nodes in multiple physical data centers and the public cloud. Puppet was first used within GitHub soon after the company was founded in 2008. Since then, over 225 unique contributors have committed code to the Puppet repository, including well over half of all software engineers currently employed at the company. The code base has hundreds of thousands of lines managing over 100 unique Puppet roles, and is the second most active repository in GitHub's organization (behind the ongoing development of GitHub's core product).

Preparing to upgrade

Puppet's Puppet 3.x to 4.x: Get upgrade-ready guide provides the blueprint for upgrading to Puppet 4. GitHub was already running Puppet 3.8 with directory environments and PuppetDB, so we were ready to get started.

The next step was to check for deprecated features. Unfortunately, we quickly found many deprecated patterns in the code base, some dating back several years. Scouring thousands of manifests and hundreds of thousands of lines of code in hopes of finding them all would have been time-prohibitive, and it was too risky simply to enable the future parser and wait to see what broke. This situation called for a novel approach, and fortunately another tool was under development to overcome this roadblock.

Catalog difference testing

Around the same time as the Puppet 4 upgrade was being planned, GitHub was re-examining the workflow used to test and deploy Puppet code. At the time, the development workflow included rspec-puppet tests followed by no-op deployments to representative hosts. This was infuriatingly time-consuming, because each rspec-puppet continuous-integration (CI) run took around 5 minutes, and the test deployments often needed to go to 100-plus hosts for full coverage. It was also risky, because human log review was required to determine whether the test deployments had succeeded.

Ultimately, GitHub built a new development and testing tool, which was open-sourced at PuppetConf 2016 as octocatalog-diff. This tool uses the facts from the most recent Puppet run on a given host to compile the catalogs for the original (master) branch and the new development branch. It then displays the differences between the two catalogs. This is accomplished without putting load on the Puppet servers, or even touching any of the actual systems.

Using the GitHub CI environment, a developer can run this analysis for all of those 100-plus unique roles in under four minutes. This screenshot demonstrates output from the octocatalog-diff CI job, showing catalog changes that arise from the change of a resource parameter.

Output from the octocatalog-diff CI job, showing catalog changes that arise from the change of a resource parameter

To assist with the Puppet 4 upgrade, we leveraged the ability of octocatalog-diff to enable the future parser selectively. That four-minute CI job now produced a list of all the catalog differences between the default parser and the future parser for each role, without actually enabling the future parser anywhere.

At the beginning, the future parser produced a huge number of catalog differences, and many catalogs simply failed to compile. Each difference or failure was a broken design or deprecated feature revealing itself! As developers tracked down and fixed the problems, the list of failures and differences began to shrink. Once the CI job showed no differences, it was safe to turn on the future parser for all of our Puppet servers.

Puppet 4 and Puppet Server 2.x

With the future parser enabled on all version 3.8 servers, it was time to push forward to Puppet 4. Catalog difference testing was crucial to this portion of the upgrade as well. Compiling the same catalog with different Puppet versions revealed a few subtle differences. Like before, this difference list created a clear roadmap to update the areas of the Puppet code base giving rise to the differences. When all differences were resolved, the code base was ready for Puppet 4.

This was also the best time to switch from the rack server Puppet master to Puppet Server 2.x. We leveraged octocatalog-diff's ability to compare catalogs generated by two separate Puppet servers to confirm that the catalogs produced by Puppet master and Puppet Server 2.x matched, giving confidence in the new Puppet Server deployment. Once all nodes were pointed at newly provisioned Puppet Server instances, the legacy Puppet masters could be retired. (It is important to note that validating catalog equivalence is just one part of a properly planned upgrade. To achieve full confidence, one must also confirm adequate performance and high availability, and validate backup strategies.)

GitHub continues to use octocatalog-diff to validate Puppet Server upgrades. When a new version of Puppet or Puppet Server is released, we compare the catalogs between servers running the old and new versions. If all of the catalogs match, it's safe to try out and ultimately switch over to the new version with no unexpected effects on the fleet.

Before GitHub released octocatalog-diff (but after we started developing it), Puppet released the catalog_preview module. This module also performs catalog difference testing and provides specific tips for migrations. If catalog_preview and octocatalog-diff were both available at the time of our upgrade, we would have used them together to get the most comprehensive results.

Conclusion

Catalog difference testing remains a key component of GitHub's Puppet development workflow and upgrade process. Implementing this approach has substantially improved speed and reduced deployment risk for day-to-day development, and allowed us to complete a complicated upgrade with no downtime and minimal risk.

Kevin Paulisse is a site reliability engineer at GitHub.

Learn more

Share via:
Posted in:
Tagged:

Add new comment

The content of this field is kept private and will not be shown publicly.

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.