How Walmart scaled Puppet to 55,000 nodes… and beyond
*Editor's note: This post was originally published on Walmart's TechBetter site, under the title Walmart Scales to 55,000-Plus Nodes on Puppet, and is republished with the kind permission of the TechBetter editors. Since scaling to 55,000 nodes with open source Puppet, Walmart has adopted Puppet Enterprise and expanded beyond its Linux estate to its Windows infrastructure. We're embedding here the video of Martin's talk last month at PuppetConf 2016, Collaboration and Empowerment Driving Change in Infrastructure with Culture.*
A little more than three years ago, we embarked on a journey. The goal was simple: to reduce our delivery times and improve our build quality by using automation. Thanks to great team members as well as some awesome open source software, we’ve made major strides toward achieving that goal.
The whole story begins with trust and empowerment. A small team was formed to start automating our server-build processes with the eventual hope of bringing automation to our entire distributed compute environment, including servers in our corporate data centers and public clouds, as well as in our distribution centers and stores. To achieve this, we chose to work with open source Puppet 3. Management endorsed our decision, and so we got to work.
In the beginning, we didn’t know how many nodes we could run Puppet on. What we did know is that we wanted to unify and centralize the function, to have one process and one point of contact for as many of our build processes as possible. A major turning point occurred when we began building greenfield data center servers with Puppet around the summer of 2014. Management sent down a major challenge that in retrospect was surprisingly non-directive: “How many nodes can you get Puppet on before holiday?”
We came back with 2,000, which we thought was pretty aggressive. A small coalition of engineers from various infrastructure disciplines stood in our VP’s office, feeling confident that 2000 nodes would make a statement and could help us make a difference. The VP looked at the number 2,000, written on his dry erase board, calmly rubbed out the “2” and replaced it with a “7.” Our jaws actually dropped.
We were talking about taking a new technology from zero to about 500 stores in less than two months. It would have represented more than double our existing Puppet footprint as it existed at the time. Would the system even scale that big? We had no idea. And now we wanted to take the doubling and triple that.
Our VP said, “We have over 10,000 stores. 500 won’t make that big of a difference. It won’t be enough to change how the sysadmin function works.” One of the engineers in the room, who had worked on store systems nearly his whole career said, “Well, if we want to shoot for 7,000, why not do the whole chain?” That meant 30,000 nodes. The VP said, “Alright! That sounds like you’re dreaming big.”
We filed out of that office, not sure whether we could do it or not, but together we were going to try. If we failed, it wouldn’t be because we hadn’t given it our all.
Over the next two months, we deployed nodes, often 500 or 2,000 at a time. The first time we tried to go to 500, we overwhelmed our Puppet servers, so we learned to add splay and retry logic to our installation and bootstrap process. Several times, we broke our Puppet infrastructure and had to build more Puppet servers, or classifiers. Or we had to make the ones we had bigger.
At any point, our management could have said, “Thanks for trying, but it looks like this isn’t going to go the distance.” But they gave us the chance to work through the problems, and that made all the difference. The system held up. Then, right around the middle of October — two weeks before our deadline — we finished it. We had more than 30,000 nodes reporting into a single, load-balanced infrastructure.
Once we got Puppet into the store chain, it was fairly easy to get it onto the rest of our brownfield server footprint. We had answered the question of whether or not Puppet could scale. But much more importantly, we had proven to ourselves and to the organization as a whole that we could use Puppet as a system to manage infrastructure change with both speed and quality.
We upgraded the agent software, and as we learned the ins and outs of how Puppet worked, we used it to manage more and change more. And we are still expanding what we manage with the tooling today. Success with the store chain rollout gave us the confidence to grow our environment even larger.
Once Puppet 4 came out and support for Puppet on Windows improved, our Windows teams decided they wanted to use Puppet as well. Today, we have more than 55,000 nodes reporting into a single administrative Puppet instance. We’ve upgraded the infrastructure three times, and we’re running the latest versions of all the tools.
But it all began with our leadership making the conscious decision to trust us. That led to an amazing level of pride and ownership in the solution we built, and that in turn led to some amazing results, for us and for our company.
Martin Jackson is an enterprise technical expert at Walmart.