In his keynote, Google site reliability manager Gordon Rowell said he identifies with sysadmins and the weird turns their culture can take:
"I’m one of those weird people who find load balancing fun."
Jokes aside, his work involves building infrastructure that can scale at levels nobody's even considering today, and doing it well enough that it could scale that high next week. "I internalize our external IT, so we act like our own customers," he said, and teams at Google follow some rules to create a stable, scalable product.
1. Make the Service Scale
"We scale, said Gordon. "We build to scale. We don’t deploy 'a server.' I’m not going to let that out the door. That’s not a reliable system. I can’t let a system go out that if a disk goes, that system will break."
Scale doesn't mean "double or triple" the scale, either, but by a factor of 10 or 100. "You should be ready for that to happen next week," he said.
2. Make Deployments Consistent
This is where Puppet is important to Google: It ensures the system behaves the same way every time.
"Diversity is good for people - not platforms," he said. "If you have to hand-tweak something, you will get it wrong if there are 50 of them. You’ll make mistakes."
3. Understand Every Layer
Slow disks affect the performance of the machine, which affects the performance of the database, which affects the performance of the application. Until you understand all the layers, you don’t understand your system, because any part can fail, and then the whole system will fail. "And things will fail," he said.
4. Monitor Everything
It’s nice to know your web server is up. Most people know that much. But how many requests is it handling? How long is each request taking? Which are expensive? Which involve a cross-system call?
Here, Gordon touched on something near and dear to operations teams everywhere:
"This is a callout to the dev side of community," he said, "please put monitoring inflection points in everything you write. Put in something we can look at. I want to know how many requests went into the database at a particular point, and how many came out. I want to know when we can’t scale anymore and we have to rearchitect. Because those points come."
5. Plan for Failure
"We take systems and break them," he said. "We pull disks out. Do we degrade gracefully? Do we fall in a heap?"
"Take the system apart. Slow the database down. Put it on slow disks and see what happens. Put in a random cron job and see what that does to your system. If it’s still running, your database is probably now performing badly."