First, let us introduce ourselves. We (Igor and David) have both worked in software since 1999, Igor mostly on the ops side of things, David mostly on the dev side. From our different experiences, we have both come to the opinion that infrastructure as code needs:
- to be tested like code
- and monitored like infrastructure.
Let's look at why.
Testing has a number of uses: from development, where it can be used to gain insight into the impact of changes, and explore special cases, to documenting edge cases and their behavior, to building shared confidence in the code base. Testing traditionally can't provide feedback from users, though, since it always runs offline.
There are methods in monitoring that can give us that much-needed feedback from users on whether our software is delivering the value they need.
Sometimes you don't really need monitoring, because your users will call you quicker than the monitoring could tell you something is down. But imagine if your monitoring could predict something before it happened.
There are methods in testing that can give us that early warning.
What's the simplest check that could work?
Many programming languages provide standard tools to assess the general quality of the code. These are called compilers, linters or style checkers. They provide quick feedback on whether the machine understands your code, and point out glaring formatting mistakes, so your colleagues don't have to suffer through them. Passing such tests is necessary for the code to be useful and readable, but can't give you much confidence in its business value. Computers are patient — and stupid. The code could still be very dangerous — or just deliver little to no value. Luckily, the tests are very cheap and can run all the time during development, without anyone noticing.
This nicely compares to:
In monitoring, the equivalent to static analysis is a basic check like whether a process identifier (PID) is up. That is a necessary condition for usefulness, but again, without intimate knowledge of an application, its programming environment and frameworks, the (distributed) context in which it is run, and more, black box testing — like other basic health checks — cannot determine whether actual business value is being provided.
The PID-is-down example is a good way to show how we can use methods of root-cause analysis to find a better monitoring check:
- Why is the PID gone?
- Why didn't launchd, upstart, smf or systemd respawn it?
- If they did, why has it still not started?
- What, exactly, does the error log say?
These questions get you to something that is not just actionable, but also, in a complex organisation, routable to the correct team.
Lifting the hood
The magician needs to know how the rabbit goes into the hat.
On the code side, unit tests in their simplest form just check that today the code is doing the same things it did yesterday. This can be very helpful if I just made a change, and want to know what I broke. While this isolation allows unit tests to be very quick and specific, the lack of a real system under test also means that it's impossible to test interactions across components.
With only basic knowledge of an application's inner workings, we can observe its performance. CPU load, memory usage and I/Os per second can tell a story about what can be improved, or where the system behaves differently today than it did yesterday. Like unit tests, this kind of monitoring is very cheap, and can provide good pointers to hot spots. But again, like unit testing, performance monitoring doesn't give you any insight into the reasons behind these metrics. If the database has twice the load it had yesterday, that could be due to an attack, a new version of the code, or just a Black Friday sale.
Bolting everything together
Throwing more resources at the problem.
Full system tests
To validate that the plane actually flies, we need to put someone in the pilot's seat and put it through its paces. In software engineering, this means putting together a number of typical scenarios, taking the code through its paces in these scenarios, and checking whether the results match our expectations.
The art here is to make the expectations robust enough to survive irrelevant changes in the system, while still sensitive enough to detect actual problems with the code. By running the tests on real systems, you can determine whether your application responded correctly in a realistic configuration. While a failing unit test always points to a specific location in the code that does not match expectations, full-system tests tend to fail on more generic issues, like “service is not available.” This makes the tests much more expensive to run than unit tests, since each scenario needs systems to run on. It also makes the tests more valuable, as the service really did start when they passed.
Classically, an integration test involves ops people installing a complex, distributed system into a production-like environment, and then letting testers loose to test if the thing actually works. Here is where the methodologies start to converge. This is exactly the same as a full system test, just seen from the operations side. Ops experience can help here by creating realistic scenarios, deploying them quickly and repeatably, and providing success metrics.
The software is finished, so after it compiles, the developer's job is done, and ops can put it on the web, right? Well, no. After the code leaves the strict confines of the dev team, things start to get messy. Users actually want to use it, and they will find all the sharp edges … but the developer will never know unless there is a feedback mechanism. What that looks like very much depends on the kind of software you’re building, but one thing is certain: depending on your users to report bugs is not a path to success.
We cannot know a system before we have observed it in its real-world environment. A system under load behaves differently than it behaves in a controlled (and often, unimaginative) test environment. Some edge cases can be observed only under high load, or when used at high scale by diverse users.
After a system has passed an integration test, we can roll it into production. If we have performance monitoring in place, we can use a green/blue deployment strategy to slowly roll out a system across more instances or to more customers. This type of deployment reduces risk by rolling out a new version to only a part of the production environment. You designate one part of the environment — call it green — for normal use, while the blue part is already running your new changes. As each change is observed in the wild, it can then be judged on the metrics coming back from the monitoring. Once confident in the change, it can then be rolled out to the whole environment. Note that a roll-out can be as simple as enabling a feature flag. The customer-centric version of this approach is called A/B testing.
Instrumenting your systems so you can measure the success of green/blue deployments and A/B tests can provide you with much more in-depth feedback than just relying on bug reports.
The investment in this type of testing is up front, and lies almost entirely in the system under test. It's important to realize this is a strategic investment: The invested developer time needs to be justified by gains in confidence, flexibility and stability from the improved processes.
Whether we use feature flags, or simply roll forward and back, the system must be able to handle it. The database schema, APIs and consumers of APIs (for example) must all be compatible for at least two versions up and down. This capability reduces risks, by making it possible to back out a release.
Some things can be done only in production.
Some issues cannot be debugged anywhere but in production, because it's impossible to reproduce them anywhere but in production, for any number of reasons.
It could be that a production environment uses different hardware, or simply a physically different machine, or a different RAID configuration. But often it's simply the sheer scale of production that we cannot reproduce, or the fact that we don't exactly know how an issue is triggered.
There's more to this issue than simply providing developers with sudo and ssh to production. Ops people usually know a different set of debugging tools. Again, depending on the programming environment in question, the debugging tools enable active intervention in the running system, making you a first-class citizen. Erlang’s monitor is a good example here: it allows us to directly observe and influence the system in real time. Other systems, such as the JVM, allow us to become at least a close observer through JMX. Tools like DTrace can provide deep introspection into the system, but they will only make sense to someone with deep knowledge of the system.
Monitoring the business
With deep knowledge of an application's inner workings and runtime environment, we can perform complex monitoring of the application and everything in it. This can give us an insight into the business benefits the application provides in terms of KPIs.
Exposing a method via JMX is relatively easy, and doing so can reveal deep details about the application. The additional investment for business monitoring can be minimal, as we can use the same tools we use for infrastructure monitoring, but there's a much bigger investment on the application side. On the one hand, this investment depends heavily on the programming environment. On the other hand, we have to know which values we want to expose that tell us something about the business value. This is why working together across team boundaries is so important.
We think that DevOps means working together, talking and sharing, and that's exactly what we want you to do. The issues we face in complex systems require a diverse range of skills to conquer. Together, we are more — and better.
We promise it's mostly harmless.
David Schmitt is a software engineer at Puppet Labs. His first project was creating ISP users from a database, because there was no one who could have done it manually. Igor Galić is a freelancer and a seasoned open source contributor who knows that fixing the issue in production is only the beginning.