Blog

July 28, 2022

Sleep Through the Night With Self-Healing Infrastructure

How to & Use Cases,

Infrastructure Automation

With self-healing infrastructure, DevOps engineers can rest assured that their operating systems will stay up and running around the clock. Less manual work, more peace of mind.

So, what is self-healing infrastructure, and how can you get started? We’ve covered all this and more in this article.

What is Self-Healing Infrastructure?
How to Get Started with Self-Healing Infrastructure
What NOT to Do with Self-Healing Infrastructure
A Self-Healing Infrastructure Example
With Self-Healing Infrastructure, Let Your Systems Work For You

What is Self-Healing Infrastructure?

Self-healing infrastructure is an automation methodology that allows systems to identify and repair errors and misconfigurations without any human action.

Self-healing infrastructure is the next natural progression in automation. IT automation has become necessary in today’s environment where expectations on IT are high, demand for talent far outpaces the supply, and threat remediation requires immediate action. Self-healing infrastructure is also the basis of AIOps. Without the ability to automatically remediate errors and misconfigurations, an IT group won’t be able to scale.

Stay on top of DevOps trends with the newest State of DevOps Report from Puppet >>

The reason for implementing self-healing infrastructure should be to remove interruptions for your work day (and your nights). The goal should be to adopt a well-scheduled work cycle where interruptions are the exception, not the norm.

When I was a DevOps engineer, I had a list of projects I was going to complete “when things slowed down.” That list served me well for many years, and I passed the list on, sepia-toned and faded, to my replacement when I moved on. Imagine a world where those projects get worked on, implemented, and checked off… and the phone isn’t ringing at 3 a.m. because a disk drive is full.

How to Get Started with Self-Healing Infrastructure

✅ DO Start Small

Identify common, repetitive support issues and compliance-breaking configuration changes. Those five-minute fixes that require very little engagement for each occurrence but that add up over time are a great way to cut down on the tickets!

✅ DO Keep It Simple

A complex, multi-step remediation process isn't the place to start.

✅ DO Set Guidelines and Be Transparent

Pick your tools, enforce their use, and have the automation stored in a code repository to make the solutions readily available. As your self-healing library expands, there will be more and more opportunity to reuse the code, keeping things simple and standardized.

What NOT to Do with Self-Healing Infrastructure

🚫 DON'T Automate Tasks That Change Frequently

Don’t exchange being interrupted by errors and outages for being interrupted by debugging automation.

🚫 DON'T Tackle Complex, Multi-Step Solutions

Monolithic self-healing routines are brittle and high maintenance. It’s better to make the self-healing responses modular and, if appropriate, put them in a toolchain.

🚫 DON'T Ignore the ROI

Spending days scripting and testing automation for a task that takes you five minutes once a month isn’t freeing up your time. Target quick hits and repeat offenders.

A Self-Healing Infrastructure Example

So what would be a practical, actionable example of a task or scenario that makes sense for creating self-healing infrastructure? Think of ways you can complete this sentence in your environment:

Every time X happens, I have to Y.

It looks like there are two variables to consider here, but there are actually three parts to this self-healing equation. You need to identify the incident that arises (X) and the action taken every time to remediate the incident (Y), but also the conditions that arise that make you aware of the incident. More often than not, this comes in the form of an alert from your monitoring system. So the manual, human-based workflow today is 1) I get an alert, 2) the alert tells me -- or I divine from the message and the symptoms -- what is going on, and 3) I take action to resolve it.

When broken down like this, it becomes a bit easier to conceptualize how to automate these three actions into a self-healing situation. Identify the message or alert that indicates an incident has occurred, then tie an automated action to the incident. Optionally, add confirmation that the incident has, in fact, been resolved, and notify your monitoring platform to acknowledge and resolve the alert.

A very common example is disk space utilization. Frequently, a human will be woken in the middle of the night by an alert that a disk is 90% full. Then the human will log into a system to delete old files, clean up large binaries left over from other users, compress old logs, or take some other space-saving measures to resolve the alert. A more permanent solution would be to also extend the size of a volume and filesystem to meet increasing demand. An important follow-up action for the next day is to contact the owner of the system and remind them (for the fourth time this week) that their system is filling up filesystems and can they please do something about it.

Tying together monitoring, ChatOps, and configuration management into a self-healing system can seem daunting, but if you start small, create reusable pieces, and gradually move toward a solution, you can start to move the humans from performing the healing to managing the self-healing process itself.

With Self-Healing Infrastructure, Let Your Systems Work For You

Hopefully, we’ve been able to pique your interest in self-healing automation for your infrastructure. This is a very basic outline of the theory; the practice itself can vary widely depending on your environment and your needs.

A good plan and a goal can help guide you in designing your self-healing infrastructure and keep it where it should be: a tool that improves your life and enhances the reliability of your IT estate.

Puppet Makes IT Automation Easy

Puppet Enterprise is the leading solution for streamlining and automating your IT operations. It helps you keep your infrastructure up and running efficiently and securely, allowing your team to focus on mission-critical tasks. With Puppet, you can easily automate a wide range of tasks and processes, such as provisioning, configuration management, application deployment, and more.

Download a free trial and take advantage of its powerful automation capabilities today!

Try Puppet Enterprise

Learn More

Read Intention-as-Code: Making Self-Healing Infrastructure Work
Download the white paper: The Business Case for IT Automation
Request the State of DevOps Salary Report
How to use Onceover for repository testing
Discover more of the benefits of IT automation

By Industry

Plans and Pricing

Puppet Forge

Support

Services

Puppet: Intelligent Infrastructure Governance