Chaos Engineering In Practice
Chaos engineering is all about running chaos experiments in systems to determine how the systems behave under certain circumstances. The idea is not merely to test the system. You’ll also want to learn how the system reacts when a failure happens. For instance, you might have a redundant architecture in place. But have you ever tested that the system is indeed capable of tolerating faults? You might be surprised to observe failures you didn’t expect to see.
To better understand how the discipline of chaos engineering works, it’s best to understand what it looks like in practice. According to the chaos engineering principles, a chaotic experiment follows four steps: defining a “steady state,” building a hypothesis, introducing real-world events, and observing.
In this post, we’ll see how all of these principles look in practice.
Build a Hypothesis
Let’s say that you want to put chaos engineering into practice in a learning platform that offers free and paid courses. The system runs in a cloud environment where you can automate much of the operations to manage it. Based on these assumptions, let’s start thinking about one of these systems’ most important features: availability. People using the system expect that the platform will be available when they’re studying something—this is the “steady state.” What could happen that might put the system’s availability at risk? Well, as Werner Vogels of Amazon says, “everything fails all the time.” You should expect that a server might be down all of a sudden.
So the hypothesis should be something like, “When a server goes down, the system’s availability shouldn’t be affected.” This could mean that the load balancer can retry a call to a healthy server in front of the system. Or the system doesn’t depend on sticky sessions. But the details here are not important, only the hypothesis.
Vary Real-World Events
In this example, the system is running in a cloud provider. In Amazon Web Services, for instance, you could have configured an auto-scaling group that takes care of reprovisioning a server in case of a failure. And you could have integrated a load balancer that distributes the load only to healthy servers. However, you might need to implement a retry policy within your services if the system is still redirecting users to an unhealthy server (perhaps the failure has just happened).
What I’ve described is something that could really happen, and it’s not just a risk. Therefore, you could start varying these types of events to see how the system behaves. Also, you’re trying to test the hypothesis you built previously. One way of putting this into practice is to start terminating servers in the cloud provider. You can then observe how the system behaves when a failure happens. The auto-scaling group and load balancer combo will automatically make your system available, but will your users notice a failure? Perhaps, but you won’t know. The best way to find out is to run the experiment.
Minimize the Blast Radius
Once you have the hypothesis in place and know how you’ll vary real-world events, the next step is to minimize the blast radius. You’re running an experiment; you don’t want to cause so much damage that the system can’t recover from it. So, when you’re designing and running your experiment, make sure you always have a way to minimize the blast radius. For instance, in a system that has free and paid users, you could start running experiments for the free-tier users. This will immediately require you to modify the architecture to use a premium infrastructure for paid users and a cheaper, less available architecture for your free users. This may not be something you expected to have in place if you didn’t run a chaotic experiment like the one I described.
Conclusion
Putting a discipline like chaos engineering into practice varies from system to system. As the system complexity grows, you might have to look at different places and different ways of injecting chaos. What surprises me the most is that you always find new behaviors in the system. These are often behaviors that you weren’t expecting to observe when you run chaotic experiments in your systems. You’ll always have failures in systems, so it’s better to prepare in advance and run experiments continuously because systems are not static. They’re always changing and evolving.
Read part 1 of this blog series, Chaos Engineering: What It Is and Isn’t.