How to Measure System Reliability
As businesses grow, new requirements arise for teams. The technology ecosystem becomes ever more complex and it is really important to understand each change and how it affects the overall system, as well as the service provided to users, who have high expectations. They expect systems to be up, responsive, fast, consistent, and reliable.
Reliability for systems means that a system is doing what its users need it to do. The reliability of a system is essentially how happy the customer is and we know that a happy customer is better for business. If we accept that reliability is one of the most important requirements of any service, users determine this reliability, and it’s okay to not be perfect all the time. We need a way of thinking that can encompass these truths. We have limited resources to spend, be they financial, human, or political.
Service Level Indicators
A metric is a measurement of something about a system. Good examples are the amount of memory a server is using, the time it takes an HTTP request to be fulfilled, or if an HTTP response was an error. Metrics are extremely important to understand a system and develop higher-level constructs that help us assess a system’s reliability.
Since we can’t measure user happiness directly, Service Level Indicators (SLIs) are proxies to help us gauge user satisfaction. SLIs represent a quantifiable measure of service reliability and we can calculate them:
SLI = (Good events / Valid Events) * 100%
What is, then, a Good Event? This can be challenging to define. For example, measuring the latency for a service is not enough. What good latency is, needs to be defined so that the SLI is useful. It forces a binary state to be achieved, even if the underlying metrics don’t provide a binary state.
A good example of an SLI would be:
Requests to a service will be responded to within 200ms.
Good events would be all requests that are responded to within 200ms. This SLI provides information about whether or not the service is up if it is available, and also if it is responsive. With a simple SLI, we can get a lot of information from the service state. The trick is to keep it simple and iterate. A good way to go about this is to choose important features the service is providing and measure what it’s valued by its users.
SLIs are foundational to assess the reliability of a system and are therefore very important that they are chosen meaningfully. If an SLI isn’t good, concepts like SLOs and Error Budgets aren’t either. Why is this important? If reliability is always measured from the user’s perspective, a user can be a human or a machine, and we choose an indicator that doesn’t measure user happiness, all SLOs, and Error Budgets would be useless since users would be unhappy whether we are achieving the objective or not.
Service Level Objectives
With an SLI defined, a goal to be achieved is the next target. In other words, an SLI defines what matters to users, while a Service Level Objective (SLO) defines how many times it has to be achieved for users to be happy with the service.
In a perfect world, 100% would be the target. Unfortunately, in the real world, reliability is expensive, and to achieve 100% no failures could occur. But the internet is complex. There are switches, cables, routers, ISPs, CDNs, etc, – and all of that sits between systems and the users using them. At any point in time, any of those can fail. With a slightly less reliable service, a user cannot distinguish between the failure of a system or the internet’s infrastructure. Systems should be built to be reliable enough to make the users happy, no more no less, and allow teams to invest time in providing business value.
SLOs are targets that are agreed upon, about what it means to be reliable enough, for a service that is being provided during a period of time. It means that the more reliable a system has to be, the more expensive it will become.
A good example of an SLO would be:
99% of requests to a service need to be responded to within 200ms within a 30 day period.
Error Budgets
What is left from an SLO is called an Error budget. For example, when defining an SLO of 99%, the Error Budget would be 1%. Error Budgets can be calculated:
Error Budget = 100% – SLO
For the previous example, when the measurement period starts, 1% is the Error Budget available. During the 30 days period, incidents will burn the Error Budget each time they happen. An Error Budget is effectively the percentage of reliability left, and it helps make educated decisions on whether, for example, to release a new feature or not based on the risk. They help make sure the operability process (e.g. incident response) is appropriate to the budget available for the service being provided.
SLIs, SLOs, and Error Budgets form a framework that makes it possible to define what reliability means for services. It allows reliability to be continuously measured and provides a platform for educated decisions to be made about how to prioritize features against reliability work. It allows development and operations teams to agree on what reliability means for a service, supported by data and avoiding unnecessary confrontation. It allows teams to collaborate, as well as focus on their specific work, by making sure user satisfaction is measured and assessed.