System Monitoring in the Age of Site Reliability Engineering
Google’s Site Reliability Engineering book came out a few years ago. Since then, the Site Reliability Engineering movement has taken over many organizations in search of better reliability and observability. This post will look at a small portion of the wisdom and experience shared in the book—namely, system monitoring.
Now, we all know that monitoring is important, but what makes for good monitoring? What makes for bad monitoring? And how can we tell the difference? Let’s review some of the basic concepts. We’ll start with the benefits of system monitoring.
Why Do We Need Monitoring?
Overall, an important aspect of Site Reliability Engineering involves monitoring. Without monitoring, many Site Reliability Engineering practices don’t make sense. But let’s look at some specifics. Why do we need monitoring, and what benefits can we look forward to? Here are some of the benefits:
- Tracking long-term trends: Many uses exist for long-term trends. In addition to system statistics like service degradation or disk space, long-term trends can provide business values. Here, you can track the number of customers and sales orders over various time periods, or know when your customers use the system the most.
- Testing performance hypotheses: Instead of wondering if that new database index or additional instance of a service will improve performance, track the performance effects with monitoring.
- Alerting: Without monitoring, we can’t be notified when something goes wrong. Well, actually, we could be notified by our customers—but we should know when something breaks before they do.
- Providing business analytics: See how the new sales copy affects sales, and report how the latest application layout affects conversions.
- Improve debugging: Be able to see what metrics change together. A spike in traffic on one part of your application might increase latency in another.
What Should We Monitor?
Before we get into specifics, let’s look at different types of monitoring.
White-Box Monitoring
When we talk about white-box monitoring, we mean monitoring based on metrics that we’ve exposed. Here, the system being monitored produces logs, profiling interfaces, metric endpoints or emitters. White-box metrics include things like CPU, memory, and dependency latency.
Black-Box Monitoring
Alternatively, when we talk about black-box monitoring, we talk about externally visible behavior. This includes metrics around latency from the customer’s point of view, error responses, and correct business behavior. It might not identify underlying issues, because they could be masked by retry logic or load balancers.
The Four Golden Signals
The four golden signals shouldn’t be considered the be-all and end-all of your monitoring. These signals represent the beginning of your monitoring and should always be included.
Latency
Latency measures the time it takes to successfully process a request. Yup, that’s right. I said “successful.” Failed authentication requests that respond within milliseconds don’t count toward your overall latency number. Neither do validation errors that require the client to send a different request. This metric typically gets measured using milliseconds.
Traffic
Traffic indicates how much demand exists on your system. For example, in web services, this metric measures things like requests per second (RPS). Or, for back-end remote procedure calls or database calls, this may be transactions per second (TPS). The way it’s measured varies based on the type of system being tracked.
Errors
A software system wouldn’t be truly complete without error tracking. This golden signal tracks the rate of requests or transactions that fail. This includes blatant errors like HTTP 500 as well as responses that don’t provide the correct data. It’s also good if client errors and server errors report separately so that you’re aware of client-based problems.
Saturation
Saturation is one of the more complicated golden signals. It tracks how much of your capacity is being used. For example, saturation can include specific metrics on CPU, memory, and disk space utilization. It works as an early warning indicator of system failures or slowdowns.
So now that we have various metrics to track, what do we do with them?
How Should We Monitor?
It’s not enough to have monitoring on a system in any form or fashion. We also need an easily digestible format for said monitoring. One common way of providing visibility to metrics and monitoring involves a dashboard. The dashboard is a UI that displays a services metric. A common way of displaying the dashboard involves large screens or monitors in the development team workspace. They either display statically or rotate through the most important metrics.
Build a Proper Dashboard
Building a proper dashboard depends on displaying the correct data. As mentioned just above, we want to focus on the most important metrics. As engineers, we often geek out at all the different metrics we can gather and the creative ways we can display the data. However, if the metrics we’re displaying don’t provide value, then we shouldn’t have them displayed.
That’s because there’s a limited amount of real estate on dashboards. Although you can continue to get bigger monitors or screens, there’s also a limited amount of attention we can give to the data being broadcast. Our dashboards should show a high-level view of our system’s health.
Additionally, if you have a lot of real estate for your monitors and can display lots of information, you’re still subjected to the engineer’s attention span. We can’t look at 20 to 30 graphs and find the most important activity. We need focused and summarized data that helps us monitor our application reliability.
Differentiate Symptoms from Causes
One section of the Site Reliability Engineering book from Google discusses how dashboards can answer two questions: what’s broken, and why? It might not seem like a big deal to some. If you have latency increasing on part of an application, isn’t it the engineer’s job to figure out why? Well, perhaps. We still want to provide the best possible information available.
For example, if your customers receive an increased number of HTTP 500 errors on your site, that’s the “what,” and it should be available on the dashboard. The “why” would tell you that one of your dependencies crashed or a database is refusing connections. It’s not the complete story, but it helps your engineers get to the root of the problem faster.
Focusing as close as we can on the cause of reliability issues will improve resolution time and sharing of the relevant root causes.
How Much Effort Will Monitoring Take?
Although many tools are available for application and infrastructure monitoring, it still takes a fair amount of work to set up and keep up to date. You must set realistic expectations. This isn’t a “do it once and you’re done” task. Monitoring needs change based on changing architecture. And dashboards need updates based on feedback from engineers and changing production incidents. For example, if your team has been wrestling with an issue around queues filling up, you may have fairly detailed graphs on queue depth in your environments as the biggest portion of your dashboard. But once things resolve, you may need only a high-level graph lurking in a corner.
Is That It?
The best time to start monitoring your system is now. And the best place to start is with the golden signals. But as all architectures and systems evolve, so too should your monitoring capabilities. When growing your monitoring capabilities, think about long-term goals. Consider what works best for your team and product while also providing a first-class experience for your customers.
Looking for formal training on Site Reliability Engineering practices? Take a look at our course: Implementing Site Reliability Engineering. This three-day boot camp will teach you how to successfully implement site reliability engineering in your organization.