Transforming Culture With Site Reliability Engineering (SRE)
Site Reliability Engineering (SRE) proclaims many advantages for distributed systems. It improves infrastructure automation, increases reliability, and transforms incident management. However, an often-overlooked benefit of Site Reliability Engineering involves culture transformation. When reading Google’s Site Reliability Engineering book, you’ll see the culture mentioned in many chapters. But it’s not talked about as often as skills and processes.
Why do we overlook the cultural changes that come from Site Reliability Engineering? Well, oftentimes we’ve been conditioned to respect technical knowledge over people skills and culture. This is especially true in IT, where we still refer to skilled engineers as rock stars and ninjas. However, we end up with a culture that puts individuals at the center instead of the team. Just as fragile and brittle software is bad, so is a brittle organization that relies on the actions of a few.
So, what’s a solution to this problem? For starters, in addition to driving change in processes, Site Reliability Engineering drives change in culture as well. It embraces risk, talks about hard problems, and learns from failure without letting egos get in the way. And how does that sort of culture work? Let’s take a look.
Embracing Risk
A big part of Site Reliability Engineering culture embraces risk, but this isn’t always a natural inclination for teams and organizations. Even many teams that say they embrace risk don’t fully accept the possibility of risk. Where does this come from?
It comes from the inherent risk of distributed systems. As the authors of Google’s Site Reliability Engineering book note, the goal isn’t to have 100% reliable services. In fact, it’s often too expensive and not worth it to increase reliability past the point that it’s needed. Additionally, users don’t typically notice the difference—they’re used to calls occasionally dropping and cell service that’s not 100%.
Therefore, Site Reliability Engineering embraces the risk that systems will go down. The follow-up to embracing the risk is managing the risk. Once we know things will break and go down, what do we do to protect our customers and ourselves? What automated processes do we need to add to make sure we can still deliver value?
That’s what embracing risk is about.
Reducing IT Dogma
In the past, companies relied on systems administrators to run their systems and infrastructure. Though repeatable, many tasks were performed one at a time, manually, and with little thought to automation. Problems should be solved with technical solutions that work. But they should also be solved with culture in mind.
Today’s organizations need pragmatic engineers who are willing to change processes and procedures. They need to look beyond the playbook and do what’s best for their software ecosystem.
For example, many organizations have a governance process for allowing new software or infrastructure tooling to be used in production. This process purports to safeguard the organization against restrictive licenses, inefficient tools, or security threats. However, this process actually creates a bottleneck for teams.
Instead of being able to pick the appropriate tool for the problem, teams waste time fitting square pegs in round holes because they need to use a preapproved tool. Or it causes more security threats by not being able to green-light patched versions of libraries with known security threats. The pragmatic engineer looks at the original problem and works to automate a better solution. She’ll look at ways of automating scans of licenses, software repositories, and infrastructure patches.
The Site Reliability Engineering culture emphasizes change, automation, and questioning of the dogmatic processes that many hold dear.
Learning From Failure
In the past, teams have felt the need to hide failures. After all, their next bonus or compensation bump could rely on showing more successes than failures. Unfortunately, this cultural stigma has resulted in a fear of showing weakness or exposing failure.
However, in an Site Reliability Engineering culture, we encourage each other to learn from our failures. We share our failures publicly and with transparency. The following practices help build that learning culture:
- Hold blameless postmortems: Dissect incidents and outages as a team to find what automation and processes can prevent or fix the issue in the future. Attack the problem, not the person.
- Share postmortems with other teams: Review them and make them visible to all so everyone can learn.
- Role-play potential disasters: Improve the team’s problem-solving skills instead of relying on siloed experts.
- Host lunch-and-learns that review past incidents: Allow teams from around the company to learn from each other.
Hire Team Players
Many companies struggle with hiring. It’s not easy. Often, we fall back on hiring people with the right technical skills over the right team skills. That may help in the short term, but over time it could lead to negativity and difficult working environments.
That’s why an Site Reliability Engineering culture requires a hiring practice that looks for team players and collaborators. We need candidates that leave their ego behind and work together for the betterment of the product. So, when looking at potential hires, look at more than just technical skills. Make sure your candidates have collaborative abilities. Look for humbleness and a willingness to learn—and a sense of empathy.
Educate Your Hires
Once you’ve hired your latest engineers, what’s next? Traditionally, there has been a “trial-by-fire” approach to training system administration and operation teams. But in the Site Reliability Engineering culture, you need a more deliberate touch.
What does this look like? Let’s take a look at just a few of the onboarding ideas that Site Reliability Engineering culture brings to the table.
For one, we should consider creating a sequential learning experience for your new hires that sets them up for success. This will prepare them more than menial ticket triage and alerting and will show them the respect that they should expect from and extend to others.
Another practice you can take on is encouraging your new hires to work through problems using reverse engineering and fundamentals. This gives the engineers a deeper understanding of the systems and issues. It also reduces their reliance on dogmatic procedures and checklists.
And, finally, there’s an old-school practice that we should stop—giving the new hires grunt work. Historically, we give our new hires trivial and easy work until they really prove themselves. But if we want them to feel a sense of ownership early on, we should also sprinkle in the nontrivial and complicated work as well. There’s a lot to learn there.
Many other ideas and revelations exist in Google’s Site Reliability Engineering book.
Practice and Role-Play Your Way to Success
The ideas behind programming katas and deliberate practice aren’t new. However, Site Reliability Engineering culture brings a new life to them through various practices. For example, we’ve talked a bit about having blameless postmortems. Taking that a step further, we can also create repeatable exercises from past incidents. Here, we can gather a small group to discuss failures and incidents from the past and role-play what actions the group wants to take to identify and resolve the issue. A site reliability engineer that was involved in the original incident can facilitate and let the group know if they’re headed in the right direction.
Another great custom involves practice fire drills. Similar to fire drills at work or school, the team practices what would occur during a real outage or incident by walking through and debugging issues. Using principles from chaos engineering, we can make sure our teams are ready to stop, drop, and roll our way to reliability.
Interrupt Yourself
Interruptions are a part of life for the Site Reliability Engineer. So what can we do to fix this? Frankly, not much.
However, what we can do is change our mindset around interruptions. For example, when we’re on call, we should see the interruptions as our primary focus and work. We should get lost in the flow of it. We should see our project work as the interruption. It’s an investigative quest to not only find the problem but to develop short- and long-term solutions and develop experiments to confirm the hypothesis. On-call Site Reliability Engineers shouldn’t be expected to make progress on projects.
Conclusion
So, what do we need to do to change our culture? It’s not enough to say you want to change it. And it can’t be a top-down command to change our ways. Luckily, we can take some Site Reliability Engineering ideas and start implementing Site Reliability Engineering practices in our teams and organizations. These ideas will drive your culture to the next level of reliability.
Looking for formal training on Site Reliability Engineering practices? Take a look at our course: Implementing Site Reliability Engineering. This three-day boot camp will teach you how to successfully implement site reliability engineering in your organization.