Revolutionizing PayPal: The Largest Atlassian Cloud Migration in History
PayPal, a global leader in online payments, faced significant challenges with their existing infrastructure, which…
Learn SRE best practices from an industry-expert and begin automating, monitoring , troubleshooting, and improving your systems.
Courses fill fast—register now and start mastering the skills that move business forward.
Standard Delivery: 21 hours of instruction over 3 days
The site reliability engineer role has been around for over 15 years now. And as the ubiquitousness of distributed systems increases, the demand for this role will continue to increase. However, many companies and technologists have not had exposure to the tenets of the SRE role, and there is often a lot of misunderstanding as to what this role is. Unlike traditional operations roles, the site reliability engineer puts additional focus on reducing human intervention by designing and implementing automation.
This three-day course will walk through the book Site Reliability Engineering: How Google Runs Production Systems, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. During the course, you will learn about Google's approach to service management, gain an understanding of the basics of site reliability engineering, and get an introduction to advanced topics.
You'll look at real-world examples and code samples of how companies are using SRE to ensure that their services are exactly as reliable as they need to be. And finally, we'll cover the cultural and human aspects of site reliability that drive successful implementation.
Have a group of 5 or more students? Cprime also provides specialist private training with exclusive discounts for tailored, high-impact learning.
1. Introduction
2. The Production Environment at Google, From the Viewpoint of an SRE
3. Exercise: Mapping Your Production Environment
1. Embracing Risk
2. Service-Level Objectives
3. Eliminating Toil
4. Monitoring Distributed Systems
5. The Evolution of Automation at Google
6. Release Engineering
7. Simplicity
1. Practical Alerting
2. Being On-Call
3. Effective Troubleshooting
4. Emergency Response
5. Managing Incidents
6. Postmortem Culture: Learning from Failure
7. Tracking Outages
8. Testing for Reliability
9. Software Engineering in SRE
10. Load Balancing at the Front End
11. Load Balancing in the Datacenter
12. Handling Overload
13. Addressing Cascading Failures
14. Managing Critical State: Distributed Consensus for Reliability
15. Distributed Periodic Scheduling with Cron
16. Data Processing Pipelines
17. Data Integrity: What You Read Is What You Wrote
18. Reliable Product Launches at Scale
1. Accelerating SREs to On-Call and Beyond
2. Dealing with Interrupts
3. Embedding an SRE to Recover from Operational Overload
4. Communication and Collaboration in SRE
5. The Evolving SRE Engagement Model
1. Lessons Learned From Other Industries
2. Conclusion
This site reliability engineering training course is perfect for anyone in the IT/SDLC field looking to implement SRE teams and practices in their organization.
Professionals who may benefit include: