In this exciting episode of Cloud Dialogues, we are joined by Liz Fong-Jones, Field CTO at Honeycomb and former Google SRE, to explore the fascinating world of Site Reliability Engineering (SRE)—a game-changer for scaling and automating large systems.
What We Covered:
1. Meet Liz Fong-Jones: Liz brings over a decade of SRE experience from her time at Google and Honeycomb, helping companies revolutionize how they manage reliability and automation.
2. The Origin Story: SRE actually predates the cloud! Born at Google in the early 2000s, SRE started as a way to automate manual system administration tasks and has since evolved into its own discipline, running parallel to DevOps.
3. SRE at Its Core: - Minimize repetitive work (aka "toil") by automating everything you can. - Use Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and maintain reliability.
4. Different SRE Models: There are different ways to implement SRE: - Tools-based within platform teams - Consultative SREs parachuting in to help teams - Embedded SREs integrated within every team
5. The SRE Mindset: Curiosity and empathy are essential for SREs. Teams need a culture of psychological safety where concerns can be raised without fear.
6. The Magic of SLOs and SLIs: SLOs set reliability targets (like aiming for 99.5% uptime), while SLIs measure performance against those targets. Together, they ensure your systems are running smoothly.
7. FinOps Meets SRE: Liz explains how SREs can help balance reliability, performance, and costs using SLOs to allocate resources more efficiently.
8. Disaster Testing: Want proof SREs are ready for anything? Honeycomb regularly tests its disaster recovery by taking down an entire availability zone—on purpose!
9. Pro Tips for Executives: Thinking about implementing SRE at your company? Liz suggests starting with your biggest challenges, offering executive support, and setting clear, achievable SLOs.
10. Why Observability Matters: Observability is the backbone of SRE. Having real-time, actionable data is key for setting and managing effective SLOs.
Plus, Liz gives covers off on her favorite ARM processors (for cost and environmental savings) and shares insights from her book Observability Engineering.
This episode is a deep dive into SRE, filled with actionable insights and strategies for leaders looking to supercharge their reliability game. You won’t want to miss it!