The Art of SLOs with Alex Bramley

Google Cloud Platform Podcast

Episode | Podcast

Date: Wed, 25 Mar 2020 00:00:00 +0000

<p>Today on the podcast, <a href="https://twitter.com/syntxerror1">Jon Foust</a> is back with <a href="https://twitter.com/markmirch">Mark Mirchandani</a> as we talk about SLOs and the importance of measuring service reliability with Alex Bramley. As a member of the Google SRE team, Alex and his coworkers help customers optimally run their services on Google Cloud. They collaborate with the client, weighing client needs and user needs to develop a plan that is affordable, efficient, and has the highest reliability for the user. Recently, they’ve been working to automate functions such as detection of outages, so that Google and the customer can work together quickly to get everything working smoothly again.</p> <p>Later, Alex, describes the steps developers go through at his workshop, The Art of SLOs, which was designed to help companies measure and improve reliability. At this workshop, attendees are encouraged to set SLO targets and error budgets. They are given theoretical reliability problems to solve, allowing them to practice without the added pressure of messy, real-world problems. The Art of SLOs helps developers understand what measurements are beneficial and why and the best way to implement projects that can take those measurements accurately. Alex was able to make the materials for the workshop free online!</p> <h5 id="alex-bramley">Alex Bramley</h5> <p><a href="https://cre.page.link/art-of-slos-help">Alex Bramley</a> joined Google in January 2010 as the first Mobile SRE in London, after IBM bought the <a href="https://www.software.ac.uk/blog/2016-09-30-heroes-software-engineering-men-and-women-transitive"> startup he enjoyed working for</a> and made it much less fun. He spent around 7½ years in various reincarnations of Mobile/Android/Play SRE, looking after the infrastructure that makes phones smart, keeps them up to date, and provides them with countless distracting apps.</p> <p>CRE offered an interesting opportunity to do something different and learn from a bunch of very smart senior people, and Alex has not regretted taking the leap into the unknown. Much of his time recently has been spent rethinking how people teach customers, partners and the general public about SLOs. He helped create the Coursera course on <a href="https://cre.page.link/coursera">measuring and managing reliability</a> and developed what became the Art of SLOs for <a href="http://twitter.com/lizthegrey">Liz Fong-Jones</a> to deliver with other Google SREs at <a href="https://www.usenix.org/conference/srecon18europe/presentation/fong-jones-0"> SREcon EMEA’18</a>.</p> <p>Alex works four days a week so he can (suffer) enjoy looking after his children on Wednesdays, listen to <a href="https://www.mixcloud.com/kleinerbrain/luna-cs-mirror-mix/">cheerful music</a>, and waste a <a href="http://www.zachtronics.com/">lot</a> <a href="https://www.egosoft.com/games/x4/info_en.php">of</a> <a href="https://www.feed-the-beast.com/">time</a> <a href="http://www.factorio.com/">playing</a> <a href="https://www.klei.com/games/oxygen-not-included">computer</a> <a href="https://www.zelda.com/breath-of-the-wild/">games</a> and occasionally <a href="http://github.com/fluffle/">writing code</a>.</p> <h5 id="cool-things-of-the-week">Cool things of the week</h5> <ul> <li>Postponing Google Cloud Next ’20: Digital Connect <a href="https://cloud.google.com/blog/topics/inside-google-cloud/postponing-google-cloud-next20-digital-connect"> blog</a></li> <li>New research: How effective is basic account hygiene at preventing hijacking <a href="https://security.googleblog.com/2019/05/new-research-how-effective-is-basic.html"> blog</a></li> <li>Simplified global game management: Introducing Game Servers <a href="https://cloud.google.com/blog/products/gaming/introducing-google-cloud-game-servers"> blog</a></li> </ul> <h5 id="interview">Interview</h5> <ul> <li>The Art of SLOs <a href="https://cre.page.link/art-of-slos">site</a></li> <li>CRE Life Lessons <a href="https://cloud.google.com/blog/topics/cre-life-lessons">blog</a></li> <li>Putting customers first with SLIs and SLOs <a href="https://medium.com/the-telegraph-engineering/putting-customers-first-with-slis-and-slos-15352f9b6cbc"> blog</a></li> <li>Putting customers first with SLIs and SLOs (Part 2) <a href="https://medium.com/the-telegraph-engineering/putting-customers-first-with-slis-and-slos-part-2-6b5c2452aecd"> blog</a></li> <li>Measuring and Managing Reliability <a href="https://cre.page.link/coursera">course</a></li> <li>Site Reliability Engineering <a href="https://landing.google.com/sre/books/">books</a></li> </ul> <h5 id="question-of-the-week">Question of the week</h5> <p>How do I get started with GCGS? <a href="https://cloud.google.com/game-servers/docs">docs</a></p> <ul> <li>Google for Games Developer Summit Keynote <a href="https://www.youtube.com/watch?v=2haNNRU1Gxs">video</a></li> <li>Google for Games Developer Summit Playlists <a href="https://www.youtube.com/user/GoogleDevelopers/playlists?view=50&amp;sort=dd&amp;shelf_id=88"> videos</a><br /></li> </ul> <h5 id="where-can-you-find-us-next">Where can you find us next?</h5> <p>Jon will be working on an Open Match sample project for the developer community.</p> <p>Mark will be making more videos like <a href="https://www.youtube.com/watch?v=wcEL6ES0dAI">Error Reporting and error logging - Stack Doctor</a>.</p>