SRE Conv @ Dropbox

with:

  • Betsy Beyer
  • Rob Ewaschuk
  • Liz Fong-Jones

Where is SRE now?

more than 2000 engineers in SRE today at Google
outside of google? sort of embraced – definitely in flux
note: SRECon is conf coming up

SRE – what is it?

“you know it when you see it”

  • Creative autonomous engineering
  • wisdom of operations
  • SWE and systems engineering are equally valued skills

 * what is systems engineering? how do you thinking about breaking large systems down? define interfaces? if things break, how do you troubleshoot?
 * how do we automate / make more robust?

Making of the book

  • articles in journals that were copyright friendly
  • stitch it together offsite in a building without engineers

Interesting notes

  • writing book lead to discovery that they had multiple opinions on what SLOs are
  • configuration management chapter got skipped because opinions were too varied
  • at Google, whitepapers float around and mature before becoming external

Q&A

Q: how did you come to common tone?
A: This was Betsy’s job. Tech writers were in/out on various chapters.

Q: insights as google has scaled? currently SRE is very mature.
A:
what has increased is technical maturity of automation
Rob:
started in gmail crafted own automation in python
now: down to three CM solutions at Google, two are going away

Liz:
Scaling by having changing incentives:
It used to promote people by how complex was the product that you made. (This leads to multiple competing solutions)
Now: reward convergence and solving complex problems with simple solutions
Better to have three standards than twenty

Rob:
What to page on/etc
Growing culture/techniques over time
Operational maturity
Ben (orig VP of SRE) does apply capability maturity models to teams

Liz: He asks “Where are you on the scale? Are you improving? Why not?”

Rob:
Management and view of SRE time
Project work vs op work, strategic vs tactical, got the 50/50%
support and management came in when not meeting 50%

Liz: must be engineering.  let some breaks happen to do engineering.

Rob:
SLO on mean latency of bigtables was paging us all the time even on 75-percentile . dropped promise quite a bit. Worked on engineering, set up monitoring, two years later we had a 99-percentile SLO

Betsy:
login has a article on Liz’s interrupts
https://research.google.com/pubs/pub45764.html

Click to access 45764.pdf

Q: Follow up – there were multiple steps here, can you break down the in between targets?
A:
Liz: didn’t have a long term roadmap.  Automate the largest fire, then go to the next problem.  EG Get rid of re-sizing jobs by hands, standardize size of jobs next.

Rob: “At Google there’s no such thing as a five year roadmap that doesn’t change in a year (hardware exempt)”

Q: you mention automation, Amazon had an interesting postmortem that discussed too-powerful config management tools
A:
Liz: we love to write postmortems, we love to make post-mortem tickets, and we occasionally work on those items.

safety rails is the model for this type of problem.

“should you be able to turn down a unit of processing power below its level of utilization? probably not”

Rob:
Norms that allow us to avoid some things like this
Changes have to be cluster by cluster

Liz:
if you make it easy to do incremental rollouts (via automation), people will not write “update all” shell scripts – “Shell scripts sink ships”

Q: per book: make visibility of processes really helps

Liz:
Our mission the next few years is to make a framework with guardrails built in
the status port page should come free
resize check should come free
default to safe / platform building is one of SRE’s main objectives

Rob:
Should your server prevent you from touching every instance

Liz:
Easy undo buttons
if you have these then problems are nbd

Betsy:
“What is ideal, what is actuality”
SREs would like to be involved in design, sometimes still come to a service when it is a hot mess

Rob:
lots of google code has assertions that crash rather than corrupt mutable data.
sometimes engineers have assertions where they don’t know what to do, though it is not a dangerous situation

Betsy
Re: postmortem action items not geting addressed:
long tail of AIs that never get closed out.
in login; there’s an article about how to make an AI list that actually gets closed out.

Google does not do things perfectly, we are trying to improve and we’re talking to people about how to do better

Q: creative commons license – how did that all go?
A:
Betsy: i didn’t have to deal with the legal side of this
in lieu of profits, we got the creative commons version. O’Riley didn’t want it to come out at the same time

could not use O’Riley’s formatting or images.

Liz: O’Riley seems more open to this in the future

Q: you mentioned building a framework for building in reliability (eg baking in status ports, other features)
many times you start with your own version, then by its too big to switch to open source projects

is there a way to build tooling that is generically useful like the book?

A:
Liz:
example: grpc is open source and useful
there may not be utility in releasing tools around logging, because our tools around logging are special for google needs.

ease of use thing: no one should be writing a “main” function in C

“init google” as a C++ function that pulls in lots of helpers

Rob:
Agree that not all infra tools are sharable because systems are bespoke
hard to image how to open source many of those things (which is sad)

Liz:
another example: Go lang and debugging tools

Q: what sorts of orgs are resistant to postmortems?
A:  Liz: Gov contracting is very blameful
admitting responsibility is hard in those envs</code>