Updates from March, 2017 Toggle Comment Threads | Keyboard Shortcuts

  • @alexlo03 11:08 on 2017/03/27 Permalink | Reply  

    Making Ansible Network Security 2-3x Faster 

    At Flatiron Health we use Ansible to configure AWS Network Security groups (see blog).  Over time I noticed more and more timeouts while asserting that the network security state was where we thought it should be.  Digging into the code I found this confusing block of code:

    Screen Shot 2017-03-27 at 10.46.39 AM

    The timeout happened on the highlight.  Why was checking a group getting all ec2 instances?  It doesn’t even use them unless the target description doesn’t match the existing description.  We could be more lazy in getting ec2 instances if that’s the case.

    Digging deeper, the error condition listed on L326 has the intent that if the group is not being used, then maybe the description can be updated.  Presumably that update would be done via deleting the security group and recreating it since security group descriptions are immutable.  This update never happens in the module, so clearly this is a relic.  (side note: the public-ssh group assumption on L322 is another funny relic)

    My recent PR/commit against this code cleaned this up a fair bit and just made it an error if the target description does not equal the existing description without checking if any ec2 instances are using the existing group.

    Impact

    How expensive is getting all ec2 instances?  Well it depends on how large your AWS account is.  For us the return value was in the ballpark of 1MB (tested via aws ec2 describe-instances).

    Before

    # ansible==2.1.3.0
    time ansible-playbook sg-update.yml --check
    ...
    real 8m23.103s
    user 2m3.358s
    sys 0m46.586s

    After (running Ansible at commit of change)

    time ansible-playbook sg-update.yml --check
    ...
    real 3m5.069s
    user 1m0.551s
    sys 0m38.873s

    From 503 seconds to 185 is an appreciable speed up (2.7x faster).  This speedup should apply any time the security group is already present whether in --check mode or not.

    I’m looking forward to the next release of Ansible when we can realize these savings.  (I’m not sure if this will be in 2.3, which was just cut, or we’ll have to wait for 2.4)

    Thanks to the reviewers/maintainers of Ansible for the review and getting this merged.

     
  • @alexlo03 21:02 on 2017/03/02 Permalink | Reply  

    SRE Conv @ Dropbox 

    with:

    • Betsy Beyer
    • Rob Ewaschuk
    • Liz Fong-Jones

    Where is SRE now?

    more than 2000 engineers in SRE today at Google
    outside of google? sort of embraced – definitely in flux
    note: SRECon is conf coming up

    SRE – what is it?

    “you know it when you see it”

    • Creative autonomous engineering
    • wisdom of operations
    • SWE and systems engineering are equally valued skills

     * what is systems engineering? how do you thinking about breaking large systems down? define interfaces? if things break, how do you troubleshoot?
     * how do we automate / make more robust?

    Making of the book

    • articles in journals that were copyright friendly
    • stitch it together offsite in a building without engineers

    Interesting notes

    • writing book lead to discovery that they had multiple opinions on what SLOs are
    • configuration management chapter got skipped because opinions were too varied
    • at Google, whitepapers float around and mature before becoming external

    Q&A

    Q: how did you come to common tone?
    A: This was Betsy’s job. Tech writers were in/out on various chapters.

    Q: insights as google has scaled? currently SRE is very mature.
    A:
    what has increased is technical maturity of automation
    Rob:
    started in gmail crafted own automation in python
    now: down to three CM solutions at Google, two are going away

    Liz:
    Scaling by having changing incentives:
    It used to promote people by how complex was the product that you made. (This leads to multiple competing solutions)
    Now: reward convergence and solving complex problems with simple solutions
    Better to have three standards than twenty

    Rob:
    What to page on/etc
    Growing culture/techniques over time
    Operational maturity
    Ben (orig VP of SRE) does apply capability maturity models to teams

    Liz: He asks “Where are you on the scale? Are you improving? Why not?”

    Rob:
    Management and view of SRE time
    Project work vs op work, strategic vs tactical, got the 50/50%
    support and management came in when not meeting 50%

    Liz: must be engineering.  let some breaks happen to do engineering.

    Rob:
    SLO on mean latency of bigtables was paging us all the time even on 75-percentile . dropped promise quite a bit. Worked on engineering, set up monitoring, two years later we had a 99-percentile SLO

    Betsy:
    login has a article on Liz’s interrupts
    https://research.google.com/pubs/pub45764.html

    Click to access 45764.pdf

    Q: Follow up – there were multiple steps here, can you break down the in between targets?
    A:
    Liz: didn’t have a long term roadmap.  Automate the largest fire, then go to the next problem.  EG Get rid of re-sizing jobs by hands, standardize size of jobs next.

    Rob: “At Google there’s no such thing as a five year roadmap that doesn’t change in a year (hardware exempt)”

    Q: you mention automation, Amazon had an interesting postmortem that discussed too-powerful config management tools
    A:
    Liz: we love to write postmortems, we love to make post-mortem tickets, and we occasionally work on those items.

    safety rails is the model for this type of problem.

    “should you be able to turn down a unit of processing power below its level of utilization? probably not”

    Rob:
    Norms that allow us to avoid some things like this
    Changes have to be cluster by cluster

    Liz:
    if you make it easy to do incremental rollouts (via automation), people will not write “update all” shell scripts – “Shell scripts sink ships”

    Q: per book: make visibility of processes really helps

    Liz:
    Our mission the next few years is to make a framework with guardrails built in
    the status port page should come free
    resize check should come free
    default to safe / platform building is one of SRE’s main objectives

    Rob:
    Should your server prevent you from touching every instance

    Liz:
    Easy undo buttons
    if you have these then problems are nbd

    Betsy:
    “What is ideal, what is actuality”
    SREs would like to be involved in design, sometimes still come to a service when it is a hot mess

    Rob:
    lots of google code has assertions that crash rather than corrupt mutable data.
    sometimes engineers have assertions where they don’t know what to do, though it is not a dangerous situation

    Betsy
    Re: postmortem action items not geting addressed:
    long tail of AIs that never get closed out.
    in login; there’s an article about how to make an AI list that actually gets closed out.

    Google does not do things perfectly, we are trying to improve and we’re talking to people about how to do better

    Q: creative commons license – how did that all go?
    A:
    Betsy: i didn’t have to deal with the legal side of this
    in lieu of profits, we got the creative commons version. O’Riley didn’t want it to come out at the same time

    could not use O’Riley’s formatting or images.

    Liz: O’Riley seems more open to this in the future

    Q: you mentioned building a framework for building in reliability (eg baking in status ports, other features)
    many times you start with your own version, then by its too big to switch to open source projects

    is there a way to build tooling that is generically useful like the book?

    A:
    Liz:
    example: grpc is open source and useful
    there may not be utility in releasing tools around logging, because our tools around logging are special for google needs.

    ease of use thing: no one should be writing a “main” function in C

    “init google” as a C++ function that pulls in lots of helpers

    Rob:
    Agree that not all infra tools are sharable because systems are bespoke
    hard to image how to open source many of those things (which is sad)

    Liz:
    another example: Go lang and debugging tools

    Q: what sorts of orgs are resistant to postmortems?
    A:  Liz: Gov contracting is very blameful
    admitting responsibility is hard in those envs</code>

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel