March | 2017 | Alex Lo Engineering Blog

Updates from March, 2017 Toggle Comment Threads | Keyboard Shortcuts

@alexlo03 11:08 on 2017/03/27 Permalink | Reply
Making Ansible Network Security 2-3x Faster
At Flatiron Health we use Ansible to configure AWS Network Security groups (see blog). Over time I noticed more and more timeouts while asserting that the network security state was where we thought it should be. Digging into the code I found this confusing block of code:

The timeout happened on the highlight. Why was checking a group getting all ec2 instances? It doesn’t even use them unless the target description doesn’t match the existing description. We could be more lazy in getting ec2 instances if that’s the case.

Digging deeper, the error condition listed on L326 has the intent that if the group is not being used, then maybe the description can be updated. Presumably that update would be done via deleting the security group and recreating it since security group descriptions are immutable. This update never happens in the module, so clearly this is a relic. (side note: the public-ssh group assumption on L322 is another funny relic)

My recent PR/commit against this code cleaned this up a fair bit and just made it an error if the target description does not equal the existing description without checking if any ec2 instances are using the existing group.

Impact

How expensive is getting all ec2 instances? Well it depends on how large your AWS account is. For us the return value was in the ballpark of 1MB (tested via aws ec2 describe-instances).

Before
```
# ansible==2.1.3.0
time ansible-playbook sg-update.yml --check
...
real 8m23.103s
user 2m3.358s
sys 0m46.586s
```
After (running Ansible at commit of change)
```
time ansible-playbook sg-update.yml --check
...
real 3m5.069s
user 1m0.551s
sys 0m38.873s
```
From 503 seconds to 185 is an appreciable speed up (2.7x faster). This speedup should apply any time the security group is already present whether in --check mode or not.

I’m looking forward to the next release of Ansible when we can realize these savings. (I’m not sure if this will be in 2.3, which was just cut, or we’ll have to wait for 2.4)

Thanks to the reviewers/maintainers of Ansible for the review and getting this merged.
Reply Cancel reply
Required fields are marked *

Name *

Email *

Website

Notify me of new comments via email.
Notify me of new posts via email.
Δ
@alexlo03 21:02 on 2017/03/02 Permalink | Reply
SRE Conv @ Dropbox
with:
- Betsy Beyer
- Rob Ewaschuk
- Liz Fong-Jones
Where is SRE now?

more than 2000 engineers in SRE today at Google
outside of google? sort of embraced – definitely in flux
note: SRECon is conf coming up

SRE – what is it?

“you know it when you see it”
- Creative autonomous engineering
- wisdom of operations
- SWE and systems engineering are equally valued skills
* what is systems engineering? how do you thinking about breaking large systems down? define interfaces? if things break, how do you troubleshoot?
* how do we automate / make more robust?

Making of the book
- articles in journals that were copyright friendly
- stitch it together offsite in a building without engineers
Interesting notes
- writing book lead to discovery that they had multiple opinions on what SLOs are
- configuration management chapter got skipped because opinions were too varied
- at Google, whitepapers float around and mature before becoming external
Q&A

Q: how did you come to common tone?
A: This was Betsy’s job. Tech writers were in/out on various chapters.

Q: insights as google has scaled? currently SRE is very mature.
A:
what has increased is technical maturity of automation
Rob:
started in gmail crafted own automation in python
now: down to three CM solutions at Google, two are going away

Liz:
Scaling by having changing incentives:
It used to promote people by how complex was the product that you made. (This leads to multiple competing solutions)
Now: reward convergence and solving complex problems with simple solutions
Better to have three standards than twenty

Rob:
What to page on/etc
Growing culture/techniques over time
Operational maturity
Ben (orig VP of SRE) does apply capability maturity models to teams

Liz: He asks “Where are you on the scale? Are you improving? Why not?”

Rob:
Management and view of SRE time
Project work vs op work, strategic vs tactical, got the 50/50%
support and management came in when not meeting 50%

Liz: must be engineering. let some breaks happen to do engineering.

Rob:
SLO on mean latency of bigtables was paging us all the time even on 75-percentile . dropped promise quite a bit. Worked on engineering, set up monitoring, two years later we had a 99-percentile SLO

Betsy:
login has a article on Liz’s interrupts
https://research.google.com/pubs/pub45764.html

Click to access 45764.pdf

Q: Follow up – there were multiple steps here, can you break down the in between targets?
A:
Liz: didn’t have a long term roadmap. Automate the largest fire, then go to the next problem. EG Get rid of re-sizing jobs by hands, standardize size of jobs next.

Rob: “At Google there’s no such thing as a five year roadmap that doesn’t change in a year (hardware exempt)”

Q: you mention automation, Amazon had an interesting postmortem that discussed too-powerful config management tools
A:
Liz: we love to write postmortems, we love to make post-mortem tickets, and we occasionally work on those items.

safety rails is the model for this type of problem.

“should you be able to turn down a unit of processing power below its level of utilization? probably not”

Rob:
Norms that allow us to avoid some things like this
Changes have to be cluster by cluster

Liz:
if you make it easy to do incremental rollouts (via automation), people will not write “update all” shell scripts – “Shell scripts sink ships”

Q: per book: make visibility of processes really helps

Liz:
Our mission the next few years is to make a framework with guardrails built in
the status port page should come free
resize check should come free
default to safe / platform building is one of SRE’s main objectives

Rob:
Should your server prevent you from touching every instance

Liz:
Easy undo buttons
if you have these then problems are nbd

Betsy:
“What is ideal, what is actuality”
SREs would like to be involved in design, sometimes still come to a service when it is a hot mess

Rob:
lots of google code has assertions that crash rather than corrupt mutable data.
sometimes engineers have assertions where they don’t know what to do, though it is not a dangerous situation

Betsy
Re: postmortem action items not geting addressed:
long tail of AIs that never get closed out.
in login; there’s an article about how to make an AI list that actually gets closed out.

Google does not do things perfectly, we are trying to improve and we’re talking to people about how to do better

Q: creative commons license – how did that all go?
A:
Betsy: i didn’t have to deal with the legal side of this
in lieu of profits, we got the creative commons version. O’Riley didn’t want it to come out at the same time

could not use O’Riley’s formatting or images.

Liz: O’Riley seems more open to this in the future

Q: you mentioned building a framework for building in reliability (eg baking in status ports, other features)
many times you start with your own version, then by its too big to switch to open source projects

is there a way to build tooling that is generically useful like the book?

A:
Liz:
example: grpc is open source and useful
there may not be utility in releasing tools around logging, because our tools around logging are special for google needs.

ease of use thing: no one should be writing a “main” function in C

“init google” as a C++ function that pulls in lots of helpers

Rob:
Agree that not all infra tools are sharable because systems are bespoke
hard to image how to open source many of those things (which is sad)

Liz:
another example: Go lang and debugging tools

Q: what sorts of orgs are resistant to postmortems?
A: Liz: Gov contracting is very blameful
admitting responsibility is hard in those envs</code>

	@alexlo03 on Exploring 'htop explained…
	Kevin Risden on Exploring 'htop explained…
	Jude Allred on Notes: Servant Leadership at…
	@alexlo03 on Naming your variables, every…
	@alexlo03 on Naming your variables, every…

Alex Lo Engineering Blog

Recent Posts

Recent Comments

Archives

Categories

Meta

Updates from March, 2017 Toggle Comment Threads | Keyboard Shortcuts

@alexlo03 11:08 on 2017/03/27 Permalink | Reply

Making Ansible Network Security 2-3x Faster

Impact

Before

After (running Ansible at commit of change)

Reply Cancel reply

@alexlo03 21:02 on 2017/03/02 Permalink | Reply

SRE Conv @ Dropbox

Where is SRE now?

SRE – what is it?

Making of the book

Interesting notes

Q&A