@alexlo03

Updates from @alexlo03 Toggle Comment Threads | Keyboard Shortcuts

@alexlo03 08:35 on 2022/12/08 Permalink | Reply

COVID: Top risk on the board

I am not a doctor or health professional.

I believe that for many people COVID is the top threat on the board to long term health and happiness.

COVID is why health systems are overrun, people are missing more work/school, and people are dying at higher rates (excess mortality).

Source

Source

COVID represents a new kind of danger in the world. There is no good analogue threat. For example the idea that COVID is “like influenza” is a popular take – you can get influenza multiple times like COVID, symptoms can be similar, but it is NOT like influenza: much deadlier and the long term damage/cumulative risk component.

COVID increases risk for other bad health outcomes (stroke, etc). Each COVID infection increases risk. It is not a binary event that you’ve had COVID.

This chart has the Hazard Ratio (“1” means no increased risk – this is the dotted line, “2” means “at double risk vs controls”). Note the large jump from “infected once” to “infected twice”.

Source

COVID represents 100% downside risk. There is no upside.

Presently humans are unable to acquire durable immunity to Covid either by vaccination or infection. The only viable strategy is avoidance.

Misc

“Immunity Debt” article
Reply Cancel reply
Required fields are marked *

Name *

Email *

Website

Notify me of new comments via email.
Notify me of new posts via email.
Δ
@alexlo03 06:54 on 2017/10/24 Permalink | Reply

Long Term Stock Exchange Fallacy?
The new idea for the “start up / owner friendly” Long Term Stock Exchange offers “Tenure Voting” which means that owning a share for ten years gives you 10x the voting power of a new share owner. (1, 2, 3)

This seems backwards to me, the longest holders don’t necessarily have the longest forward view of the company. Don’t you want to offer the most voting power to those who plan on holding the stock the longest prospectively? So, for example at voting time, you could lock in your shares for X years and get the voting leverage there. This is harder, because if you commit to X years and lose on some vote then you’re locked in to a board/decision that you’re not into. One solution is “only wining votes commitments stick” – an incentive to not to vote or commit – this seems good, if you care and commit you get leverage, if you don’t you get freedom. Another solution to accidental long commitments is to codify exit terms. Doing both solutions would probably work.
@alexlo03 15:46 on 2017/10/14 Permalink | Reply

Learning Curve

The Curve (y=2 * ln x +3) – note the boxes (ABC) are all the same height.

I’ve been thinking about this curve. It represents learning over time, on average, in jobs/projects I’ve had. At first (A) you know nothing so doubling your knowledge happens every day, once you start being useful (B) you’re still learning a bunch as you’re applying previous learning, and then eventually learning slows down (C).

I would like to always be able to throw myself back into A. I enjoy continuing to learn and grow. I feel fortunate that this is necessary and encouraged in software engineering as it is still a growing discipline. I’m also a little afraid that the Steamroller will fucking get me.

C can be seductive. I’ve spent time there. The longer I spent there the more I grew afraid of change. I now consider it a professional and personal hazard to spend time in C.

That feeling of getting into C? When’s the last time you learned something new? Have you surprised yourself? Have you had the feeling of inventing yourself?

How do you pivot back to A/B? To me the biggest step is to decide you’re ok with discomfort for growth. Once that is settled I think the rest is pretty straight forward. One can change jobs, change roles, change teams, start a new initiative, or just gravitate to a new work stream.
@alexlo03 11:08 on 2017/03/27 Permalink | Reply
Making Ansible Network Security 2-3x Faster
At Flatiron Health we use Ansible to configure AWS Network Security groups (see blog). Over time I noticed more and more timeouts while asserting that the network security state was where we thought it should be. Digging into the code I found this confusing block of code:

The timeout happened on the highlight. Why was checking a group getting all ec2 instances? It doesn’t even use them unless the target description doesn’t match the existing description. We could be more lazy in getting ec2 instances if that’s the case.

Digging deeper, the error condition listed on L326 has the intent that if the group is not being used, then maybe the description can be updated. Presumably that update would be done via deleting the security group and recreating it since security group descriptions are immutable. This update never happens in the module, so clearly this is a relic. (side note: the public-ssh group assumption on L322 is another funny relic)

My recent PR/commit against this code cleaned this up a fair bit and just made it an error if the target description does not equal the existing description without checking if any ec2 instances are using the existing group.

Impact

How expensive is getting all ec2 instances? Well it depends on how large your AWS account is. For us the return value was in the ballpark of 1MB (tested via aws ec2 describe-instances).

Before
```
# ansible==2.1.3.0
time ansible-playbook sg-update.yml --check
...
real 8m23.103s
user 2m3.358s
sys 0m46.586s
```
After (running Ansible at commit of change)
```
time ansible-playbook sg-update.yml --check
...
real 3m5.069s
user 1m0.551s
sys 0m38.873s
```
From 503 seconds to 185 is an appreciable speed up (2.7x faster). This speedup should apply any time the security group is already present whether in --check mode or not.

I’m looking forward to the next release of Ansible when we can realize these savings. (I’m not sure if this will be in 2.3, which was just cut, or we’ll have to wait for 2.4)

Thanks to the reviewers/maintainers of Ansible for the review and getting this merged.
@alexlo03 21:02 on 2017/03/02 Permalink | Reply
SRE Conv @ Dropbox
with:
- Betsy Beyer
- Rob Ewaschuk
- Liz Fong-Jones
Where is SRE now?

more than 2000 engineers in SRE today at Google
outside of google? sort of embraced – definitely in flux
note: SRECon is conf coming up

SRE – what is it?

“you know it when you see it”
- Creative autonomous engineering
- wisdom of operations
- SWE and systems engineering are equally valued skills
* what is systems engineering? how do you thinking about breaking large systems down? define interfaces? if things break, how do you troubleshoot?
* how do we automate / make more robust?

Making of the book
- articles in journals that were copyright friendly
- stitch it together offsite in a building without engineers
Interesting notes
- writing book lead to discovery that they had multiple opinions on what SLOs are
- configuration management chapter got skipped because opinions were too varied
- at Google, whitepapers float around and mature before becoming external
Q&A

Q: how did you come to common tone?
A: This was Betsy’s job. Tech writers were in/out on various chapters.

Q: insights as google has scaled? currently SRE is very mature.
A:
what has increased is technical maturity of automation
Rob:
started in gmail crafted own automation in python
now: down to three CM solutions at Google, two are going away

Liz:
Scaling by having changing incentives:
It used to promote people by how complex was the product that you made. (This leads to multiple competing solutions)
Now: reward convergence and solving complex problems with simple solutions
Better to have three standards than twenty

Rob:
What to page on/etc
Growing culture/techniques over time
Operational maturity
Ben (orig VP of SRE) does apply capability maturity models to teams

Liz: He asks “Where are you on the scale? Are you improving? Why not?”

Rob:
Management and view of SRE time
Project work vs op work, strategic vs tactical, got the 50/50%
support and management came in when not meeting 50%

Liz: must be engineering. let some breaks happen to do engineering.

Rob:
SLO on mean latency of bigtables was paging us all the time even on 75-percentile . dropped promise quite a bit. Worked on engineering, set up monitoring, two years later we had a 99-percentile SLO

Betsy:
login has a article on Liz’s interrupts
https://research.google.com/pubs/pub45764.html

Click to access 45764.pdf

Q: Follow up – there were multiple steps here, can you break down the in between targets?
A:
Liz: didn’t have a long term roadmap. Automate the largest fire, then go to the next problem. EG Get rid of re-sizing jobs by hands, standardize size of jobs next.

Rob: “At Google there’s no such thing as a five year roadmap that doesn’t change in a year (hardware exempt)”

Q: you mention automation, Amazon had an interesting postmortem that discussed too-powerful config management tools
A:
Liz: we love to write postmortems, we love to make post-mortem tickets, and we occasionally work on those items.

safety rails is the model for this type of problem.

“should you be able to turn down a unit of processing power below its level of utilization? probably not”

Rob:
Norms that allow us to avoid some things like this
Changes have to be cluster by cluster

Liz:
if you make it easy to do incremental rollouts (via automation), people will not write “update all” shell scripts – “Shell scripts sink ships”

Q: per book: make visibility of processes really helps

Liz:
Our mission the next few years is to make a framework with guardrails built in
the status port page should come free
resize check should come free
default to safe / platform building is one of SRE’s main objectives

Rob:
Should your server prevent you from touching every instance

Liz:
Easy undo buttons
if you have these then problems are nbd

Betsy:
“What is ideal, what is actuality”
SREs would like to be involved in design, sometimes still come to a service when it is a hot mess

Rob:
lots of google code has assertions that crash rather than corrupt mutable data.
sometimes engineers have assertions where they don’t know what to do, though it is not a dangerous situation

Betsy
Re: postmortem action items not geting addressed:
long tail of AIs that never get closed out.
in login; there’s an article about how to make an AI list that actually gets closed out.

Google does not do things perfectly, we are trying to improve and we’re talking to people about how to do better

Q: creative commons license – how did that all go?
A:
Betsy: i didn’t have to deal with the legal side of this
in lieu of profits, we got the creative commons version. O’Riley didn’t want it to come out at the same time

could not use O’Riley’s formatting or images.

Liz: O’Riley seems more open to this in the future

Q: you mentioned building a framework for building in reliability (eg baking in status ports, other features)
many times you start with your own version, then by its too big to switch to open source projects

is there a way to build tooling that is generically useful like the book?

A:
Liz:
example: grpc is open source and useful
there may not be utility in releasing tools around logging, because our tools around logging are special for google needs.

ease of use thing: no one should be writing a “main” function in C

“init google” as a C++ function that pulls in lots of helpers

Rob:
Agree that not all infra tools are sharable because systems are bespoke
hard to image how to open source many of those things (which is sad)

Liz:
another example: Go lang and debugging tools

Q: what sorts of orgs are resistant to postmortems?
A: Liz: Gov contracting is very blameful
admitting responsibility is hard in those envs</code>
@alexlo03 20:15 on 2017/02/23 Permalink | Reply
Martin Fowler – The Many Meanings of Event-Driven Architecture
“The Many Meanings of Event-Driven Architecture”
@martinfowler
## Intro
This talk is based on https://martinfowler.com/articles/201701-event-driven.html ## Typical event scenario user => (changes address) Customer Management => (Get Requote) Insurance Quoting => (Send Email) Communications ## Request/Response style Implied problem: dependencies NOT inverted. Cusomer management should not have to know about Insurance Quoting System. We'd like to invert the dependency. How to do that? Insurance Quoting polling constantly? To get a more timely responce we emit an "address changed" message and Insurance Quoting subscribes Classic way to invert dependency is to issue events, can be done at small level and large level. Call this "Event Noification" ### Note on terminology: Events or Commands? Command = tell callee what to do Event = more open ended Although they are the same mechanics, the way you name them affects the way we treat them/think about them Some events may be phrased as passive agressive commands ## Event Notifications People love that pub/sub allows multiple consumers. Easy. Great property of this pattern but also a problem because we lose the flow of the system and it is hard to reason about. Note: GUI MVC relies on events. Event notification: Pro: Decouples reciever from sender Con: No statement of overall behavior. Trolling through logs to track a distributed transaction. Harder to make changes safely ## Event-Based State Transfer ("Event-Carried" in the original article) Can't be sure when your client needs more information. If your clients want to follow up for more information, you may allow your clients to now DDOS you. Pros: even more decoupled from source, reduced load on supplier Cons: replicated data (costly), eventual consistency problem (you really hate it when you notice it - you have to think about it/manage it) -- alex note: seems like DB replication (though in db log shipping there is a closer relationship between primary/secondary?) ## Event Sourcing Ex: Change my address. Dispatch a function to a Person Domain Object. Alternative: first thing is to create an event object and stick it somewhere. Then process the event. True test of Event Source: can toss all state and re-build from events. EG Git "Example I like to give people, well programmers not normal people, it doesn't work on them" Other example: bank account (your balance is application state) Pros: Audit Debugging (can replay through an error scenario) Historic State Alternative State Memory Image Cons: Unfamiliar External Systems Event Schema (changing schema are a pain) Identifiers (if regenerating) ? Asynchrony - merge event streams is an async event (commit is not) ? Versioning Implementation wrinkle: manage snapshots as you go (recomputing entire event log too expensive) ### Ex: Widget purchase system Input even: buy 15 widgets output event: 15 widgets total 33$ internal event: buy 15 widgets: price 30$, discount 3$, shipping: 5$, total: 33$ you have to then save the internal events in the event that the internal event calculation. you lose some flexibility. Keep both internal events and output events? Doing a refactoring: EG renaming a function. Touching lots of files. Note: there was a bug in the 33$ calc. Now what? -- alex note: see recent crypto-currency problems- https://zcoin.io/language/en/zcoins-zerocoin-bug-explained-in-detail/ Note that all three patterns are distinct so far ## CQRS (Command Query Responsibility Segregation) Query model and command model different Completely different services, maybe even different models Martin hears more complaints about this then any other pattern. "it is twice as much work" Why is it problematic: Inherent problems? Done badly? Misused? Treat with caution. Awkward to use tool. ### How different is it to have an operational and reporting database? With CQRS: expecting sync reads - with reporting database there is expected lag
with CQRS: don't allow read from operational DB (not the typical op db/report db setup)
@alexlo03 21:15 on 2017/01/02 Permalink | Reply
Steps for making a change to the Ansible AWS Security Group Module
(On OSX)

1) fork & clone ansible repo

commentary: ansible has sensibly merged back ansible-modules and ansible-modules-extra rather than continue using submodules. IF you had cloned in the submodules, go ahead and `rm -rf extras` and `core` from ansible/lib/ansible/modules or your global finds will be very confusing!

2) get `ansible-playbook` to use your repo (rather than your brew or python installed official release of ansible)
```
ansible/ $ . hacking/env-setup
ansible/ $ make
```
note that now `ansible-playbook –version` should print info about your current githash/branch

3) Try to get integration tests working locally
3A) setup AWS IAM for cloud integration tests.

a) set up an IAM user (“tester”) with ec2 full control managed policy attached and keep the key around
https://console.aws.amazon.com/iam/home?region=us-east-1#/policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess
b) get your own aws credentials out of the way `mv ~/.aws/credentials ~/.aws/credentials.hide`
c) EC2_ACCESS_KEY=xxx
EC2_SECRET_KEY=yyy
EC2_REGION=us-east-1

3B) setup credentials.yml
copy credentials.template into credentials.yml, fill in the AWS related credentials

3C) strip down amazon.yml to the tests i care about
```
- hosts: amazon
  gather_facts: true
  roles:
    - role: test_ec2_group
```
3D) run `ansible/test/integration $ make amazon`

OPTIONAL: have it fail due to boto
```
fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "boto required for this module"}
```
Remediation:

1) verify boto in your global python env
```
pip freeze | grep boto
```
if not `pip install boto`

2) if the integation tests still fail – there may be a difference between the site-packages (pip packages) between SYSTEM python (/usr/bin/python) and BREW python (/usr/local/bin/python)
Ansible uses SYSTEM python whereas you mostly use BREW python

useful diagnostic line: `which -a python`
should show the local version first if your path is setup correctly.

Fix from homebrew issue

==> Caveats
If you need Python to find the installed site-packages:
mkdir -p ~/Library/Python/2.7/lib/python/site-packages
echo ‘/usr/local/lib/python2.7/site-packages’ > ~/Library/Python/2.7/lib/python/site-packages/homebrew.pth

OPTIONAL: fix integration tests or module not working in integration tests

4) add breaking integration test
5) fix module
6) PR
7) remove “tester” IAM user
8) DONE

other useful tools:
audit what your test suite is doing in cloudtrail (must be enabled)
@alexlo03 01:22 on 2016/12/07 Permalink | Reply
Exploring 'htop explained' locally
I greatly enjoyed htop explained – it helped me explore Linux internals safely. In the post the author explored htop by spinning up a virtual machine in Digital Ocean – here are some quick instructions to getting your own Ubuntu 16.04 playground locally.

install vagrant [1] [2]
install vbox
```
vagrant box add ubuntu/xenial64
mkdir ~/htopfun 
cd ~/htopfun
vagrant init
# edit Vagrantfile - set `config.vm.box = "ubuntu/xenial64"`
vagrant up
vagrant ssh

# within the virtual machine
sudo apt install htop

# when done, outside of the virtual machine
vagrant suspend
# alternatively teardown
```
@alexlo03 and Kevin Risden are discussing. Toggle Comments
- Kevin Risden 11:53 on 2017/01/03 Permalink | Reply
  
  “vagrant init” can take a box name and automatically setup the Vagrantfile for you. When “vagrant up” is run, if the box doesn’t exist locally it will download it for you. This looks something like:
  
  mkdir ~/htopfun
  cd ~/htopfun
  vagrant init ubuntu/xenial64
  vagrant up
  …
  - @alexlo03 12:03 on 2017/01/03 Permalink | Reply
    
    Thanks Kevin! My first time using vagrant outside of old docker setups.

@alexlo03 12:34 on 2016/10/26 Permalink | Reply

DataDog summit 10/26

https://www.eventbrite.com/e/datadog-summit-tickets-27691767823

# Keynote - Alexis Le-Quoc - CTO
Observability / curiosity / control
alerting/anomaly detection designed with these wants in mind.

outlier/anomaly - expect announcement tomorrow
"notebooks" feature - "curiosity" feature - EG metrics explorer

# Cory Watson @gphat cory@stripe.com
Building a culture of observability

## starting point
no clear ownership -> broken windows
lack of confidence/vision for future (how will things get better)
very reactive

## stripe
550 employees (how many eng? asked: top secret!)
~230 services, 1000s of aws vms
Obs systems: DD, Splunk, Sentry, PagerDuty, "core dashboards"
Obs team: 5+intern+1 on loan

## how to make a change
* give a shit about your users
* follow up on feedback
* trend towards a bright future
* measure your progress
** HOW?

## start over, kinda
* spend time w/ tools
* improve if possible
* replace if not
* leverage past knowledge
EG "what about that part about grafana you don't like"
social archeology

## why DD?
* general purpose / simple interfaces
* velocity of improvement of DD platform
* OSS
* friendly helpful staff

## empathy and respect
* people are not generally evil, they are just busy
* being a hater is lazy
* help people be great at their jobs

## replacing existing system
* overcoming the momentum is hard - adds work
* declaring bankruptcy w/ statsd (the dotted naming thing does not translated
into DD verbage)
* saved us ops headaches (won't have the statsd droprate - no more UDP - was
dropping up to 50% of metrics)
* still ongoing

## getting change rolling: Nemawashi
Japanese - let the tree come to you - can't show up at a meeting and introduce
a brand new concept
* start small - guineapig yourself
* quietly lay foundation and gather feedback
* asking how you can improve, follow up
* engage the discontent - most to learn from them

## identify power users
* find interested parties - empower them to help others - levers to move the
org (training, adoption, etc)

## value
* what are you improving?
* how can you measure it?
* is this the best way?

what metrics?  MTTD?  MTTR?

feedback loop: engineer -> system (add sensor that feeds back to eng)

# flat org - how to improve observability w/o mandate
* not having a mandate
* stigmergy - https://en.wikipedia.org/wiki/Stigmergy
** eg grind or hustle
* strike when good opportunities (eg incidents)

## advertise
* promote team accomplishments, accomplishments of others
* ask to help - then learn
* observability team branded as "bees"

## make it easy&good
* hard to make email exciting
* make it easy/automatic to do things right

## automated monitors
* baseline monitors - common problems/solutions
* users have no state, are surprised
* people care whe you show them failure and how to fix.

## features:
### Automatic ticket creation w/ labels/tags
* can find links to previous ticket resolutions
* can find all active tickets of typeX, can close if they are false alarms
* feedback via google forms

## tracking toil
* find all pagerduty info
* input into redshift + looker (app)

## usage
* >100% growth in metrics, monitors/dashboards
* 7.5k metrics (w/ tags)

## problems
* metric/naming, cardinality
* monitor "blame",
* what metrics are available to me (service owner)
* metrics or logs? traces?
** splunk or DD?


# Algorithmic Alerting
Homin Lee
@hominprovement

being released today/tomorrow

Anomaly detection - monitoring a metric though time
Outlier detection - monitoring a metric though space

if you have a trending down metric that you want to alert on - you'd have to
reset your thresholds often

seasonal metrics (big ups/downs throught day) - thresholding does not work -
can use "change alerts" - this is a problem because large changes in one
direction is typically OK (memory usage going low for a while is not a problem,
it going high can be)

what if you have a trending and seasonal metric?
anomaly detection: predict range of values that looks normal

algorithms for anomaly detection:
"Basic"
"Not-basic" algorithms
* "robust"  - decompose history into trend component and seasonable component
* "agile" - look at previous time yesterday/last week
* "adaptive" - if the behavior changing over time - requires less information
over time

single parameter: tolerance

aggregation time frame can lead to false positives

Outlier detection
DBSCAN algo
MAD algo - median absolute deviation from the median

w/ anomaly detection and outlier direction
don't apply to everything
outliers should be applied to things that ought be strongly related

# Airbnb - Ben Hughes
* just turned off graphite

## background
* hired lots of product engineers, not lots of SREs
* product engineers started learning / helping with pager side of things
* dub'd sysops - set up lots of trainings
* 50 people on rotation, 30% of eng has attended trainings
* w/ 50 people on volunteer rotation: only on call a few times a year, don't
know which pages can be ignored, you've probably never seen the alert before
* therefore pager alerts have to be very certain to have a problem

## monitoring as code - configure dd alerts:
https://github.com/airbnb/interferon
* pros: code working ecosystem (git, grep, etc)
* pros: automation = good
* cons: also causes messes
* pros outweigh cons

* can scripted monitor creation by getting inventory from AWS API
* pull requests on alerts include information on what incident / background
caused creation of alert
* 730 alert specs that turn into 11k monitors

## reduce alert noise
* difficult problem
* email gets filtered, paged alerts will eventually get fixed due to annoyance
* requires ownership

* when adding new alerts - keep old ones around while proving out new one (add,
don't modify?)

==

# GrubHub

## why dd?
* single pane of glass in context of multiple datacenters
* alerting built in
* apis
* statsd/graphite compatible
* advanced options, increasing features

## background
* many services, many problems
* new teams coming in, easy to miss things
* lots of different application frameworks/etc

## monitor all the services
* define common metric names at framework level (important for dataviz)
* provide basic metric set for all services
* service discovery to apply monitoring to all services
* ensure all monitors have links to logs,runbooks,etc
* run same monitoring in pre-production but w/o pages
* store everything in source control
* devs own monitoring as much as sres

## viz
* heavy use of templated dashboards
* operations "summary" dashboards and developer focused dashboards
* store dashboard defs in source code
* should help provide context to monitoring

## metrics
* start with sane, non-product specific, metric names
* careful of metric counts

# Note from Darren:
TODO can have ansible results go to DD
https://www.datadoghq.com/blog/ansible-datadog-monitor-your-automation-automate-your-monitoring/
could use this on ansible-pull quickly, would need to standardize ansible push
env to use more broadly

# Tracing Code matt@datadog.com - EG APM
* multiple services that publish metrics as part of a whole service
* can scope graphs/etc by dimension (EG endpoint, hostname)
* can drill into specific reqs or categories of reqs
* qs: what are most frequent queries/slow queries/etc

* can do distributed trace - can connect RPC calls
* lots of common integrations w/ normal services postgres/etc
* currently integrates with python, ruby, go (more soon)
* currently in private beta

QQ: high security mode?

@alexlo03 19:32 on 2016/06/01 Permalink | Reply

Flow.io

http://www.meetup.com/ContinuousDeliveryNYC/events/230740814/
Michael Bryzek
@mbryzek

@Yodle - http://www.yodletechblog.com/

CD system: Delta
https://github.com/flowcommerce/delta

CD is autit/transparency
not an opinionated ways to do things 
(when tests happen is an opinion)

Deploys are triggered by git tags
merging PRs creates tags
deploys automatically

microservice architecture

Travis for CI
debate: should require travis to succeed before allow merge?
no: because in emergency we still want to use same flow
CI is part of our process. do not expect CD tool to enforce

authn/z via github
webhooks w/ github (ignores payload from GH)

application dashboard: shows desired state vs last state 
(version/number of instances)

merge PR: 
* slack alert
* do it:
** sync repo + tags
** create tag
** set desired state to new version
** build docker image in dockerhub (`docker build .`)
** pull state using Scala/Akka every 30 seconds until docker image ready
** scale: 
*** create new cluster
*** once healthy move traffic
*** scale down old version

webapp speaks to API, API speaks to PGSQL in RDS
docker instances on ECS

"rollback is antipattern, roll forward"
"don't like 'deploy', see Martin Fowler separate deploy from sending traffic to"
"always scale up, assert healthy, before scale down"

delta config - yaml
initial number of instances: only used once
-> later uses current number of instances

5 engineers
>1500 releases, weekly >100, 20 releases/eng/week

roadmap
* deploy other than master branch
* dependency support w/in repo
* smarter traffic management
* better healthchecks
* more settings (enable/disable build in UI)
* UI/UX improve

CD very nice. 
every Friday we udate all projects to latest versions of every library
delta deploys itself

Q/A
All data stores are append-only
What is the cost of failure vs cost of inaction 
(credit card system is bad possible example)
Setup culture/process/alerting when break in prod
Healthcheck does verify DB connectivity, having env vars, etc

	@alexlo03 on Exploring 'htop explained…
	Kevin Risden on Exploring 'htop explained…
	Jude Allred on Notes: Servant Leadership at…
	@alexlo03 on Naming your variables, every…
	@alexlo03 on Naming your variables, every…

Alex Lo Engineering Blog

Recent Posts

Recent Comments

Archives

Categories

Meta

Updates from @alexlo03 Toggle Comment Threads | Keyboard Shortcuts

@alexlo03 08:35 on 2022/12/08 Permalink | Reply

COVID: Top risk on the board

Reply Cancel reply

@alexlo03 06:54 on 2017/10/24 Permalink | Reply

Long Term Stock Exchange Fallacy?

@alexlo03 15:46 on 2017/10/14 Permalink | Reply

Learning Curve

@alexlo03 11:08 on 2017/03/27 Permalink | Reply

Making Ansible Network Security 2-3x Faster

Impact

Before

After (running Ansible at commit of change)

@alexlo03 21:02 on 2017/03/02 Permalink | Reply

SRE Conv @ Dropbox

Where is SRE now?

SRE – what is it?

Making of the book

Interesting notes

Q&A

@alexlo03 20:15 on 2017/02/23 Permalink | Reply

Martin Fowler – The Many Meanings of Event-Driven Architecture

@alexlo03 21:15 on 2017/01/02 Permalink | Reply

Steps for making a change to the Ansible AWS Security Group Module

@alexlo03 01:22 on 2016/12/07 Permalink | Reply

Exploring 'htop explained' locally

Kevin Risden 11:53 on 2017/01/03 Permalink | Reply

@alexlo03 12:03 on 2017/01/03 Permalink | Reply

@alexlo03 12:34 on 2016/10/26 Permalink | Reply

DataDog summit 10/26

@alexlo03 19:32 on 2016/06/01 Permalink | Reply

Flow.io