Updates from @alexlo03 Toggle Comment Threads | Keyboard Shortcuts

  • @alexlo03 08:35 on 2022/12/08 Permalink | Reply  

    COVID: Top risk on the board 

    I am not a doctor or health professional.

    I believe that for many people COVID is the top threat on the board to long term health and happiness.

    COVID is why health systems are overrun, people are missing more work/school, and people are dying at higher rates (excess mortality).

    Source

    Source

    COVID represents a new kind of danger in the world. There is no good analogue threat. For example the idea that COVID is “like influenza” is a popular take – you can get influenza multiple times like COVID, symptoms can be similar, but it is NOT like influenza: much deadlier and the long term damage/cumulative risk component.

    COVID increases risk for other bad health outcomes (stroke, etc). Each COVID infection increases risk. It is not a binary event that you’ve had COVID.

    This chart has the Hazard Ratio (“1” means no increased risk – this is the dotted line, “2” means “at double risk vs controls”). Note the large jump from “infected once” to “infected twice”.

    Source

    COVID represents 100% downside risk. There is no upside.

    Presently humans are unable to acquire durable immunity to Covid either by vaccination or infection. The only viable strategy is avoidance.

    Misc

    “Immunity Debt” article

     
  • @alexlo03 06:54 on 2017/10/24 Permalink | Reply  

    Long Term Stock Exchange Fallacy? 

    The new idea for the “start up / owner friendly” Long Term Stock Exchange offers “Tenure Voting” which means that owning a share for ten years gives you 10x the voting power of a new share owner. (1, 2, 3)

    This seems backwards to me, the longest holders don’t necessarily have the longest forward view of the company.  Don’t you want to offer the most voting power to those who plan on holding the stock the longest prospectively?  So, for example at voting time, you could lock in your shares for X years and get the voting leverage there.  This is harder, because if you commit to X years and lose on some vote then you’re locked in to a board/decision that you’re not into.  One solution is “only wining votes commitments stick” – an incentive to not to vote or commit – this seems good, if you care and commit you get leverage, if you don’t you get freedom.  Another solution to accidental long commitments is to codify exit terms.  Doing both solutions would probably work.

     
  • @alexlo03 15:46 on 2017/10/14 Permalink | Reply  

    Learning Curve 

    boxed-curve

    The Curve (y=2 * ln x +3) – note the boxes (ABC) are all the same height.

    I’ve been thinking about this curve.  It represents learning over time, on average, in jobs/projects I’ve had.  At first (A) you know nothing so doubling your knowledge happens every day, once you start being useful (B) you’re still learning a bunch as you’re applying previous learning, and then eventually learning slows down (C).

    I would like to always be able to throw myself back into A. I enjoy continuing to learn and grow. I feel fortunate that this is necessary and encouraged in software engineering as it is still a growing discipline.  I’m also a little afraid that the Steamroller will fucking get me.

    C can be seductive.  I’ve spent time there.  The longer I spent there the more I grew afraid of change.  I now consider it a professional and personal hazard to spend time in C.

    That feeling of getting into C?  When’s the last time you learned something new?  Have you surprised yourself?  Have you had the feeling of inventing yourself?

    How do you pivot back to A/B?  To me the biggest step is to decide you’re ok with discomfort for growth.  Once that is settled I think the rest is pretty straight forward.  One can change jobs, change roles, change teams, start a new initiative, or just gravitate to a new work stream.

     
  • @alexlo03 11:08 on 2017/03/27 Permalink | Reply  

    Making Ansible Network Security 2-3x Faster 

    At Flatiron Health we use Ansible to configure AWS Network Security groups (see blog).  Over time I noticed more and more timeouts while asserting that the network security state was where we thought it should be.  Digging into the code I found this confusing block of code:

    Screen Shot 2017-03-27 at 10.46.39 AM

    The timeout happened on the highlight.  Why was checking a group getting all ec2 instances?  It doesn’t even use them unless the target description doesn’t match the existing description.  We could be more lazy in getting ec2 instances if that’s the case.

    Digging deeper, the error condition listed on L326 has the intent that if the group is not being used, then maybe the description can be updated.  Presumably that update would be done via deleting the security group and recreating it since security group descriptions are immutable.  This update never happens in the module, so clearly this is a relic.  (side note: the public-ssh group assumption on L322 is another funny relic)

    My recent PR/commit against this code cleaned this up a fair bit and just made it an error if the target description does not equal the existing description without checking if any ec2 instances are using the existing group.

    Impact

    How expensive is getting all ec2 instances?  Well it depends on how large your AWS account is.  For us the return value was in the ballpark of 1MB (tested via aws ec2 describe-instances).

    Before

    # ansible==2.1.3.0
    time ansible-playbook sg-update.yml --check
    ...
    real 8m23.103s
    user 2m3.358s
    sys 0m46.586s

    After (running Ansible at commit of change)

    time ansible-playbook sg-update.yml --check
    ...
    real 3m5.069s
    user 1m0.551s
    sys 0m38.873s

    From 503 seconds to 185 is an appreciable speed up (2.7x faster).  This speedup should apply any time the security group is already present whether in --check mode or not.

    I’m looking forward to the next release of Ansible when we can realize these savings.  (I’m not sure if this will be in 2.3, which was just cut, or we’ll have to wait for 2.4)

    Thanks to the reviewers/maintainers of Ansible for the review and getting this merged.

     
  • @alexlo03 21:02 on 2017/03/02 Permalink | Reply  

    SRE Conv @ Dropbox 

    with:

    • Betsy Beyer
    • Rob Ewaschuk
    • Liz Fong-Jones

    Where is SRE now?

    more than 2000 engineers in SRE today at Google
    outside of google? sort of embraced – definitely in flux
    note: SRECon is conf coming up

    SRE – what is it?

    “you know it when you see it”

    • Creative autonomous engineering
    • wisdom of operations
    • SWE and systems engineering are equally valued skills

     * what is systems engineering? how do you thinking about breaking large systems down? define interfaces? if things break, how do you troubleshoot?
     * how do we automate / make more robust?

    Making of the book

    • articles in journals that were copyright friendly
    • stitch it together offsite in a building without engineers

    Interesting notes

    • writing book lead to discovery that they had multiple opinions on what SLOs are
    • configuration management chapter got skipped because opinions were too varied
    • at Google, whitepapers float around and mature before becoming external

    Q&A

    Q: how did you come to common tone?
    A: This was Betsy’s job. Tech writers were in/out on various chapters.

    Q: insights as google has scaled? currently SRE is very mature.
    A:
    what has increased is technical maturity of automation
    Rob:
    started in gmail crafted own automation in python
    now: down to three CM solutions at Google, two are going away

    Liz:
    Scaling by having changing incentives:
    It used to promote people by how complex was the product that you made. (This leads to multiple competing solutions)
    Now: reward convergence and solving complex problems with simple solutions
    Better to have three standards than twenty

    Rob:
    What to page on/etc
    Growing culture/techniques over time
    Operational maturity
    Ben (orig VP of SRE) does apply capability maturity models to teams

    Liz: He asks “Where are you on the scale? Are you improving? Why not?”

    Rob:
    Management and view of SRE time
    Project work vs op work, strategic vs tactical, got the 50/50%
    support and management came in when not meeting 50%

    Liz: must be engineering.  let some breaks happen to do engineering.

    Rob:
    SLO on mean latency of bigtables was paging us all the time even on 75-percentile . dropped promise quite a bit. Worked on engineering, set up monitoring, two years later we had a 99-percentile SLO

    Betsy:
    login has a article on Liz’s interrupts
    https://research.google.com/pubs/pub45764.html

    Click to access 45764.pdf

    Q: Follow up – there were multiple steps here, can you break down the in between targets?
    A:
    Liz: didn’t have a long term roadmap.  Automate the largest fire, then go to the next problem.  EG Get rid of re-sizing jobs by hands, standardize size of jobs next.

    Rob: “At Google there’s no such thing as a five year roadmap that doesn’t change in a year (hardware exempt)”

    Q: you mention automation, Amazon had an interesting postmortem that discussed too-powerful config management tools
    A:
    Liz: we love to write postmortems, we love to make post-mortem tickets, and we occasionally work on those items.

    safety rails is the model for this type of problem.

    “should you be able to turn down a unit of processing power below its level of utilization? probably not”

    Rob:
    Norms that allow us to avoid some things like this
    Changes have to be cluster by cluster

    Liz:
    if you make it easy to do incremental rollouts (via automation), people will not write “update all” shell scripts – “Shell scripts sink ships”

    Q: per book: make visibility of processes really helps

    Liz:
    Our mission the next few years is to make a framework with guardrails built in
    the status port page should come free
    resize check should come free
    default to safe / platform building is one of SRE’s main objectives

    Rob:
    Should your server prevent you from touching every instance

    Liz:
    Easy undo buttons
    if you have these then problems are nbd

    Betsy:
    “What is ideal, what is actuality”
    SREs would like to be involved in design, sometimes still come to a service when it is a hot mess

    Rob:
    lots of google code has assertions that crash rather than corrupt mutable data.
    sometimes engineers have assertions where they don’t know what to do, though it is not a dangerous situation

    Betsy
    Re: postmortem action items not geting addressed:
    long tail of AIs that never get closed out.
    in login; there’s an article about how to make an AI list that actually gets closed out.

    Google does not do things perfectly, we are trying to improve and we’re talking to people about how to do better

    Q: creative commons license – how did that all go?
    A:
    Betsy: i didn’t have to deal with the legal side of this
    in lieu of profits, we got the creative commons version. O’Riley didn’t want it to come out at the same time

    could not use O’Riley’s formatting or images.

    Liz: O’Riley seems more open to this in the future

    Q: you mentioned building a framework for building in reliability (eg baking in status ports, other features)
    many times you start with your own version, then by its too big to switch to open source projects

    is there a way to build tooling that is generically useful like the book?

    A:
    Liz:
    example: grpc is open source and useful
    there may not be utility in releasing tools around logging, because our tools around logging are special for google needs.

    ease of use thing: no one should be writing a “main” function in C

    “init google” as a C++ function that pulls in lots of helpers

    Rob:
    Agree that not all infra tools are sharable because systems are bespoke
    hard to image how to open source many of those things (which is sad)

    Liz:
    another example: Go lang and debugging tools

    Q: what sorts of orgs are resistant to postmortems?
    A:  Liz: Gov contracting is very blameful
    admitting responsibility is hard in those envs</code>

     
  • @alexlo03 20:15 on 2017/02/23 Permalink | Reply  

    Martin Fowler – The Many Meanings of Event-Driven Architecture 

    “The Many Meanings of Event-Driven Architecture”
    @martinfowler

    ## Intro

    This talk is based on https://martinfowler.com/articles/201701-event-driven.html

    ## Typical event scenario

    user => (changes address) Customer Management => (Get Requote) Insurance Quoting => (Send Email) Communications

    ## Request/Response style

    Implied problem: dependencies NOT inverted.

    Cusomer management should not have to know about Insurance Quoting System.

    We'd like to invert the dependency. How to do that?

    • Insurance Quoting polling constantly?
    • To get a more timely responce we emit an "address changed" message and Insurance Quoting subscribes

    Classic way to invert dependency is to issue events, can be done at small level and large level. Call this "Event Noification"

    ### Note on terminology: Events or Commands?

    Command = tell callee what to do

    Event = more open ended

    Although they are the same mechanics, the way you name them affects the way we treat them/think about them

    Some events may be phrased as passive agressive commands

    ## Event Notifications

    People love that pub/sub allows multiple consumers. Easy.

    Great property of this pattern but also a problem because we lose the flow of the system and it is hard to reason about.

    Note: GUI MVC relies on events.

    Event notification:

    • Pro: Decouples reciever from sender
    • Con: No statement of overall behavior.
      • Trolling through logs to track a distributed transaction.
      • Harder to make changes safely

    ## Event-Based State Transfer ("Event-Carried" in the original article)

    Can't be sure when your client needs more information. If your clients want to follow up for more information, you may allow your clients to now DDOS you.

    Pros: even more decoupled from source, reduced load on supplier

    Cons: replicated data (costly), eventual consistency problem (you really hate it when you notice it - you have to think about it/manage it)

    -- alex note: seems like DB replication (though in db log shipping there is a closer relationship between primary/secondary?)

    ## Event Sourcing

    Ex: Change my address. Dispatch a function to a Person Domain Object. Alternative: first thing is to create an event object and stick it somewhere. Then process the event.

    True test of Event Source: can toss all state and re-build from events. EG Git "Example I like to give people, well programmers not normal people, it doesn't work on them"

    Other example: bank account (your balance is application state)

    Pros:

    • Audit
    • Debugging (can replay through an error scenario)
    • Historic State
    • Alternative State
    • Memory Image

    Cons:

    • Unfamiliar
    • External Systems
    • Event Schema (changing schema are a pain)
    • Identifiers (if regenerating)
    • ? Asynchrony - merge event streams is an async event (commit is not)
    • ? Versioning

    Implementation wrinkle: manage snapshots as you go (recomputing entire event log too expensive)

    ### Ex: Widget purchase system

    Input even: buy 15 widgets

    output event: 15 widgets total 33$

    internal event: buy 15 widgets: price 30$, discount 3$, shipping: 5$, total: 33$

    you have to then save the internal events in the event that the internal event calculation. you lose some flexibility. Keep both internal events and output events?

    Doing a refactoring: EG renaming a function. Touching lots of files.

    Note: there was a bug in the 33$ calc. Now what?

    -- alex note: see recent crypto-currency problems- https://zcoin.io/language/en/zcoins-zerocoin-bug-explained-in-detail/

    Note that all three patterns are distinct so far

    ## CQRS (Command Query Responsibility Segregation)

    Query model and command model different

    Completely different services, maybe even different models

    Martin hears more complaints about this then any other pattern. "it is twice as much work"

    Why is it problematic: Inherent problems? Done badly? Misused?

    Treat with caution. Awkward to use tool.

    ### How different is it to have an operational and reporting database?

    With CQRS: expecting sync reads - with reporting database there is expected lag

    with CQRS: don't allow read from operational DB (not the typical op db/report db setup)

     
  • @alexlo03 21:15 on 2017/01/02 Permalink | Reply  

    Steps for making a change to the Ansible AWS Security Group Module 

    (On OSX)

    1) fork & clone ansible repo

    commentary: ansible has sensibly merged back ansible-modules and ansible-modules-extra rather than continue using submodules. IF you had cloned in the submodules, go ahead and `rm -rf extras` and `core` from ansible/lib/ansible/modules or your global finds will be very confusing!

    2) get `ansible-playbook` to use your repo (rather than your brew or python installed official release of ansible)

    ansible/ $ . hacking/env-setup
    ansible/ $ make

    note that now `ansible-playbook –version` should print info about your current githash/branch

    3) Try to get integration tests working locally
    3A) setup AWS IAM for cloud integration tests.

    a) set up an IAM user (“tester”) with ec2 full control managed policy attached and keep the key around
    https://console.aws.amazon.com/iam/home?region=us-east-1#/policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess
    b) get your own aws credentials out of the way `mv ~/.aws/credentials ~/.aws/credentials.hide`
    c) EC2_ACCESS_KEY=xxx
    EC2_SECRET_KEY=yyy
    EC2_REGION=us-east-1

    3B) setup credentials.yml
    copy credentials.template into credentials.yml, fill in the AWS related credentials

    3C) strip down amazon.yml to the tests i care about

    - hosts: amazon
      gather_facts: true
      roles:
        - role: test_ec2_group

    3D) run `ansible/test/integration $ make amazon`

    OPTIONAL: have it fail due to boto

    fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "boto required for this module"}

    Remediation:

    1) verify boto in your global python env

    pip freeze | grep boto

    if not `pip install boto`

    2) if the integation tests still fail – there may be a difference between the site-packages (pip packages) between SYSTEM python (/usr/bin/python) and BREW python (/usr/local/bin/python)
    Ansible uses SYSTEM python whereas you mostly use BREW python

    useful diagnostic line: `which -a python`
    should show the local version first if your path is setup correctly.

    Fix from homebrew issue

    ==> Caveats
    If you need Python to find the installed site-packages:
    mkdir -p ~/Library/Python/2.7/lib/python/site-packages
    echo ‘/usr/local/lib/python2.7/site-packages’ > ~/Library/Python/2.7/lib/python/site-packages/homebrew.pth

    OPTIONAL: fix integration tests or module not working in integration tests

    4) add breaking integration test
    5) fix module
    6) PR
    7) remove “tester” IAM user
    8) DONE

    other useful tools:
    audit what your test suite is doing in cloudtrail (must be enabled)

     
  • @alexlo03 01:22 on 2016/12/07 Permalink | Reply  

    Exploring 'htop explained' locally 

    I greatly enjoyed htop explained – it helped me explore Linux internals safely.  In the post the author explored htop by spinning up a virtual machine in Digital Ocean – here are some quick instructions to getting your own Ubuntu 16.04 playground locally.

    install vagrant [1] [2]
    install vbox

    vagrant box add ubuntu/xenial64
    mkdir ~/htopfun 
    cd ~/htopfun
    vagrant init
    # edit Vagrantfile - set `config.vm.box = "ubuntu/xenial64"`
    vagrant up
    vagrant ssh
    
    # within the virtual machine
    sudo apt install htop
    
    # when done, outside of the virtual machine
    vagrant suspend
    # alternatively teardown
     
    • Kevin Risden 11:53 on 2017/01/03 Permalink | Reply

      “vagrant init” can take a box name and automatically setup the Vagrantfile for you. When “vagrant up” is run, if the box doesn’t exist locally it will download it for you. This looks something like:

      mkdir ~/htopfun
      cd ~/htopfun
      vagrant init ubuntu/xenial64
      vagrant up

      • @alexlo03 12:03 on 2017/01/03 Permalink | Reply

        Thanks Kevin! My first time using vagrant outside of old docker setups.

  • @alexlo03 12:34 on 2016/10/26 Permalink | Reply  

    DataDog summit 10/26 

    https://www.eventbrite.com/e/datadog-summit-tickets-27691767823

    # Keynote - Alexis Le-Quoc - CTO
    Observability / curiosity / control
    alerting/anomaly detection designed with these wants in mind.
    
    outlier/anomaly - expect announcement tomorrow
    "notebooks" feature - "curiosity" feature - EG metrics explorer
    
    # Cory Watson @gphat cory@stripe.com
    Building a culture of observability
    
    ## starting point
    no clear ownership -> broken windows
    lack of confidence/vision for future (how will things get better)
    very reactive
    
    ## stripe
    550 employees (how many eng? asked: top secret!)
    ~230 services, 1000s of aws vms
    Obs systems: DD, Splunk, Sentry, PagerDuty, "core dashboards"
    Obs team: 5+intern+1 on loan
    
    ## how to make a change
    * give a shit about your users
    * follow up on feedback
    * trend towards a bright future
    * measure your progress
    ** HOW?
    
    ## start over, kinda
    * spend time w/ tools
    * improve if possible
    * replace if not
    * leverage past knowledge
    EG "what about that part about grafana you don't like"
    social archeology
    
    ## why DD?
    * general purpose / simple interfaces
    * velocity of improvement of DD platform
    * OSS
    * friendly helpful staff
    
    ## empathy and respect
    * people are not generally evil, they are just busy
    * being a hater is lazy
    * help people be great at their jobs
    
    ## replacing existing system
    * overcoming the momentum is hard - adds work
    * declaring bankruptcy w/ statsd (the dotted naming thing does not translated
    into DD verbage)
    * saved us ops headaches (won't have the statsd droprate - no more UDP - was
    dropping up to 50% of metrics)
    * still ongoing
    
    ## getting change rolling: Nemawashi
    Japanese - let the tree come to you - can't show up at a meeting and introduce
    a brand new concept
    * start small - guineapig yourself
    * quietly lay foundation and gather feedback
    * asking how you can improve, follow up
    * engage the discontent - most to learn from them
    
    ## identify power users
    * find interested parties - empower them to help others - levers to move the
    org (training, adoption, etc)
    
    ## value
    * what are you improving?
    * how can you measure it?
    * is this the best way?
    
    what metrics?  MTTD?  MTTR?
    
    feedback loop: engineer -> system (add sensor that feeds back to eng)
    
    # flat org - how to improve observability w/o mandate
    * not having a mandate
    * stigmergy - https://en.wikipedia.org/wiki/Stigmergy
    ** eg grind or hustle
    * strike when good opportunities (eg incidents)
    
    ## advertise
    * promote team accomplishments, accomplishments of others
    * ask to help - then learn
    * observability team branded as "bees"
    
    ## make it easy&good
    * hard to make email exciting
    * make it easy/automatic to do things right
    
    ## automated monitors
    * baseline monitors - common problems/solutions
    * users have no state, are surprised
    * people care whe you show them failure and how to fix.
    
    ## features:
    ### Automatic ticket creation w/ labels/tags
    * can find links to previous ticket resolutions
    * can find all active tickets of typeX, can close if they are false alarms
    * feedback via google forms
    
    ## tracking toil
    * find all pagerduty info
    * input into redshift + looker (app)
    
    ## usage
    * >100% growth in metrics, monitors/dashboards
    * 7.5k metrics (w/ tags)
    
    ## problems
    * metric/naming, cardinality
    * monitor "blame",
    * what metrics are available to me (service owner)
    * metrics or logs? traces?
    ** splunk or DD?
    
    
    # Algorithmic Alerting
    Homin Lee
    @hominprovement
    
    being released today/tomorrow
    
    Anomaly detection - monitoring a metric though time
    Outlier detection - monitoring a metric though space
    
    if you have a trending down metric that you want to alert on - you'd have to
    reset your thresholds often
    
    seasonal metrics (big ups/downs throught day) - thresholding does not work -
    can use "change alerts" - this is a problem because large changes in one
    direction is typically OK (memory usage going low for a while is not a problem,
    it going high can be)
    
    what if you have a trending and seasonal metric?
    anomaly detection: predict range of values that looks normal
    
    algorithms for anomaly detection:
    "Basic"
    "Not-basic" algorithms
    * "robust"  - decompose history into trend component and seasonable component
    * "agile" - look at previous time yesterday/last week
    * "adaptive" - if the behavior changing over time - requires less information
    over time
    
    single parameter: tolerance
    
    aggregation time frame can lead to false positives
    
    Outlier detection
    DBSCAN algo
    MAD algo - median absolute deviation from the median
    
    w/ anomaly detection and outlier direction
    don't apply to everything
    outliers should be applied to things that ought be strongly related
    
    # Airbnb - Ben Hughes
    * just turned off graphite
    
    ## background
    * hired lots of product engineers, not lots of SREs
    * product engineers started learning / helping with pager side of things
    * dub'd sysops - set up lots of trainings
    * 50 people on rotation, 30% of eng has attended trainings
    * w/ 50 people on volunteer rotation: only on call a few times a year, don't
    know which pages can be ignored, you've probably never seen the alert before
    * therefore pager alerts have to be very certain to have a problem
    
    ## monitoring as code - configure dd alerts:
    https://github.com/airbnb/interferon
    * pros: code working ecosystem (git, grep, etc)
    * pros: automation = good
    * cons: also causes messes
    * pros outweigh cons
    
    * can scripted monitor creation by getting inventory from AWS API
    * pull requests on alerts include information on what incident / background
    caused creation of alert
    * 730 alert specs that turn into 11k monitors
    
    ## reduce alert noise
    * difficult problem
    * email gets filtered, paged alerts will eventually get fixed due to annoyance
    * requires ownership
    
    * when adding new alerts - keep old ones around while proving out new one (add,
    don't modify?)
    
    ==
    
    # GrubHub
    
    ## why dd?
    * single pane of glass in context of multiple datacenters
    * alerting built in
    * apis
    * statsd/graphite compatible
    * advanced options, increasing features
    
    ## background
    * many services, many problems
    * new teams coming in, easy to miss things
    * lots of different application frameworks/etc
    
    ## monitor all the services
    * define common metric names at framework level (important for dataviz)
    * provide basic metric set for all services
    * service discovery to apply monitoring to all services
    * ensure all monitors have links to logs,runbooks,etc
    * run same monitoring in pre-production but w/o pages
    * store everything in source control
    * devs own monitoring as much as sres
    
    ## viz
    * heavy use of templated dashboards
    * operations "summary" dashboards and developer focused dashboards
    * store dashboard defs in source code
    * should help provide context to monitoring
    
    ## metrics
    * start with sane, non-product specific, metric names
    * careful of metric counts
    
    # Note from Darren:
    TODO can have ansible results go to DD
    https://www.datadoghq.com/blog/ansible-datadog-monitor-your-automation-automate-your-monitoring/
    could use this on ansible-pull quickly, would need to standardize ansible push
    env to use more broadly
    
    # Tracing Code matt@datadog.com - EG APM
    * multiple services that publish metrics as part of a whole service
    * can scope graphs/etc by dimension (EG endpoint, hostname)
    * can drill into specific reqs or categories of reqs
    * qs: what are most frequent queries/slow queries/etc
    
    * can do distributed trace - can connect RPC calls
    * lots of common integrations w/ normal services postgres/etc
    * currently integrates with python, ruby, go (more soon)
    * currently in private beta
    
    QQ: high security mode?
    
     
  • @alexlo03 19:32 on 2016/06/01 Permalink | Reply  

    Flow.io 

    http://www.meetup.com/ContinuousDeliveryNYC/events/230740814/
    Michael Bryzek
    @mbryzek
    
    @Yodle - http://www.yodletechblog.com/
    
    CD system: Delta
    https://github.com/flowcommerce/delta
    
    CD is autit/transparency
    not an opinionated ways to do things 
    (when tests happen is an opinion)
    
    Deploys are triggered by git tags
    merging PRs creates tags
    deploys automatically
    
    microservice architecture
    
    Travis for CI
    debate: should require travis to succeed before allow merge?
    no: because in emergency we still want to use same flow
    CI is part of our process. do not expect CD tool to enforce
    
    authn/z via github
    webhooks w/ github (ignores payload from GH)
    
    application dashboard: shows desired state vs last state 
    (version/number of instances)
    
    merge PR: 
    * slack alert
    * do it:
    ** sync repo + tags
    ** create tag
    ** set desired state to new version
    ** build docker image in dockerhub (`docker build .`)
    ** pull state using Scala/Akka every 30 seconds until docker image ready
    ** scale: 
    *** create new cluster
    *** once healthy move traffic
    *** scale down old version
    
    webapp speaks to API, API speaks to PGSQL in RDS
    docker instances on ECS
    
    "rollback is antipattern, roll forward"
    "don't like 'deploy', see Martin Fowler separate deploy from sending traffic to"
    "always scale up, assert healthy, before scale down"
    
    delta config - yaml
    initial number of instances: only used once
    -> later uses current number of instances
    
    5 engineers
    >1500 releases, weekly >100, 20 releases/eng/week
    
    roadmap
    * deploy other than master branch
    * dependency support w/in repo
    * smarter traffic management
    * better healthchecks
    * more settings (enable/disable build in UI)
    * UI/UX improve
    
    CD very nice. 
    every Friday we udate all projects to latest versions of every library
    delta deploys itself
    
    Q/A
    All data stores are append-only
    What is the cost of failure vs cost of inaction 
    (credit card system is bad possible example)
    Setup culture/process/alerting when break in prod
    Healthcheck does verify DB connectivity, having env vars, etc
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel