DataDog summit 10/26 | Alex Lo Engineering Blog

Hide threads | Keyboard Shortcuts

@alexlo03 12:34 on 2016/10/26 Permalink Reply

DataDog summit 10/26

https://www.eventbrite.com/e/datadog-summit-tickets-27691767823

# Keynote - Alexis Le-Quoc - CTO
Observability / curiosity / control
alerting/anomaly detection designed with these wants in mind.

outlier/anomaly - expect announcement tomorrow
"notebooks" feature - "curiosity" feature - EG metrics explorer

# Cory Watson @gphat cory@stripe.com
Building a culture of observability

## starting point
no clear ownership -> broken windows
lack of confidence/vision for future (how will things get better)
very reactive

## stripe
550 employees (how many eng? asked: top secret!)
~230 services, 1000s of aws vms
Obs systems: DD, Splunk, Sentry, PagerDuty, "core dashboards"
Obs team: 5+intern+1 on loan

## how to make a change
* give a shit about your users
* follow up on feedback
* trend towards a bright future
* measure your progress
** HOW?

## start over, kinda
* spend time w/ tools
* improve if possible
* replace if not
* leverage past knowledge
EG "what about that part about grafana you don't like"
social archeology

## why DD?
* general purpose / simple interfaces
* velocity of improvement of DD platform
* OSS
* friendly helpful staff

## empathy and respect
* people are not generally evil, they are just busy
* being a hater is lazy
* help people be great at their jobs

## replacing existing system
* overcoming the momentum is hard - adds work
* declaring bankruptcy w/ statsd (the dotted naming thing does not translated
into DD verbage)
* saved us ops headaches (won't have the statsd droprate - no more UDP - was
dropping up to 50% of metrics)
* still ongoing

## getting change rolling: Nemawashi
Japanese - let the tree come to you - can't show up at a meeting and introduce
a brand new concept
* start small - guineapig yourself
* quietly lay foundation and gather feedback
* asking how you can improve, follow up
* engage the discontent - most to learn from them

## identify power users
* find interested parties - empower them to help others - levers to move the
org (training, adoption, etc)

## value
* what are you improving?
* how can you measure it?
* is this the best way?

what metrics?  MTTD?  MTTR?

feedback loop: engineer -> system (add sensor that feeds back to eng)

# flat org - how to improve observability w/o mandate
* not having a mandate
* stigmergy - https://en.wikipedia.org/wiki/Stigmergy
** eg grind or hustle
* strike when good opportunities (eg incidents)

## advertise
* promote team accomplishments, accomplishments of others
* ask to help - then learn
* observability team branded as "bees"

## make it easy&good
* hard to make email exciting
* make it easy/automatic to do things right

## automated monitors
* baseline monitors - common problems/solutions
* users have no state, are surprised
* people care whe you show them failure and how to fix.

## features:
### Automatic ticket creation w/ labels/tags
* can find links to previous ticket resolutions
* can find all active tickets of typeX, can close if they are false alarms
* feedback via google forms

## tracking toil
* find all pagerduty info
* input into redshift + looker (app)

## usage
* >100% growth in metrics, monitors/dashboards
* 7.5k metrics (w/ tags)

## problems
* metric/naming, cardinality
* monitor "blame",
* what metrics are available to me (service owner)
* metrics or logs? traces?
** splunk or DD?


# Algorithmic Alerting
Homin Lee
@hominprovement

being released today/tomorrow

Anomaly detection - monitoring a metric though time
Outlier detection - monitoring a metric though space

if you have a trending down metric that you want to alert on - you'd have to
reset your thresholds often

seasonal metrics (big ups/downs throught day) - thresholding does not work -
can use "change alerts" - this is a problem because large changes in one
direction is typically OK (memory usage going low for a while is not a problem,
it going high can be)

what if you have a trending and seasonal metric?
anomaly detection: predict range of values that looks normal

algorithms for anomaly detection:
"Basic"
"Not-basic" algorithms
* "robust"  - decompose history into trend component and seasonable component
* "agile" - look at previous time yesterday/last week
* "adaptive" - if the behavior changing over time - requires less information
over time

single parameter: tolerance

aggregation time frame can lead to false positives

Outlier detection
DBSCAN algo
MAD algo - median absolute deviation from the median

w/ anomaly detection and outlier direction
don't apply to everything
outliers should be applied to things that ought be strongly related

# Airbnb - Ben Hughes
* just turned off graphite

## background
* hired lots of product engineers, not lots of SREs
* product engineers started learning / helping with pager side of things
* dub'd sysops - set up lots of trainings
* 50 people on rotation, 30% of eng has attended trainings
* w/ 50 people on volunteer rotation: only on call a few times a year, don't
know which pages can be ignored, you've probably never seen the alert before
* therefore pager alerts have to be very certain to have a problem

## monitoring as code - configure dd alerts:
https://github.com/airbnb/interferon
* pros: code working ecosystem (git, grep, etc)
* pros: automation = good
* cons: also causes messes
* pros outweigh cons

* can scripted monitor creation by getting inventory from AWS API
* pull requests on alerts include information on what incident / background
caused creation of alert
* 730 alert specs that turn into 11k monitors

## reduce alert noise
* difficult problem
* email gets filtered, paged alerts will eventually get fixed due to annoyance
* requires ownership

* when adding new alerts - keep old ones around while proving out new one (add,
don't modify?)

==

# GrubHub

## why dd?
* single pane of glass in context of multiple datacenters
* alerting built in
* apis
* statsd/graphite compatible
* advanced options, increasing features

## background
* many services, many problems
* new teams coming in, easy to miss things
* lots of different application frameworks/etc

## monitor all the services
* define common metric names at framework level (important for dataviz)
* provide basic metric set for all services
* service discovery to apply monitoring to all services
* ensure all monitors have links to logs,runbooks,etc
* run same monitoring in pre-production but w/o pages
* store everything in source control
* devs own monitoring as much as sres

## viz
* heavy use of templated dashboards
* operations "summary" dashboards and developer focused dashboards
* store dashboard defs in source code
* should help provide context to monitoring

## metrics
* start with sane, non-product specific, metric names
* careful of metric counts

# Note from Darren:
TODO can have ansible results go to DD
https://www.datadoghq.com/blog/ansible-datadog-monitor-your-automation-automate-your-monitoring/
could use this on ansible-pull quickly, would need to standardize ansible push
env to use more broadly

# Tracing Code matt@datadog.com - EG APM
* multiple services that publish metrics as part of a whole service
* can scope graphs/etc by dimension (EG endpoint, hostname)
* can drill into specific reqs or categories of reqs
* qs: what are most frequent queries/slow queries/etc

* can do distributed trace - can connect RPC calls
* lots of common integrations w/ normal services postgres/etc
* currently integrates with python, ruby, go (more soon)
* currently in private beta

QQ: high security mode?

Reply Cancel reply

c: Compose new post
j: Next post/Next comment
k: Previous post/Previous comment
r: Reply
e: Edit
o: Show/Hide comments
t: Go to top
l: Go to login
h: Show/Hide help
shift + esc: Cancel