DataDog summit 10/26
https://www.eventbrite.com/e/datadog-summit-tickets-27691767823
# Keynote - Alexis Le-Quoc - CTO Observability / curiosity / control alerting/anomaly detection designed with these wants in mind. outlier/anomaly - expect announcement tomorrow "notebooks" feature - "curiosity" feature - EG metrics explorer # Cory Watson @gphat cory@stripe.com Building a culture of observability ## starting point no clear ownership -> broken windows lack of confidence/vision for future (how will things get better) very reactive ## stripe 550 employees (how many eng? asked: top secret!) ~230 services, 1000s of aws vms Obs systems: DD, Splunk, Sentry, PagerDuty, "core dashboards" Obs team: 5+intern+1 on loan ## how to make a change * give a shit about your users * follow up on feedback * trend towards a bright future * measure your progress ** HOW? ## start over, kinda * spend time w/ tools * improve if possible * replace if not * leverage past knowledge EG "what about that part about grafana you don't like" social archeology ## why DD? * general purpose / simple interfaces * velocity of improvement of DD platform * OSS * friendly helpful staff ## empathy and respect * people are not generally evil, they are just busy * being a hater is lazy * help people be great at their jobs ## replacing existing system * overcoming the momentum is hard - adds work * declaring bankruptcy w/ statsd (the dotted naming thing does not translated into DD verbage) * saved us ops headaches (won't have the statsd droprate - no more UDP - was dropping up to 50% of metrics) * still ongoing ## getting change rolling: Nemawashi Japanese - let the tree come to you - can't show up at a meeting and introduce a brand new concept * start small - guineapig yourself * quietly lay foundation and gather feedback * asking how you can improve, follow up * engage the discontent - most to learn from them ## identify power users * find interested parties - empower them to help others - levers to move the org (training, adoption, etc) ## value * what are you improving? * how can you measure it? * is this the best way? what metrics? MTTD? MTTR? feedback loop: engineer -> system (add sensor that feeds back to eng) # flat org - how to improve observability w/o mandate * not having a mandate * stigmergy - https://en.wikipedia.org/wiki/Stigmergy ** eg grind or hustle * strike when good opportunities (eg incidents) ## advertise * promote team accomplishments, accomplishments of others * ask to help - then learn * observability team branded as "bees" ## make it easy&good * hard to make email exciting * make it easy/automatic to do things right ## automated monitors * baseline monitors - common problems/solutions * users have no state, are surprised * people care whe you show them failure and how to fix. ## features: ### Automatic ticket creation w/ labels/tags * can find links to previous ticket resolutions * can find all active tickets of typeX, can close if they are false alarms * feedback via google forms ## tracking toil * find all pagerduty info * input into redshift + looker (app) ## usage * >100% growth in metrics, monitors/dashboards * 7.5k metrics (w/ tags) ## problems * metric/naming, cardinality * monitor "blame", * what metrics are available to me (service owner) * metrics or logs? traces? ** splunk or DD? # Algorithmic Alerting Homin Lee @hominprovement being released today/tomorrow Anomaly detection - monitoring a metric though time Outlier detection - monitoring a metric though space if you have a trending down metric that you want to alert on - you'd have to reset your thresholds often seasonal metrics (big ups/downs throught day) - thresholding does not work - can use "change alerts" - this is a problem because large changes in one direction is typically OK (memory usage going low for a while is not a problem, it going high can be) what if you have a trending and seasonal metric? anomaly detection: predict range of values that looks normal algorithms for anomaly detection: "Basic" "Not-basic" algorithms * "robust" - decompose history into trend component and seasonable component * "agile" - look at previous time yesterday/last week * "adaptive" - if the behavior changing over time - requires less information over time single parameter: tolerance aggregation time frame can lead to false positives Outlier detection DBSCAN algo MAD algo - median absolute deviation from the median w/ anomaly detection and outlier direction don't apply to everything outliers should be applied to things that ought be strongly related # Airbnb - Ben Hughes * just turned off graphite ## background * hired lots of product engineers, not lots of SREs * product engineers started learning / helping with pager side of things * dub'd sysops - set up lots of trainings * 50 people on rotation, 30% of eng has attended trainings * w/ 50 people on volunteer rotation: only on call a few times a year, don't know which pages can be ignored, you've probably never seen the alert before * therefore pager alerts have to be very certain to have a problem ## monitoring as code - configure dd alerts: https://github.com/airbnb/interferon * pros: code working ecosystem (git, grep, etc) * pros: automation = good * cons: also causes messes * pros outweigh cons * can scripted monitor creation by getting inventory from AWS API * pull requests on alerts include information on what incident / background caused creation of alert * 730 alert specs that turn into 11k monitors ## reduce alert noise * difficult problem * email gets filtered, paged alerts will eventually get fixed due to annoyance * requires ownership * when adding new alerts - keep old ones around while proving out new one (add, don't modify?) == # GrubHub ## why dd? * single pane of glass in context of multiple datacenters * alerting built in * apis * statsd/graphite compatible * advanced options, increasing features ## background * many services, many problems * new teams coming in, easy to miss things * lots of different application frameworks/etc ## monitor all the services * define common metric names at framework level (important for dataviz) * provide basic metric set for all services * service discovery to apply monitoring to all services * ensure all monitors have links to logs,runbooks,etc * run same monitoring in pre-production but w/o pages * store everything in source control * devs own monitoring as much as sres ## viz * heavy use of templated dashboards * operations "summary" dashboards and developer focused dashboards * store dashboard defs in source code * should help provide context to monitoring ## metrics * start with sane, non-product specific, metric names * careful of metric counts # Note from Darren: TODO can have ansible results go to DD https://www.datadoghq.com/blog/ansible-datadog-monitor-your-automation-automate-your-monitoring/ could use this on ansible-pull quickly, would need to standardize ansible push env to use more broadly # Tracing Code matt@datadog.com - EG APM * multiple services that publish metrics as part of a whole service * can scope graphs/etc by dimension (EG endpoint, hostname) * can drill into specific reqs or categories of reqs * qs: what are most frequent queries/slow queries/etc * can do distributed trace - can connect RPC calls * lots of common integrations w/ normal services postgres/etc * currently integrates with python, ruby, go (more soon) * currently in private beta QQ: high security mode?
Reply