- April 2018 SF Metrics Meetup
Doctor Graphite or How I Learned to Stop Worrying and Love the Dots Brad Lhotsky (https://edgeofsanity.net/), Systems & Security Administrator @ Craigslist Graphite revolutionized monitoring and our expectations of monitoring tools. Criticized for being too slow, clunky, or awkward, most folks are moving away from Graphite to other tools promising horizontal scaling, SQL-ish syntax, and fancy, shiny moving parts. We'll discuss why Graphite's simple Whisper file backend might be all you need. We'll try squeezing all the performance we can from the official Graphite applications. The talk will cover some features in the new 1.1.x branch and how you can use Graphite to be more successful. --- Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).
- March '18 SF Metrics Meetup
Talks: 1. Observability Isn't Monitoring - Baron Schwartz, CEO @ VividCortex (https://twitter.com/xaprb) 2. PostgreSQL as Persistent Storage for Prometheus Metrics - Mike Freedman, Co-Founder & CTO @ TimescaleDB (https://www.timescale.com, Twitter: @TimescaleDB) ---- Observability Isn't Monitoring What is observability? It's not monitoring, that's what it is. But seriously, let's talk about the seven golden signals, the relationship and difference between concepts like observability, monitoring, telemetry, and instrumentation, and how to build highly observable services. ---- PostgreSQL as Persistent Storage for Prometheus Metrics Time-series data is now everywhere and increasingly used to power core applications. Yet it creates a number of technical challenges around ingesting high volumes of data, supporting complex queries for recent and historical time intervals, and performing time-centric analysis and data management. Meanwhile, Prometheus and Grafana have emerged as a popular duo for collecting, querying, and graphing metrics. But while Prometheus has its own time-series storage subsystem for metrics monitoring, users sometimes need richer time-series analysis as well as the ability to join such data against other relational data to answer key business questions. In this talk, we take a somewhat heretical stance in the monitoring world, and describe why and how we view an enhanced Postgres as an effective Prometheus backend to support those complex questions (and get a proper SQL interface). We present pg_prometheus, a new native Prometheus datatype for Postgres, as well as a remote storage adaptor that allows Prometheus to write directly to a Postgres database. We also describe our work with TimescaleDB, a new open-source database designed for time-series workloads, engineered up as a PostgreSQL extension. In transforming PostgreSQL into a scalable time-series database, TimescaleDB serves as a powerful persistent store for Prometheus. ---- Doors open at 6:30pm. Catch up with other #monitoringlove folk over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).
- October '17 SF Metrics Meetup
Talks: • LinkedIn: Monitoring Evolution - Richard Waid • Sample your traffic but keep the good stuff! - Ben Hartshorne LinkedIn: Monitoring Evolution Richard Waid (https://www.linkedin.com/in/richardwaid/), Director of Monitoring Infrastructure, LinkedIn In the past 6 years, the Monitoring team at LinkedIn has dealt with an explosive change in scale from 12k to 850M individual metrics, as well as a migration from NOC based escalation to direct remediation and escalation. This is a brief overview of how we accomplished that, how we fit into the overall engineering ecosystem, as well as what we're doing for the next major evolution in our journey. Along the way I'll cover a few of our major learnings: protecting the ecosystem against well meaning users, planning for explosive scaling, and the global vs. local optima challenge of self-service tooling. Sample your traffic but keep the good stuff! Ben Hartshorne (https://twitter.com/maplebed), Software Engineer, Honeycomb (Organizer’s note: This is a sneak peek of Ben’s talk at LISA17 on Nov 3rd, 2017) The two main methods of reducing high volume instrumentation data to a manageable load are aggregation and sampling. Aggregation is well understood, but sampling remains a mystery. We'll start by laying down the basic ground rules for sampling—what it means and how to implement the simplest methods. There are many ways to think about sampling, but with a good starting point, you gain immense flexibility. Once we have the basics of what it means to sample, we'll look at some different traffic patterns and the effect of sampling on each. When do you lose visibility into your service with simple sampling methods? What can you do about it? Given the patterns of traffic in a modern web infrastructure, there are some solid methods to change how you think about sampling in a way that lets you keep visibility into the most important parts of your infrastructure while maintaining the benefits of transmitting only a portion of your volume to your instrumentation service. Taking it a step further, you can push these sampling methods beyond their expected boundaries by using feedback from your service and its volume to affect your sampling rates! Your application knows best how the traffic flowing through it varies; allowing it to decide how to sample the instrumentation can give you the ability to reduce total throughput by an order of magnitude while still maintaining the necessary visibility into the parts of the system that matter most. I'll finish by bringing up some examples of dynamic sampling in our own infrastructure and talk about how it lets us see individual events of interest while keeping only 1/1000th of the overall traffic. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).
- September '17 SF Metrics Meetup
Talks: • OpenTracing and Metrics: Measure Twice, Instrument Once - Ted Young • IRONdb - Solving the technical challenges of TSDBs at scale - Fred Moyer OpenTracing and Metrics: Measure Twice, Instrument Once Ted Young (https://twitter.com/tedsuo), Director of Open Source Development, LightStep As systems become highly distributed, mechanisms for correlating diagnostic information become a necessity. In this talk, we will discuss on how the need for correlation is beginning to blur the lines between the formerly separate domains of metrics, logging, and tracing. We will discuss how separating observation from aggregation - using a single, neutral instrumentation API that can push data into multiple types of monitoring systems - as one approach to this problem. IRONdb - Solving the technical challenges of TSDBs at scale Fred Moyer (https://twitter.com/phredmoyer), Developer Evangelist, Circonus (IRONdb) Time series databases are optimized for handling sets of data indexed by time. Aspects of data storage, data safety, and the iops problem are challenges that all TSDBs face at scale. In this talk, we'll look at how IRONdb solves these technical problems, or avoids them entirely. IRONdb is a commercial time series database developed by Circonus, and is a Graphite compatible drop in replacement for Whisper. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).
- August '17 SF Metrics Meetup
Talks: • Evaluating a log analysis platform the wrong way - Amy Nguyen • How to watch TV at work and get away with it - Mathieu Frappier Evaluating a log analysis platform the wrong way Amy Nguyen (https://twitter.com/amyngyn), Infrastructure Engineer, Observability Team at Stripe I was recently asked to investigate whether my team should switch from running our own ELK stack to paying for a SaaS logging vendor. Eventually, I concluded that we should switch, and so we did - but not without encountering significant pushback and unexpected difficulties along the way. In this talk, I'll explain the criteria we started out with for switching, what we did during the evaluation period, and what I wish we had done instead. We'll cover actionable lessons such as how to evaluate security, the right way to ask for feedback, and what you might not have thought to ask about in a vendor trial. How to watch TV at work and get away with it Mathieu Frappier (https://www.linkedin.com/in/matfra/), Site Reliability Engineer at Yelp We have close to 100 TVs in our office and we also run hundreds of microservices. What connects the two? Service owners need to have good visibility into the health of their services. We use SignalFX and Terraform to generate nice-looking, TV-ready dashboards to supplement our regular monitoring and alerting. This talk will cover our experience from monitoring our monolith to monitoring our services and how we build dashboards for them. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).
- June '17 SF Metrics Meetup
This Meetup features talks by J Paul Reed and James Cunningham. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/). Detecting Whispers in Chaos J Paul Reed (https://twitter.com/@jpaulreed), Managing Partner at Release Engineering Approaches (http://release-approaches.com/) In this talk, we'll look at what decades of research in the safety sciences has to say about humans interacting with and operating complex socio-technical systems, including what air craft carriers have to do with Internet infrastructure operations, how resilience engineering can help us, and the use of heuristics in incident response. All of these provide insight into ways we can improve one the most advanced—and most effective—monitoring tools we have available to keep those systems running: ourselves. Learn more about Paul here: http://jpaulreed.com/ Vetting your Pager James Cunningham (https://twitter.com/JTCunning), Operations Engineer at Sentry (https://sentry.io) Sentry (sentry.io) receives a million requests a minute to process and store crashes from all around the world. It's the Operations Team's responsibility that everything goes right, but it's also their responsibility to not burn themselves out when things go wrong. Sentry collects fifty thousand custom metrics inside of DataDog, but only alerts on less than fifty of them. James leads Sentry's observability initiative, creating and maintaining those alerts. Learn about the lifecycle of an alert at Sentry, including: • How a variety of metrics are collected efficiently • How Sentry justifies a metric's degree of accuracy • Why a metric's logical purpose is defined • How alerts evolve from metrics, articulating its existence • When an Engineer actually gets paged and what they're instructed to do
- May '17 SF Metrics Meetup
- September '16 SF Metrics Meetup
- July '16 SF Metrics Meetup
Housekeeping: Arrive around 6:30pm to catch up with other quantifiers over food and drinks, talks start at 7:00pm. Space is limited, please RSVP. Featured Talks: Turnkey Distributed Tracing with OpenTracing Ben Sigelman (https://www.linkedin.com/in/bensigelman), Cofounder of Lightstep (http://lightstep.com/) This talk describes why distributed tracing is important, why its instrumentation presents uncommon standardization problems, and the way that OpenTracing addresses these problems. It's been 12 years since Google started using Dapper internally. Zipkin was open-sourced over 4 years ago. This stuff is not new! Yet if you operate a complex services architecture, deploying a distributed tracing system today requires person-years of engineer effort, monkey-patched communication packages, and countless inconsistencies across platforms. If distributed tracing is so valuable, why doesn't everyone do it already? Because tracing instrumentation has been broken until now. That brings us to the OpenTracing project. OpenTracing is a new, open distributed tracing standard for applications and OSS packages. I will describe how OpenTracing integrates with application code and OSS libraries, how it interoperates with Zipkin, Appdash, LightStep, and other tracing backends, and where the project is headed. We will end with a deep dive of some OpenTracing libraries and show a few demos. Druid: Realtime Analytics for Metrics Data Gian Merlino (https://www.linkedin.com/in/gianmerlino), Cofounder and CTO of Imply (http://imply.io/) Druid is an open source, distributed data store designed to analyze event data. Druid powers user-facing data applications, provides fast queries on data in Hadoop, and helps you glean insights from streaming data. The architecture unifies historical and real-time data and enables fast, flexible OLAP analytics at scale. We will cover Druid's design and architecture, and how Druid can be utilized to monitor metrics data. ---- Legal Drinking Age Required: We will be serving beer at the Meetup. We will not serve alcohol to persons under the age of 21.