• April 2018 SF Metrics Meetup

    Heavybit, Inc.

    Doctor Graphite or How I Learned to Stop Worrying and Love the Dots Brad Lhotsky (https://edgeofsanity.net/), Systems & Security Administrator @ Craigslist Graphite revolutionized monitoring and our expectations of monitoring tools. Criticized for being too slow, clunky, or awkward, most folks are moving away from Graphite to other tools promising horizontal scaling, SQL-ish syntax, and fancy, shiny moving parts. We'll discuss why Graphite's simple Whisper file backend might be all you need. We'll try squeezing all the performance we can from the official Graphite applications. The talk will cover some features in the new 1.1.x branch and how you can use Graphite to be more successful. --- Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).

    1
  • March '18 SF Metrics Meetup

    Heavybit, Inc.

    Talks: 1. Observability Isn't Monitoring - Baron Schwartz, CEO @ VividCortex (https://twitter.com/xaprb) 2. PostgreSQL as Persistent Storage for Prometheus Metrics - Mike Freedman, Co-Founder & CTO @ TimescaleDB (https://www.timescale.com, Twitter: @TimescaleDB) ---- Observability Isn't Monitoring What is observability? It's not monitoring, that's what it is. But seriously, let's talk about the seven golden signals, the relationship and difference between concepts like observability, monitoring, telemetry, and instrumentation, and how to build highly observable services. ---- PostgreSQL as Persistent Storage for Prometheus Metrics Time-series data is now everywhere and increasingly used to power core applications. Yet it creates a number of technical challenges around ingesting high volumes of data, supporting complex queries for recent and historical time intervals, and performing time-centric analysis and data management. Meanwhile, Prometheus and Grafana have emerged as a popular duo for collecting, querying, and graphing metrics. But while Prometheus has its own time-series storage subsystem for metrics monitoring, users sometimes need richer time-series analysis as well as the ability to join such data against other relational data to answer key business questions. In this talk, we take a somewhat heretical stance in the monitoring world, and describe why and how we view an enhanced Postgres as an effective Prometheus backend to support those complex questions (and get a proper SQL interface). We present pg_prometheus, a new native Prometheus datatype for Postgres, as well as a remote storage adaptor that allows Prometheus to write directly to a Postgres database. We also describe our work with TimescaleDB, a new open-source database designed for time-series workloads, engineered up as a PostgreSQL extension. In transforming PostgreSQL into a scalable time-series database, TimescaleDB serves as a powerful persistent store for Prometheus. ---- Doors open at 6:30pm. Catch up with other #monitoringlove folk over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).

  • October '17 SF Metrics Meetup

    Heavybit, Inc.

    Talks: • LinkedIn: Monitoring Evolution - Richard Waid • Sample your traffic but keep the good stuff! - Ben Hartshorne LinkedIn: Monitoring Evolution Richard Waid (https://www.linkedin.com/in/richardwaid/), Director of Monitoring Infrastructure, LinkedIn In the past 6 years, the Monitoring team at LinkedIn has dealt with an explosive change in scale from 12k to 850M individual metrics, as well as a migration from NOC based escalation to direct remediation and escalation. This is a brief overview of how we accomplished that, how we fit into the overall engineering ecosystem, as well as what we're doing for the next major evolution in our journey. Along the way I'll cover a few of our major learnings: protecting the ecosystem against well meaning users, planning for explosive scaling, and the global vs. local optima challenge of self-service tooling. Sample your traffic but keep the good stuff! Ben Hartshorne (https://twitter.com/maplebed), Software Engineer, Honeycomb (Organizer’s note: This is a sneak peek of Ben’s talk at LISA17 on Nov 3rd, 2017) The two main methods of reducing high volume instrumentation data to a manageable load are aggregation and sampling. Aggregation is well understood, but sampling remains a mystery. We'll start by laying down the basic ground rules for sampling—what it means and how to implement the simplest methods. There are many ways to think about sampling, but with a good starting point, you gain immense flexibility. Once we have the basics of what it means to sample, we'll look at some different traffic patterns and the effect of sampling on each. When do you lose visibility into your service with simple sampling methods? What can you do about it? Given the patterns of traffic in a modern web infrastructure, there are some solid methods to change how you think about sampling in a way that lets you keep visibility into the most important parts of your infrastructure while maintaining the benefits of transmitting only a portion of your volume to your instrumentation service. Taking it a step further, you can push these sampling methods beyond their expected boundaries by using feedback from your service and its volume to affect your sampling rates! Your application knows best how the traffic flowing through it varies; allowing it to decide how to sample the instrumentation can give you the ability to reduce total throughput by an order of magnitude while still maintaining the necessary visibility into the parts of the system that matter most. I'll finish by bringing up some examples of dynamic sampling in our own infrastructure and talk about how it lets us see individual events of interest while keeping only 1/1000th of the overall traffic. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).

    6
  • September '17 SF Metrics Meetup

    Heavybit, Inc.

    Talks: • OpenTracing and Metrics: Measure Twice, Instrument Once - Ted Young • IRONdb - Solving the technical challenges of TSDBs at scale - Fred Moyer OpenTracing and Metrics: Measure Twice, Instrument Once Ted Young (https://twitter.com/tedsuo), Director of Open Source Development, LightStep As systems become highly distributed, mechanisms for correlating diagnostic information become a necessity. In this talk, we will discuss on how the need for correlation is beginning to blur the lines between the formerly separate domains of metrics, logging, and tracing. We will discuss how separating observation from aggregation - using a single, neutral instrumentation API that can push data into multiple types of monitoring systems - as one approach to this problem. IRONdb - Solving the technical challenges of TSDBs at scale Fred Moyer (https://twitter.com/phredmoyer), Developer Evangelist, Circonus (IRONdb) Time series databases are optimized for handling sets of data indexed by time. Aspects of data storage, data safety, and the iops problem are challenges that all TSDBs face at scale. In this talk, we'll look at how IRONdb solves these technical problems, or avoids them entirely. IRONdb is a commercial time series database developed by Circonus, and is a Graphite compatible drop in replacement for Whisper. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).

    2
  • August '17 SF Metrics Meetup

    Heavybit, Inc.

    Talks: • Evaluating a log analysis platform the wrong way - Amy Nguyen • How to watch TV at work and get away with it - Mathieu Frappier Evaluating a log analysis platform the wrong way Amy Nguyen (https://twitter.com/amyngyn), Infrastructure Engineer, Observability Team at Stripe I was recently asked to investigate whether my team should switch from running our own ELK stack to paying for a SaaS logging vendor. Eventually, I concluded that we should switch, and so we did - but not without encountering significant pushback and unexpected difficulties along the way. In this talk, I'll explain the criteria we started out with for switching, what we did during the evaluation period, and what I wish we had done instead. We'll cover actionable lessons such as how to evaluate security, the right way to ask for feedback, and what you might not have thought to ask about in a vendor trial. How to watch TV at work and get away with it Mathieu Frappier (https://www.linkedin.com/in/matfra/), Site Reliability Engineer at Yelp We have close to 100 TVs in our office and we also run hundreds of microservices. What connects the two? Service owners need to have good visibility into the health of their services. We use SignalFX and Terraform to generate nice-looking, TV-ready dashboards to supplement our regular monitoring and alerting. This talk will cover our experience from monitoring our monolith to monitoring our services and how we build dashboards for them. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).

  • June '17 SF Metrics Meetup

    Heavybit, Inc.

    This Meetup features talks by J Paul Reed and James Cunningham. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/). Detecting Whispers in Chaos J Paul Reed (https://twitter.com/@jpaulreed), Managing Partner at Release Engineering Approaches (http://release-approaches.com/) In this talk, we'll look at what decades of research in the safety sciences has to say about humans interacting with and operating complex socio-technical systems, including what air craft carriers have to do with Internet infrastructure operations, how resilience engineering can help us, and the use of heuristics in incident response. All of these provide insight into ways we can improve one the most advanced—and most effective—monitoring tools we have available to keep those systems running: ourselves. Learn more about Paul here: http://jpaulreed.com/ Vetting your Pager James Cunningham (https://twitter.com/JTCunning), Operations Engineer at Sentry (https://sentry.io) Sentry (sentry.io) receives a million requests a minute to process and store crashes from all around the world. It's the Operations Team's responsibility that everything goes right, but it's also their responsibility to not burn themselves out when things go wrong. Sentry collects fifty thousand custom metrics inside of DataDog, but only alerts on less than fifty of them. James leads Sentry's observability initiative, creating and maintaining those alerts. Learn about the lifecycle of an alert at Sentry, including: • How a variety of metrics are collected efficiently • How Sentry justifies a metric's degree of accuracy • Why a metric's logical purpose is defined • How alerts evolve from metrics, articulating its existence • When an Engineer actually gets paged and what they're instructed to do

  • May '17 SF Metrics Meetup

    Heavybit, Inc.

    This Meetup features talks by Emily Nakashima and Megan Kanne. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. What Your Javascript Does When You're Not Around Emily Nakashima (https://twitter.com/eanakashima), Lead Front End Engineer at Bugsnag When we shipped a big re-write of Bugsnag's customer-facing dashboards as 40,000 line Javascript app, first we celebrated ... and then we looked at our client-side monitoring and did a lot of head-scratching. We'd tested in multiple browsers and thought through dozens of use cases, but when we looked at the kinds of real errors coming back to us from the wild, the reality was so much weirder than we'd expected. As client-side app frameworks like React and Ember keep growing more popular, we're shipping more and more application logic out to users' browsers. But we don't always know much about what happens to it after we send it out to the client. This talk will take you on a fast-paced tour of all the strange cases we've looked at since shipping our new dashboard, from overseas proxy sites to rogue browser extensions to out-and-out clones of our UI. Finally, I'll talk about how to cut the noise and focus on monitoring and mitigating the cases that really matter to your users' experience. Twitter's Next-Gen Alerting System: Reliable Alerting at Scale Megan Kanne (https://twitter.com/megankanne), Senior Software Engineer at Twitter (Observability Team) Twitter’s Observability team provides Twitter developers a monitoring infrastructure including real time dashboards and alerting for their services. Twitter has grown orders of magnitude since it debuted at SXSW. The monitoring infrastructure has followed suit, ingesting orders of magnitude more metrics, from millions to billions. As the company grew, so did the challenges of providing an always available alerting system. Design decisions made when Twitter had only one datacenter and a monolithic architecture created unresolvable scale and reliability issues for its alerting service. In this talk I'll walk you through these challenges and describe our solution: a next-gen alerting system that provides reliable, realtime, multi-zone alerting at scale.

    1
  • September '16 SF Metrics Meetup

    Heavybit, Inc.

    This Meetup features talks by Justin Reynolds and Alex Newman. Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP. Intuition Engineering at Netflix Justin Reynolds (https://www.linkedin.com/in/justinmreynolds), Senior Software Engineer, Traffic & Chaos at Netflix (https://www.netflix.com/) Netflix runs on hundreds of interconnected microservices that, as a whole system, is too complex for any one person to completely understand. We are developing a tool, Vizceral, that helps distill the most useful bits of information and present it to the user in a way that they can 'feel' the state of the system without needing to understand all the moving parts. There is still a lot of work to be done to get more and more intuition about the system, but it has already proven vital internally and has been open sourced. A Next-Generation Telemetry and Logging Solution for Any Application. Alex Newman (https://twitter.com/posix4e), Pink Ranger at Planet (https://www.planet.com/) Rust-metrics is an early project which wants to be your one stop shop for telemetry and alert logging. Rust-metrics supports a variety of metrics and alerting tools but goes even farther by allowing users to mix and match the use of multiple reporters simultaneously with different telemetry thresholds for alerting. Rust-metrics provides telemetry and alerts for carbon/graphite, local csv, syslog and prometheus. Not only can rust-metrics be used from rust but soon from c, python, c++, java, go or even javascript. Rust metrics is a work in progress but versions of it are already in production. Come learn why I designed rust-metrics, a bit about the google project that inspired it, and a bit of the design of rust-metrics. We will also cover the future of rust-metrics, what projects it is being integrated in, and interesting use cases that people are using it for.

    1
  • July '16 SF Metrics Meetup

    Heavybit, Inc.

    Housekeeping: Arrive around 6:30pm to catch up with other quantifiers over food and drinks, talks start at 7:00pm. Space is limited, please RSVP. Featured Talks: Turnkey Distributed Tracing with OpenTracing Ben Sigelman (https://www.linkedin.com/in/bensigelman), Cofounder of Lightstep (http://lightstep.com/) This talk describes why distributed tracing is important, why its instrumentation presents uncommon standardization problems, and the way that OpenTracing addresses these problems. It's been 12 years since Google started using Dapper internally. Zipkin was open-sourced over 4 years ago. This stuff is not new! Yet if you operate a complex services architecture, deploying a distributed tracing system today requires person-years of engineer effort, monkey-patched communication packages, and countless inconsistencies across platforms. If distributed tracing is so valuable, why doesn't everyone do it already? Because tracing instrumentation has been broken until now. That brings us to the OpenTracing project. OpenTracing is a new, open distributed tracing standard for applications and OSS packages. I will describe how OpenTracing integrates with application code and OSS libraries, how it interoperates with Zipkin, Appdash, LightStep, and other tracing backends, and where the project is headed. We will end with a deep dive of some OpenTracing libraries and show a few demos. Druid: Realtime Analytics for Metrics Data Gian Merlino (https://www.linkedin.com/in/gianmerlino), Cofounder and CTO of Imply (http://imply.io/) Druid is an open source, distributed data store designed to analyze event data. Druid powers user-facing data applications, provides fast queries on data in Hadoop, and helps you glean insights from streaming data. The architecture unifies historical and real-time data and enables fast, flexible OLAP analytics at scale. We will cover Druid's design and architecture, and how Druid can be utilized to monitor metrics data. ---- Legal Drinking Age Required: We will be serving beer at the Meetup. We will not serve alcohol to persons under the age of 21.

    2