• LinkedIn: Monitoring Evolution - Richard Waid
• Sample your traffic but keep the good stuff! - Ben Hartshorne
LinkedIn: Monitoring Evolution
Richard Waid (https://www.linkedin.com/in/richardwaid/), Director of Monitoring Infrastructure, LinkedIn
In the past 6 years, the Monitoring team at LinkedIn has dealt with an explosive change in scale from 12k to 850M individual metrics, as well as a migration from NOC based escalation to direct remediation and escalation. This is a brief overview of how we accomplished that, how we fit into the overall engineering ecosystem, as well as what we're doing for the next major evolution in our journey. Along the way I'll cover a few of our major learnings: protecting the ecosystem against well meaning users, planning for explosive scaling, and the global vs. local optima challenge of self-service tooling.
Sample your traffic but keep the good stuff!
Ben Hartshorne (https://twitter.com/maplebed), Software Engineer, Honeycomb
(Organizer’s note: This is a sneak peek of Ben’s talk at LISA17 on Nov 3rd, 2017)
The two main methods of reducing high volume instrumentation data to a manageable load are aggregation and sampling. Aggregation is well understood, but sampling remains a mystery.
We'll start by laying down the basic ground rules for sampling—what it means and how to implement the simplest methods. There are many ways to think about sampling, but with a good starting point, you gain immense flexibility. Once we have the basics of what it means to sample, we'll look at some different traffic patterns and the effect of sampling on each. When do you lose visibility into your service with simple sampling methods? What can you do about it?
Given the patterns of traffic in a modern web infrastructure, there are some solid methods to change how you think about sampling in a way that lets you keep visibility into the most important parts of your infrastructure while maintaining the benefits of transmitting only a portion of your volume to your instrumentation service.
Taking it a step further, you can push these sampling methods beyond their expected boundaries by using feedback from your service and its volume to affect your sampling rates! Your application knows best how the traffic flowing through it varies; allowing it to decide how to sample the instrumentation can give you the ability to reduce total throughput by an order of magnitude while still maintaining the necessary visibility into the parts of the system that matter most.
I'll finish by bringing up some examples of dynamic sampling in our own infrastructure and talk about how it lets us see individual events of interest while keeping only 1/1000th of the overall traffic.
Doors open at 6:30pm. Catch up with other quantifiers over food and drinks. Talks start at 7:00pm and end at 8pm. Space is limited, please RSVP.
This event will be live-streamed at heavybit.com/live/ (http://www.heavybit.com/live/).