Come join us for Travis Mattera presenting on...
Not every Datadog monitor needs to be created from scratch. We build many of our services using the same resources, particularly when it comes to cloud-provider components. Those common resources in turn have certain key metrics that best indicate the health of the resource and application using it. The focus here is on the 4 golden signals which form the standard baseline for operational observability.
You don't always have to reinvent the wheel when it comes to baseline monitors. For a given combination of a resource and metric, there's likely a general best practice for a Datadog monitor's settings. One could leverage the wisdom of their organization by starting with these settings and later tweaking your monitors to suit your service-specific needs. In conjunction, operational lessons could back-drive revisions to those baseline defaults.
Like documentation, observability often comes after app launch or feature deployment. Anything that helps fill in blind spots with minimal effort will increase an organization's situational awareness, aiding detection & diagnosis. Though fears of rampant false positives may arise, recall we are basing these monitors on peer-reviewed baseline definitions. And as with any monitor, careful tuning is always part of the life-cycle.
Service Level Objectives (SLO)
Though services may use the same foundational resources, frameworks and even codebases their service level agreements (SLAs) may differ significantly. SLAs are maintained by measuring service level objectives (SLOs) which is where monitors come in. Every useful monitor of a service's component will tree up to an SLA for that service. With diligent engineering, the degree of impact to the SLA by an SLO excess can be quantified, and the SLO's monitor can thus be weighted in terms of severity.
The Human Element
Business processes can run on machines but there's always a group of people tasked with caring for those processes. The fleet of monitors for a service should contain SLO context, team contact info and paging integrations-- and those bits of info should be able to change as rapidly as organization can, but with less effort and human touch time.
Not the End
Many OSS tools for Datadog monitor management enshrine the monitor definitions in code. MonitorMgr is designed to be a monitor wizard that rapidly generates proportionally large number of monitors from a minimum of user input. I envision MonitorMgr reaching its maximum potential as a catalyst of systems engineering discipline for an organization via black box monitoring on all services' SLAs.
Directions & Location:
This month we're collecting at the Dell (EMC) Office in Pioneer Square neighborhood.
6:30pm - We'll arrive and mingle a bit.
6:59pm - We'll start the talk from Travis Mattera "Datadog, OSS, SLAs, and Enterprise Monitoring".
7:55pm - Warp up.
8:05pm - Vote time for either Altstadt Bierhalle, J & M, or Collins Pub for post-meetup conversations and a round on the meetup!