Skip to content

Cloud scale incident detection across metrics & logs

Photo of Salim Lakhani
Hosted By
Salim L. and May C.
Cloud scale incident detection across metrics & logs

Details

Topic: Cloud scale incident detection across metrics & logs

Abstract:
Zebrium uses machine learning to automatically detect software incidents and show root cause. The foundation of the technology looks for hotspots of abnormally correlated anomalous log events. We wanted to extend this to also include correlated Prometheus metric anomalies. This necessitated a few unique requirements: minimal bandwidth between client and our SaaS backend, near real-time metric updates and the ability to match metrics with logs collected from the same container/source.

We will discuss and give a live demo of our journey using the Prometheus server/scraper and explain why we built and open-sourced a forked instance to meet our needs. The end result: over 500x bandwidth reduction (under 0.7 bytes per sample vs 391 bytes in the raw form) and near real time updates while being able to correlate anomalies across both logs and metrics to automatically catch software incidents and show their root cause.

Presented by Rod Bagg – Co-founder and VP, Engineering @ Zebrium

Rod is known as the as the pioneer of using data science and analytics to analyze logs and metrics. Prior to Zebrium he joined Nimble Storage as an early employee and created the InfoSight predictive analytics platform. Prior to Nimble, he co-founded Glassbeam as VP Engineering. Rod began his career in this field when he built the NetApp Support Automation team using AutoSupport product telemetry in 1999.

Zoom Link: https://us02web.zoom.us/j/82270952134

Photo of SF Prometheus Meetup Group group
SF Prometheus Meetup Group
See more events