addressalign-toparrow-leftarrow-rightbackbellblockcalendarcameraccwchatcheckchevron-downchevron-leftchevron-rightchevron-small-downchevron-small-leftchevron-small-rightchevron-small-upchevron-upcircle-with-checkcircle-with-crosscircle-with-pluscrosseditemptyheartexportfacebookfolderfullheartglobegmailgoogleimageimagesinstagramlinklocation-pinmagnifying-glassmailminusmoremuplabelShape 3 + Rectangle 1outlookpersonplusprice-ribbonImported LayersImported LayersImported Layersshieldstartickettrashtriangle-downtriangle-uptwitteruseryahoo

Alois Reitbauer: The Dark Art of Building a Production Incident System

  • Jun 18, 2014 · 6:30 PM
  • Wayfair Offices - 4 Copley Place

Alois Reitbauer is Chief Evangelist of ruxit and key contributer to a number of performance management solutions.

Incident management and alerting are the spine of operations monitoring. Getting them working properly is a great challenge – even today. Besides in-depth knowledge about how IT infrastructure and applications work, you also need a fair amount of mathematical skills, if you don’t want to define ten thousands of alerting thresholds manually.

Our challenge was even tougher. We needed to develop a reliable incident system that works consistent in over 1000 applications without the requirement for any manual configuration. This talk will cover how we did it.

Even if you are not a statistics maniac you will get in-depth insight into how to build a better incident system. We will cover a large variety of topics and cover non-everyday questions like:

·  Which metrics you should track and which you shouldn’t?

·  Why you should differentiate between violations and incidents?

·  What makes a metric suitable for baselining and violation detection?

·  What are the top reasons for too many or too few incidents?

·  Why your incident system might trigger too fast or too slow?

·  What is percentile drift detection and how it helps to improve your alerts?

·  How your infrastructure and applications structure helps to reduce the number of incidents you get?

Whether you are responsible for setting up incident management or are just the consumer, this talk is for you. It will help do develop more trust in your incidents and also provide you with the skills to improve them.

Join or login to comment.

Our Sponsors

  • Wayfair

    Wayfair hosts our meetups, and is looking to hire great engineers!

  • Akamai

    Any experience. Any device. Anywhere.

  • Instart Logic

    Cloud application delivery

  • Catchpoint

    10% off to Members + "Complete Web Monitoring" ebooks


    Web and mobile app test automation solution.

  • O'Reilly

    O'Reilly provides discounts to all members, and free books

  • AppDynamics

    Next generation application performance management

  • Yottaa

    Yottaa provides a selection of books on Web Performance for each Meetup

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy