Alois Reitbauer is Chief Evangelist of ruxit and key contributer to a number of performance management solutions.
Incident management and alerting are the spine of operations monitoring. Getting them working properly is a great challenge – even today. Besides in-depth knowledge about how IT infrastructure and applications work, you also need a fair amount of mathematical skills, if you don’t want to define ten thousands of alerting thresholds manually.
Our challenge was even tougher. We needed to develop a reliable incident system that works consistent in over 1000 applications without the requirement for any manual configuration. This talk will cover how we did it.
Even if you are not a statistics maniac you will get in-depth insight into how to build a better incident system. We will cover a large variety of topics and cover non-everyday questions like:
· Which metrics you should track and which you shouldn’t?
· Why you should differentiate between violations and incidents?
· What makes a metric suitable for baselining and violation detection?
· What are the top reasons for too many or too few incidents?
· Why your incident system might trigger too fast or too slow?
· What is percentile drift detection and how it helps to improve your alerts?
· How your infrastructure and applications structure helps to reduce the number of incidents you get?
Whether you are responsible for setting up incident management or are just the consumer, this talk is for you. It will help do develop more trust in your incidents and also provide you with the skills to improve them.