We'd like to share experiences building scalable monitoring and alerting solutions using Graphite, Grafana, Collectd, Nagios, Logstash, Elasticsearch and Kibana, among others.
We believe that it is easy to collect metrics from any one system, and to define alerts on single metrics. We already have this capability in place. However, in complex systems, the real operational challenges arise from the way system components interact. Some of these components live inside our data center, some live outside our data center, and all are updating on differing timelines. The functionality and performance of every component has the potential to change every day.
Our challenge is to identify patterns and correlations across multiple systems in our stack. We need to integrate top-down and bottom-up analysis, so we can see, for example, that trial subscription signups (a user metric) fell off at the same time that an internal API call began to fail (an application metric), and it was caused by a database host falling offline (a system metric).
When collecting so much data, there is a risk of being overwhelmed and not being able to make sense of it all. In essence, a risk of collecting data but not producing intelligence. We combat this risk by converting our accumulated data into the most visually information-dense format available: graphs. Then we make graphs easy to compare and easy to share. We make them informative at a glance and easy for the team to keep watching. Finally, once we are regularly identifying patterns across our graphs, we should have an automated way to "watch the graphs" in our absence. It is not an AI or a pattern recognition "black box", it should just automate patterns that humans have first validated to be meaningful.