Skip to content

Introspective Monitoring: Self-Analyzing Monitoring for Higher Availability

Photo of Jason
Hosted By
Jason
Introspective Monitoring: Self-Analyzing Monitoring for Higher Availability

Details

Most operations and engineering shops have written scripts, tools and applications to perform and automate large amounts of work but that are not always optimally monitored. Some of them don't display external behavior that can be probed easily (or at all) while others do but the tests don't necessarily exercise the entire internal stack of the running monitored instance; some run periodically, and thus may require strict synchronization between the monitoring server and server running the workload, leaving unmonitored gaps between runs. Ascertaining the health of this entire infrastructure ends up being a puzzle with somewhat convoluted solutions, yielding a potentially incomplete picture of the actual health of the component. Finally, reporting and acting on status can also be problematic. Thus, we sometimes resort to active checks that test side-effects, running a myriad of checks against the same component, abusing log files and syslog servers or send out mail reports at a given frequency that need to be either parsed or visually inspected. At scale, this situation becomes unmanageable.

Introspective monitoring (also known as passive monitoring) provides the means to perform extremely accurate monitoring in a lightweight fashion, improving response and recovery times, and generally easing troubleshooting work to reach resolution. As the name implies, introspective monitoring is built-in to the script, tool or application, and thus has access to the running environment, which enables it to keep track of state as events are taking place in real-time. Furthermore, by its very nature, it can report faults when they happen (instead of having to wait for the next check cycle) and provides keep-alive capabilities (for those times when we forget to re-enable a cron job).

Led by Gerir Lopez-Fernandez, we will learn about introspective monitoring, and review a number of examples where it is actively used in production in both Ning and Mogwee. At Ning (http://www.ning.com), we monitor NetApp filers with a proxied version of introspective monitoring with a tool called Theia, watch after file system replication processes with tools such as Snapbaby and Zettabee, and see how it is used by BFM, a Ning-specific monitoring tool that tests a number of Ning internal components in a transaction-oriented fashion. Mogwee is essentially built from the ground up with introspective monitoring in place, which was extremely useful during the first few weeks after going live as engineering made tweaks to the server components, added new tests, and was able to report on health status without having operations intervene in an ongoing basis. We will explore configuration and the code necessary to make it work in a Nagios-centric environment using several languages, and see real-world output from our current use of it.

Pizza, beer and other refreshments will be served. We'll begin with a quick series of informal "Lightning Talks" -- guests can present active projects or interests they're working on. If you'd like to present, there's a spot to propose your topic when you RSVP. There will also be time to ask Ning's Engineering and Ops teams any burning questions you have! We look forward to seeing you at Ning HQ (http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=285+Hamilton+Ave,+Palo+Alto,+CA+94301&aq=0&sll=37.0625,-95.677068&sspn=76.360484,74.882813&ie=UTF8&hq=&hnear=285+Hamilton+Ave,+Palo+Alto,+Santa+Clara,+California+94301&z=17) in downtown Palo Alto!

To learn more about Ning (http://www.ning.com)'s Engineering ninja skills, check out our Engineering blog, Ning Code (http://code.ning.com/).

ABOUT GERIR LOPEZ-FERNANDEZhttp://photos4.meetupstatic.com/photos/event/5/f/2/9/event_26844361.jpeg

Gerir is a Senior Architect on Ning's Operations team. His experience on large-scale Internet operations focuses on systems architecture, design, implementation and management, with particular emphasis on life-cycle automation, monitoring and storage.

#NingTechTalks (http://twitter.com/#search?q=%23NingTechTalks)

FIND NING ON

http://twitter-badges.s3.amazonaws.com/t_small-b.png (http://www.twitter.com/ning) twitter.com/ning (http://www.twitter.com/ning)
http://static.ning.com/about/images/press/findus/facebook.gif (http://www.facebook.com/ning) facebook.com/ning (http://www.facebook.com/ning)
http://static.ning.com/about/images/press/findus/youtube.gif (http://www.youtube.com/ning) youtube.com/ning (http://www.youtube.com/ning)
http://static.ning.com/about/images/press/findus/scribd.gif (http://www.scribd.com/ning) scribd.com/ning (http://www.scribd.com/ning)

Photo of Ning Tech Talks in Palo Alto group
Ning Tech Talks in Palo Alto
See more events
Ning HQ
285 Hamilton Avenue, Suite 400 · Palo Alto, CA