Remediation for Real: Stories and Demos from Netflix and Mirantis


Details
Hi fellow automators and welcome back to the series of meetups, now that the summer is almost gone, and we are starting up the new season.
We got two speakers with really interesting topics:
Mirantis, who will share how they auto-remediate 1,000 node OpenStack cloud at Symantec using StackStorm platform.
Netflix, who will present (and demo!) Winston Studio - you might have read about it on Netflix tech blog (http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html), now you will hear from the source and actually see it.
The full details on the talks, and on the speakers, are coming up....
Happy to welcome a new sponsor - Neptune.io - the folks who have been pushing auto remediation on the cloud, and bringing pizza & soda this time around. And the venue this time is a "Theater" at Brocade HQ. We hopefully will record, but we cannot be sure, so please come in person and enjoy the company.
As usual, there will be a good crowd of devops folks and fellow automators, we'll are looking forward to reconnect and welcome new members.
==============
Diagnostics and Remediation Platform @Netflix
presented by Sayli Karmarkar - Senior Software Engineer at Netflix, Diagnostics and Remediation Engineering (DaRE)
During the last meetup (https://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/225104291/), we talked about motivation, initial design and use-cases of our Runbook Automation Platform - Winston. In this talk, we will dive deeper into the architecture and deployment of Winston. We will also see a demo of Winston Studio which is a development studio to create and manage winston automations. Attendees will see Winston in action acting as Tier-1 support for developers where they can outsource their repeatable diagnostic and remediation tasks and have them run automatically in response to events. We will also discuss the use-cases, customer impact of our platform, adoption challenges we have faced and how the architecture of the platform evolved with our learning along the way.
Sleep Better at Night: OpenStack Cloud Auto-Healing
presented by: Alexander Sakhnov, Senior Software Engineer, and Mykyta Gubenko, Deployment Engineer, at the Services department of Mirantis.
Software-defined everything is a new trend. How about software-defined outage prevention and remediation?
You have your cloud up and running. You monitor it through StackLight, Zabbix, Nagios or some other tool. But what's happening when one of the services is unresponsive or your free disk space is low? How quickly will you able to resolve the issue? Do you have any debugging information or logs gathered before you actually start digging into the issue?
We introduce a “robosysadmin” for our production OpenStack cloud that reacts to alerts and outages and helps us to speed up mean time to repair by gathering debug information and trying to fix issues automatically using predefined workflows. It’s a kind of Tier 0 support: it troubleshoots, fixes known problems, escalates to humans when necessary, and provides detailed information on what it has discovered.
We will show:
- How we monitor our multi-dc production cloud at Symantec.
- How we approached the problem of cloud auto-healing
- Stackstorm and alternatives for automating prevention and remediation of outages
- Openstack auto-healing workflows we created


Remediation for Real: Stories and Demos from Netflix and Mirantis