Incidents aren't fun, but they happen and are often great learning experiences. This time, speakers from Stripe, Fastly, SendGrid, and SigOpt will talk about the lifecycles of incidents. Learn about how you can be more prepared from educating your engineers, to handling the fires, to the review process afterwards!
6:30-7:00 Doors Open & Food & Drinks
8:20-9:00 Social & Networking Time
Amy Nguyen (Infrastructure Engineer at Stripe, https://twitter.com/amyngyn)
Talk: Big Red Button: How Stripe Automates Incident Management
When an incident starts, ten different things need to happen at once. You need to get an incident commander, you need to get all the right people in the room, you need to mitigate the incident, and you need to stay organized. At Stripe, we've built a tool for automating as much of the routine tasks as possible so responders can focus on what humans do best. In this talk, I'll show you the Big Red Button, a web form that sends emails, creates JIRA tickets, opens Slack channels, sends pages, and more. We'll talk about the unique constraints of this tool (such as, how much incident metadata do you ask for up-front?) and how our incident management philosophy influenced our design.
Alexandra Johnson (Tech Lead at SigOpt, https://twitter.com/alexandraj777)
Talk: How to Lead a Disaster Recovery Exercise For Your On Call Team
On call teams at startups have three big problems: they're small, they cover a wide breadth of infrastructure, and the last two points usually imply that they lack the bandwidth to maintain and write documentation for a suite of devops tools. At SigOpt, our on call team tackles these challenges with a biannual "disaster recovery exercise", a simulated outage that provides a crash course in finding and using the tools we need to keep our service up and running. In this talk, I'll cover what a disaster recovery exercise is, how we plan ours, and what the benefits are for our team.
Connie-Lynne Villani (Director of Incident Management at Fastly, https://twitter.com/clynnexx)
Talk: Who are these people, anyway: roles and expectations during an incident.
Lots of people want to help during an incident, but aren't sure what to do. Clearly defining roles and their jobs before anything goes wrong helps people organize quickly, reducing mitigation time and streamlining effort. At Fastly, we train our teams in incident response ahead of time, and we match incident-specific jobs to daily work. In this talk, I'll discuss who does what during an incident, and why we put managers on call.
Sue Pomeroy (Program Manager for Information Services at SendGrid, https://twitter.com/sueallspaw)
Talk: Incident Management is Made of People
Post incident review is a story about people. This talk is a story about how we talk about incidents: when we’re seeing indicators of an issue, when we’re generating and investigating hypotheses about what’s happening in our systems, and when we’re learning from the way we responded to it. Incidents are all about uncertainty and in order to learn from them, we need to understand how people adapt in the middle of the fray.