Leading companies such as Google, Heroku, and PagerDuty have developed successful incident management practices based on the public safety world's Incident Command System (ICS)
Brent Chapman will be presenting a talk that offers key lessons learned by those organizations, along with a few war stories.
The key lessons that will be discussed include:
* Incident response is a critical SRE capability
* We need to explicitly distinguish between "normal" and "emergency" operations
* ICS principles apply well to our incidents
* Simple modifications to public safety ICS practices make it better for our needs
* Certain communications tools are more effective for incident response
* Checklists are very powerful and under-appreciated tools
* Blameless postmortems are a key to improving incident response
* Senior managers can inadvertently disrupt incident response just by showing up
Brent Chapman is an expert at emergency management and at helping organizations prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).
While Brent was part of the legendary SRE organization at Google, he created and launched the Incident Management at Google (IMAG) system that is now used throughout the company for emergency management. Brent is also a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.
Our meetings are scheduled for 7:30pm on the third Thursday of each month.
BayLISA includes system and network administrators across a range of skill levels. BayLISA meets to discuss topics of interest to system administrators and managers. The meetings are free and open to the public.
We always welcome presentation topics and volunteer speakers. Use the "Contact us" link on this page to get in touch with BayLISA's directors.