Skip to content

PWL #69: Metastable Failures in Distributed Systems

M
Hosted By
Max P. and 2 others
PWL #69: Metastable Failures in Distributed Systems

Details

Details
• What we'll do

Main Event: Metastable Failures in Distributed Systems with David Murray! (https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf)

We describe metastable failures—a failure pattern in distributed systems. Currently, metastable failures manifest themselves as black swan events; they are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework.

We introduce a framework for thinking about metastable failures, apply it to examples observed during years of operating distributed systems at scale, and survey ad-hoc techniques developed post-factum for making systems resilient to known metastable failures. A systematic approach for building systems that are robust against unknown metastable failures remains an open problem.

(bonus paper: ‘‘Those found responsible have been sacked’’ - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.623.5749&rep=rep1&type=pdf)

• Important to know

As a chapter of Papers We Love we abide by and enforce the PWL Code of Conduct (https://github.com/papers-we-love/seattle/blob/master/code-of-conduct.md) at our events. Please give it a read, plan on acting like an adult, and involve one of the organizers if you need help.

Stop slacking and join us in the #seattle channel at https://papersweloveslack.herokuapp.com!

If you have a paper you'd like to present, or even just a mini, please hit up one of the organizers :) We're always looking for more presenters.

Photo of Papers We Love @ Seattle group
Papers We Love @ Seattle
See more events