Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Come have lunch with Papers We Love and enjoy good food and geeky talk. We'll be discussing the paper "Gray Failure: The Achilles’ Heel of Cloud-Scale Systems," from Microsoft Research. In it, the authors note that the most common failure scenario in cloud-scale systems is not hard failure, but soft failures, which may be only partially detectable, and are much harder to compensate for:

"Cloud scale provides the vast resources necessary to replace failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences."

You can get a free copy of the paper here:

As always, please feel free to attend if the subject matter interests you, whether or not you read the paper ahead of time. Reading the paper in advance is helpful, but don't let it stop you from coming!

