How Facebook Architects Around Silent Data Corruption

Seems like Facebook’s computation facilities have become so massive that it is starting to see intermittent errors in places you wouldn’t expect <https://www.nextplatform.com/2021/03/01/facebook-architects-around-silent-data-corruption/>: Engineers found that many of the cascading errors are the result of CPUs in production but not always due to the “soft errors” of radiation or synthetic fault injection. Rather, they find these can happen randomly on CPUs in repeatable ways. Although ECC is useful, this is focused on problems in SRAM but other elements are susceptible. The Facebook engineering team that reported on these problems finds that CPU silent data corruptions are actually orders of magnitude higher than soft-errors due to a lack of error correction in other blocks. The article mentions examples like a computation of 2×3 sometimes returning an answer other than 6, or a file size intermittently (and incorrectly) being computed as zero, leading to the file data not being copied. Such occurrences may be extremely rare, but when you are doing things on the scale that Facebook does, they start to become all too common. Facebook has published a paper (linked in the report) on their research into the problem. They have worked out some techniques to try to mitigate these issues, but of course they come at a cost.

On Thu, 4 Mar 2021 12:56:59 +1300, I wrote:
Seems like Facebook’s computation facilities have become so massive that it is starting to see intermittent errors in places you wouldn’t expect <https://www.nextplatform.com/2021/03/01/facebook-architects-around-silent-data-corruption/>:
Not just Facebook, but Google too <https://www.theregister.com/2021/06/04/google_chip_flaws/>. In their paper, they suggest that these intermittent errors are becoming more frequent as ever-more-complex chip designs push closer to the limits of what is physically possible. It seems like some particular chips are more prone to this than others, but it needs more extensive testing to identify them than is normally done by the manufacturers.
participants (1)
-
Lawrence D'Oliveiro