
Instructive to read this notice <https://rimuhosting.com/maintenance.jsp?server_maint_oid=5241959612&sn_oid=4781959612> about progress in fixing an issue in the Dallas co-lo centre for Rimu hosting. As a customer, I wasn’t affected (though I could have been), and I have no complaints with the way this was handled--it was great that they kept customers like me apprised of developments like this. The interesting part is how a single initial failure can lead to a cascade of other failures. From the description, the original problem was a breakdown of a transformer feeding power from the grid. To allow it to be repaired, the power was switched to a backup generator. The generator then shut down with a belt problem. So power went to UPS batteries. These ran flat before the generator could be brought back to operation. So various parts of the facility were left without power. After power was restored, it was discovered that some (network?) switches had failed to come up. So a bunch of physical boxes holding customers’ VMs had to be physically moved to a different location where they could get connectivity again. From that point, the disruption was (mostly) over as far as customers were concerned, though there was obviously still work to be done to get the infrastructure back to its normal state. I think that a lot of theoretical statistical modelling of the odds of system failures tends to assume that failures of individual components will be independent of each other. Yet time and time again in the real world, we discover that this is not the case.
participants (1)
-
Lawrence D'Oliveiro