It Never Rains But It Pours

24 Jul 2024

      Instructive to read this notice
<https://rimuhosting.com/maintenance.jsp?server_maint_oid=5241959612&sn_oid=4781959612>
about progress in fixing an issue in the Dallas co-lo centre for Rimu
hosting.

As a customer, I wasn’t affected (though I could have been), and I have
no complaints with the way this was handled--it was great that they kept
customers like me apprised of developments like this.

The interesting part is how a single initial failure can lead to a
cascade of other failures. From the description, the original problem
was a breakdown of a transformer feeding power from the grid. To allow
it to be repaired, the power was switched to a backup generator. The
generator then shut down with a belt problem. So power went to UPS
batteries. These ran flat before the generator could be brought back to
operation. So various parts of the facility were left without power.

After power was restored, it was discovered that some (network?)
switches had failed to come up. So a bunch of physical boxes holding
customers’ VMs had to be physically moved to a different location where
they could get connectivity again. From that point, the disruption was
(mostly) over as far as customers were concerned, though there was
obviously still work to be done to get the infrastructure back to its
normal state.

I think that a lot of theoretical statistical modelling of the odds of
system failures tends to assume that failures of individual components
will be independent of each other. Yet time and time again in the real
world, we discover that this is not the case.

Lawrence D'Oliveiro

tags

participants (1)