When A Safety Check Becomes A Point Of Failure...

All inbound networking for GCE instances, load balancers and VPN tunnels enter via shared layer 2 load balancers. These load balancers are configured with changes to IP addresses for these resources, then automatically tested in a canary deployment, before changes are globally propagated. The issue was triggered by a large set of updates which were applied to a rarely used load balancing configuration. The application of updates to this configuration exposed an inefficient code path which resulted in the canary timing out. From this point all changes of public addressing were queued behind these changes that could not proceed past the testing phase. In other words, the safety system for validating configuration updates before allowing deployment suffered a failure which blocked all configuration updates thereafter. I guess the answer is to add a watchdog for the canaries ... how many more metaphors can I mix in? <http://www.theregister.co.uk/2017/02/10/google_cloud_outage_explanation/>
participants (1)
-
Lawrence D'Oliveiro