Learning from Facebook Global Outage Caused by Mis-Configuration

In a recent blog post, Facebook revealed that the global outage that lasted many hours was caused by configuration change errors to its routers.

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”

CloudFlare also has an interesting blog that describes what they saw from their perspective. Understanding How Facebook Disappeared from the Internet.

What confounds me is how often errors from misconfiguration still bite us. Even large organizations with seemingly plentiful resources and processes are prone to incidents like the rest of us.

How did the change process break down?
Was it a planned change or an ad-hoc change?
Was the change request not reviewed or lack sufficient detail?
Was a post-implementation review carried out?
Were the configuration changes captured for post-implementation review?

More often than not, the change process in most organizations is simply a rubber stamp process:

“Does this planned change make sense? Go ahead, and don’t mess it up.”

For me, what can really save a lot of headaches is to ensure the last 2 questions above are covered:

always have someone else peer-review the changes
capture, identify and highlight the changes made so it’s easy to provide real feedback and what-ifs

Configuration and change management has been something developers have been dealing with since the dawn of programming. As software development has matured, so have the processes and tooling required to support and minimize the human errors that occur. Specifically,

peer code review
software version control to automatically capture and highlight the changes made

Both of these go hand-in-hand. Without an easy and automated way to capture and identify configuration changes, peer code review is difficult to do well, and more importantly, becomes very time-consuming.

In operations, your infrastructure configuration is your code.

Do you know when your “code” is getting changed?
Do you know what is getting changed? What was it before?
Do you know who is making the changes?

And when a planned change is made, do you automatically capture the actual changes across all the devices and systems in your environment so that you can adequately do a peer review, and hopefully prevent human errors like this one.

This is what we do at SIFF.IO.

To learn more about how SIFF can help empower your configuration and change management, watch our 3-minute video to find out “What the #%&$ changed?!”