{"id":2951,"date":"2021-10-05T10:37:42","date_gmt":"2021-10-05T17:37:42","guid":{"rendered":"https:\/\/44.203.207.232\/?p=2951"},"modified":"2021-10-05T10:37:44","modified_gmt":"2021-10-05T17:37:44","slug":"learning-facebook-instragram-outage-mis-configuration","status":"publish","type":"post","link":"https:\/\/webdev.siff.io\/learning-facebook-instragram-outage-mis-configuration\/","title":{"rendered":"Learning from Facebook Global Outage Caused by Mis-Configuration"},"content":{"rendered":"\n
In a recent blog post<\/strong><\/a>, Facebook revealed that the global outage that lasted many hours was caused by configuration change errors to its routers.<\/p>\n\n\n\n \u201cOur engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.<\/em>\u201d<\/p>\n\n\n\n CloudFlare also has an interesting blog that describes what they saw from their perspective. Understanding How Facebook Disappeared from the Internet<\/a>.<\/p>\n\n\n\n What confounds me is how often errors from misconfiguration still bite us. Even large organizations with seemingly plentiful resources and processes are prone to incidents like the rest of us. <\/p>\n\n\n\n More often than not, the change process in most organizations is simply a rubber stamp process: <\/p>\n\n\n\n \u201cDoes this planned change make sense? Go ahead, and don\u2019t mess it up.\u201d<\/p>\n\n\n\n For me, what can really save a lot of headaches is to ensure the last 2 questions above are covered:<\/p>\n\n\n\n Configuration and change management has been something developers have been dealing with since the dawn of programming. As software development has matured, so have the processes and tooling required to support and minimize the human errors that occur. Specifically,<\/p>\n\n\n\n Both of these go hand-in-hand. Without an easy and automated way to capture and identify configuration changes, peer code review is difficult to do well, and more importantly, becomes very time-consuming.<\/p>\n\n\n\n In operations, your infrastructure configuration is your code. <\/p>\n\n\n\n And when a planned change is made, do you automatically capture the actual changes across all the devices and systems in your environment so that you can adequately do a peer review, and hopefully prevent human errors like this one.<\/p>\n\n\n\n This is what we do at SIFF.IO<\/strong><\/a>.\u00a0<\/p>\n\n\n\n