Configuration Outages That We Should Be Learning From

In this week’s blog, we look back at some of the major infrastructure outage news, the cost of human error, and misconfiguration. We will look at the damage, the root causes, and remedies that SIFF could have provided in addition to company instituted best practices, to prevent future outages.

Outages occur all the time and unfortunately we have become desensitized to the realities behind the incident. It’s important from time to time, without raising too much alarm or causing undue panic, to put a magnifying glass to the actual realities that befall corporations and customers hit by outages. Today, we explore one outage and its effects on CenturyLink, a communication and network related corporation.

I have an elderly parent with cancer, who is barely mobile. The other day, he was lying on his back on the floor of his bedroom, resting and chatting it up with me as I was visiting him from out of town. When he was done “stretching” and relaxing, He couldn’t lift himself back up, and I needed to assist him. Imagine for a second if this had happened when no one was around. He’d have to call 911. Imagine if over 800 of these desperate calls did not go through. sadly, this did occur, and much worse. Not only did 866 calls to 911 go undelivered, but 17 million customers across 29 states lacked reliable access to 911. Sources do not tell us the human toll of this outage, but one can only imagine!

According to the FCC, the 37-hour outage at CenturyLink began on December 27 and was caused by an equipment failure that was exacerbated by a network configuration error. CenturyLink estimates that more than 12.1 million phone calls on its network were blocked or degraded due to the incident.

The problems began when, “A switching module in CenturyLink’s Denver, Colorado node spontaneously generated four malformed management packets,” the FCC report said. The Malformed packets “are usually discarded immediately due to characteristics that indicate that the packets are invalid,” but that didn’t happen in this case. The switching module sent these malformed packets “as network management instructions to a line module,” and the packets “were delivered to all connected nodes,” the FCC said. Each node that received the packet then “retransmitted the packet to all its connected nodes.” The company “identified and removed the module that had generated the malformed packets.” But the outage continued because “the malformed packets continued to replicate and transit the network, generating more packets as they echoed from node to node.

To remedy the costly $16M outage, CenturyLink said that it “has taken a variety of steps to help prevent the issue from recurring, including disabling the communication channel these malformed packets traversed during the event, and enhancing network monitoring. However, the FCC report said that several best practices could have prevented the outage or lessened its negative effects. For example, the FCC said that CenturyLink and other network operators should disable system features that are not in use.

Source: https://www.toolbox.com/tech/networking/news/major-network-outages-in-2020-what-could-have-prevented-them/

What becomes paramount, is ensuring that best practices and recommendations are actually implemented and continuously monitored. Anything less, is simply documentation sitting on a table or a passing verbal commitment. Proactive steps need to be taken, need to be monitored, and need regular updating. One wouldn’t benefit from a prescribed medication if it’s simply sitting in the medicine cabinet. One needs to actually take the medicine!

If we look back at other recent outages (below), misconfiguration is a constant source of major incidents and outages. Now imagine the number of incidents that are not publicly visible that occur within an organization, how much time and resources are wasted due to repeated problems that could be avoided by monitoring for configuration compliance to best practices and recommendations.

No.	Name	Date	Description
1	CenturyLink (communication, network and related services)	December 2018	22 million subscribers in 39 states were affected by an outage and 17 million customers across 29 US states were unable to reach emergency 911 services and at least 886 calls to 911 were not delivered. These subscribers and others in the UK and Singapore lost connectivity for two days. Additionally, customers could not make ATM withdrawals, access sensitive patient healthcare records, and more. The outage was attributed to equipment failure exacerbated by a network configuration error; redundant systems did not take over. (In 2015, CenturyLink was fined $16m for a six-hour 911 outage.)
2	Facebook	March 2019	Facebook’s first, but not only, outage of 2019, lasted 14 hours and was reportedly the result of a “server configuration change.” Things happen, but why didn’t redundant systems take over?
3	AeroData (weight and balance calculations for flight planning)	April 2019	A “mere” 40 minute outage delayed close to 3,000 flights. Affected airlines included Southwest, SkyWest, United, Delta, United Continental, JetBlue and Alaskan Airlines. The outage was referred to as a “technical issue.” Although recovery was fairly quick, damage was significant. Was the outage due to a misconfiguration? In any case, redundant systems did not kick in.
4	Microsoft Azure	May 2019	A nearly three-hour global outage affecting core Microsoft cloud services, including compute, storage, an application development platform, Active Directory and SQL database services. Cloud-based applications, including Microsoft 365, Dynamics and Azure DevOps, were also impacted. Microsoft stated that the outage was caused by “a nameserver delegation change affecting DNS resolution, harming the downstream services” and occurred “during the migration of a legacy DNS system to Azure DNS.”
5	Salesforce	May 2019	A 15 hour global outage due to a permissions failure allowed users of Salesforce’s Pardot marketing automation software to see and edit the entirety of their company’s data on the system. Salesforce cut access twice to the larger Salesforce Marketing Cloud to stop exposure of sensitive information and to handle what was discovered to be a database script error. Sales agents and marketers around the world lost access to customer information. Restoration of permissions was not simple; customers’ admins had to set up permissions again, some could do so automatically, some needed to manually restore.
6	Google Cloud Platform	June 2019	A four-hour outage affecting many millions of users including tech brands that use Google Cloud as well as Google’s own services such as YouTube, Gmail, Google Search, G Suite, Google Drive, and Google Docs. The problem occurred during maintenance operations and was caused by a software bug combined with two misconfigurations, which Google characterized as “normally-benign” misconfigurations. In our experience misconfigurations are never benign. If they don’t immediately cause an outage, they eventually will.
7	Verizon	June 2019	A roughly three hour worldwide outage of major websites like Google, Amazon, and Reddit did not originate with Verizon but, it was the company that allowed the fault to propagate. A misconfiguration in the routing optimization software of a small internet service provider led to incorrect routes that were eventually taken up by Verizon, which did not have software in place to block and filter them. These faulty routes caused massive volumes of traffic to be directed through small networks not equipped to deal with it, leading to packet loss, unavailability, and disruption of services at major websites.

Source: https://www.continuitysoftware.com/blog/it-resilience/19-of-the-worst-it-outages-in-2019-a-recap-of-being-let-down/

Having a configuration monitoring and compliance solution that is able to collect from all sources (networks, apps, servers, storage, vm, containers, cloud) allows you to improve the governance of these services and devices in a consolidated and consistent manner. All the lessons learned can be implemented, consolidated and improved over time vs one time checks during an annual, time-consuming audit.

Best practices and recommendations are only actionable if they are automated.

To learn more about how SIFF can help:

troubleshoot complex outages by identifying config changes related to the incident
reduce and prevent incidents by improving the change process
continuously analyze configuration for policy compliance and governance

Visit us at https://siff.io