Emergency Support for IT and Ops

Imagine going to the ER and the attending doctor is not able to ask you what happened. All he has to work with are a myriad of tests he can run on you to narrow down the problem. This is the environment IT and Operations Support frequently work in when they do not have visibility to configuration changes in their environment.

Emergency Support 1

MAGIC is often a perception of some desirable outcome without the understanding or appreciation of the underlying work required to produce the desired result. We all want MAGIC without having to do the work. Without tackling the fundamental challenges, MAGIC and all the amazing things that can come from it cannot happen. Understanding your IT and Operations infrastructure configuration is one of these core principles.

Emergency Support 2

To learn more about how SIFF can help empower your configuration and change management, watch our 3-minute video to find out “What the #%&$ changed?!”

Log4Shell: Finding Where You Are Vulnerable

I’m sure by now you are well aware of the Log4j 2 vulnerability which is putting an unprecedented number of companies at risk. In case you haven’t heard, here are a couple of quick links to get you up-to-date and advise you on how to mitigate it:

The big question, however, for those who are directly responsible for the security of your company or perhaps indirectly responsible as the application owner or as operational support, is where are the vulnerabilities located? Which Applications? Which Servers? Which tools are susceptible to the Log4Shell; and more importantly, how confident are you that you found every instance of it?

Using the SIFF configuration monitoring platform, you can quickly discover the location of the Log4j vulnerability by using a  SIFF Service Definition (SD) to discover and identify the java processes that are using Apache Log4j, then leverage a SIFF Policy Definition (PD) to validate whether these instances are compliant or not (i.e. Log4j version <= 2.14.1). Violations are flagged, users notified, and the platform can be configured to trigger automated remediation actions.

To make this easy, we have created these SD/PD and included them in the built-in SIFF community library. You simply have to activate these definitions and they will automatically examine any SIFF-managed devices for the Log4Shell vulnerability and notify you.

If you are interested in learning more about using SIFF to ensure security and configuration compliance as well as how SIFF can help monitor configuration changes in your environment, learn more here.

Network Automation != Network Compliance

In a recent study by EMA, The State of Network Automation: Configuration Management Obstacles are Universal, the report indicated that there is significant dissatisfaction with the current state of Configuration Management, especially at the large network operators. The concerns revealed that 3 out of 4 IT organizations are worried that configuration changes are likely to lead to performance problems and security issues. These errors can impact any organization, even those with a leading reputation for network operations such as Facebook where they suffered a global outage in October 2021. Facebook publicly attributed the outage due to a bad network configuration change. 

The study goes on to prescribe that Network Automation is the key path towards improving Network Compliance and Audits. Although automation tools do help provide more consistency and reduce human configuration errors, this path ignores critical attributes of network operations in the real world. Specifically:

  • No networks are fully automated. Most have people making manual configuration changes to the infrastructure. 
  • A large volume of planned vs unplanned changes. Ideally configuration changes follow the change management process however in most organizations, there are a large number of changes made directly that avoid the process for various reasons.
  • Authorized vs unauthorized changes. This includes changes due to security intrusions/hackers as well as internal personnel making changes that are implicitly “allowed”.
  • Multiple automation tools. Most environments have multiple tools used by different functional groups that make configuration changes including vendor-specific management tools and Element Management Systems (EMS). 

The real issue is that Network Compliance and by extension, configuration monitoring, should not be conflated with Network Automation. They serve different purposes. Network Compliance and Audits need to ensure the correct configuration on actual devices and not just “golden configs” defined in automation tools. In other words,

The “configuration truths” are on the actual devices and not in a CMDB or an Inventory system.
Not in network management tools or network automation tools.

The Network Compliance policies and audits must validate what is on the actual devices and verify all changes made to those devices regardless of whether it is manual, automated, or worst case, hacked. 

At SIFF.IO, this is the methodology or approach we use to ensure Network Compliance.

  • SIFF collects and monitors any configuration changes, whether it is a manual change or a change initiated by Network Automation.
  • SIFF applies Compliance Policies to ensure any misconfiguration is immediately flagged and notified. This includes checking existing configs as well as newly detected configuration changes which allows new vulnerabilities to be identified on existing configs.
  • SIFF integrates with one or multiple change management systems used by different functional groups to identify planned vs unauthorized changes.

With SIFF, you have visibility into all configuration changes across all sources (networks, servers, apps, cloud, VMs, and containers) to meet your security compliance and audit requirements. This change visibility is not limited to those that are planned or automated.

Configuration Monitoring and Compliance is different from Configuration Automation. There are certainly overlaps but what they do should not be confused. To learn more about how SIFF.IO can help monitor all infrastructure configuration changes and ensure policy compliance,

Visit SIFF.IO to find out “What the #%&$ changed?!”

Configuration Change Instrumental to Addressing Fastly’s June 8 Outage

Here is another example of a very visible outage experienced by Fastly. In this case, we highlight how their use of the customer configuration change was instrumental to their ability to quickly identify, isolate and restore their service.  

On June 8, Fastly, the CDN provider experienced an outage that caused disruption to 85% of their network service. The incident was triggered by a customer configuration change resulting from a bug introduced by an earlier software deployment on May 12th. 

The good news is that the outage was immediately detected by the monitoring system within 1 min, however it still took 40 mins to identify the configuration change. Often with complex systems, designed with resiliency in mind, problems often do not immediately present themselves. Instead, they gradually build up over time and catch the operational support team, unaware. Troubleshooting these kinds of problems can be extremely difficult, especially if you lack visibility and awareness to configuration changes throughout your environment, and specifically the corresponding changes for completed change requests. Like most engineers, I can barely remember what I did yesterday, let alone a week ago.

The change and operational support processes can be greatly improved by implementing Configuration Change Monitoring, to provide the necessary visibility to all configuration changes across the infrastructure – networks, servers, services, applications, virtualization and cloud.

Configuration Change Monitoring is an invaluable tool to help identify and isolate the root cause of complex incidents by providing the details for all changes, whether they’re planned, unplanned or unauthorized.

Configuration Change Monitoring, used during the post-implementation review step, can also significantly reduce the number of incidents by catching errors early, before they become an outage. By automatically capturing the actual configuration changes, and making them available for peer-review, human and automation errors can be greatly reduced.

If you’re responsible for your company’s infrastructure operations and interested in improving your incident response,
visit SIFF.IO to find out “What the #%&$ changed?!”

Salesforce Outage Linked to DNS Configuration Change

Yes, another example of an incident. In this case a very visible global outage, due to infrastructure misconfiguration. If you are not familiar with the Salesforce outage, read more.

Before you sound the trumpet for more DevOps automation, of which I am a strong supporter, what I would like to highlight is the need for visibility to the configuration changes (or lack thereof), for completed change requests. 

Visibility to all configuration changes across your entire infrastructure: networks, servers, apps, containers, and cloud is essential for effective incident response, to quickly narrow down and isolate the root cause. Change monitoring; however, can also be used as a proactive tool to prevent incidents by correlating configuration changes with planned change requests, in order to make it easier to see what configuration was changed from the completed work. 

During the post implementation review stage of the change process, reviewers can examine whether the resulting configuration changes have errors and/or are missing. This may give the slight chance needed, to prevent an incident or outage from occurring.

Regarding the Salesforce DNS misconfiguration outage, the change was indeed automated, but there was a bug in the automation that caused the problem. Making it easy to review the configuration changes will enable reviewers to check for errors, especially for manual changes.

If you’re responsible for your company’s infrastructure operations, come visit us at SIFF.IO to find out “What the #%&$ changed?!”

Photo credit

Configuration Outages That We Should Be Learning From

In this week’s blog, we look back at some of the major infrastructure outage news, the cost of human error, and misconfiguration. We will look at the damage, the root causes, and remedies that SIFF could have provided in addition to company instituted best practices, to prevent future outages.

Outages occur all the time and unfortunately we have become desensitized to the realities behind the incident. It’s important from time to time, without raising too much alarm or causing undue panic, to put a magnifying glass to the actual realities that befall corporations and customers hit by outages. Today, we explore one outage and its effects on CenturyLink, a communication and network related corporation. 

I have an elderly parent with cancer, who is barely mobile. The other day, he was lying on his back on the floor of his bedroom, resting and chatting it up with me as I was visiting him from out of town. When he was done “stretching” and relaxing, He couldn’t lift himself back up, and I needed to assist him. Imagine for a second if this had happened when no one was around. He’d have to call 911.  Imagine if over 800 of these desperate calls did not go through. sadly, this did occur, and much worse. Not only did 866 calls to 911 go undelivered, but 17 million customers across 29 states lacked reliable access to 911. Sources do not tell us the human toll of this outage, but one can only imagine!

According to the FCC, the 37-hour outage at CenturyLink began on December 27 and was caused by an equipment failure that was exacerbated by a network configuration error. CenturyLink estimates that more than 12.1 million phone calls on its network were blocked or degraded due to the incident. 

The problems began when, “A switching module in CenturyLink’s Denver, Colorado node spontaneously generated four malformed management packets,” the FCC report said. The Malformed packets “are usually discarded immediately due to characteristics that indicate that the packets are invalid,” but that didn’t happen in this case. The switching module sent these malformed packets “as network management instructions to a line module,” and the packets “were delivered to all connected nodes,” the FCC said. Each node that received the packet then “retransmitted the packet to all its connected nodes.” The company “identified and removed the module that had generated the malformed packets.” But the outage continued because “the malformed packets continued to replicate and transit the network, generating more packets as they echoed from node to node.

To remedy the costly $16M outage, CenturyLink said that it “has taken a variety of steps to help prevent the issue from recurring, including disabling the communication channel these malformed packets traversed during the event, and enhancing network monitoring. However, the FCC report said that several best practices could have prevented the outage or lessened its negative effects. For example, the FCC said that CenturyLink and other network operators should disable system features that are not in use.

Source: https://www.toolbox.com/tech/networking/news/major-network-outages-in-2020-what-could-have-prevented-them/

What becomes paramount, is ensuring that best practices and recommendations are actually implemented and continuously monitored. Anything less, is simply documentation sitting on a table or a passing verbal commitment. Proactive steps need to be taken, need to be monitored, and need regular updating. One wouldn’t benefit from a prescribed medication if it’s simply sitting in the medicine cabinet. One needs to actually take the medicine! 

If we look back at other recent outages (below), misconfiguration is a constant source of major incidents and outages. Now imagine the number of incidents that are not publicly visible that occur within an organization, how much time and resources are wasted due to repeated problems that could be avoided by monitoring for configuration compliance to best practices and recommendations.

No.NameDateDescription
 1 CenturyLink (communication, network and related services) December 201822 million subscribers in 39 states were affected by an outage and 17 million customers across 29 US states were unable to reach emergency 911 services and at least 886 calls to 911 were not delivered. These subscribers and others in the UK and Singapore lost connectivity for two days. Additionally, customers could not make ATM withdrawals, access sensitive patient healthcare records, and more. The outage was attributed to equipment failure exacerbated by a network configuration error; redundant systems did not take over. (In 2015, CenturyLink was fined $16m for a six-hour 911 outage.)
 2 Facebook March 2019Facebook’s first, but not only, outage of 2019, lasted 14 hours and was reportedly the result of a “server configuration change.”  Things happen, but why didn’t redundant systems take over?
 3AeroData
(weight and balance calculations for flight planning)
 April 2019A “mere” 40 minute outage delayed close to 3,000 flights. Affected airlines included Southwest, SkyWest, United, Delta, United Continental, JetBlue and Alaskan Airlines. The outage was referred to as a “technical issue.” Although recovery was fairly quick, damage was significant. Was the outage due to a misconfiguration? In any case, redundant systems did not kick in.
 4Microsoft Azure May 2019A nearly three-hour global outage affecting core Microsoft cloud services, including compute, storage, an application development platform, Active Directory and SQL database services. Cloud-based applications, including Microsoft 365, Dynamics and Azure DevOps, were also impacted. Microsoft stated that the outage was caused by “a nameserver delegation change affecting DNS resolution, harming the downstream services” and occurred “during the migration of a legacy DNS system to Azure DNS.”
 5 Salesforce May 2019A 15 hour global outage due to a permissions failure allowed users of Salesforce’s Pardot marketing automation software to see and edit the entirety of their company’s data on the system. Salesforce cut access twice to the larger Salesforce Marketing Cloud to stop exposure of sensitive information and to handle what was discovered to be a database script error. Sales agents and marketers around the world lost access to customer information.     Restoration of permissions was not simple; customers’ admins had to set up permissions again, some could do so automatically, some needed to manually restore.
 6 Google Cloud Platform June 2019A four-hour outage affecting many millions of users including tech brands that use Google Cloud as well as Google’s own services such as YouTube, Gmail, Google Search, G Suite, Google Drive, and Google Docs.  The problem occurred during maintenance operations and was caused by a software bug combined with two misconfigurations, which Google characterized as “normally-benign” misconfigurations. In our experience misconfigurations are never benign. If they don’t immediately cause an outage, they eventually will.
 7 Verizon June 2019A roughly three hour worldwide outage of major websites like Google, Amazon, and Reddit did not originate with Verizon but, it was the company that allowed the fault to propagate. A misconfiguration in the routing optimization software of a small internet service provider led to incorrect routes that were eventually taken up by Verizon, which did not have software in place to block and filter them. These faulty routes caused massive volumes of traffic to be directed through small networks not equipped to deal with it, leading to packet loss, unavailability, and disruption of services at major websites.

Source: https://www.continuitysoftware.com/blog/it-resilience/19-of-the-worst-it-outages-in-2019-a-recap-of-being-let-down/

Having a configuration monitoring and compliance solution that is able to collect from all sources (networks, apps, servers, storage, vm, containers, cloud) allows you to improve the governance of these services and devices in a consolidated and consistent manner. All the lessons learned can be implemented, consolidated and improved over time vs one time checks during an annual, time-consuming audit.

Best practices and recommendations are only actionable if they are automated. 

To learn more about how SIFF can help:

  • troubleshoot complex outages by identifying config changes related to the incident
  • reduce and prevent incidents by improving the change process
  • continuously analyze configuration for policy compliance and governance

Visit us at https://siff.io

Bringing Change to Incident Response

Imagine for a minute, going to the ER in this panic-stricken, pandemic era, with symptoms akin to Covid due to a peanut allergy. After waiting for what seems like an eternity in a long queue of patient admittance, the ER doctor finally makes his hurried visit and prognosis.

The doctor fires off a series of probing questions, but sadly the patient is unable to speak. Due to this lack of communication, the doctor’s prognosis is now reduced to guesswork based on symptoms. Without the situational awareness of what could have caused those symptoms, it is now likely that a lot of time will be wasted exploring various other possible maladies.

Across many industries, miscommunication, unfocussed communication, or a complete lack of situational awareness, can be the Achilles heel of many companies’ performance, public perception, and brand. We see it in healthcare, we see it in government, we see it in human interactions and relationships. We certainly see it in today’s complex IT environment, where the repercussions include the invaluable loss of customer trust, loss of revenue, internal disruption, and diminished employee morale. Lack of situational awareness impacts every level of business, from global corporations such as Google, Capital One, and Amazon Web Services, to small businesses managed by outsourced MSPs. 

“More than 80% of all incidents are caused by planned and unplanned changes” -Gartner

Sometimes, the changes remain unnoticed for hours if not days, before a major outage occurs…and by that time, even Sherlock Holmes would be challenged to unravel what went wrong! The time required to identify and remedy the infrastructure configuration changes that caused the incident in the first place, significantly impacts the business, and the damages and losses have already begun to compound.

But what if one could have… a real time activity stream of configuration changes, that occurs throughout the environment? 

At a minimum, this would enable incident triage to expedite all relevant changes, quickly narrowing down the scope of the problem. The technician at hand could easily see how changes to the firewall rules, which belong to the network team, directly impacts the application team. In doing so, the technician has also saved a lot of time for many other folk, by avoiding the unnecessary, dreaded emergency conference bridge, which all too often has become commonplace.

Perhaps, if the change data is leveraged early enough, by capturing the configuration changes related to the Change Request and used during post-implementation reviews, the outages may be prevented altogether. 

Why don’t most infrastructure operations monitor for configuration changes, today?  

Most likely, companies assume they have a comprehensive coverage with their existing myriad list of configuration management tools, with each silo (networks, servers, apps, cloud, VM, containers, etc…) providing coverage for their own specific domain. While this may be true, none of these independent tools brings all the configuration changes together in one place, and certainly none relate the configuration data together or with Change Requests. 

The configuration change monitoring solution needs to be specifically focused to help incident response, incident prevention and configuration assurance.

Just as our ER patient would have greatly benefited if the doctor was able to communicate with the patient, and more importantly, was made aware of his peanut allergy, so too, IT infrastructure and operational support can also greatly benefit from having visibility to configuration changes that are constantly made throughout their environment, whether planned or unplanned. Finding out “what’s changed” shouldn’t require everyone getting on an emergency conference bridge.

Responsive Infrastructure Operations

Bridging the Two Worlds, Part 2

In part 1 of the series, we provided an overview of how organizations have a tough time tracking their ongoing configuration changes in their IT and network infrastructure. Many have implemented change management processes to provide some approval and accountability for changes, however many config changes whether planned or unplanned still go undetected which could often lead to disastrous outages, impact on customers, and business continuity.

The ability for companies to innovate, improve service uptime, and implement rapid change is critical to their ability to remain competitive. Knowing that 80% of all incidents are caused by configuration changes, many companies create an overabundance of overhead and a risk averse culture that hobbles change – they fear making mistakes and slow down the process to ensure they cross all the T’s and dot all the i’s – death by a million CABs and approvals.

Innovative companies view their ability to be agile and change as a competitive differentiator. They strive for ways to improve their ability to make changes while minimizing risk. Abstraction through the use of cloud services and automation through the use of new DevOps release automation tools are a couple good examples of how to minimize change risk. Cloud services reduce the number of moving parts and enable us to reliably spin up new services quickly and easily, while automation tools enable consistent repeatable change.  

However, abstraction and automation still do not prevent bad configuration changes from occurring, they simply shift where the configuration changes are performed. More importantly, when incidents or problems occur (and they still will) abstraction and automation does not eliminate accountability or determination of root cause. Awareness of all planned, but also unplanned, configuration changes is critical for effective analysis and remediation of all infrastructure and security incidents. Having this visibility and awareness of change is key to helping to mitigate risk and still provide operational efficiency.  

The adoption of Change Monitoring technologies can significantly reduce the risk of infrastructure changes. It provides accountability to all changes whether they’re planned or unplanned, as well as unauthorized or security intrusions. It can quickly isolate and help identify the root cause of complex incidents.

Operational-Change-Maturity-Levels

Level 1 – Unaware

Level – 1 Unaware was covered in the previous blog post (part1). Most organizations fit into this category as they are simply unaware of configuration changes that are made to their environment. Not only does this pose a significant security risk, it also reinforces bad behavior where users would often skip the prescribed change management process because there simply is no visibility or accountability to infrastructure changes. 

Level 2 – Responsive

A “Responsive” operational support organization is different from a traditional support team.  The Responsive operational support organization leverages and utilizes configuration change events to quickly narrow down and isolate the scope of the incident and identify potential root cause before spending precious time chasing and diving down various “rabbit holes” that can easily consume time and expensive resources. Ideally, the support team should be able to directly utilize Change Request records and analyze the resulting configuration changes made to the infrastructure as well as the impacted dependent services.

The “Level 2 – Responsive” of the Operational Change Maturity Model is the foundational layer in which all the capabilities and objectives of the higher change assurance levels are derived.

  • Greatly improved ability to troubleshoot incidents, especially complex outages
  • Prevent and reduce incidents caused by human errors, or the lack of understanding the underlying service impact and dependencies
  • Identify and reduce unplanned changes to ensure a consistent process and review are followed
  • Detect unauthorized changes and security intrusions
  • Implement and ensure security policy and compliance

Although the primary focus of the Responsive operational support team is to monitor, diagnose and repair as quickly as possible, the team also plays an important role to provide the necessary feedback to the change team and the security teams to ensure that consistent processes are followed to minimize risk and unnecessary incidents.  One of the key requirements of “Level 2 – Responsive” is to provide the visibility and accountability to ALL infrastructure changes. The support team needs to be able to easily distinguish between planned vs unauthorized changes or security intrusions; to provide the feedback to the change and security teams to reinforce the correct behavior.

What about existing tools and approaches?

Configuration Management Database (CMDB)

CMDB is a foundational idea but often with impractical implementation that provides limited value. A good test that demonstrates whether the CMDB provides any useful configuration information for troubleshooting complex incidents is whether the operational support team actually uses the CMDB.

The horror stories of CMDB endeavours are enough to scare off any ambitions of an actual configuration change repository vs what most CMDB implementations are today. CMDBs today are a shell of what they promised. Most simply provide an inventory of systems, devices and software which are used as references or Configuration Item (CIs) by Incident, Change and Problem records. Some CMDB may provide some service dependency information but the actual configuration details of the CI themselves are very limited.

For example, CMDBs would be challenged to answer questions like:

  • Which of MySQL databases have MAX_CONNECTIONS configuration setting greater than 500?
  • Which Cisco devices have the vulnerable Smart Install feature enabled?
  • What configurations were recently changed on this device?

The difficulty in collecting all appropriate data in the CMDB, with too many assets or CI categories is best highlighted by Gartner’s recommendation that only 10-15% of assets be cataloged in the CMDB.

Domain-specific Configuration Management Tools

There are many existing Configuration Management tools that collect and deploy configurations to networks, servers, applications and cloud environments. The challenge with these tools is that they are limited to their specific domains e.g. only networks, specific applications, only cloud, etc.

The ability to see changes across ALL technologies is essential to be able to troubleshoot complex incidents and to understand what services are impacted by the configuration changes. For example a simple firewall rule change by the network security team can directly impact the availability of an application managed by the application team. As services grow increasingly more connected with microservices, containers, virtualization and software-defined storage and networks, the ability to correlate change events across domains is critical.

Being Responsive with Change Monitoring

Having a Change Monitoring tool enables organizations to quickly determine what has changed in the environment whether they are planned or unplanned changes.  Change Monitoring allows operational teams to quickly isolate, troubleshoot complex issues and become more responsive.  The following are some of the key elements of Change Monitoring that should be considered:

  • discovery and inventory of devices to manage
  • discovery and classification of devices and services
  • retrieve the various forms of configuration information
  • search indexing and parsing
  • policy and action
  • service dependency
  • user workflow and interaction
  • API and reporting
  • process and workflow integration
Device Discovery and Inventory

Dynamically add new elements to be managed. This includes networks, servers, containers, applications, cloud services, etc. Need to be able to define rules to include as well as exclude elements. Integration with other inventory sources such as CMDBs are also helpful.

Device and Service Classification

Examine devices in detail to understand what services are relevant and what configuration to collect and monitor for changes.

Configuration Collection

Configuration information exists in many different forms, e.g. files, command line output, registry keys, database entries, APIs. The Change Monitoring system should be able to collect all of these types of configuration information and normalize the data.  Being able to collect the proper information based on the type of application, system or device is critical to monitor the changes in an environment. 

Search Index and Parsing

Once the Configuration information is collected and consolidated, being able to index and analyze the configuration data in an IT-centric way enables the proper information to be provided to IT professionals. For example, IP addresses syntax, special symbols, etc have special meaning that should not be lost by the indexing process so that searching is intuitive for infrastructure management.

Policy and Action

Configuration and monitoring policies can be defined to notify on policy non-compliance. Why is this important?

Service Dependency

Analysis of the configuration data to determine dependencies between service components. This is the foundation for impact analysis and service visualization and enables related systems and information to be automatically collected and presented in context of the affected service

User Workflow and Interaction

What the user is trying to accomplish should determine the workflow and interaction required to retrieve the configuration change information. Troubleshooting and searching for the root-cause of an incident is very different from searching for configuration information about a specific service.

API and Reporting

The searching and filtering capability to retrieve the configuration data should be easily accessible via the API for reporting and integration with external systems.

Process and Workflow Integration

The configuration change information should be integrated with the Change Request workflow to automatically audit the underlying infrastructure configuration changes related to the change request.

Conclusion

A Change Monitoring solution does not requirement much effort to deploy. It can quickly and easily provide additional insights to the operational support team, enabling them to be more efficient at troubleshooting complex incidents. The configuration data is also invaluable to the security team and provides the necessary auditing required for many compliance requirements.

Next in the series we will cover “Level 3 – Proactive” when we make strategic proactive and preventative measures that will reduce unnecessary incidents and minimize change risks.