Responsive Infrastructure Operations

Bridging the Two Worlds, Part 2

In part 1 of the series, we provided an overview of how organizations have a tough time tracking their ongoing configuration changes in their IT and network infrastructure. Many have implemented change management processes to provide some approval and accountability for changes, however many config changes whether planned or unplanned still go undetected which could often lead to disastrous outages, impact on customers, and business continuity.

The ability for companies to innovate, improve service uptime, and implement rapid change is critical to their ability to remain competitive. Knowing that 80% of all incidents are caused by configuration changes, many companies create an overabundance of overhead and a risk averse culture that hobbles change – they fear making mistakes and slow down the process to ensure they cross all the T’s and dot all the i’s – death by a million CABs and approvals.

Innovative companies view their ability to be agile and change as a competitive differentiator. They strive for ways to improve their ability to make changes while minimizing risk. Abstraction through the use of cloud services and automation through the use of new DevOps release automation tools are a couple good examples of how to minimize change risk. Cloud services reduce the number of moving parts and enable us to reliably spin up new services quickly and easily, while automation tools enable consistent repeatable change.

However, abstraction and automation still do not prevent bad configuration changes from occurring, they simply shift where the configuration changes are performed. More importantly, when incidents or problems occur (and they still will) abstraction and automation does not eliminate accountability or determination of root cause. Awareness of all planned, but also unplanned, configuration changes is critical for effective analysis and remediation of all infrastructure and security incidents. Having this visibility and awareness of change is key to helping to mitigate risk and still provide operational efficiency.

The adoption of Change Monitoring technologies can significantly reduce the risk of infrastructure changes. It provides accountability to all changes whether they’re planned or unplanned, as well as unauthorized or security intrusions. It can quickly isolate and help identify the root cause of complex incidents.

Level 1 – Unaware

Level – 1 Unaware was covered in the previous blog post (part1). Most organizations fit into this category as they are simply unaware of configuration changes that are made to their environment. Not only does this pose a significant security risk, it also reinforces bad behavior where users would often skip the prescribed change management process because there simply is no visibility or accountability to infrastructure changes.

Level 2 – Responsive

A “Responsive” operational support organization is different from a traditional support team. The Responsive operational support organization leverages and utilizes configuration change events to quickly narrow down and isolate the scope of the incident and identify potential root cause before spending precious time chasing and diving down various “rabbit holes” that can easily consume time and expensive resources. Ideally, the support team should be able to directly utilize Change Request records and analyze the resulting configuration changes made to the infrastructure as well as the impacted dependent services.

The “Level 2 – Responsive” of the Operational Change Maturity Model is the foundational layer in which all the capabilities and objectives of the higher change assurance levels are derived.

Greatly improved ability to troubleshoot incidents, especially complex outages
Prevent and reduce incidents caused by human errors, or the lack of understanding the underlying service impact and dependencies
Identify and reduce unplanned changes to ensure a consistent process and review are followed
Detect unauthorized changes and security intrusions
Implement and ensure security policy and compliance

Although the primary focus of the Responsive operational support team is to monitor, diagnose and repair as quickly as possible, the team also plays an important role to provide the necessary feedback to the change team and the security teams to ensure that consistent processes are followed to minimize risk and unnecessary incidents. One of the key requirements of “Level 2 – Responsive” is to provide the visibility and accountability to ALL infrastructure changes. The support team needs to be able to easily distinguish between planned vs unauthorized changes or security intrusions; to provide the feedback to the change and security teams to reinforce the correct behavior.

What about existing tools and approaches?

Configuration Management Database (CMDB)

CMDB is a foundational idea but often with impractical implementation that provides limited value. A good test that demonstrates whether the CMDB provides any useful configuration information for troubleshooting complex incidents is whether the operational support team actually uses the CMDB.

The horror stories of CMDB endeavours are enough to scare off any ambitions of an actual configuration change repository vs what most CMDB implementations are today. CMDBs today are a shell of what they promised. Most simply provide an inventory of systems, devices and software which are used as references or Configuration Item (CIs) by Incident, Change and Problem records. Some CMDB may provide some service dependency information but the actual configuration details of the CI themselves are very limited.

For example, CMDBs would be challenged to answer questions like:

Which of MySQL databases have MAX_CONNECTIONS configuration setting greater than 500?
Which Cisco devices have the vulnerable Smart Install feature enabled?
What configurations were recently changed on this device?

The difficulty in collecting all appropriate data in the CMDB, with too many assets or CI categories is best highlighted by Gartner’s recommendation that only 10-15% of assets be cataloged in the CMDB.

Domain-specific Configuration Management Tools

There are many existing Configuration Management tools that collect and deploy configurations to networks, servers, applications and cloud environments. The challenge with these tools is that they are limited to their specific domains e.g. only networks, specific applications, only cloud, etc.

The ability to see changes across ALL technologies is essential to be able to troubleshoot complex incidents and to understand what services are impacted by the configuration changes. For example a simple firewall rule change by the network security team can directly impact the availability of an application managed by the application team. As services grow increasingly more connected with microservices, containers, virtualization and software-defined storage and networks, the ability to correlate change events across domains is critical.

Being Responsive with Change Monitoring

Having a Change Monitoring tool enables organizations to quickly determine what has changed in the environment whether they are planned or unplanned changes. Change Monitoring allows operational teams to quickly isolate, troubleshoot complex issues and become more responsive. The following are some of the key elements of Change Monitoring that should be considered:

discovery and inventory of devices to manage
discovery and classification of devices and services
retrieve the various forms of configuration information
search indexing and parsing
policy and action
service dependency
user workflow and interaction
API and reporting
process and workflow integration

Device Discovery and Inventory

Dynamically add new elements to be managed. This includes networks, servers, containers, applications, cloud services, etc. Need to be able to define rules to include as well as exclude elements. Integration with other inventory sources such as CMDBs are also helpful.

Device and Service Classification

Examine devices in detail to understand what services are relevant and what configuration to collect and monitor for changes.

Configuration Collection

Configuration information exists in many different forms, e.g. files, command line output, registry keys, database entries, APIs. The Change Monitoring system should be able to collect all of these types of configuration information and normalize the data. Being able to collect the proper information based on the type of application, system or device is critical to monitor the changes in an environment.

Search Index and Parsing

Once the Configuration information is collected and consolidated, being able to index and analyze the configuration data in an IT-centric way enables the proper information to be provided to IT professionals. For example, IP addresses syntax, special symbols, etc have special meaning that should not be lost by the indexing process so that searching is intuitive for infrastructure management.

Policy and Action

Configuration and monitoring policies can be defined to notify on policy non-compliance. Why is this important?

Service Dependency

Analysis of the configuration data to determine dependencies between service components. This is the foundation for impact analysis and service visualization and enables related systems and information to be automatically collected and presented in context of the affected service

User Workflow and Interaction

What the user is trying to accomplish should determine the workflow and interaction required to retrieve the configuration change information. Troubleshooting and searching for the root-cause of an incident is very different from searching for configuration information about a specific service.

API and Reporting

The searching and filtering capability to retrieve the configuration data should be easily accessible via the API for reporting and integration with external systems.

Process and Workflow Integration

The configuration change information should be integrated with the Change Request workflow to automatically audit the underlying infrastructure configuration changes related to the change request.

Conclusion

A Change Monitoring solution does not requirement much effort to deploy. It can quickly and easily provide additional insights to the operational support team, enabling them to be more efficient at troubleshooting complex incidents. The configuration data is also invaluable to the security team and provides the necessary auditing required for many compliance requirements.

Next in the series we will cover “Level 3 – Proactive” when we make strategic proactive and preventative measures that will reduce unnecessary incidents and minimize change risks.