Bringing Change to Incident Response

Imagine for a minute, going to the ER in this panic-stricken, pandemic era, with symptoms akin to Covid due to a peanut allergy. After waiting for what seems like an eternity in a long queue of patient admittance, the ER doctor finally makes his hurried visit and prognosis.

The doctor fires off a series of probing questions, but sadly the patient is unable to speak. Due to this lack of communication, the doctor’s prognosis is now reduced to guesswork based on symptoms. Without the situational awareness of what could have caused those symptoms, it is now likely that a lot of time will be wasted exploring various other possible maladies.

Across many industries, miscommunication, unfocussed communication, or a complete lack of situational awareness, can be the Achilles heel of many companies’ performance, public perception, and brand. We see it in healthcare, we see it in government, we see it in human interactions and relationships. We certainly see it in today’s complex IT environment, where the repercussions include the invaluable loss of customer trust, loss of revenue, internal disruption, and diminished employee morale. Lack of situational awareness impacts every level of business, from global corporations such as Google, Capital One, and Amazon Web Services, to small businesses managed by outsourced MSPs.

“More than 80% of all incidents are caused by planned and unplanned changes” -Gartner

Sometimes, the changes remain unnoticed for hours if not days, before a major outage occurs…and by that time, even Sherlock Holmes would be challenged to unravel what went wrong! The time required to identify and remedy the infrastructure configuration changes that caused the incident in the first place, significantly impacts the business, and the damages and losses have already begun to compound.

But what if one could have… a real time activity stream of configuration changes, that occurs throughout the environment?

At a minimum, this would enable incident triage to expedite all relevant changes, quickly narrowing down the scope of the problem. The technician at hand could easily see how changes to the firewall rules, which belong to the network team, directly impacts the application team. In doing so, the technician has also saved a lot of time for many other folk, by avoiding the unnecessary, dreaded emergency conference bridge, which all too often has become commonplace.

Perhaps, if the change data is leveraged early enough, by capturing the configuration changes related to the Change Request and used during post-implementation reviews, the outages may be prevented altogether.

Why don’t most infrastructure operations monitor for configuration changes, today?

Most likely, companies assume they have a comprehensive coverage with their existing myriad list of configuration management tools, with each silo (networks, servers, apps, cloud, VM, containers, etc…) providing coverage for their own specific domain. While this may be true, none of these independent tools brings all the configuration changes together in one place, and certainly none relate the configuration data together or with Change Requests.

The configuration change monitoring solution needs to be specifically focused to help incident response, incident prevention and configuration assurance.

Just as our ER patient would have greatly benefited if the doctor was able to communicate with the patient, and more importantly, was made aware of his peanut allergy, so too, IT infrastructure and operational support can also greatly benefit from having visibility to configuration changes that are constantly made throughout their environment, whether planned or unplanned. Finding out “what’s changed” shouldn’t require everyone getting on an emergency conference bridge.