{"id":2228,"date":"2020-03-01T06:26:13","date_gmt":"2020-03-01T05:26:13","guid":{"rendered":"http:\/\/siff.io\/?p=2228"},"modified":"2020-05-20T20:36:52","modified_gmt":"2020-05-20T18:36:52","slug":"change-management-infrastructure-operations","status":"publish","type":"post","link":"https:\/\/webdev.siff.io\/change-management-infrastructure-operations\/","title":{"rendered":"Change Management & Infrastructure Operations"},"content":{"rendered":"\t\t
<\/p>\n
<\/p>\n
<\/p>\n
This is the first in a series of articles in which we will explore the opportunities that can be accomplished by connecting these two related, yet inherently disconnected worlds. The series will navigate the Operational Change Management Maturity Levels to provide a pathway towards reducing incidents and outages, and avoid inflicting unnecessary pain to ourselves.\u00a0\u00a0<\/p>\n
<\/p>\n
<\/p>\n
The series does not cover the change management process itself. There is plenty of content available that explores the change management discipline. We will explore what is limiting the benefits promised by the change management process from directly helping infrastructure operations; and how these challenges can be overcome.<\/p>\n
\u00a0<\/p>\n
<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t
<\/p>\n
<\/p>\n
<\/p>\n
explores the change management discipline. We will explore what is limiting the benefits promised by the change management process from directly helping infrastructure operations and how these challenges can be overcome.\u00a0<\/p>\n
\u00a0<\/p>\n
Of all the wild claims we often hear from industry analysts, this particular one rings true. We could\u00a0 debate the percentage value, but the core essence of the statement is accurate. There\u2019s a reason why most industries that are highly dependent on their IT infrastructure and services such as financial services, retail, etc\u2026 go into a \u201clock-down\u201d mode during their seasonal peaks \u2014<\/span>making changes breaks things.<\/span><\/i><\/p>\n But aren\u2019t we supposed to \u201cfail fast\u201d and \u201cbreak things\u201d now?<\/b>\u00a0<\/span><\/p>\n Amazon, Google, and Netflix dominate because of their ability to innovate quickly. They adopted the “fail fast” mentality and created a culture that do not adversely penalize mistakes for the sake of innovation. This is often measured by the number of successful change requests. In some ways, the ability to promote changes and improvements directly correlates to their business agility.<\/span><\/p>\n But how do you do this at scale? How does an organization reduce the risk and make more frequent changes manageable? Many look towards infrastructure as code or software-defined X as the goal. However as many developers can attest, just because it is “code” i.e releases are automated, does not eliminate the underlying issues. O<\/span>rganizations need to revise their perspective on what managing changes means to them, specifically the discipline and process required to reduce the risk. There are light-weight strategies teams can adopt to help avoid causing pain.\u00a0<\/span><\/p>\n The good news is that software developers have been dealing with quality assurance for a very long time. The opportunity is in applying the lessons to help infrastructure operations.\u00a0\u00a0<\/span><\/p>\n Here\u2019s the scenario:<\/span><\/p>\n But what\u2019s missing? In this scenario, we rarely experience the technician search Change Requests (CRs) to see if there were recent changes that may have caused the problem. The technician is busy looking at symptoms and trying to guess the cause vs assessing relevant changes that have recently occurred that might have caused the problem. Both approaches are needed however, understanding changes can tremendously narrow the search-space and reduce the time to repair.<\/span><\/p>\n Change Requests are often not utilized because they lack actual detailed configurations that resulted from the work. CRs describe the work to be carried out and can provide detailed instructions on how to perform the work, but they do not capture the resulting configurations needed to troubleshoot and repair problems by the infrastructure monitoring team. This is the disconnect or gap between those implementing changes to the infrastructure vs those monitoring and keeping everything up and running.\u00a0<\/span><\/p>\n Additionally, many organizations still struggle to adopt a consistent change management process. Frequently changes are made directly to systems and devices without following the change request process. These unplanned changes are often the source of outages. Without visibility and accountability to these ad-hoc or unplanned changes, it is very difficult to instill these process changes.<\/span><\/p>\n Lastly, unauthorized changes or security breaches, both internal and external are becoming common-place. It\u2019s not a matter of if but when this will occur. How do you know if you\u2019re compromised if you don\u2019t even know what changes are going on?\u00a0\u00a0<\/span><\/p>\n These symptom highlights a couple of key limitations in managing change:<\/span><\/p>\n How do operations answer this question today? The answer is … complicated.\u00a0<\/span><\/p>\n First, it depends on what functional domain (aka silo) you\u2019re referring to. Is it network configuration changes, servers, applications, storage, cloud, or security? In addition, each of these are likely to have their own management tool to configure their elements. Some may have multiple, where one application team uses Ansible, another may prefer Puppet.<\/span><\/p>\n All of these disparate tools only show part of the picture which makes it difficult to really understand how independent changes in each functional group can impact each other. Hence the need for the conference bridges to isolate the root-cause of complex outages are common-place. No wonder we frequently hear of outrageous hourly outage costs.<\/span><\/p>\n According to Gartner, the average cost of IT downtime is $5,600 per minute. Because there are so many differences in how businesses operate, downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end.\u00a0<\/span><\/p>\n A critical requirement to improve change management for infrastructure operations is \u201cchange monitoring\u201d. You need the ability to quickly review configuration changes across all IT infrastructure and easily search for relevant configurations and changes all in a single place.<\/span><\/p>\n The change monitoring tool must be flexible. Depending upon your environment, it may be needed to support a plethora of complex services and devices as well as the nuances of legacy systems. For a Communications Service Provider (CSP), this can range from fiber equipment and 5G Radio Access Network (RAN) equipment, to new SDN virtual devices. For a new SaaS provider, it needs to be Cloud services aware (AWS, Azure, GCP), supports containers and orchestration. For enterprises, Hypervisors, SANs and all of the above.\u00a0<\/span><\/p>\n The above is simply just getting the data. Making it easy for operators to get the needed information at their fingertips is entirely another challenge.<\/span><\/p>\n Change management is more than just managing the approval process and execution of changes. It must also involve the monitoring of changes to bring visibility and accountability to planned, unplanned and unauthorized changes. Once you have this in place, you can then start to take actionable steps towards improving infrastructure operations.\u00a0<\/span><\/p>\n Next in the series we will cover \u201cLevel 2 – Responsive\u201d and discuss in greater detail how configuration change monitoring can help accelerate root-cause analysis of complex incidents.<\/span><\/p>\nWhat does \u201cLevel 1- Unaware\u201d mean?<\/b><\/h2>\n
\n
\n
What\u2019s Changed?\u00a0<\/b><\/h2>\n
Conclusion<\/b><\/h2>\n