The SIFF Story
Our mission is to help manage change in the agile world.
“What the $%&! changed!”
Throughout the decades of infrastructure operations and software development, one of the constantly recurring themes has been “outages” or “incidents” as a result of configuration change.
If we take a look at recent major public outages such as CloudFlare, CenturyLink, Google, AWS – all were caused by configuration errors. Numerous quotes from analysts claim that more than 80% of all incidents are caused by planned and unplanned changes.
The risks associated with making configuration changes is the reason why most organizations “freeze” changes to the infrastructure or code. This is particularly evident during critical times such as the holiday season in retail organizations, or prior to a major release for software companies.
The general perception that some large organizations are slow to innovate, can largely be attributed to the risk aversion they have to making changes. This lack of confidence towards change, permeates the organizational culture over time. Conversely, agile organizations are the ones that believe that their ability to manage change and associated risks, is essential to their success and differentiation to their competitors.
In software engineering, one of the things developers learn early on is that the software version control system is a critical process technology, supporting the constant changes occurring in the code base. Not only does it provide the “backup”, to roll back to previously working revisions; modern tools like GitHub, allows one to quickly and easily see what is changing (“diffs”). This is very helpful when troubleshooting, but is transformative when used proactively to review code changes; allowing human errors to be identified and addressed, before they are introduced to the production environment.
In IT infrastructure management, the “configuration” across network devices, systems, applications, storage, VMs, containers and cloud, is the code of the infrastructure. At SIFF, we believe that it is essential to have visibility over ALL changes, whether planned or unplanned, especially unauthorized changes and/or security intrusions.
The change data can be used to help narrow down the root-cause of incidents when troubleshooting, as well as post-mortem analysis to find out what went wrong. More importantly, just like software engineering, the change data can be used to improve the change process by enabling configuration reviews to prevent errors either through automated configuration analysis, or manual peer-reviews.
Our mission at SIFF, is to introduce these software engineering advantages to infrastructure operations, in order to help accelerate incident response, incident prevention, and governance.
All Config Changes. Searchable. In One Place
SIFF’s first goal is to collect all config data into a single place, where one can search and audit changes, and apply processes and policies to the change data, in a consistent manner.
There are many existing tools, today, that perform “Configuration Management”. For example, Solarwinds NCM for network configs, Ansible, Puppet or Chef for server and application configs, Terraform for Cloud configs, and many more. These tools are focused on applying configuration changes for a specific domain; however, in today’s highly interconnected environment, where changes in one team is likely to impact other stakeholders, configuration change monitoring must not be limited to a specific domain. The operational support team must be able to see changes across functional groups, enabling them to “connect the dots”, and improve their effectiveness at troubleshooting complex incidents.
Troubleshooting and Root-Cause Analysis
At SIFF, our focus is to make it easy for support engineers to quickly identify changes that may be related to the incident. SIFF provides a change activity stream, where users can easily search for config changes around a point-in-time, see related components that make up a business application, or just simply lookup the current configuration for a specific device or service.
The use of configuration change data works particularly well, in conjunction with existing event and performance management systems. There is a direct correlation between the number of change events, and the number of incidents. Analysis of the change data can quickly narrow the scope of problem search space. In other words, SIFF helps one look in the right places.
Additionally, with a seamless integration to the Change Management process, SIFF can help identify configuration changes resulting from completed Change Requests. Here, planned changes versus unplanned changes are clearly identified, which also helps expedite root-cause analysis in the event of an incident.
Preventing Incidents
Incident prevention is not easy. It requires a consistent process and discipline, to improve quality and avoid mistakes. Strict Change Management processes and Change Advisory Boards (CAB) can help provide the foundational process structure; however, most organizations do not have the necessary resources or time to support such disciplined commitments.
In most organizations, change management is often simply an approval process, “Does this change make sense? Go ahead.” The change often relies on the assigned engineer to not make any mistakes. Given that we are all human, incidents and errors are going to happen. The question is, how does one minimize these incidents.
At SIFF we believe there are 2 critical elements to prevent incidents:
- Visibility and Accountability – Tools like GitHub bring accountability to software development. Users can easily see what has changed, when code is checked in. SIFF provides the same functionality for infrastructure changes. From the Change Request, users can see what infrastructure changes resulted from that work. This is particularly helpful when troubleshooting complex incidents.
- Implementation Review – Peer review goes hand-in-hand with accountability. There is no accountability if no one is going to check the work. Just the simple fact that someone is going to review the work, helps improve quality. You’d be surprised how many errors are caught from a quick review of “actual” changes vs simply reviewing what is “planned” to be changed; which is what happens in most organizations. The review does not have to be an exhaustive exercise. Most organizations do not do implementation reviews, because gathering all the underlying config changes for change requests, is very time consuming. SIFF helps by automating the collection and attribution of infrastructure changes to Change Requests, required for accountability and peer reviews.
Additionally, incidents can be further prevented through proactive, automated configuration policies, and compliance monitoring.
Governance and Compliance
A key part to “constantly improving”, is learning from mistakes and preventing it from happening again. One of the benefits of having all configs in one place, is that you can apply processes on the data in a consistent manner.
Most organizations are required to do a security review and audit, at least once a year. Some of these IT security and compliance controls are attestation that all changes are audited. SIFF already provides this for you.
Our goal is to go further, to enable you to define and automate complex policies that ensure configuration changes are consistently following best practices and policy guidelines.
After a security review, guidelines and recommendations are often defined and communicated to the stakeholders, to ensure that they are implemented. For example, a policy may require that all network communications between application components should be encrypted, e.g. https turned on, a database password to be changed each month, a server not having listening ports besides the specified list, and so on.
Most organizations do not have the resources to perform frequent IT audits. They simply assume that these guidelines and recommendations are implemented. Often; however, these false assumptions are easily exposed, as soon as something bad happens.
SIFF builds on its ability to collect all configs in one place, and allows policy rules to analyze the configuration data to ensure that any misconfiguration, or violations, are notified.
Managing Change in the Agile World
Hopefully, our mission to help manage the increasing demand and complexity of infrastructure changes, is something that you are also passionate about. We believe the ability to manage configuration changes, to reduce incidents, and to improve configuration governance, is key to your innovation and competitive differentiation.