One approach to improve just about anything is to learn from a mishap and do something to prevent similar mishaps from taking place. In the IT world, stuff happens frequently enough that we have the tendency to just fix things up and move on to the next incident or crisis. Having a disciplined approach to do root cause analysis (RCA) on incidents and putting permanent solutions in place, or Problem Management another word, can only help.
ITIL Service Operation handbook already provides an excellent overview of the Problem Management process with a suggested model and various analysis techniques, there is no need for me to reiterate. I am proposing a periodic Major Incident Review process that takes place after incidents have been resolve with service restored. People and tools aside, I think there are three things that can contribute to the effectiveness of this review process.
- Have a well-defined scope of incident you plan to review. Depending how an organization defines the impact and urgency of the incidents, the scope of incidents that get reviewed can vary. Some shops will choose to review only the most critical incidents with visible business impacts. Other might choose to review all incidents that took place. Many organizations will probably fall somewhere in between. My recommendation is to figure out what types of incidents the process stakeholders care about. We will discuss more about the stakeholder shortly.
- Have a clear exit plan on what to do with the root cause discovered and the permanent solutions. I think the situation to avoid is having the root causes identified but getting stuck on what to do next. The root cause to a number of incidents is a simple break-down of technology. For those straight-forward incidents, you identify what broke down, fix it, and move on. Sometimes a permanent solution could take a significant amount of time, people, or financial resource to achieve. For all incidents, a decision should be made to either
- Follow up using the same Major Incident Review process up to some point.
- Get this item off the Major Incident Review process but to use another process/procedure to track and to follow up on making sure the permanent solutions get implemented
- Decide nothing further will or can be done. Capture the lesson learned in a known error database or some knowledge management repository.
- Perform the process periodically, without exception. The frequency of the review can vary from one organization to another, monthly or bi-weekly for some or weekly for those busy shops. RCA activities can have a tendency to take a while to do because they often do not register at the same level of criticality as the incidents. By having this review process on a periodic basis, we are taking a position of … “Don’t procrastinate. Let’s figure out what went wrong. What were the causes? Decide what we plan to do about it and move on.”
In this post, I will discuss what you need in order to put a Major Incident Review process together for your organization. In addition to the three success factors that need to take into account, here are some additional elements to consider.
- Who will participate and who are the stakeholders? This review process will involve three groups of people at a minimum, the Incident Management team, the Problem Management team, and the technology/application support team. The actual organizational functions that perform the incident and problem management processes can vary. Frequently, the Service Desk is the owner for the Incident Management process, while the ownership for Problem Management rests somewhere outside of the Service Desk. Also, the IT/corporate management and the business users will likely become another set of stakeholders, since the results of the view often get shared with those constituents.
- Who will own this process? Since the review process will occur after incident resolution and spend a great deal time on RCA activities, the Problem Management process owner is the logical owner of this process. Even though the name of the process is called Major “Incident” Review, calling this process a Major “Problem” Review just sounds awkward. The convention you use in your organization should determine what is the most easily understood name you will use.
- What technology or tools will the process require? This process is straight forward enough where a spreadsheet tool should be sufficient. In addition, most organization will have some type of incident tracking tools that can feed the incident information into this review process. The output of this process could also feed into a Problem Management tool if you have one.
Here is an example of the review process flow. I have used a bi-weekly schedule for the review process. Depending on your organization’s requirement, this schedule could expand or contract. In part two, I will describe this process flow more in detail with a template you can use to capture the incident review details.
Links to other posts in the series