Major Incident Review Process Design – Part Two

This post is the part two (and concluding part) of a series where we discuss the Major Incident Review process and how to put one together. Previously we discussed the elements and considerations that should go into the process design. We elaborated those considerations further with a sample process flow. We will describe the process activities further along with a reporting template you can use to implement the process.

Sample Incident Report Template

Sample Process Design Document

The process design document provides a detailed description of the fields within the report template, so no plan to repeat. I think there are two factors to keep in mind when undertaking such process. First, don’t do the process just for the sake of doing it. Do it because your organization genuinely wants to improve service by eliminating as many of these incidents over the long-term as you can. If the organization chose not to implement certain solution for some reasons, costs, technical complexity, longevity of the technology, regulatory/compliance, or whatever, at least document the discussion. That way, it shows that the organization understood the risks and chose to accept them.

Second, perform meaningful measurements and, again, use the statistics to improve service. For example, if the majority of the incidents are reported by the end users, perhaps that is giving us a clue that we should be more proactive and beef up the automated monitoring? If a particular technology area has been experiencing more major incidents than the other areas, perhaps we should figure out what ills are plaguing the area and fix what are broken? If a particular business unit or segment has been experiencing more major incidents than the other segments, perhaps we owe it to the business communities to figure out what we can do to make things better? The business impact information we capture will enhance our understanding of the incidents and help us in formulating the solutions that make sense for the business.

Most organizations I know practice some type of incident review process, so I hope the information presented so far has been helpful. Please feel free to suggest other approaches that have worked for your organization.

Links to other posts in the series

Major Incident Review Process Design – Part One

One approach to improve just about anything is to learn from a mishap and do something to prevent similar mishaps from taking place. In the IT world, stuff happens frequently enough that we have the tendency to just fix things up and move on to the next incident or crisis. Having a disciplined approach to do root cause analysis (RCA) on incidents and putting permanent solutions in place, or Problem Management another word, can only help.

ITIL Service Operation handbook already provides an excellent overview of the Problem Management process with a suggested model and various analysis techniques, there is no need for me to reiterate. I am proposing a periodic Major Incident Review process that takes place after incidents have been resolve with service restored. People and tools aside, I think there are three things that can contribute to the effectiveness of this review process.

  1. Have a well-defined scope of incident you plan to review. Depending how an organization defines the impact and urgency of the incidents, the scope of incidents that get reviewed can vary. Some shops will choose to review only the most critical incidents with visible business impacts. Other might choose to review all incidents that took place. Many organizations will probably fall somewhere in between. My recommendation is to figure out what types of incidents the process stakeholders care about. We will discuss more about the stakeholder shortly.
  2. Have a clear exit plan on what to do with the root cause discovered and the permanent solutions. I think the situation to avoid is having the root causes identified but getting stuck on what to do next. The root cause to a number of incidents is a simple break-down of technology. For those straight-forward incidents, you identify what broke down, fix it, and move on. Sometimes a permanent solution could take a significant amount of time, people, or financial resource to achieve. For all incidents, a decision should be made to either
    1. Follow up using the same Major Incident Review process up to some point.
    2. Get this item off the Major Incident Review process but to use another process/procedure to track and to follow up on making sure the permanent solutions get implemented
    3. Decide nothing further will or can be done. Capture the lesson learned in a known error database or some knowledge management repository.
  3. Perform the process periodically, without exception. The frequency of the review can vary from one organization to another, monthly or bi-weekly for some or weekly for those busy shops. RCA activities can have a tendency to take a while to do because they often do not register at the same level of criticality as the incidents. By having this review process on a periodic basis, we are taking a position of … “Don’t procrastinate. Let’s figure out what went wrong. What were the causes? Decide what we plan to do about it and move on.”

In this post, I will discuss what you need in order to put a Major Incident Review process together for your organization. In addition to the three success factors that need to take into account, here are some additional elements to consider.

  1. Who will participate and who are the stakeholders? This review process will involve three groups of people at a minimum, the Incident Management team, the Problem Management team, and the technology/application support team. The actual organizational functions that perform the incident and problem management processes can vary. Frequently, the Service Desk is the owner for the Incident Management process, while the ownership for Problem Management rests somewhere outside of the Service Desk. Also, the IT/corporate management and the business users will likely become another set of stakeholders, since the results of the view often get shared with those constituents.
  2. Who will own this process? Since the review process will occur after incident resolution and spend a great deal time on RCA activities, the Problem Management process owner is the logical owner of this process. Even though the name of the process is called Major “Incident” Review, calling this process a Major “Problem” Review just sounds awkward. The convention you use in your organization should determine what is the most easily understood name you will use.
  3. What technology or tools will the process require? This process is straight forward enough where a spreadsheet tool should be sufficient. In addition, most organization will have some type of incident tracking tools that can feed the incident information into this review process. The output of this process could also feed into a Problem Management tool if you have one.

Sample Major Incident Review Process Flow

Here is an example of the review process flow. I have used a bi-weekly schedule for the review process. Depending on your organization’s requirement, this schedule could expand or contract.  In part two, I will describe this process flow more in detail with a template you can use to capture the incident review details.

Links to other posts in the series

Event Management Process Design – Part One

This post is the first part of a series where we discuss the Event Management process and how to put one together. Accordingly to ITIL (quoted directly), Event Management is the process that monitor all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exceptional conditions. Another word, Event Management picks up the alerts and events generated from the devices and applications, figure out what to do with those alerts and events, and follow up afterward to make sure the alerts and events get the due attention and addressed properly. To begin putting together an Event Management process for your organization, here are some elements to think about.

  1. What events and alerts do you plan to trap and process? It may be a noble goal to design a process that can trap 100% of the alerts from the environment and process them all. It is not always possible. Some events can be trapped and processed automatically by the tools you have on hand, and some alerts will require manual intervention. Where will the alerts/events be captured from and where they will be recorded? ITIL suggested centralizing the event management process as much as possible, and it makes sense. If the alerts need to come from different technology stacks or devices, which they often do, can you at least centralize the location where the recording and processing activities can take place? Determine the scope, what you can do or cannot do, and have a clear idea of what you hope to get out of the process.
  2. Once you determine the set of events or alerts that can be picked up and fed through the process, you will need a set of rules on what to do with those events. The rules need to be explicit so there is little room for guessing or personal interpretation by those carrying out the process. The rules will determine what conditions, after being met or exceeded with some thresholds, will trigger an event. For example, you may have a rule that says when server ABC’s CPU utilization reaches 90% and stay there consistently for over 10 minutes during business hours (6am to 6pm), an alert will be triggered. The rule will further stipulate what actions will be taken when the event is triggered. For example, you may have a rule that says the CPU alert will be escalated or handed over to the systems admin team for further evaluation via email or phone call. The rule will also call out what acknowledgement or interaction will constitute a successful escalation or hand-off.
  3. You should have a classification scheme for the incoming alerts/events. Not all alerts require the same handling actions. Using ITIL’s suggestion of having alerts that can be either Informational, Warning, and Exception is a good starting point and more than sufficient for most organizations anyway. For example, informational alerts usually get recorded for historical purposes and not escalated anywhere else, only the warning and exception get escalated further. Between the warning and exception alerts, they may get escalated differently to different teams with different timing considerations. Furthermore, once the alert is escalated, the job of Event Management is not 100% done. We also need to have a standard rule or approach on how to follow up while the alert condition is being addressed and to close out the alerts once certain conditions are met (incident resolved or alerts cease to repeat within a 24 or 48 time frame).
  4. As you can see, determining what to do with an alert, making sure the alerts are handled correctly and efficiently, and following up to close the alerts properly take some up-front thoughts and planning. The number of alerts monitored in a moderately complex IT environment can grow very quickly. Therefore, having heavily customized, individual alerts is not recommended, and really not necessary. My suggestion is to have a default event handling procedure that will work for over 90-95% of the events you anticipate to process. For the remaining 5-10%, use the default handling procedure as the foundation but with some customized procedure on top so the events can be handled correctly.
  5. Who will be on point as the process owner for and responsible for carrying out the Event Management process? If you are lucky enough where you can have a team in your organization whose primary responsibility is to monitor the environment and process the events, that team can be both accountable as the process owner and responsible for doing it. If a dedicated team is not an option and multiple people/teams will be carrying out the process, at least designate one, single process owner and have a consistent process in place for everyone else to follow.
  6. How will the process be measured for efficiency and effectiveness? What measurements does your organization care about? What actions will result from analyzing the measurement data? Measurements will mean very little if they are not acted upon to further improve the performance of the process.

Those are a lot to think about for now. In part two, I will provide a sample list of Event Management process design requirements and a sample process flow for further discussion.

Links to other posts of the series