In IT, incidents as a result of technology failure or human error can strike at any moment. Occasionally, we can have an incident that has a wide impact and poses serious risks to the business operations. Those major incidents need to be handled swiftly, so the IT service can be restored quickly with useful information captured that can be used for the root cause analysis afterward. If you have business critical services or applications under your management, having an organized approach to handling major incidents can save a lot of time and improve productivity. If you need to put a process together for your organization, here are some elements to take into consideration.
- Scope and Criteria: What characteristics would qualify an incident as being a “Major” Incident? This is very organization specific but generally there are two basic elements to consider, impact and urgency. Many organizations use the combination of those two elements to classify the priority level assigned to an incident, and that is a good starting point. Any incident that possesses a high degree of impact and high degree of urgency should probably be considered “major” and get the utmost attention. You may have other characteristics you want to define. For example, the outage of a particular application or for a particular line of business may trigger a “major” incident automatically. Since mobilizing the people and logistic necessary to handle a major incident is never a trivial exercise, clearly defined and agreed upon scope and criteria are mandatory.
- Roles and Responsibilities: Who will declare a major incident is in motion and own the process execution end-to-end? Since we are talking about major incidents, the Incident Management process owner in your organization will likely own this process as well. Will you have a person or a team designated as the “Major Incident Manager?” Will you rotate such role from individual to individual or from team to team? Depending on the nature of the technology failure or breakdown, how will the major incident manager find the appropriate technical resources to get involved? Will the major incident manager someone who is on stand-by waiting for the occasions to spring into action or will she have another “day-job” and wear the major incident manager hat when necessary? This will again depend on how your organization feels about this role. One thing I am certain of is that this role will require someone with the appropriate skills, environment know-how, and leadership experience to pull people together and execute the agreed-upon process. Another word, I do not believe this is a simple service desk phone dispatch type of role.
- Logistic and Facility: Everyone needs to know exactly what to do when the major incident process gets initiated. Will you have a dedicated meeting space or war room type of set up? Will people know what teleconference number to use in order to call in and to provide updates or to receive updates? Will you have a separate teleconference number to work through the technology aspect of incident recovery without cluttering with other non-technical discussions? Who will manage the conference call? What criteria determine when the conference calls start and end? In addition to the conference call, will you hold some kind of web meeting or online collaboration setup where people can share things on screen? Will you have some type of continual update via web or email, so people can stay informed? All these finer details should be planned upfront.
- Escalation and Communication: How will you define the communication interval and who will receive what communication at what point in time? How will the incident be escalated up the chain of command as long as the incident remains open? For example, you may define something simple as follow:
- At Hour 0: Major incident declared and the technical team contacted by phone. Director of the technical team and VP of IT notified via email.
- At Minute 30: Director of the technical team notified again via email with updates.
- At Hour 1: Major Incident Manager asks the Director of the technical team to join the conference in person. Another email update goes to the VP of IT.
- At Hour 2: Major Incident Manager asks the VP of IT asked to join the conference call for updates.
- At Hour 4: Major Incident Manager asks the business customer to join the conference call for updates and to discuss other recovery options.
- Other Considerations: How will this process connect with a downstream process such as Problem Management? Will you have the problem manager on the call as the incident progresses? What documentation or deliverables will the major incident process produce? Simple log of incident chronology, who participated the call when, important details shared at various point of the incident, official updates communicated, reasons for the incident closure, and other pertinent information about the incident probably should be documented at a minimum.
One thing for sure, all these considerations are too important not to get agreed upon beforehand. When the agreed upon details are not in place, it is simply not productive for everyone involved to try to figure out the process details during the heat of the battle. When that happens, most people have a tendency to go into the “headless chicken” mode – responsibility-dodging and finger-pointing start to spawn shortly afterward. In the next post, I will provide a sample process flow for further discussion.
Links to other posts in the series