We talked about how Analytics can be the next logical step beyond just simple measurements and metrics.
Analytics can help by highlighting potential correlations between data and giving us more opportunities in connecting the dots and reaching insights.
While analytics can move things forward in the positive direction, reaching an insight is nowhere guaranteed or assured.
Often, human decisions can be complex. Analytics is not the be-all and end-all mechanism for decision-making.
Machine learning tools can do many data wrangling, cleaning, transformation, and hyper parameter tuning tasks that are complex but not particularly creative.
It would be foolish to simply throw tons of data at the algorithms and just do what the machine learning application tells you to do.
This means leaders and managers need to understand how their organizations really work, from both quantitative and qualitative perspectives.
The qualitative perspective often provides additional context for the quantitative perspective.
Insights with the balanced quantitative and qualitative perspectives will likely be the most actionable.
For many organizations, measurements are just data. Metrics are mostly ratios or simple manipulation of the measurements.
Measurements and metric may say something about what had happened, but they do not explain why.
For example, a popular service desk (SD) metric “First Call Resolution” was down 5 percentage points this month versus last. Was it because…
Did we have a major outage so the call volume went way up?
Did the end-users suddenly got to be more sophisticated and asked harder questions?
Did the SD analysts grow more lazy or dumber?
IT is a complex business and often there could be multiple factors in play when metric moves in the certain direction.
This means we need a more sophisticated mechanism than just simple metrics to help analyze and evaluate root causes.
Analytics can help.
Descriptive analytics, even with the basic statistics-based techniques, can help point out correlations that might exist within the data. Correlations, as we all know, do not prove causation but it can still be helpful.
Predictive Analytics, along with machine learning techniques, can help create models that might point out future behaviors. The predictive models are only as good as the quality and quantity of the data you use to train the models. Still, the models are a lot more actionable than simply relying on gut-feel alone.
Moving forward, finding the necessary data and applying the analytics will be essential for managing IT.
“If You Can’t Measure It, You Can’t Manage It” is frequently quoted when emphasizing the importance of managing with data, not just with intuition alone.
Some took the quote and ran with the metric gathering exercises. The data-gathering effort accumulates tons of data with a hope to find that key insight to show what the organization out to be doing next.
Perhaps the question to ask first is this. What outcomes do we all want to see and in what context/reality do we need to work with?
After determining the outcome/process to manage, ask the question of what data do we need and whether the investment of collecting the data might outstrip the benefit.
Ask the hard questions first. The resulting data collected will be a lot more meaningful.
This post is the first part of a series where we discuss the Event Management process and how to put one together. Accordingly to ITIL (quoted directly), Event Management is the process that monitor all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exceptional conditions. Another word, Event Management picks up the alerts and events generated from the devices and applications, figure out what to do with those alerts and events, and follow up afterward to make sure the alerts and events get the due attention and addressed properly. To begin putting together an Event Management process for your organization, here are some elements to think about.
- What events and alerts do you plan to trap and process? It may be a noble goal to design a process that can trap 100% of the alerts from the environment and process them all. It is not always possible. Some events can be trapped and processed automatically by the tools you have on hand, and some alerts will require manual intervention. Where will the alerts/events be captured from and where they will be recorded? ITIL suggested centralizing the event management process as much as possible, and it makes sense. If the alerts need to come from different technology stacks or devices, which they often do, can you at least centralize the location where the recording and processing activities can take place? Determine the scope, what you can do or cannot do, and have a clear idea of what you hope to get out of the process.
- Once you determine the set of events or alerts that can be picked up and fed through the process, you will need a set of rules on what to do with those events. The rules need to be explicit so there is little room for guessing or personal interpretation by those carrying out the process. The rules will determine what conditions, after being met or exceeded with some thresholds, will trigger an event. For example, you may have a rule that says when server ABC’s CPU utilization reaches 90% and stay there consistently for over 10 minutes during business hours (6am to 6pm), an alert will be triggered. The rule will further stipulate what actions will be taken when the event is triggered. For example, you may have a rule that says the CPU alert will be escalated or handed over to the systems admin team for further evaluation via email or phone call. The rule will also call out what acknowledgement or interaction will constitute a successful escalation or hand-off.
- You should have a classification scheme for the incoming alerts/events. Not all alerts require the same handling actions. Using ITIL’s suggestion of having alerts that can be either Informational, Warning, and Exception is a good starting point and more than sufficient for most organizations anyway. For example, informational alerts usually get recorded for historical purposes and not escalated anywhere else, only the warning and exception get escalated further. Between the warning and exception alerts, they may get escalated differently to different teams with different timing considerations. Furthermore, once the alert is escalated, the job of Event Management is not 100% done. We also need to have a standard rule or approach on how to follow up while the alert condition is being addressed and to close out the alerts once certain conditions are met (incident resolved or alerts cease to repeat within a 24 or 48 time frame).
- As you can see, determining what to do with an alert, making sure the alerts are handled correctly and efficiently, and following up to close the alerts properly take some up-front thoughts and planning. The number of alerts monitored in a moderately complex IT environment can grow very quickly. Therefore, having heavily customized, individual alerts is not recommended, and really not necessary. My suggestion is to have a default event handling procedure that will work for over 90-95% of the events you anticipate to process. For the remaining 5-10%, use the default handling procedure as the foundation but with some customized procedure on top so the events can be handled correctly.
- Who will be on point as the process owner for and responsible for carrying out the Event Management process? If you are lucky enough where you can have a team in your organization whose primary responsibility is to monitor the environment and process the events, that team can be both accountable as the process owner and responsible for doing it. If a dedicated team is not an option and multiple people/teams will be carrying out the process, at least designate one, single process owner and have a consistent process in place for everyone else to follow.
- How will the process be measured for efficiency and effectiveness? What measurements does your organization care about? What actions will result from analyzing the measurement data? Measurements will mean very little if they are not acted upon to further improve the performance of the process.
Those are a lot to think about for now. In part two, I will provide a sample list of Event Management process design requirements and a sample process flow for further discussion.
Links to other posts of the series