Problem Management Process Design – Part 3

This post is the third and the final installment of the series for designing a problem management (PM) process for your organization. Previously we discussed the elements and considerations that should go into designing a problem management process. We elaborated those considerations further with a sample list of process requirements, a sample process flow, and a sample RCA form. We will assemble all the information together into one process design document that can be used to implement the process.

Problem Management Process Design Example

In addition to the process requirements and the process flow, I believe a process design document should call out the following information pertinent to the implementation of the process. For example…

  • The Policy section outlines what policy statements (IT or corporate) governs the process and what expectations the organization wants to set for the process.
  • The Scope section specifies which incidents or events will generate a problem record. Your organization may have a pre-defined set of criteria on how problems are triggered, and those criteria can go into this section. Some organizations may also choose to group a series of closely related incidents and trigger a problem record for those incidents.
  • The Roles and Responsibilities section outlines the roles that will be involved in the process and their corresponding activities and responsibilities.
  • The Artifacts and Communication section describes what documentation methods will be utilized by the PM process. It provides the procedural information necessary to carry out the PM process. The communication protocols section describes the recommended communication methods and their frequencies.
  • The SLA and Metrics section describes the metrics that will be used to measure the process performance. The tutorial document has outlined some examples. Develop and measure the metrics that you can capture reliably and that your organization also cares about.

To reiterate, the primary goal of the Problem Management process is to identify the problems in the IT environment, so we can eliminate them by performing root cause analysis on the problems. As a capable IT organization, we should be able to correctly diagnose the root causes of just about everything that goes wrong within our IT environment and to implement solutions so similar problems or incidents will not reoccur. With proper documentation, the Problem Management database is a great learning tool. Also, another benefit of having a well-run problem management process is having the ability to review organizational decisions made about addressing a particular problem. Known errors do not need to be purely technical. They could also be the documented decisions about how we plan to address certain problems. The root causes, solutions (proposed or implemented) and the workarounds documented as part of the Problem Management process will benefit the Incident Management process immensely when similar incidents surface due to the recurrence of a problem.

I hope the information presented so far has been helpful. Please feel free to suggest options or other approaches that have worked for your organization.

Problem Management Process Design – Part 2

This post is part two of a series where we discuss the Problem Management process and how to put one together. In the previous post, I presented some design elements for consideration. In this follow-up post, I will illustrate the design activities further with the following, additional elements.

Problem Management Process Requirements

Problem Management Process Flow

Problem Management RCA Form

The first document contains a list of sample process requirements. The purpose of the requirement document is to capture all considerations that need to be factored into the process design. You will need to decide what activities or requirements will be considered a critical part of the Problem Management process. For example, if row #12 “Categorize the root causes to facilitate further analysis” is important to your organization, make sure that particular requirement is documented, so your process design will incorporate a method of categorizing the root causes.

What can you do if you need some extra help on knowing what to look for in designing your Problem Management process? I would suggest using the following documents as your starting point:

  • ITIL: Problem Management in the Service Operation manual, section 4.4.
  • COBIT 5: Enabling process DSS03 – Manage Problems
  • ISO/IEC 20000: Problem Management in Section 8.2

By using ISO/IEC 20000 as the base, I have derived some sample requirements for your reference. As you can see from the document, the sample requirements outlined in the document are pretty rudimentary and generic. You need to tailor your version of the document with the actual requirements from your organization. Do not select a particular requirement just because it looks good on paper or in theory. Craft or select the requirements to include in your process design only when they make sense for your organization. Also, if you plan to implement a tool or have an existing tool that will be used to support the Problem Management process, the tool-specific considerations should be captured in the requirement document as well.

The second document contains a sample process flow. The process flow shows who is doing what and during what stage of the Problem Management process. Once you have determined what your requirements are for the process, the process flow attempts to match and support the process requirements.

The third document is a sample data entry form for the root cause analysis exercise. The form illustrates what data you may want to captured with the process, and they should be consistent with the requirements you have captured from working with your organization. The data you want to capture from the process should also be consistent with the support from the tools you plan to use with the process. Normally, we don’t want to have the tool’s capability drive the process design decisions. If you have an existing ITSM tool that you would like to use for the Problem Management process, now it is the time to factor the tools into the design and make sure the design can be supported by the tools.

In part three of the post, we will combine everything we have done and produce one final process design document. The process design document will include not only the requirements, the flow, and the roles, but also other information pertinent to the process such as the policy statement, a RACI chart, and the process metrics. The final process design document can then be used as the foundation to implement the actual Problem Management process within your organization.

 

Problem Management Process Design – Part 1

This is the first post of a series where we do the tutorial and some deep-dive of problem management (PM) process design. In this post, we will go over some of the process design considerations, such as the goals/purposes, the intended scope, problem prioritization, and roles and responsibilities. In the subsequent posts, we will go into more of the execution topics such as data capturing, process flows, as well as metrics and measurements.

Goal and Purpose of Problem Management

When designing an ITSM process, one of the most fundamental questions to ask is whether you need a particular process for your organization. By ITIL’s definition, a ‘problem’ is a cause of one more incidents. By managing problems, we are attempting to manage how we document, diagnose, and learn from the root causes after handling the incidents. Do most organizations need a problem management process of their own? I believe so. Even though most organizations may not have a formal problem management process defined, the act of diagnosing and finding root causes is practiced universally. Having a well-thought-out and documented process for root cause analysis can only help to strengthen the organization’s learning and knowledge management effort.

Scope and Policy Implication

In defining your problem management process, it will be useful to define a few scope or policy related items upfront. For example:

  • What organizational boundaries will the PM process be applicable to? Who can initiate, undertake, and/or authorize the PM activities? Like implementing most ITSM processes, the benefits will compound when everyone is adopting the unified approach and vocabulary. If you need to share the root cause data between organizations, it will be important to define the process scope beforehand.
  • Will all incidents receive the PM treatment? If you don’t plan to run all incidents through the PM process, what criteria will you use to decide which incidents to focus the PM effort on? Depending on the number of incidents you receive, practicing PM on every single incident may not be feasible, so you may need to be selective. Some organizations will initiate the PM process for incidents that meet certain criteria based on the incident priority (impact vs. urgency), the nature or category of the incident, the business segment affected, or some other factors.
  • It will also be useful to define what are some of the connecting processes to PM. Incidents, problem, and changes are typically closely tied to one another. What processes will trigger the PM process from upstream or receive the PM output downstream? Will your organization perform PM without having an incident? It is possible if you practice some type of proactive PM. Will all changes related to a problem be required to go through the change management process? Will the incident tickets, problem records, known errors, and requests for change be linked in some fashion? These are some governance related questions that will affect how you design the PM process.

Problem Categorization and Prioritization

When designing a categorization scheme for problems, I recommend using the same categorization for PM and for Incident Management. Having a consistent categorization for both problems and incidents will make designing, generating, and analyzing reports much easier. Some organizations use two separate categorization schemes for incidents and problems – a decision sometimes influenced by the tools. I personally think that is making things more difficult than it needs to be.

Prioritizing problems can help you focus your RCA efforts on problems that need the most attention. When prioritizing incidents, many organizations take the impact to the business community and urgency into consideration. For problems, I believe those two considerations are essential, and I would suggest adding two additional considerations into your problem prioritization matrix. The first one is the frequency of the incident. I think the higher frequency of the incidents; the higher priority should be assigned to problem. Also, the potential risk of not addressing the problems should also be taken into account.

Roles & Responsibilities

A PM process can involve a number of participants. Here are some typical roles to be factored into the design.

  • Requester: Who can initiate a PM exercise? How will the requester participate in the overall PM process
  • Problem Management Process Owner: The process owner ensures that the process is defined, documented, maintained, and communicated at all levels within the organization. The process owner is not necessarily the one doing the actual work but the process ownership comes with the accountability of ensuring a certain level of quality for the process execution.
  • Problem Manager: The problem manager is the main actor in the PM process and has the overall responsibility of implementing the PM process end to end, according to the process laid out by the process owner. The problem manager is also responsible for meeting the service level targets and reporting the metrics to the process owner for quality assurance purposes.
  • Problem Assignee: The problem assignee role is often played by the subject matter experts who does the actual RCA work and determine what the final root cause is. The problem assignee can also be assigned to ensure all changes get properly executed through the Change Management process.
  • Stakeholder: There could be several different types of stakeholders involved in PM exercise. At a minimum, the PM process needs at least one key stakeholder who can approve the handling of the problem records and the closure of the problems. The stakeholder could also act in a governing or mediation capacity when conflicts arise.

In summary, we just went over some of the planning elements for the PM process. We talked about why we want to PM in the first place, the scope of the process, how we categorize and prioritize problems, and the essential roles for executing the process. On the next post, we will go over the process flow and spell out more details for the PM activities.

Major Incident Handling Process Design – Part Two

This post is the part two (and concluding part) of a series where we discuss the Major Incident Handling process and how to put one together. Previously we discussed the elements and considerations that should go into the process design. In this post, I have elaborated some of those considerations further with a sample process flow and a corresponding process design.

Sample Major Incident Handling Process Flow

Sample Major Incident Handling Process Design

A major incident generally imposes higher impact and requires special attention to resolve it. To summarize, I think an effective Major Incident Handling process design should clearly define at least the following who-does-what-by-when-and-how elements:

  • What constitutes a major incident in your organization? What criteria do you use to quickly and effectively determine and declare a major incident?
  • Who is accountable for coordinating and controlling the activities during a major incident exercise? The Major Incident Manager role can be fulfilled by a person or by a team, and she needs the proper authority to direct the activities and the people who are involved.
  • How the resolution efforts will be coordinated and conducted? The exact details may vary from one organization to another, or even from one incident to another. The general approach should be worked out beforehand, and the Major Incident Manager should be trained to utilize the approach as consistently as possible.
  • What escalation or communication approach will be used during and after the Major Incident?
  • What metrics will be used to measure the effectiveness of the process? Keep them simple, easily understood and reasonably painless to collect the data.
  • What format of communication and reporting will be used for the major incident? Who will get what type of information? Try to keep the contents appropriate for the intended audience.

I hope the information presented so far has been helpful. Please feel free to suggest options or other approaches that have worked for your organization.

Links to other posts in the series

Major Incident Handling Process Design – Part One

In IT, incidents as a result of technology failure or human error can strike at any moment. Occasionally, we can have an incident that has a wide impact and poses serious risks to the business operations. Those major incidents need to be handled swiftly, so the IT service can be restored quickly with useful information captured that can be used for the root cause analysis afterward. If you have business critical services or applications under your management, having an organized approach to handling major incidents can save a lot of time and improve productivity. If you need to put a process together for your organization, here are some elements to take into consideration.

  1. Scope and Criteria: What characteristics would qualify an incident as being a “Major” Incident? This is very organization specific but generally there are two basic elements to consider, impact and urgency. Many organizations use the combination of those two elements to classify the priority level assigned to an incident, and that is a good starting point. Any incident that possesses a high degree of impact and high degree of urgency should probably be considered “major” and get the utmost attention. You may have other characteristics you want to define. For example, the outage of a particular application or for a particular line of business may trigger a “major” incident automatically. Since mobilizing the people and logistic necessary to handle a major incident is never a trivial exercise, clearly defined and agreed upon scope and criteria are mandatory.
  2. Roles and Responsibilities: Who will declare a major incident is in motion and own the process execution end-to-end? Since we are talking about major incidents, the Incident Management process owner in your organization will likely own this process as well. Will you have a person or a team designated as the “Major Incident Manager?” Will you rotate such role from individual to individual or from team to team? Depending on the nature of the technology failure or breakdown, how will the major incident manager find the appropriate technical resources to get involved? Will the major incident manager someone who is on stand-by waiting for the occasions to spring into action or will she have another “day-job” and wear the major incident manager hat when necessary? This will again depend on how your organization feels about this role. One thing I am certain of is that this role will require someone with the appropriate skills, environment know-how, and leadership experience to pull people together and execute the agreed-upon process. Another word, I do not believe this is a simple service desk phone dispatch type of role.
  3. Logistic and Facility: Everyone needs to know exactly what to do when the major incident process gets initiated. Will you have a dedicated meeting space or war room type of set up? Will people know what teleconference number to use in order to call in and to provide updates or to receive updates? Will you have a separate teleconference number to work through the technology aspect of incident recovery without cluttering with other non-technical discussions? Who will manage the conference call? What criteria determine when the conference calls start and end? In addition to the conference call, will you hold some kind of web meeting or online collaboration setup where people can share things on screen? Will you have some type of continual update via web or email, so people can stay informed? All these finer details should be planned upfront.
  4. Escalation and Communication: How will you define the communication interval and who will receive what communication at what point in time? How will the incident be escalated up the chain of command as long as the incident remains open? For example, you may define something simple as follow:
    1. At Hour 0: Major incident declared and the technical team contacted by phone. Director of the technical team and VP of IT notified via email.
    2. At Minute 30: Director of the technical team notified again via email with updates.
    3. At Hour 1: Major Incident Manager asks the Director of the technical team to join the conference in person. Another email update goes to the VP of IT.
    4. At Hour 2: Major Incident Manager asks the VP of IT asked to join the conference call for updates.
    5. At Hour 4: Major Incident Manager asks the business customer to join the conference call for updates and to discuss other recovery options.
  5. Other Considerations: How will this process connect with a downstream process such as Problem Management? Will you have the problem manager on the call as the incident progresses? What documentation or deliverables will the major incident process produce? Simple log of incident chronology, who participated the call when, important details shared at various point of the incident, official updates communicated, reasons for the incident closure, and other pertinent information about the incident probably should be documented at a minimum.

One thing for sure, all these considerations are too important not to get agreed upon beforehand. When the agreed upon details are not in place, it is simply not productive for everyone involved to try to figure out the process details during the heat of the battle. When that happens, most people have a tendency to go into the “headless chicken” mode – responsibility-dodging and finger-pointing start to spawn shortly afterward. In the next post, I will provide a sample process flow for further discussion.

Links to other posts in the series

Major Incident Review Process Design – Part Two

This post is the part two (and concluding part) of a series where we discuss the Major Incident Review process and how to put one together. Previously we discussed the elements and considerations that should go into the process design. We elaborated those considerations further with a sample process flow. We will describe the process activities further along with a reporting template you can use to implement the process.

Sample Incident Report Template

Sample Process Design Document

The process design document provides a detailed description of the fields within the report template, so no plan to repeat. I think there are two factors to keep in mind when undertaking such process. First, don’t do the process just for the sake of doing it. Do it because your organization genuinely wants to improve service by eliminating as many of these incidents over the long-term as you can. If the organization chose not to implement certain solution for some reasons, costs, technical complexity, longevity of the technology, regulatory/compliance, or whatever, at least document the discussion. That way, it shows that the organization understood the risks and chose to accept them.

Second, perform meaningful measurements and, again, use the statistics to improve service. For example, if the majority of the incidents are reported by the end users, perhaps that is giving us a clue that we should be more proactive and beef up the automated monitoring? If a particular technology area has been experiencing more major incidents than the other areas, perhaps we should figure out what ills are plaguing the area and fix what are broken? If a particular business unit or segment has been experiencing more major incidents than the other segments, perhaps we owe it to the business communities to figure out what we can do to make things better? The business impact information we capture will enhance our understanding of the incidents and help us in formulating the solutions that make sense for the business.

Most organizations I know practice some type of incident review process, so I hope the information presented so far has been helpful. Please feel free to suggest other approaches that have worked for your organization.

Links to other posts in the series

Major Incident Review Process Design – Part One

One approach to improve just about anything is to learn from a mishap and do something to prevent similar mishaps from taking place. In the IT world, stuff happens frequently enough that we have the tendency to just fix things up and move on to the next incident or crisis. Having a disciplined approach to do root cause analysis (RCA) on incidents and putting permanent solutions in place, or Problem Management another word, can only help.

ITIL Service Operation handbook already provides an excellent overview of the Problem Management process with a suggested model and various analysis techniques, there is no need for me to reiterate. I am proposing a periodic Major Incident Review process that takes place after incidents have been resolve with service restored. People and tools aside, I think there are three things that can contribute to the effectiveness of this review process.

  1. Have a well-defined scope of incident you plan to review. Depending how an organization defines the impact and urgency of the incidents, the scope of incidents that get reviewed can vary. Some shops will choose to review only the most critical incidents with visible business impacts. Other might choose to review all incidents that took place. Many organizations will probably fall somewhere in between. My recommendation is to figure out what types of incidents the process stakeholders care about. We will discuss more about the stakeholder shortly.
  2. Have a clear exit plan on what to do with the root cause discovered and the permanent solutions. I think the situation to avoid is having the root causes identified but getting stuck on what to do next. The root cause to a number of incidents is a simple break-down of technology. For those straight-forward incidents, you identify what broke down, fix it, and move on. Sometimes a permanent solution could take a significant amount of time, people, or financial resource to achieve. For all incidents, a decision should be made to either
    1. Follow up using the same Major Incident Review process up to some point.
    2. Get this item off the Major Incident Review process but to use another process/procedure to track and to follow up on making sure the permanent solutions get implemented
    3. Decide nothing further will or can be done. Capture the lesson learned in a known error database or some knowledge management repository.
  3. Perform the process periodically, without exception. The frequency of the review can vary from one organization to another, monthly or bi-weekly for some or weekly for those busy shops. RCA activities can have a tendency to take a while to do because they often do not register at the same level of criticality as the incidents. By having this review process on a periodic basis, we are taking a position of … “Don’t procrastinate. Let’s figure out what went wrong. What were the causes? Decide what we plan to do about it and move on.”

In this post, I will discuss what you need in order to put a Major Incident Review process together for your organization. In addition to the three success factors that need to take into account, here are some additional elements to consider.

  1. Who will participate and who are the stakeholders? This review process will involve three groups of people at a minimum, the Incident Management team, the Problem Management team, and the technology/application support team. The actual organizational functions that perform the incident and problem management processes can vary. Frequently, the Service Desk is the owner for the Incident Management process, while the ownership for Problem Management rests somewhere outside of the Service Desk. Also, the IT/corporate management and the business users will likely become another set of stakeholders, since the results of the view often get shared with those constituents.
  2. Who will own this process? Since the review process will occur after incident resolution and spend a great deal time on RCA activities, the Problem Management process owner is the logical owner of this process. Even though the name of the process is called Major “Incident” Review, calling this process a Major “Problem” Review just sounds awkward. The convention you use in your organization should determine what is the most easily understood name you will use.
  3. What technology or tools will the process require? This process is straight forward enough where a spreadsheet tool should be sufficient. In addition, most organization will have some type of incident tracking tools that can feed the incident information into this review process. The output of this process could also feed into a Problem Management tool if you have one.

Sample Major Incident Review Process Flow

Here is an example of the review process flow. I have used a bi-weekly schedule for the review process. Depending on your organization’s requirement, this schedule could expand or contract.  In part two, I will describe this process flow more in detail with a template you can use to capture the incident review details.

Links to other posts in the series

Event Management Process Design – Part Three

This post is the part three (and concluding part) of a series where we discuss the Event Management process and how to put one together. Previously we discussed the elements and considerations that should go into the process design. We elaborated those considerations further with a sample list of process requirements and the corresponding process flow. We will assemble all the information together into one process design document that can be used to implement the process.

In addition to the process requirements and the process flow, which are two key ingredients, I believe a process design document needs to call out additional information pertinent to the implementation of the process. For example…

Sample Process Design Document

Policy statement: The policy statement calls out what are some of the governing points behind the process. Under what circumstances the process becomes applicable or not? What are some high-level expectations the organization has with the implementation of the process?

RACI Chart: Simply put, who does what? Sure the process flow calls out the major roles and describes the interactions between the activities and the roles. Having accountability clearly defined is also necessary. Some activities may involve more roles interacting with one another in various capacities, depending on the complexity of the activity, so it is a good thing to identify those finer details as well.

Process Metrics: How would the process owner or the organization measure the performance of the process? What metrics can be collected for analysis? It will be hard to improve the process over time if the process owner has little idea on how the process is doing at any given point. Or better yet, what metrics will be meaningful to measure because the organization cares enough about them?

Interfaces: How will the process interact with other processes already in place? For example, Event Management process will often interact with the Incident Management process. It will be useful to document what interactions exist between those two processes and what input/output should be taken into account.

Other supporting procedures: Most organizations will have other supporting procedures that further describe how things work together. For example, we described an activity within the process where the Service Desk notifying and keeping the business user communities informed of the incident status. Well, that is pretty high-level still, so exactly how the notification to the end users will be carried out? Again, every organization will have its own approaches so it will be helpful if those details can be incorporated into the process design somewhere or at least made references of.

Depending on the discipline required by your organization, your process design may or may not contain these additional elements, or maybe different ones. In any case, I think a good process design document should spell out all the information and references anyone will need to implement the process fully, just like a specification of some sort used to construct something. Hopefully, the process design you come up with will also have been vetted by the necessary stakeholders of the process, so you will have the support you need to implement. The process design document is also a living document that will require periodic care-and-feeding in terms of reviewing for accuracy and fine-tuning over time.

So what is the point for having a functional Event Management practice in your organization? As a technology service provider to the organization, the Event Management process can help IT stay on top of potential service interruptions or outages. As a capable IT organization, we should be the first to know what is going within our own environment and not depend on the end users to let us know when something has become unavailable. Technology and gadgets break down all the time, and that is the nature of the business. The IT organization should be the first voice to let people know when something has gone wrong within our domain. Having a well-designed Event Management process is the first step in getting a better handle on what is going on within your environment.

I hope the information presented so far has been helpful. Please feel free to suggest options or other approaches that have worked for your organization.

Links to other posts of the series

Event Management Process Design – Part Two

This post is part two of a series where we discuss the Event Management process and how to put one together. In the previous post, I presented some design elements to consider. As a follow-up, I will present two documents to illustrate the design process further.

Sample Process Requirements Document

The first document contains a list of sample process requirements. No different to engineering software or systems, the purpose of the requirement document is to capture all considerations that need to be factored into the process design. What activities will be carried out as part of the process and how one activity will flow to another? What information or data points get fed into which activities and what output are expected? Who will perform what activities and when? We need to define some roles, so we know who will do what and when. If you plan to implement a tool to support some portion of the process, some tool-specific considerations should be captured in the requirement document as well. The sample activities outlined in the document are pretty rudimentary and simplistic. You need to tailor your document with requirements from your organization.

Sample Process Flow Document

The second document contains a sample process flow. The process flow shows who is doing what and the timing of the activities. The flow document attempts to describe the process pictorially while the requirement document tries to carry as much description in text. It should be obvious that the process flow should be consistent with the requirements outlined, and, in fact, both the process requirement and process flow documents should convey the same information about the process. Some organizations combine the information from both documents into one requirement/design document, and that is perfectly fine.

In part three of the post, we will combine everything we have done and produce one final process design document. The process design document will include not only the requirements, the flow, and the roles, but also other information pertinent to the process such as the policy statement, a RACI chart, and the process metrics. The final process design document can then be used as the foundation to implement the actual Event Management process within your organization.

Links to other posts of the series

Event Management Process Design – Part One

This post is the first part of a series where we discuss the Event Management process and how to put one together. Accordingly to ITIL (quoted directly), Event Management is the process that monitor all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exceptional conditions. Another word, Event Management picks up the alerts and events generated from the devices and applications, figure out what to do with those alerts and events, and follow up afterward to make sure the alerts and events get the due attention and addressed properly. To begin putting together an Event Management process for your organization, here are some elements to think about.

  1. What events and alerts do you plan to trap and process? It may be a noble goal to design a process that can trap 100% of the alerts from the environment and process them all. It is not always possible. Some events can be trapped and processed automatically by the tools you have on hand, and some alerts will require manual intervention. Where will the alerts/events be captured from and where they will be recorded? ITIL suggested centralizing the event management process as much as possible, and it makes sense. If the alerts need to come from different technology stacks or devices, which they often do, can you at least centralize the location where the recording and processing activities can take place? Determine the scope, what you can do or cannot do, and have a clear idea of what you hope to get out of the process.
  2. Once you determine the set of events or alerts that can be picked up and fed through the process, you will need a set of rules on what to do with those events. The rules need to be explicit so there is little room for guessing or personal interpretation by those carrying out the process. The rules will determine what conditions, after being met or exceeded with some thresholds, will trigger an event. For example, you may have a rule that says when server ABC’s CPU utilization reaches 90% and stay there consistently for over 10 minutes during business hours (6am to 6pm), an alert will be triggered. The rule will further stipulate what actions will be taken when the event is triggered. For example, you may have a rule that says the CPU alert will be escalated or handed over to the systems admin team for further evaluation via email or phone call. The rule will also call out what acknowledgement or interaction will constitute a successful escalation or hand-off.
  3. You should have a classification scheme for the incoming alerts/events. Not all alerts require the same handling actions. Using ITIL’s suggestion of having alerts that can be either Informational, Warning, and Exception is a good starting point and more than sufficient for most organizations anyway. For example, informational alerts usually get recorded for historical purposes and not escalated anywhere else, only the warning and exception get escalated further. Between the warning and exception alerts, they may get escalated differently to different teams with different timing considerations. Furthermore, once the alert is escalated, the job of Event Management is not 100% done. We also need to have a standard rule or approach on how to follow up while the alert condition is being addressed and to close out the alerts once certain conditions are met (incident resolved or alerts cease to repeat within a 24 or 48 time frame).
  4. As you can see, determining what to do with an alert, making sure the alerts are handled correctly and efficiently, and following up to close the alerts properly take some up-front thoughts and planning. The number of alerts monitored in a moderately complex IT environment can grow very quickly. Therefore, having heavily customized, individual alerts is not recommended, and really not necessary. My suggestion is to have a default event handling procedure that will work for over 90-95% of the events you anticipate to process. For the remaining 5-10%, use the default handling procedure as the foundation but with some customized procedure on top so the events can be handled correctly.
  5. Who will be on point as the process owner for and responsible for carrying out the Event Management process? If you are lucky enough where you can have a team in your organization whose primary responsibility is to monitor the environment and process the events, that team can be both accountable as the process owner and responsible for doing it. If a dedicated team is not an option and multiple people/teams will be carrying out the process, at least designate one, single process owner and have a consistent process in place for everyone else to follow.
  6. How will the process be measured for efficiency and effectiveness? What measurements does your organization care about? What actions will result from analyzing the measurement data? Measurements will mean very little if they are not acted upon to further improve the performance of the process.

Those are a lot to think about for now. In part two, I will provide a sample list of Event Management process design requirements and a sample process flow for further discussion.

Links to other posts of the series