INCIDENT MANAGEMENT- Part One
Here we are with one of the most important topic of IT
Operations Management. Incident management is the most visible component to the
business.
Let us define the Incident first.
Incident can be defined as an unplanned disruption to
Service or a reduction in quality of Service. Any failure of a Configuration Item
that has not impacted the service yet is also categorized as Incident. Process
of dealing with Incident is Incident Management.
The Incident management process is targeted to-
- Restore the normal services at fast as possible.
- Minimize negative impact on business operations
In the above lines few very important terms has been used
understanding which is very important.
What is Configuration Item
A configuration Item is a component of a system that is
treated as a self contained unit for
the purpose of Identification and change management. A CI may be a primitive system building block (e.g.
code module) or an aggregate of other CIs. For example, a PC may be
designated as CI but in a support environment that requires more control,
different part of PC like monitor, HDD etc may be designated as CI.
What is Normal Service
Normal Service Operation may be defined as Service Operation
that is within the limits of SLA.
Incident Management should be designed so that it offers below Values to Business:
- Ability to detect and Resolve Incidents => Higher Availability and Lower Downtime
- Align IT Activities with business priorities so that it can identify those priorities and allocate the resources.
- Ability to identify potential service improvements
- Identify other requirements like added services, trainings etc by analyzing the incidents.
Important points to be considered while designing Incident management:
Know Your Services
List down all the services a disruption or potential
disruption to which shall be qualified as Incident. Just for example, failure
of a Server that is not in production need to be considered for incident management.
Know the Criticality & Impact of services
Every service has different level of criticality and impact
on business and hence it must be defined clearly. E.g. failure of a PC of a VIP
may have high criticality but impacts only one user while failure of a Server may
result into impact on a large number of users. Definitely later incident needs faster
recovery than the former one.
Define the Severity
On the basis of criticality and impact, severity of incident
is defined. A sample matrix of Criticality, impact and Severity is shown below
Ceriticality
|
Impact
|
Severity
|
High
|
High
|
Major Incident
|
High
|
Medium
|
High or Sev 1
|
High
|
low
|
Medium or Sev2
|
Medium
|
High
|
High or Sev 1
|
Medium
|
Medium
|
Medium or Sev2
|
Medium
|
low
|
Low or Sev 3
|
Low
|
High
|
Medium or Sev2
|
Low
|
Medium
|
Low or Sev 3
|
Low
|
low
|
Low or Sev 3
|
Define Timelines
Now that we have categorized the different levels of
incident, each severity category must have a target time for resolution. These
time targets must be realistic and aligned to business requirement. Smaller
time frame means more number of resources with high capabilities (which
generally means high cost too) needs to be aligned. If timelines are not
defined properly, it may result into inefficient services.
For example, take a case where timelines for End User PC resolution
is 30 Minutes. In almost every environment, number of low severity incidents is
generally high as compared to high severity case. Now if the timelines for such
cases is small, we’ll end up employing higher number of resources. If money
saved by smaller time is suppose 10$ but to ensure these timelines, we end up
spending 20$, it is definitely an inefficient use of resource. Same is true for
the reverse situation.
Keep reading for more information on Incident management.