Tuesday, 5 March 2013

INCIDENT MANAGEMENT- Part One


INCIDENT MANAGEMENT- Part One


Here we are with one of the most important topic of IT Operations Management. Incident management is the most visible component to the business.

Let us define the Incident first.

Incident can be defined as an unplanned disruption to Service or a reduction in quality of Service. Any failure of a Configuration Item that has not impacted the service yet is also categorized as Incident. Process of dealing with Incident is Incident Management.

The Incident management process is targeted to-
  • Restore the normal services at fast as possible.
  • Minimize negative impact on business operations

In the above lines few very important terms has been used understanding which is very important.

What is Configuration Item

A configuration Item is a component of a system that is treated as a self contained unit                 for the purpose of Identification and change management. A CI may be a primitive system building block (e.g. code module) or an aggregate of other CIs. For example, a PC may be designated as CI but in a support environment that requires more control, different part of PC like monitor, HDD etc may be designated as CI.

What is Normal Service

Normal Service Operation may be defined as Service Operation that is within the limits of SLA.

Incident Management should be designed so that it offers below Values to Business: 

  • Ability to detect and Resolve Incidents => Higher Availability and Lower Downtime
  • Align IT Activities with business priorities so that it can identify those priorities and allocate the resources.
  • Ability to identify potential service improvements
  • Identify other requirements like added services, trainings etc by analyzing the incidents.

Important points to be considered while designing Incident management:

Know Your Services

List down all the services a disruption or potential disruption to which shall be qualified as Incident. Just for example, failure of a Server that is not in production need to be considered for incident management.

Know the Criticality & Impact of services

Every service has different level of criticality and impact on business and hence it must be defined clearly. E.g. failure of a PC of a VIP may have high criticality but impacts only one user while failure of a Server may result into impact on a large number of users. Definitely later incident needs faster recovery than the former one.

Define the Severity

On the basis of criticality and impact, severity of incident is defined. A sample matrix of Criticality, impact and Severity is shown below

Ceriticality
Impact
Severity
High
High
Major Incident
High
Medium
High or Sev 1
High
low
Medium or Sev2
Medium
High
High or Sev 1
Medium
Medium
Medium or Sev2
Medium
low
Low or Sev 3
Low
High
Medium or Sev2
Low
Medium
Low or Sev 3
Low
low
Low or Sev 3

Define Timelines

Now that we have categorized the different levels of incident, each severity category must have a target time for resolution. These time targets must be realistic and aligned to business requirement. Smaller time frame means more number of resources with high capabilities (which generally means high cost too) needs to be aligned. If timelines are not defined properly, it may result into inefficient services.

For example, take a case where timelines for End User PC resolution is 30 Minutes. In almost every environment, number of low severity incidents is generally high as compared to high severity case. Now if the timelines for such cases is small, we’ll end up employing higher number of resources. If money saved by smaller time is suppose 10$ but to ensure these timelines, we end up spending 20$, it is definitely an inefficient use of resource. Same is true for the reverse situation.

Keep reading for more information on Incident management.