Major Incident Process Guidelines

General Guidelines

The Service Desk (Tier1) must be aware of any Notifications sent to the user community. As such, ONLY the Service Desk, Crisis Managers and Incident Managers can publish a Major Incident.
The Major Incident Process must be followed during business hours, after-hours, weekends and holidays
Application/Service Owners, or designated representative, must validate all incidents deemed as a Major Incident
Application/Service Owners, or designated representative, own the Problem ticket
The Crisis Manager or Service Desk own the Notification record
A new Problem ticket must be created for all Major Incidents
The Notification, whether MIN or SIA, is created from a new Problem Ticket. The notification should never be created from an existing ticket. Any existing tickets that contains pertinent information, should be childed to the new Major Incident Problem Ticket. This will ensure proper reporting of all Major Incidents.

Notification Types

Notifications are used to accurately and timely communicate to the community. When an interruption in service occurs in the environment, it is important to consistently and accurately determine the severity of the interruption and the impact to users. The goal of these notifications is to get the right technical resources involved to restore service, and to communicate the service interruption to leadership and the community.

The Type of Notification issued is determined by the Priority of the Major Incident.

SIA - Service Impact Advisory: Used to communicate an unplanned P2, P3, or P4 service interruption that results in a service not performing as designed. Such issues could be a service degradation, feature outage or system hiccup.
MIN - Major Incident Notification: Used to communicate a Critical, P1, service interruption. Such issues could involve a total service outage or campus wide impact.
Critical MIN - Critical Major Incident Notification (Crisis): Used to communicate a Critical, P1, service interruption that impacts a Critical Business Application or Core Infrastructure Service (KB04806*). This type of Notification follows the Crisis Management Protocol.

* login required

Remember! All significant disruptions to the business are considered major incidents, regardless of the type of notification (MIN or SIA).

Determining Priority and Notification Type

Following are questions that will assist in determining the correct Impact and Urgency so that the appropriate Priority is assigned. Each Application/Service Owner should ponder these questions and be prepared to provide the necessary information to the Service Desk.

Is the service completely unavailable?
How many users are impacted? Are any VIPs?
Is the impact widespread, significant or local?
Are there financial implications?
Are any Critical Business Applications or Core Infrastructure Services impacted (KB04806*)?

* login required

If the answer is YES to ANY question, the appropriate Notification is to be initiated (MIN for P1 or SIA for P2, P3, P4)

Information to Give the Service Desk when Reporting a Major Incident

What is the Service Name?
What is the Mac or IP Address?
Describe the impact to customers
Does this affect all customers?
Does the incident affect specific locations?
Date/Time of initial interruption
Is the Service completely unavailable?
Any technical details?
Does this affect University, Healthcare or both?
Suppress the notification? (Yes, If service is already restored at time of call)

Major Incident Detection and Escalation

Major Incident detection can originate from multiple sources. Once detected, the following steps should be taken from the originating source:

University Service Desk Detects Major Incident

Once three (3) to five (5) calls are received on the same issue, or one (1) call for a Critical Business Application or Core Infrastructure Service (KB04806*), perform the following:

Contact the appropriate Application/Service Owner to make them aware that service is not working as designed
After-hours Escalation Using the ServiceNow On-Call Calendar*:
1. Page On-Call Technician, if no response after 10 minutes, see Step 2
2. Page Application/Service Owner, if no response after 10 minutes, see Step 3
3. Page Crisis Manager, if no response after 10 minutes, see Step 4
4. Page Director, if no response after 10 minutes, see Step 5
5. Page Deputy CIO
Validate with the Service/Application Owner that an issue exists
Issue the appropriate notification as determined by the Application/Service Owner, based on Impact and Urgency
If Incident Priority is 1-Critical, and impacts a Critical Business Application or Core Infrastructure Service (KB04806*), page the Crisis Manager, via ServiceNow, and use the verbiage "Major Incident in progress please join Zoom bridge"
Update the front end message

The Service Desk will receive updates from the Application/Service Owner, or Crisis Manager, as communicated, until the problem is resolved. Once the problem is resolved, resolve the associated outage record.

* login required

Healthcare Service Desk or Local Support Desk Detects Major Incident

Once three (3) to five (5) calls are received on the same issue, or one (1) call for a Critical Business Application or Core Infrastructure Service (KB04806*), perform the following:

Contact the OIT Network Operations Center (NOC) to make them aware of a service not working as designed
- Business Hours Escalation: Monday - Friday, 8:00am - 5:00pm, contact the NOC at 404-727-7667
- After Hours Escalation Call the operator at 404-686-1000 and request the On-Call person for the appropriate group OR Access EHConnect* and enter the Calendar ID of the appropriate group
After the initial page has been sent, please allow the technician 15 minutes to respond. If the On-Call technician fails to respond within 15 minutes, contact the operator for further assistance.
Provide the following information to the person contacted:
- Callback Number
- Customer Impacted
- Remedy Ticket Number
- Short Description of incident

login required

Network Operations Center (NOC) Detects Major Incident

Once a NetIQ or SMARTS Alert is received that has immediate potential, displaying symptoms, or is clearly meeting the criteria of a Major Incident, perform the following:

Incident is automatically logged via a monitoring notification (i.e. NetIQ/SMARTS)
Triage the Alert
Prioritize the Incident
Troubleshoot and validate the Alert
Contact the Application/Service Owner or On-Call Technician
For special situations, on behalf of the Application/Service Owner, contact the Service Desk via the Alert/Bat Line
Call in to the Bridge line to assist as needed

The Service Desk will publish the appropriate Notification based on criticality of incident.

Application/Service Owner or Technician Detects Major Incident

Once a Major Incident is identified, either through monitoring or other means, perform the following:

Create Incident ticket in ServiceNow*. The Short Description should contain the affected CI, and a brief description of the impact to users (written in customer-friendly language)
- The Application/Service Owner can complete the Incident record and then contact the Service Desk to create/issue the appropriate Major Incident Notification, or they can contact the Service Desk to create the Incident record and create the Major Incident Notification record.
- The Application/Service Owner is responsible for the content in the Incident record, communicating the proper information:Short Description
  - Categorization
  - Configuration Item (CI)
  - Impact and Urgency
  - Who is impacted (Healthcare/University or Both)
  - What Features of the Service is impacted
Update the Incident record with troubleshooting activities until resolved. When updating, do not change Assignment Group as this impacts reporting.
Submit an Emergency Change Request in ServiceNow, if a production system requires reboot or other modification
Once the Incident has been resolved, the Application/Service Owner or Crisis Manager will:
- Enter the resolution notes into the Incident record
- Contact the Service Desk to update the front end message and to resolve the outage record

* login required

User Detects Major Incident

If you experience a failure or interruption of an IT Service, that directly affects normal business operations, report the issue to the University Service Desk by:

Calling 404.727.7777 OR Creating a ticket in ServiceNow*

If it is determined, by technical staff, that the issue is classified as a Major Incident, the following actions will be taken to update the community:

The University Service Desk front end message will be updated to let callers know that an issue has been identified
Status updates will be posted to the IT System Status Page* (help.emory.edu)
If Priority is a P1, an email will be sent to those subscribed to Major Incident Notifications.

* login required

To Subscribe to Major Incident Notifications see knowledge articleKB02946

In any of the above cases it is the Application/Service Owner, or designated representative, that has the authority to request a Notification be created and published. It is the responsibility of the Application/Service Owner to confirm the Incident to be Major.

IMPORTANT: Regardless of origination source, an Incident record must be logged. A new Major Incident record must be created to initiate the Major Incident process. If an incident record already exists for the outage, it must be childed to the new Major Incident record.