Difference between revisions of "Operations and Support"

From EITBOK
Jump to: navigation, search
Line 94: Line 94:
 
<ul>
 
<ul>
 
<li>'''In-house Local''' &mdash; In-house local support has the support personnel co-located with the assets, usually in a datacenter. These support staff are part of an EIT organization. Business users may or may not be in the same location, depending on the organization.<br />Support teams local to the assets means incidents involving hardware can be resolved sooner, by not including travel time for remote staff to get to the site.</li>
 
<li>'''In-house Local''' &mdash; In-house local support has the support personnel co-located with the assets, usually in a datacenter. These support staff are part of an EIT organization. Business users may or may not be in the same location, depending on the organization.<br />Support teams local to the assets means incidents involving hardware can be resolved sooner, by not including travel time for remote staff to get to the site.</li>
<li>'''Embedded Support''' &mdash; Embedded support means EIT support staff is co-located with the business unit (BU) directly, not with the asset location. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards. If there are Enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs. <br />''Shadow EIT''' is a common term for ''BU-owned'' technical support and development. These organizational structures grow for a variety of reasons. The main drawbacks are below.<br />
+
<li>'''Embedded Support''' &mdash; Embedded support means EIT support staff is co-located with the business unit (BU) directly, not with the asset location. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards. If there are Enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs. <br />''Shadow EIT'' is a common term for ''BU-owned'' technical support and development. These organizational structures grow for a variety of reasons. The main drawbacks are below.<br />
 
<ul>
 
<ul>
 
<li>'''Asset life-cycle maintenance''' &mdash; Routine maintenance activities may not be performed consistently, if at all, resulting in assets going out of support over time, even though license and maintenance costs are still being incurred.</li>
 
<li>'''Asset life-cycle maintenance''' &mdash; Routine maintenance activities may not be performed consistently, if at all, resulting in assets going out of support over time, even though license and maintenance costs are still being incurred.</li>

Revision as of 23:34, 22 September 2015

1 Introduction

In information technology, operations and support coordinate and carry out the activities and processes required to deliver and manage services provided by assets to business users and customers, at agreed levels. This knowledge area is all about maintaining an operation normal state in the EIT environments, meaning that systems and processes execute according to expectations, with no deviation. Operation normal states can be changed in two ways:

  1. Through proper channels via transition. Operation normal is suspended during a change, and then when the change is implemented, there is a new operation normal state. Changes can come to transition via construction or acquisition, sustainment, or retirement.
  2. Through incidents (ITIL), which interrupt the normal operational services. Incidents can be resolved with or without creating a problem (ITIL) to be handled through sustainment or strategy and governance. Incident resolution relies heavily on disaster preparedness.

There are three main elements involved in this function: assets (what), users (who), and services (how). Each of these has an inventory, catalog, or user list. Each of these also needs an overall function to manage and monitor the performance, regardless of asset, service, or user details. This chapter addresses how that overall function works.

Figure 1. Assets, Services, and Users ***add image***

Overall management of assets, services, and users is composed of two main parts: operations (assets and services) and support (users).

Operations manages the assets used to deliver services used by business processes via technology providing the following functions: event management, incident management, request fulfillment, problem management, and access management. This technology is also categorized as an asset(s) providing services.

Support manages the users, and may include technology that provides the following functions: service desk, technical administration, EIT operations monitoring, and application configuration. Although these processes and functions are associated with operations, most processes and functions have activities that take place across multiple stages of the service life cycle (Based on the definition for service operations in the ITIL glossary.)

This chapter covers topics in the Service Operation (SO) column, as shown in the diagram below.

Figure 2. ITIL Core Topics ***add image***

2 Goals and Principles

The main goals of operations and support are to maximize the use of resources and minimize negative impact of changes to the environment. All organizations have an EIT environment, a sort of ecosystem where things do tasks for people. That translates to assets provide services for users. The goals (listed on the context diagram) translate into the following four lists:

  1. What things (assets) and which people (users) require support.
  2. What tasks (services) each thing (asset) performs, which people (users) expect those tasks (services) to happen, and what support each thing (asset) needs to keep functioning over time.
  3. What each thing (asset), task (service), and person (user) is actually doing.
  4. What actions to take to manage changes to the things (assets), tasks (services), and people (users), so the business processes continue operating properly.

Each of these can be measured quantitatively and objectively.

2.1 Guiding Principles

DO

  • DO have standards for technology and data, in order to minimize confusion and rework, and reduce the opportunity for technology- or task-specific positions or services.
  • DO share resources where possible and practical.
  • DO make sure that backups and restores work by testing periodically.
  • DO make sure that all changes are properly approved, and scheduled to be implemented to cause the least disruption to business processes.
  • DO have procedures documented for when incidents occur that guide the efforts to resolve the incident and restore services as quickly as possible.

DO NOT

  • DO NOT panic when incidents occur.
  • DO NOT allow emergency changes that have not been tested to enter your environment.
  • DO NOT allow changes to systems (patches) that do not use the normal application interface without very high level clearance. If necessary, make a backup of the system before running any patch.

3 Context Diagram

Figure 3. Context Diagram for Operations and Support ***add image***

4 Asset Management

What cannot be measured cannot be managed. The first requirement for any system is knowledge of the system itself. It is critical that a centralized or federated authority exists that is responsible for asset management within the organization, ideally with a separate budget to handle asset needs outside of projects. Manage assets to enable business processes to occur predictably.

Many organizations have suffered from project-only EIT organizations, letting hardware and software age out of support due to lack of funding. Operations and support organizations have a stake in ensuring assets are up-to-date, and that maintenance assistance from vendors is available for emergencies. See Chapter 12, Sustainment, for more on this important topic.

Some organizations may out source support; however, the organization has heavily invested in the assets and depends on the services, so it follows that knowing the asset and service inventory is critical, regardless of what organization provides the support.

4.1 Asset and Facility Administration

Assets categorize into four main types. Support teams to manage those assets vary depending on the asset types and the organization’s needs.

4.1.1 Types of Assets

There are four main types of assets to manage: facilities, hardware, software, and data. The first three types exist in order to create and use the fourth.

  • Hardware and Facilities
    • Hardware is the physical machinery installed in a facility which enables software to run and data to be stored and moved. These include (but are not limited to) desktop/laptop workstations, servers (individual and farms), disks, disk arrays, solid state storage, modems, routers, network cards and appliances, backup media, and wiring. These types of assets may break down over time as they are used, and need to be repaired or replaced.
    • Facilities contain and protect the hardware, and include buildings, security systems, environment conditioning, and uninterruptable power supplies. These types of assets may break down over time as they are used, and need to be repaired or replaced. Facilities may also become insufficient due to ongoing needs and may require expansion, either in place, or with the addition of other facilities, which then requires network connections to join the two facilities. Or, company needs may change and facility contents may need to be moved to new facilities.
  • Software — This is all digital assets that are loaded onto the machinery to provide a service. These include (but are not limited to) operating systems, utilities (backup, restore, compression, search, etc.), daemons, drivers, compilers and kernels, email and calendar applications, office productivity applications, hosted third party applications, software development environments, database management systems, database access systems, network connection software, user interfaces, browsers, browser interfaces, client/server applications, reporting and analysis applications, and asset management applications. These types of assets don’t break down as they are used. However, they are patched or upgraded based on new hardware capabilities or programming techniques.
    Software does not always install and execute perfectly with no input. Most software requires some input or change in order to operate as expected. The table below outlines the differences between configuration, customization, and maintenance.
FeatureConfigurationCustomizationMaintenance
Frequency
  • Installation
  • Based on analysis or requirement changes
Based on development team release schedules
  • Based on maintenance schedules
  • Vendor releases
Changes to vendor-delivered filesNo change to source code or executable files Changes to source code and executable files.
NOTE: Customization may make installing patches or upgrades from the vendor difficult or impossible to do without losing the customization.
No change to source code or executable files
LimitsChanges execution in pre-determined waysNo change limitsChanges execution in pre-determined ways
Implementation
  • Application data entry screens (wizards)
  • Local file with defined structure (config files)
  • Software development toolkits (SDK)
  • Source code modification and re-compilation
  • Follow vendor-supplied instructions
  • Follow vendor-supplied update application
Content
  • Parameters (data)
  • Selected options
Code (instructions)
  • Executable files
  • Scripts
Skills required Ability to answer questions or follow directionsFamiliarity with software used to create the applicationAbility to answer questions or follow directions
  • Data — This type of asset is most often overlooked. This is the content that all the applications use and the hardware stores. Data is loaded and stored on hardware, and moved over networks. Data is read, written, copied, manipulated, displayed, added, deleted, backed up, restored, archived, and sent via many protocols between systems. This type of asset does not break down over time; instead the detail becomes less relevant as time passes. This asset is managed through policies determining what kind of data needs to be available in specified timeframes, and when data should be archived (put on slower/older/less expensive access media) and purged (erased from media permanently). Never patch data through the database management system alone! All data changes should occur through an application interface to reduce the chance of corruption.

4.1.2 Types of Support Teams

There are four main types of support team organizations: in-house local, embedded, in-house remote, and offshore. Large organizations may have sub-teams of each type, depending on need and costs. Larger organizations may have combinations of these types of support teams, depending on the assets and business needs.

  • In-house Local — In-house local support has the support personnel co-located with the assets, usually in a datacenter. These support staff are part of an EIT organization. Business users may or may not be in the same location, depending on the organization.
    Support teams local to the assets means incidents involving hardware can be resolved sooner, by not including travel time for remote staff to get to the site.
  • Embedded Support — Embedded support means EIT support staff is co-located with the business unit (BU) directly, not with the asset location. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards. If there are Enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs.
    Shadow EIT is a common term for BU-owned technical support and development. These organizational structures grow for a variety of reasons. The main drawbacks are below.
    • Asset life-cycle maintenance — Routine maintenance activities may not be performed consistently, if at all, resulting in assets going out of support over time, even though license and maintenance costs are still being incurred.
    • Future support — If in the future, support needs to come from EIT rather than the BU, non-standard assets may be more difficult or expensive to support, or must be converted to use standard assets, which may be costly.
    • Skills — Technical support staff that is part of a business unit may not receive the same technical training and cross training that being in an EIT support organization would provide. It may be that someone in the BU learned on-the-job instead of having formal technical training. When an incident (see Support Incidents and Recovery) occurs (something unexpected happens or an action performed without technical review has unintended consequences), they may not be prepared to resolve it without formal EIT assistance.
  • In-house Remote — EIT organizations provide in-house remote support outside of the corporate headquarters or business offices. Users contact the support center to report issues and receive support services. Some travel between the remote location, the datacenter, and the user locations occurs when necessary. This support model can be useful if there is little need for personal contact with the assets (such as when using a vendor-owned cloud implementation), so the support employees can work from home, or at a separate support center location.
  • OffshoreOffshore support occurs in another country. Many companies specialize in remote technical support. Additionally, offshore support can occur during work hours in that location, which is overnight in the datacenter.

4.1.3 Support Requirements

Each organization is different, and the needs for all assets in the environment are different as well. This step needs to be finished before even going looking for solutions (technical or otherwise).

  1. Importance — Each organization varies in how important each asset is to the organization, depending on the industry and the maturity of the organization. There may be several installations of the same asset in an environment (such as in a server farm) — each instance is treated as a separate asset. This includes production environments, and any failover, testing, sandbox, or development environments. These questions may help guide your requirements design.
    • Which assets are mission critical?
    • Which assets are critical for teams to be effective?
    • Which assets can be non-functional in an emergency?
    • How long can each asset be non-functional during an incident, without significantly affecting normal business processes?
  2. Type — Each organization varies in the types of assets needing support. Below are two ends of the levels of asset management that can exist.
    1. If the organization is using cloud computing, the need may be for only monitoring of applications and data, as the vendor handles all the rest of the monitoring needs according to contractual agreements.
    2. In a traditional EIT organization with an internal datacenter, the need would be for data about machines, networks, building features, etc.
  3. Documentation — Each organization varies in what level of detail in documentation is necessary to ensure adequate support of the assets. In mature organizations, part of the turnover exercise is a review of all relevant documentation with the support team. Below are two ends of the levels of documentation that can exist.
    1. HIGH — If the organization has an internal support staff with great expertise, the systems are not vendor supported (home grown or highly customized), the level of detail required may be high enough to include information that allows staff to make significant changes to the internal workings of the asset — for example: software development toolkits (SDK) or parts lists.
    2. LOW — If the organization has an agreement for support with an off-site vendor, the level of detail may be low — for example, support phone numbers for the vendor and minimal instructions for issue troubleshooting.

4.2 Support Processes

There are three main types of support processes for assets: monitoring, auditing, and scheduling. This step also needs to be finished before even going looking for solutions (technical or otherwise), and in fact, provides requirement information that can be used in tool selection later.

  1. Monitoring — The kinds of processes needed to view the execution of the services operating on or using the assets vary by organization. Below are two ends of the ranges of monitoring that can exist.
    1. Simple monitoring and alerting — The systems operate and only kick out alarms when certain thresholds are met or exceeded.
    2. Complex monitoring and real-time management — The systems are complex and interdependent, so support team members may need to actually see messages or monitoring screens updating in order to keep the system operating within acceptable parameters.
  2. Auditing — The level and frequency of auditing or review corresponds to the maturity of the EIT organization and the industry requirements. This may be important for certain organizations dealing with healthcare or finances, or parts of organizations dealing with human resources information.
    1. Simple annual review — The review of processes occurs annually as a checkpoint for owner or shareholder reports. There are no regulatory or legal requirements for reviews.
    2. Complex periodic auditing — This includes published certifications of compliance with laws, regulations, and internal process rules, acknowledgements from employees that all required processes are executed correctly, and evidence of acknowledgement from customers that they understand their rights.
  3. Scheduling — Support schedules for vary by each organization’s needs. There are two main support schedules — monitoring and maintenance.
    1. Monitoring — Different type of monitoring can be required depending on the time of day, day of the week, or the day of the month or year. All monitoring should have instructions for what situations to look for, and what to do in those situations, such as who to inform if a situation arises, or a list of expected anomalies and how to handle them. Also, monitoring can be automated to a point of only sending alerts when expected thresholds are crossed.
    2. Maintenance windows — Includes standard and emergency maintenance preferences (such as nightly normally, and during lunchtime in an emergency to reduce work impact when possible). Planning each maintenance session should have a minimum and maximum task limit to prevent unnecessary shutdowns for tasks that can be completed at a later time (not enough work to justify). Each instance should have a list of each process occurring on which asset needs attention during these windows (such as monthly cleanup of temporary storage areas). How many can be scheduled? How much of the system needs to be taken offline during this window? If one part is taken offline, what happens to systems downstream from or linked to that part? When are the preferred times for emergency maintenance if there is a choice?

4.2.1 Support Levels

Define requirements for how to structure the support staff to handle the varying levels of support the assets require, especially depending on the support schedule. The monitoring staff would call for support when a system is not performing as expected or within defined thresholds.

  1. Hierarchy — Most organizations have multiple levels of support, each increasing in technical expertise. Larger organizations can provide support for a majority of issues with cheaper staff in the lower levels.
    1. Self-service (lowest) — This level of support consists of documentation websites or interactive voice response (IVR) systems with instructions for resolving common situations.
    2. Service representatives — This level of support has a person with access to scripts or instruction sets for common problems, and possibly some familiarity with the systems, who can talk through basic troubleshooting steps with a caller. This level of support is sometimes called Tier 1.
    3. Technical support — This level of support has familiarity with the systems and can work with the caller using their expertise and access to monitoring tools. This level of support is sometimes called Tier 2. In larger organizations, there may be more than one technical support layer, up to and including the development organization.
    4. On-site technician — This level of support is the highest level, and sends a technician to the site when all lower levels have failed to resolve the caller’s issue.
  2. Escalation — In organizations with multiple support levels, a process for moving an issue to a higher level is necessary, to minimize time spent on resolving the issue.
    1. When to escalate:
      1. All options at a support level have been exhausted without being able to successfully resolve the issue.
      2. The issue is on a list of known issues that require higher level of support expertise.
      3. Upon request from the caller.
      4. Upon request from the support staff.
    2. How to escalate:
      1. Call in the next higher level directly using a defined process, such as using an on-call pager or phone number.
      2. Reassign the issue in a ticket management system to the next level in the hierarchy for that system.
  3. Specialization — In larger organizations, support organizations can be specialized to certain subject areas or applications.
    1. Hardware support
    2. Software support by application or business area

4.2.2 Support Tools

Support tools come in two categories: asset management (what exists) and asset monitoring (what occurs). Some toolsets may cover both categories. As the business is at least tangentially funding operations and support, there should be a business view of the assets and services provided. Only communicating with the business in technical terms may not be helpful or understandable when notifying business users of outages or the need for upgrades.

Figure 4. Business View of Operations ***add image***

  • Asset Management Tools — Many software packages exist to manage asset information. The hard work lies in populating the package’s database with the organization’s asset catalog, and maintaining the catalog over time. No matter how comprehensive a tool is, it is useless until it has the organization’s data loaded.
    Each tool may have one or more sweet spots in the following areas: depth of monitoring (system detail), breadth of monitoring (system and network scans), or user interface (information display and reporting/graphical output).
    1. Asset catalog detail — Some packages include options for managing data for each asset (hardware, software, services, and data) such as configuration settings, license and compliance information, contract and contact information, cost and depreciation tracking, requisition and procurement information, and support agreement information. Some of this data must be entered or uploaded as it does not exist on the asset itself.
      Some packages also allow for business views of the services offered, so assets can be tied to business functions. Each EIT service uses one or more assets (or components), and should map to one or more business processes, which then feeds into importance rankings and service level agreement parameters for each EIT service.
    2. Asset scanning capabilities — Some packages focus on gathering information from the systems themselves, although there is usually some sort of processing cost while it runs. This can be a way to quickly get asset data into the tool’s database.
      While system scans can replace some data entry effort up front, there is a lot of work to do making sure that the tool has the proper access to all the systems in order to run the scans, and each tool may not work with every system component or network connection in the organization. In those cases, some other tool can be used to dump data into a format that the asset management tool can import, but of course, that is a manual process with risks of errors or omissions.
      When the assets have been entered into the tool’s database, review the data to ensure it meets the expectations and has read the system data correctly. Some configuration is required to set up frequency of updates and re-scan functions, as well as notification and alert thresholds and contacts.
    3. User interface and reporting/analysis — Some packages focus on the ability to track and manage support issues for asset items, and then analyzing for trends in incidents or problems with certain assets. These tools commonly also support tracking change requests and enhancement requests. See Chapter 11, Transition for more on this topic.
      User interfaces for lookup and reporting can be either installed clients or web-based. Client-installed versions can enable data entry and other auditable activities for authorized users, and may have more intuitive user interfaces for interacting with the tool in order to enter or modify asset data. However, web-based interfaces are much easier to deploy and manage for large numbers of users.
      Some packages also include options for impact and dependency analysis to help determine if a change would have adverse or unintended effects, as well as alarms and notifications for license expiration or version updates from vendors.
  • Monitoring Tools
    • Monitoring capabilities — Some packages focus on being able to monitor components to very great detail. However, there may be a limited list of systems that the tool can function on to that level of detail.
    • User interface — Some packages focus on the ability to display monitoring data in useful formats, in both reports and in real-time on scrolling or continually updating monitors.
      These also may have more intuitive user interfaces for interacting with the tool in order to enter or modify system data or schedules.

4.3 Support Incidents and Recovery

Support organizations exist to not only keep the systems operating normally, they also serve to handle exceptions to normal operations: directly, by calling another team within the organization, or by calling in outside support from vendors.

4.3.1 Incidents versus Problems

Incidents (ITIL) are interruptions in normal processing on production systems. If a scheduled process does not start when expected, unexpectedly stops while executing, or executes longer than expected and somehow prevents something else from executing at all, it is an incident. Incidents causing incidents in other processes are considered related. ITIL also includes as incidents the failure of an asset that has not yet caused an interruption, but there is some disagreement there.

It is possible to resolve an incident with a workaround, which creates a problem.

Problems are situations where processes are executing, but not as normally expected, and the difference is tolerated or is a workaround. Problems are turned into either defects submitted to sustainment or enhancement requests submitted to the change request queue.

Incidents can result in problems, but problems are not incidents. One way to understand the difference is to consider a traffic accident.

  1. A jack-knifed semi-truck across all lanes of traffic is an incident. Traffic is stopped.
  2. Incident management (police, firemen, tow truck) are all called. Traffic is routed to the shoulder (workaround). Incident resolved. Traffic movement is restored, but not as much or as fast as normally expected.
  3. There are now two problems.
    1. The first problem is that some lanes are still blocked. Getting the truck moved to the side helps, and removing the truck entirely restores traffic to the pre-incident state.
    2. The second problem is the reason why the truck jack-knifed in the first place — The weather, the road, the traffic, the truck, the driver, or some combination.
      1. If caused by the weather (bad conditions), then resolve by communicating that road conditions are bad while salting the road (preventing future incidents).
      2. If caused by the road, patch the bad area and schedule for repair.
      3. If the level of traffic or the traffic pattern is the reason, examine how the traffic deteriorated to cause the accident, and recommend changes to the traffic pattern, such as adding turn signals or turn lanes.
      4. If the truck is the reason, fix the truck.
      5. If the driver is the reason, educate the driver.

Incident management is the process of restoring normal operations as quickly as possible with minimal business impact. This is an emergency, where getting processes working now is more important than making the system permanently stable. A problem occurred, causing an incident.

Problem management is the process of removing causes of incidents, or preventing incidents from occurring in the first place. Some problems may be so rare that a permanent remedy could be more costly than the incident causes, so problem management also includes prioritization. Problem management is best handled under sustainment, as it is targeted toward long-term solutions, rather than emergencies.

4.3.2 Incident Notification or Detection

Incidents are detected automatically by monitoring processes, or are reported by users. Support teams may have standard email addresses or phone numbers for support calls. A survey taken by Everbridge [1] includes the following findings.

  • Organizations averaged 150 EIT incidents a year, each taking an average of 2.25 hours to resolve.
  • Reporting outages by phone alone is insufficient.
  • One third of respondents reported difficulty quickly connecting to the right on-call support person to handle the outage.
  • Almost unanimously, respondents reported incidents of late or no response from the assigned support team.
  • Of the five most reported incident types (hardware failure, application outage, datacenter network outage or performances issues, or connectivity issues between sites and offices), network connectivity of some kind seems to be the most common issue. It follows that the survey also reported that network administrators were the most common type of support staff pulled into incident resolution efforts.
  • Two thirds said that incidents have caused staff personal stress or increased workloads.

Mature organizations have incident reporting processes, and automated responses for every step in the process of resolving the incident. These reporting mechanisms can be as simple as an email sent to a team mailbox, or as complex as using a standard ticket assignment tool. Both of these can provide logs of the issue for future reference.

It is very important to keep the person reporting the incident updated, as well as anyone affected by the outage. Lack of information leads to bad impressions of the support team, no matter how well they resolve the problem.

4.3.3 Incident Resolution

By nature, incidents are unpredictable, and therefore uncomfortable for all involved. Mature organizations plan for scheduling a good mix of skilled resources to be available at all times, and funding for staff enhancement when needed during incidents so that support staff members are not overwhelmed during a major incident.

Link known incidents to standard remedies to aid support staff in efficient resolution of certain expected situations. Below are two ends of the ranges of incidents that could occur:

  1. Minor incident — Process does not execute as expected. An expected file is late in arriving from a vendor, preventing a process from executing. The problem is a file arrival issue.
    1. If the cause is there was no data to send, then the incident is resolved by removing the requirement for the file for this one instance. The problem is that there may be no data for multiple execution instances, requiring a resolution of some kind that can automatically handle the situation without causing an incident, such as the vendor sends an empty file or the application wait times out.
    2. If the cause is there was a delay in processing on their side, then the incident is resolved only when the file arrives. The problem is then that the SLA between the vendor and the application owner is not being met, requiring a resolution of some kind (the SLA to be revisited and the schedule adjusted, the dependency on that file arriving on time is removed, or some other solution).
    3. If this incident occurs frequently, the problem importance can be elevated to a higher priority for resolution.
  2. Major incident — An upgrade or enhancement implementation experienced an unexpected issue, causing an unexpected system restart. All processing stops without notice.
    1. Automatic restart — If the system processing restarts automatically, there could be problems with the data, depending on where in the processing the halt occurred, and how robust the processing is for handling exceptions like unexpected restarts. It could be that incidents or problems occur for processes and data unrelated to the upgrade or enhancement due to co-existence on the same system.
    2. Unsuccessful restart — If the system processing does not successfully restart, or the system does not restart successfully at all, the incident continues until either a restore occurs from a backup taken prior to the change, or the system restarts successfully after some other action.
  3. Major incident — Power outage occurs in the datacenter (see Chapter 7, Disaster Recovery, for more on this topic).

Ideally, each asset has documentation on normal operations expectations, remedial steps for known exception occurrences, and a guide for troubleshooting when unexpected exceptions to normal operations occur.

4.3.4 Resolution Notification

In most organizations, whenever an incident occurs, all users affected by the incident receive a notice that includes what happened, what actions are occurring to resolve the incident, and an expected time of return to normal operations.

It is important to communicate with the users at least every hour during an incident, even if the communication is only that work continues to determine how to restore operations as quickly as possible. Assign one person to handle user notifications during the incident to prevent confusing or conflicting messages to the users. Some service level agreements can specify the frequency of communications during an incident (see Service Level Agreements).

When the incident is resolved, communicate to the users the status and next steps for determining the cause and resolution of any problems detected.

4.3.5 Closed-loop Analysis

Monitoring of assets is not sufficient by itself; monitoring of support instances and resolutions is necessary to identify trends, and then develop improved remediation actions for frequently occurring issues. Asset support teams can track the frequency of issues, and reduce time spent resolving issues in one of these ways:

  1. Reduce the time spent on resolution.
    1. Develop and deploy better self-service instructions to reduce the Tier 1 calls.
    2. Modify the Tier 1 scripts to identify if the issue needs to be escalated immediately, rather than trying other solutions first.
    3. Modify the Tier 1 scripts to use the most optimal solution, based on analysis of prior instances.
  2. Reduce the frequency of the issue.
    1. Create a problem ticket or enhancement request that prevents the problem from occurring.
    2. Create a problem ticket or enhancement request that handles the issue automatically, and avoid support calls.

5 User Management

Users use assets. Asset management enables business processes to occur predictably. Therefore, manage users to enable controlled access and prevent chaos. Depending on the organization, funding for user support may come from EIT as a provided service; otherwise, funding comes from the departments on a usage basis.

5.1 User and Security Administration

Users require access to systems and applications to do work. Systems and applications require information to authenticate users and enable access to authorized activity. Both sides need to agree. The most common method is to have profiles for users, and security authentication on the system side.

5.1.1 User Profiles

Profiles are collections of users with similar or identical qualities. Create user profiles by job or department, by usage patterns, by level of experience, by location, and so on. There are three common ways to assign profiles to users: user-based, job-based, and role-based.

  1. User-based — Each user has an individual profile with assigned rights and privileges to various systems. Mass changes are difficult and time-consuming, but this method has the most flexibility.
  2. Job-based — Each defined job has rights and privileges assigned to it. Link users to jobs so they can inherit the rights and privileges. When job responsibilities change, all users linked to that job inherit the changes. Mass changes are easier, but some changes may not apply, and this method requires a link to a system maintaining the job definitions.
  3. Role-based — Each system has defined roles with access rights to certain parts of the system or application. Some systems may have identical roles, so one role can share access to multiple applications for the same rights. Roles attach to users, and if job functions or access needs change, roles are attached to or detached from the users. Mass changes are simply adding or removing users from roles.
5.1.1.1 Onboarding and Offboarding

When users enter the organization, create a user profiles. When users leave the organization, promptly disable their profiles to prevent any possible negative actions.

5.1.1.2 Multiple Profiles

Some users may require multiple profiles for different situations, such as one for normal work and another for when performing maintenance or recovery operations.

5.1.2 User Security

Applications have access points where users can connect to the system. There are commonly three parts to enabling a secure connection between a user and a system:

  1. Authentication — Each access request must verify that the user has access to the system. This can be a user ID and password, a certificate, a token, or a notification or pass-through from another security system like Active Directory or LDAP.
  2. Authority — Each user has authority to perform functions within the application. Functions can be enabled or disabled based on the user’s assigned authority.
  3. Auditing or accounting — Every access request to the application is logged for review, regardless of success.

5.1.3 Types of Support Teams

5.1.3.1 Embedded Support

Embedded support means EIT support staff is co-located with the business unit directly. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards, such as using a ticketing system consistently. If there are enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs.

5.1.3.2 In-house Remote

EIT organizations provide this type of support in a separate location from the users. Users contact the support center to report issues and receive support services. Some travel between the remote location, the datacenter, and the user locations occurs when necessary.

5.1.3.3 Offshore

Offshore support occurs in another country. Many companies specialize in remote user support.

5.1.4 Support Requirements

Each organization is different, and the needs for all users in the organization are different as well. This step needs to be finished before even going looking for solutions (technical or otherwise).

  1. Task complexity — Each organization varies in how complex their organizations are. Some user support tasks may be simple and frequent, like password resets. Other tasks may require specialized access from certain teams to accomplish. These questions may help guide your requirements design.
    • Which support tasks are common?
    • Which support tasks are deliverable through self-service functions?
    • Which support tasks can be disabled during an emergency?
    • What is the expected task turnaround time?
  2. User skill level — Each organization varies in the types of users needing support. Below are two ends of the levels of user support that can exist.
    1. Some users are very familiar with the applications and can perform some support tasks themselves.
    2. Some users are very uncomfortable with computers and may need additional attention to make them comfortable and successful.

5.2 Support Processes

There are two main types of support processes for users: help desk, and self-service portals. Each of these support processes requires monitoring and auditing to ensure proper performance.

  1. Monitoring — Help desks can be monitored through side-by-side call monitoring, or after-call reviews. Monitoring ensures standard policies compliance and that standard activities are being executed properly.
  2. Auditing — The level and frequency of auditing or review corresponds to the maturity of the EIT organization and the industry requirements. This may be important for certain organizations dealing with healthcare or finances, or parts of organizations dealing with human resources information.

5.2.1 Help Desk

Define requirements for how to structure the support staff to handle the varying levels of support the users require, especially depending on the support schedule. The help desk staff would call for support when a system is not performing as expected or within defined thresholds.

  1. Hierarchy — Most organizations have multiple levels of support, each increasing in authority. Larger organizations can provide support for a majority of issues with cheaper staff in the lower levels.
    1. Service representatives — This level of support has a person with access to scripts or instruction sets for common problems, and possibly some familiarity with the systems, who can talk through basic troubleshooting steps with a caller. This level of support is sometimes called Tier 1.
    2. Technical support — This level of support has familiarity with the systems and user support tasks. They can work with the caller using their expertise and access to additional functions tools. This level of support is sometimes called Tier 2. In larger organizations, there may be more than one technical support layer, up to and including the development organization.
    3. On-site technician — This level of support is the highest level, and is dispatched to the site when all lower levels have failed to resolve the caller’s issue.
  2. Escalation — In organizations with multiple support levels, a process for moving an issue to a higher level is necessary, to ensure that time spent on resolving the issue is minimized.
    1. When to escalate:
      1. All options at a support level have been exhausted without being able to successfully resolve the issue.
      2. The issue is on a list of known issues that require higher level of support expertise.
      3. Upon request from the caller.
      4. Upon request from the help desk staff.
    2. How to escalate:
      1. Call in the next higher level directly using a defined process, such as using an on-call pager or phone number.
      2. Reassign the issue in a ticket management system to the next level in the hierarchy for that system.
  3. Specialization — In larger organizations, support organizations can be specialized to certain subject areas or applications.
    1. Security access — Specialized teams evaluate requests for privileged access.
    2. Application deployment — Specialized teams deploy applications to users upon request.

Some organizations may out source user support; however, the organization is heavily invested in the users and depends on the services, so it follows that knowing the user profiles is critical, regardless of where the support is provided and by what organization.

5.2.2 Self Service Support

Examples of self-service functions are password resets or standardized software downloads and installations. Common self-service functions occur through IVR systems, or websites where users request the support service without support staff involvement.

5.2.3 Support Tools

Support tools come in two categories: user management (what exists) and help desk monitoring (what occurs). Some toolsets may cover both categories. As the business is at least tangentially funding operations and support, there should be a business view of the users and services consumed.

  • User Management Tools — Many software packages exist to manage user information and profile assignments. The hard work lies in populating the package’s database with the organization’s users, and maintaining the catalog over time. No matter how comprehensive a tool is, it is useless until it has the organization’s data loaded.
    Each tool may have one or more sweet spots in the following areas: profile management (user detail), remote connectivity to user workstations, or reporting user interface (information display and reporting/graphical output).
    • Profile management — Some packages include options for managing data for each user and profile such as permission and authorization settings, contact information, and support agreement information. Some of this data must be entered or uploaded as it does not exist on the asset itself.
    • Remote connectivity — Being able to log directly into the user’s workstation enables support staff to see what the user sees to determine and resolve the user’s issue.
    • User interface and reporting/analysis — Some packages focus on the ability to track and manage support issues, and then analyzing for trends in incidents or problems with certain assets. These tools commonly also support tracking change requests and enhancement requests. See Chapter 11, Transition, for more on this topic.
    • User interfaces for lookup and reporting can be either installed clients or web-based. Client-installed versions can enable data entry and other auditable activities for authorized users, and may have more intuitive user interfaces for interacting with the tool in order to enter or modify asset data. However, web-based interfaces are much easier to deploy and manage for large numbers of users.

  • Help Desk Monitoring Tools
    • Monitoring capabilities — Many tools handle monitoring help desk or call center activity. This type of monitoring is not specific to EIT operations and support help desks.
    • User interface — Some packages focus on the ability to display monitoring data in useful formats, in both reports and in real-time on scrolling or continually updating monitors.

6 Service and Operational Level Agreements

  • Negotiation
  • Notifications
  • Rewards and Penalties

6.1 Service Level Agreements

***no content***

7 Monitoring performance thresholds and SLAs

***no content***

7.1 Escalation Procedures and Mechanisms

***no content***

8 Change Request Management

  • Enhancement requests
  • Service requests

There are three types of transitions (see Chapter 11, Transition, for more information):

  1. Business transitions impacting the whole organization (or a key part of it)
  2. Application transitions related to application and database software platforms
  3. Infrastructure transitions layer related to servers, storage, and network devices

Figure 5. Change Life Cycle ***add image***

9 Supportability and Operational Suitability (SeBOK 1.2 Logistics)

Operation of the system is from the perspective of the user.

The needs of the customer should be the primary concern of the IT system support staff. That being said, sometimes the best path forward is to remove a system from service for maintenance and corrective action. The most important aspect of this concept is communication. The user community needs to be informed as quickly as possible, with as much information on the timeframe of the outage and service restoration. Ideally there has been processes, including a schedule and communication protocols approved by the business for all types of incidents either planned or unplanned.

10 References

[1] http://www.cioinsight.com/it-strategy/infrastructure/slideshows/how-mishandled-it-incidents-spiral-out-of-control.html taken from http://go.everbridge.com/ITCommunicationeBook-web.html.