Maintenance and Control

From EITBOK
Jump to: navigation, search
Welcome to the initial version of the EITBOK wiki. Like all wikis, it is a work in progress and may contain errors. We welcome feedback, edits, and real-world examples. Click here for instructions about how to send us feedback.
Ieee logo 1.png
Acm logo 3.png

 

1 Introduction

Maintenance and control are two sides of the same process. Maintenance activities ensure that a system remains operational and does not degrade over time. Maintenance activities preserve existing function. Control of a system manages the approval process for requested changes to a system, including defect fixes, evolution of third-party components, and in-house enhancements. Control activities evaluate and determine approval and schedules for changes to existing function, which are then implemented through the transition of a solution to operations (see see Transition into Operation).

Considerable research over the past several decades has shown that the majority of expenses (75-80 percent) reported as maintenance costs are actually due to changes to systems in response to enhancement requests.[3] [4] Changes that add functionality almost always add value, and therefore, should not properly be called "maintenance." They are instead properly seen as stages in the system's evolution.

1.1 Maintenance

An EIT system (asset or solution) is maintained by activities performed to ensure that the system continues to be operational over time. [1] The maintenance team is responsible for two main functions:

  • The maintenance team plans for, designs, and directs the maintenance necessary to prevent the deterioration and failure of a system, which can be due to defects, obsolescence, or environmental conditions.
  • The maintenance team recommends changes to ensure continuing environmental compatibility of a system due to evolution of its hardware and software as provided by vendors.

Maintenance is different from operations; however, the understanding of a system's maintenance needs is closely connected to its operational function.

  • Operations and support activities ensure that production systems operate consistently in a steady state of defined functionality. Operational support focuses outwardly on preserving execution of systems providing service to users.
  • Maintenance evaluates the system's operations over time in comparison to changing environment component standards (upgrades from vendors) and increasing age of components (failures due to normal use). Maintenance focuses inwardly on proactively preserving a system's ability to provide defined functionality services to users over time.

Maintenance is different from evolution or enhancement. In short, evolution development and enhancement change the functionality of a system or add functionality to a system. Maintenance does not.

The table below shows a comparison of maintenance with operational support and evolutionary development and enhancement.

 Operational SupportMaintenanceEvolutionary Development and Enhancement
Functionality effectProvides consistent functionality and recovers from failures Predicts activities necessary to preserve functionality or prevents failuresChanges or adds functionality
FocusActivityProcessProcess and activity
User affectLow to noneLow to noneHigh
Active/passivePassiveActiveActive
FrequencyContinualRegularly scheduledScheduled for inclusion in development projects
Standard activitiesExecutes maintenance processesCreates and implements maintenance processesN/A

To illustrate the differences, consider these concepts as they relate a hot air balloon.

  • Maintenance is an important part of risk management. It scans for and analyzes changes in current safety regulations, wind and weather conditions, and the condition of the heater, basket, and balloon fabric, which will determine what parts need to be replenished, repaired, or replaced. Maintenance determines the schedule and defines the standard checklist of pre- and post-flight tasks, as well as the schedule and criteria for performing occasional maintenance tasks, such as examining parts for signs of wear or failure, resulting in activities to patch the basket or balloon, or replace hoses, straps, heating elements, or ropes before they reach end of life.
  • Operational support executes the scheduled standard tasks from maintenance that keeps the balloon safely airborne when in use, such as pre-flight unpacking, fuel tank loading, fuel consumption, equipment inventory, and post-flight packing.
  • Evolution or enhancement adds new features, or changes out the heater, basket, or balloon to enable longer flights, more passengers, or more comfort during flight.

1.2 Control

Control is the process of ensuring that only expected and approved changes are implemented in any system. Change requests may come through defect reports, external drivers (such as patches or revisions from third-party software providers), changes to relevant laws or regulations (such as for tax or payroll systems, or privacy protection), or through feature-enhancement requests. Change requests are collected and reviewed by a team of stakeholders in the organization, including members of the operations, finance, and architecture teams, other business users, and software portfolio managers. Changes may be deferred (to be done later when more feasible), rejected (will never be done), or approved and prioritized for implementation (see Figure 2).

Approved changes are then developed (see Construction) or acquired (see Acquisition), and placed into operations through the transition process.

2 Goals and Principles

The basic goal for all of Enterprise IT (EIT) is to keep systems operating to provide value to the organization, despite defects discovered after installation, changes in laws, and advances in technology. The main goal for maintenance and control is to preserve operations over time through asset lifecycle management and control of changes to assets. This goal has three parts:

  • Ensure service levels and stability through standard maintenance activities to prevent service disruption.
  • Preserve service levels and functionality through approved changes to assets as provided by suppliers, required by law or regulation, or to repair a defect.
  • Reduce risk to assets by designing achievable maintenance activities, removing obsolete assets, reviewing all requested changes, implementing only stakeholder-approved changes, and ensuring that changes occur through defined standard construction, acquisition, and transition processes.

EIT risk management responsibilities include coordinating with disaster recovery planning, testing, and evaluation. These responsibilities are especially important for EIT, because the business of the enterprise almost certainly cannot be conducted without EIT systems operation.

2.1 Guiding Principles

DO:

  • Design maintenance activities to be simple, straightforward, and consistent across systems to minimize the need for specialized skills.
  • Have good relationships with suppliers to enable open and honest discussions about any offered upgrades or fixes, and to ensure that no side effects or unintended consequences occur.
  • Determine how much risk the organization is willing to take on when considering an upgrade—early adopters might get considerations for any defects they discover, while late adopters usually have fewer implementation issues.
  • Ensure that the business stakeholders have a presence in change request reviews.
  • Ensure that change request reviews include prioritization based upon the value to the business.

DO NOT:

  • Change more than necessary to reduce risk of failure, fix a defect, or meet a requirement.
  • Approve changes to data that are not implemented through existing applications or interfaces.

3 Context Diagram

14 Maintenance CD.png
Figure 1. Context Diagram for Maintenance and Control

4 Maintenance Responsibilities

Maintenance is defined as activities required to keep a system operational and responsive after it is accepted and placed into production. The maintenance of EIT systems includes preventive actions (risk reduction) and corrective actions (fixes) that preserve consistent operations. In EIT systems, maintenance can be performed on hardware, software, and data.

As part of portfolio management, EIT and the enterprise are expected to set policies about support levels for operational systems. The support level for any given system is determined by weighing the value of the service provided against the cost to support it. The goal is to not spend more on a system than its value to the enterprise.

The capability to be maintainable must be designed into all systems. This includes the system's architecture, and by extension, its maintainability requirements. The ability to maintain a system is determined by the processes required (a function of the system's design) and the availability of resources to execute those processes. [2] Maintenance processes must be monitored and measured for continuous improvement.

As part of system evaluations, maintenance tools and processes must be included. As part of the package, systems should include expected maintenance activities (much like regular oil changes on cars), tools to monitor system behavior, and possibly tools to perform maintenance activities.

4.1 Define Maintenance Activities

There are four types of maintenance: corrective, preventive, adaptive, and perfective:

  • Corrective maintenance can be either unscheduled (emergency) or scheduled.
    • In an incident (ITIL) or emergency situation, maintenance activities occur to recover and bring a system back into operation. These occurrences can be reduced by proper implementation and execution of the other types of maintenance, as well as proper control, which reduces the risk of emergencies. [5]
    • Scheduled corrective maintenance occurs to remove existing defects in a system, which are related to a problem, or are due to an issue with a change applied to the system.
  • Preventive maintenance is scheduled. It is set up based upon analysis of similar systems to find patterns of flaws and to replace components before they fail. This type of maintenance may be required by vendors, regulations, or laws, especially in safety-related systems. Even when not required, preventive maintenance keeps track of such things as aging of components, vis-à-vis their expected useful lives, and inspects wires and connections for signs of wear. This is an important part of risk management. By extension, preventive maintenance activities require certain interactions with facilities management, for example, with regard to provisions for power backups for EIT systems in case of power failures.
  • Adaptive maintenance is less common and occurs when EIT changes one system to adapt to changes in another system. This is actually a type of enhancement, because the entire environment is enhanced when one part is upgraded. An adaptive maintenance task can be as simple as changing a configuration in one system to adapt to an upgrade in another system, using a different driver to connect databases because the other system's database software was upgraded, or increasing data capacity via a parameter change. On the other hand, it could be a complex set of operations such as those that would enable increasing the number of concurrent users.
  • Perfective maintenance is a misnomer and the term is used less often. It is defined as the process of improving or evolving a system in some manner, which is actually enhancement, not maintenance (see Evolutionary Responsibilities.)

In summary:

CorrectivePreventiveAdaptivePerfective (Enhancements)
  • Correcting code errors (patches)
  • Correcting errors in install scripts
  • Replacing worn parts
  • Standard purging of temp areas and protection logging spaces, and standard cleanup of FTP/SFTP sites for old files
  • Required by changes in laws, tax schedules, etc.
  • Required to run on a new operating system, or to integrate with or connect to another upgraded system
  • Changes to existing features
  • Adding features
  • Any change that reflects a change in original requirements

Defining maintenance activities depends on:

  • The type of system and component
    Activities required vary based on the system, and the component type within the system. Maintenance of (for example) disk drives may vary depending on whether they are installed in individual servers, or a storage cluster.
  • Expected support levels and system priorities
    As assigned by the organization, some systems are designated as "mission-critical" and therefore are maintained to preserve the least amount of risk of failure, whereas non-production systems may be on a slower maintenance schedule, or have lower priority for resource assignment during times of peak production usage or risk. A high-priority system may be assigned maintenance activities that cycle out components that have a predictable lifecycle, while a low-priority system may be assigned a support level that allows only mandatory support activities response to component failure.
  • System economics
    The economics of a system can be described as the difference between the benefits the service is bringing to the business versus the cost to maintain and support it. Some measurements that determine the benefits of a solution are:
    • Criticality of the business processes supported
    • Number of users accessing the system
    • Number of new transaction being processed
    • Amount of time saved by using the functions (e.g., versus manual procedures)
  • These measurements should be compared with the total cost of maintenance and support of the system, which includes:

    • Vendor support costs (maintenance, subscription or licensing, or leasing)
    • Infrastructure costs (server, storage, rack space, power, cooling)
    • Technical stack costs (operating system, utilities, printing, data transfer, reporting, and so on)
    • FTE costs of the support, maintenance, and operations

    When the cost becomes greater than the benefits, it is time to retire or replace the service. Aging out, and therefore eliminating utility/maintenance/licensing costs, should reduce EIT costs, as long as the functional replacement does not result in an added maintenance burden for EIT staff, or provide a reduced benefit to the business (see Transition into Operation for the planning and processes for these functions.

Generally, maintenance activity design should take into account:

  • The importance of the system to the organization (priority)—All systems vary in importance to the organization, depending on the function the system provides, and whether the organization can continue to transact business without that system operational. Mission-critical systems have a higher priority for recovery from failure, and therefore have more monitoring and maintenance activities performed to prevent any failures from occurring at all.
  • Maintenance requirements due to regulations and laws—Some industries have monitoring and maintenance regulations, rather than letting each organization define their own. In these cases, there may be reporting of monitoring and maintenance activities to an outside organization as well, to manage compliance.
  • The risk and business cost of a failure occurring—
    • Some components may have a low risk of failure, but when a failure occurs, the business cost and recovery costs are high. Older systems may need parts that become scarce or expensive over time. Technical debt (the additional cost of maintaining and/or upgrading systems that lag behind current releases or technology) increases as components age, which may make the cost of a system failure catastrophic.
    • Some components may have a high risk of failure, but recovery is cheap and quick, such as by swapping out drives or cards in arrays, such that the system overall has a low risk of failure, even though components are replaced frequently.
  • Maintenance recommendations from vendors—Any system purchased or leased has vendor-recommended activities to keep the system operational. Of course, the vendor probably errs on the side of more frequent and expensive activities, so each organization must determine its own needs. Cloud systems remove this responsibility from the lessee as part of the platform as a service (PaaS), although maintenance activity costs are built into the contract.
  • The expected life of each component (lifecycle) under normal use—All components have an expected life under normal conditions. Some components are less durable than others or consumable, and therefore need more frequent maintenance.
  • The probability of each component's failure at or before scheduled maintenance—A function of the expected life is a growing probability that a failure will occur over time, or after extraordinary use or strain.
  • The cost of the maintenance activity in both parts and labor—A balance must be found between overprotection: continuous monitoring and replacement at the first sign of trouble (costly) and negligence: inadequate attention to monitoring or delayed maintenance (which leads to technical debt).

4.2 Define Maintenance Schedules

Maintenance activities almost always interfere with normal processing. Components must be made unavailable or incur additional strain from both production and maintenance activities occurring simultaneously. Only in extreme situations should maintenance activities occur on online components. All systems must be designed to enable offline maintenance, even if infrequent, as that ability is also used in disaster (Incident) situations (see [5] and Disaster Recovery).

Maintenance activities occur as scheduled over time or occur due to an event. Some components may recommend that maintenance activities occur both on a schedule and due to an event. Both may also be automated to automatically perform some maintenance activity based either on a time or an event.

  • Scheduled maintenance activities have defined time periods for activities, and each activity is placed on the schedule according to the organization's needs. This type of maintenance is pro-active.
  • Event-based maintenance occurs when a monitoring threshold is reached (time to clean out the shared temp area), or a component signals a need for attention. This type of maintenance is re-active.

Almost all vendors provide suggested maintenance schedules or monitoring thresholds for the components they support. Schedules are in terms of activities to be performed per time period (week, month, etc.). Otherwise a list of events and suggested actions are provided.

In distributed (failover) or high-availability systems, maintenance on components may occur while the system is online (even if the component is not), as other components take over processing from the ones undergoing maintenance. So maintenance activities may occur during business hours as an option.

For systems without failover, plan the most intrusive maintenance activities to occur during non-peak times, and include options for taking components offline for a time to perform activities that may adversely affect business processing. Most maintenance activities occur during times of lower business processing, such as overnight or on weekends, although with the advances in distributed systems and networks, these activities require less downtime and lower frequency.

4.3 Design and Implement Standard Maintenance Processes

Maintenance processes should be designed to be simple, easy-to-perform, repeatable activities that occur based on a schedule or on an event, and ideally, can be automated as much as possible (see above). In many cases, those performing manual maintenance activities are not knowledgeable about the component or system; instead, those operators follow the instructions provided.

Manual maintenance activities that are onerous, intricate, difficult, or not clearly documented (so not understood by the operator) are less likely to be performed correctly without monitoring. Design each task to be simple, including breaking up complex tasks into simpler parts, clearly document the activity, automate as much as possible, and train the operators on any manual tasks to increase success.

For example, one important maintenance activity is the ongoing standard purging of temp data areas, such as shared databases and standard cleanup of old files and logs.

  • An automatic activity can be scheduled to regularly remove data older than a specific date. The maintenance team develops and tests the scripts that do the cleanup, and then submits them to the operations and support teams to run on a regular basis (in a smaller shop this may be the same team).
  • If DBAs report that temporary work areas are running out of space, maintenance activities need to identify acceptable remedial actions, such as by temporarily adding space, temporarily restricting access to the shared space, or automatically dropping temporary data objects based on previously identified criteria.

4.4 Design Standard Alert Thresholds, Reports, and Forecasts

Over time, infrastructure resources such as storage, CPUs, network bandwidth, and data transfer rates grow. This capacity growth should be monitored and trends reported to help determine future capacity needs to the system.

Commonly, system capacity is designed to meet an initial workload, and to handle projected usage changes going forward. There are two main usage patterns:

  • Slow steady growth—For certain industries (healthcare for example), there are few times when the average usage spikes either up or down greater than a standard deviation. Capacity increases can then be planned in advance to be ahead of the usage slope.
  • Peaks and valleys—For certain industries (such as retail), there are standard times when the capacity needs to handle several times the average usage. Capacity can be planned to either always be able to handle the highest peak (which means most of the year there is unused capacity being supported), or to handle an average capacity with the ability to temporarily expand capacity during peak periods (which can also be expensive if the capacity is leased from a vendor).

Capacity limits affect availability because if a system begins to experience downtime due to inability to allocate storage areas or CPU power when needed, it influences the amount of resources that must be allocated to the system, to meet future business demands.

  • Thresholds must provide enough headroom or lead time to allow for analysis before taking action.
  • Reports must provide enough information to enable appropriate decisions.
  • Forecasts must be based on enough data to rule out anomalies, resulting in either excessive or inadequate capacity growth, both of which are costly.

Excessive capacity may have unintended consequences such as:

  • More time needed for backups, or other scheduled maintenance
  • Increased power needed for equipment
  • Need to upgrade computer chassis to support capacity increases
  • Increased floor space/footprint for equipment
  • Increased cooling needs to maintain preferred datacenter temperatures
  • Changes in UPS needs
  • Assumptions by users that space is infinite and therefore efficiency in storage and processing is unnecessary

Inadequate capacity may result in more frequent system additions, which may also cost in reduced bulk discounts and more frequent maintenance events to add the capacity.

Today's technologies are providing great advancements in on-demand capacity allocation in CPU processing (compute capacity) and storage capacity. This capability is available both for in-house delivered infrastructure, as well as with cloud computing.

5 Evolutionary Responsibilities

Requests to add functionality or to change the way existing functionality works are enhancement requests (ERs). Acting on enhancement requests without sufficient analysis can be dangerous to the overall health of the system. In fact, enhancement requests, done in isolation, contribute to the problem of spaghetti code often encountered in legacy systems. For that reason, the standard practice is now to recognize that enhancement activities evolve systems. In other words, evolution is not the same as simply maintaining a system. Enhancement requests should be collected and addressed in groups, within development projects. See Construction for tools and techniques.

An EIT organization can submit ERs to third-party vendors, which may or may not be acted upon. Vendors have their own internal systems for evaluating ERs, whether from customers or generated internally. Thus, vendor-provided components evolve independently from any customer using those components. When vendors notify organizations of upgrades (new versions or patches), the maintenance team must ensure that all changes to the component—and their potential impacts—are well understood before recommending installation of an upgrade, and going through the Transition process (see Transition into Operation).

If a component has been customized by the organization (not just configured, but significantly changed from the off-the-shelf installed version), it can become increasingly difficult to retain those customizations in the component as new versions become available. This leads to components falling behind, which increases technical debt both in opportunity cost from the inability to take advantage of functional improvements (for example, security improvements), and in increasing the eventual cost when an upgrade is unavoidable. Often, local modifications of a third-party system make it difficult to accept new versions, because so much work would be required to carry those modifications forward to the new version, and the vendor may not be inclined (without significant cost) to include the customizations in their base product.

Evolution is a continuous change from a lesser, simpler, or worse state to a higher or better state. However, acting on vendor notifications without sufficient analysis can be dangerous to the overall health of the system. No organization only has components provided by a single vendor, so evaluation of the entire environment must be made to assess the impact an upgrade to a component may have on other systems (ripple effect). Some upgrades may require other systems to change how they interface or connect to the component being upgraded (adaptive maintenance).

It is also often the case where one component may have an upgrade available, but other connected components may not be compatible with the upgrade until a later time. Careful evaluation of the entire component inventory is essential to prevent an upgrade causing an incident in another system, disrupting business, and requiring remediation, such as blackout (see Transition into Operation).

Both EIT management and the business product owners have a responsibility to ensure that a solution does not fall behind in both service currency (i.e., meeting the business need), but also product currency (i.e., vendor support and maintenance). A system lapsing from supportability by vendors is negligence on the organization's part, unless the component is placed in "sunset status" with a defined retirement date. In this case, there may be little point in upgrading to the latest version.

Keeping a system around only to be used for historical reference is a waste of resources—convert the data to a currently readable archive and disconnect the system.

6 Change Control Systems and Processes

The maintenance function has the responsibility for establishing change control mechanisms for any and all types of changes requested for installed systems. While operations and support use the change management system for recording and tracking, the maintenance function ensures the orderly progression of requests through to resolution.

6.1 Define and Implement a Standard Change Management Process

A change management (CM) system is a set of processes that defines, at a high level, how subsystems can be introduced or changed. The CM tracking system includes a change request process and a defect handling process. These processes are generic across all component types. For example, a change request to add new hardware is treated exactly the same as a change request to enhance functionality (i.e., both CRs are assigned, approved, etc.).

In order for this process to work effectively, a number of change management mechanisms must be established and consistently used. First and foremost, a change control authority (a change control board (CCB) or change advisory board (CAB)) has to be defined and established. It is typically chaired by the maintenance function, and includes representatives of all stakeholders: product owners, developers, testers, users, operations, and support. In addition, for some types of requests, specialists (such as enterprise architects and the original business analysts) may be called in.

In order to support and facilitate the functioning of the board, specific mechanisms need to be in place. These include things such as:

  • A numbering scheme for defect reports, enhancement requests, adaptive requests, and prevention requests
  • A scheme for categorizing, assessing risk, and prioritizing requests, taking into account how severe an incident is and how many users it might affect
  • A scheme for siphoning off enhancement requests into queues for bundling requests into development projects
  • Ensuring that all requests are entered and tracked in a change management system
  • Defining a closed-loop change management process so that:
    • All requests are tracked through resolution (deferred, rejected, approved, change made, change tested, change released).
    • A clear path of request reporting, reviews, approvals, and resolution is defined.
    • All requests come through the same system that the operations help desk (aka support) uses.
    • Tracking and reporting provides trend analysis for such things as error-prone areas or modules, and volatility of change requests (especially defect reports).
    • Ensuring that action on requests is reflected in the operations CM database system and the development CM system.

6.2 Define Request Intake and Evaluation Process

A typical change control process looks something like this:

  1. Operations receives software change requests via its help desk function and enters them into the incident-tracking system. Operations does not make changes to software.
  2. Defects and adaptive requests are automatically sent to the responsible development team (or vendor relationship manager for acquired components or systems), where they are assigned by appropriate manager according to relative priority.
  3. Approved enhancement requests are periodically reviewed by the Product Manager for inclusion in later releases of the system. In EIT organizations that do not have a Product Manager function, a suitable user representative is tapped for this role in the CCB.
  4. Preventive requests are reviewed by the Product Manager and the CCB.
  5. The status of all changes to a system are reviewed by the CCB prior to release.
  6. The CCB is comprised of representatives of all stakeholders (development, testing, documentation and training, operations, and product management).

6.3 Change Request Processing and Approval Flow

Configuration management (CM) is the foundation of a software project. It is the management of change to components and systems. Without it, no matter how talented the staff, how large the budget, how robust the development and test processes, or how technically superior the development tools, project discipline will collapse and success will be left to chance.

When a change request (CR) is assigned and approved, its "owner" manages the necessary change via the defined CM process. Procedures may differ depending on the type of change. For example, a developer may be required to apply an operating system patch. An operating system patch will be applied differently than a system release.

At various stages of the configuration control process, the owner can to identify where the change is in the high-level CM process. For example, if an operating system patch is ready to be applied in a test environment, the CR should be marked as Ready_for_testing. When the patch has been successfully tested, the CR should be marked Tested.

The CM process and its supporting mechanisms should provides a clear, documented trail of change requests, their disposition, and changes introduced into the system, enabling better team communication, and collection of meaningful project metrics. The request itself, the requestor, the approvers, and all actions taken in response to the CR should be available through the CM process and tools to everyone on the project.

The generic approval flow defines a generic CR. The CR may be used to represent and track defects, enhancements, greenfield development, documentation, etc. The change control board (CCB) is a central control mechanism to ensure that every change request is properly considered, authorized, and coordinated. The full CCB should meet on a regular basis, probably once a week. Emergency meetings can be called as necessary. All decisions made by the CCB should be documented in the CM system.

A CCB member is the top level of the change management hierarchy and can also act as every role defined lower in the hierarchy. For example, if a team leader is not present, a CCB member can act on behalf of the team leader.

The CCB includes the following members:

  • Configuration process manager
  • CM system administrator
  • Respective system/component development managers
  • Key stakeholders, such as operations and support, and user representatives

ChangeRequestLifeCycle.png

Figure 2. Change Request Lifecycle

This table describes basic actions performed on a change request.

ActionDescriptionRole
Submit CRAny stakeholder on the project can submit a change request (CR). This logs the CR in the CM system, places it into the CCB review queue, and sets its state to Submitted.Submitter
Review CRThis CCB action reviews the submitted change requests. The CR's content is initially reviewed in the CCB review meeting to determine if it is a valid request. If it is, a determination is made if the change is in or out of scope for the current releases, based on priority, schedule, resources, level-of-effort, risk, severity, and any other relevant criteria as determined by the group. The state of a valid CR is set to Assigned or Postponed, accordingly.CCB
Confirm duplicate or rejectIf a CR is suspected of being a duplicate or invalid request (e.g., operator error, not reproducible, the way it works, etc.), a delegate of the CCB is assigned to confirm the duplicate or rejected CR and to gather more information from the submitter, if necessary. The CR state is set to Duplicate or Closed, as appropriate.CCB Delegate
Re-openIf more information is needed, or if a CR is rejected at any point in the process, the submitter is notified and may update the CR with new information. The updated CR is then re-submitted to the CCB review queue for consideration of the new data.Submitter
Open and work on When a CR is assigned by the CCB, the project lead assigns the work to the appropriate user—depending on the type of request (enhancement request, defect, documentation change, test defect, etc.)—and make any needed updates to the project schedule. The CR state is set to Opened. Configuration Manager
Resolve The assigned worker performs the set of activities defined within the appropriate section of the process (e.g., requirements, analysis and design, implementation, produce user-support materials, design test, etc.) to make the changes requested. These activities include all normal review and unit test activities as described within the normal development process. The CR is then marked as Resolved.Assigned user
ValidateAfter the changes are resolved by the assigned user (analyst, developer, tester, etc.), the changes are placed into a test queue to be assigned to a tester and validated in a test build of the product.Tester

7 Types of System Change Releases

7.1 Patch Release (Patch Only)

A patch is a relatively small change, generally to source code, to fix a defect. However, data fixes may be required to rectify invalid data that has been created by bad code or user error. Either patch type, although small, can still have wide-spread impact on the system, especially if it the source code change affects a critical component of a system or the data fix changes millions of data records. Therefore, all patches applied must be fully tested before being scheduled for production implementation.

7.2 Full Release

Full release generally means that most or all system components are packaged on a release medium. This is usually called a version upgrade.

7.3 Traceability and Audibility

Processes and systems that assist in the management of patch and version upgrades are a part of configuration management. They track past, current, and future versions of software and infrastructure components (i.e., databases, utilities, hardware) that have been, or will be, implemented. For large systems, like ERPs, with thousands of modules, manual processes become error prone and unmanageable, so automated tracking is required to ensure that major downtime is not experienced due to user error.

A configuration management system can provide a great audit trail for implemented changes; however, this is not the only tracking necessary in most cases. Many companies today have regulatory requirements to comply with accounting and other regulations or standards, such as Sarbanes Oxley (SOX) or internal auditing control functions. Both internal and external EIT auditors use this history to ensure that control processes are followed by EIT. Specific EIT staff are allocated responsibility and oversight for these control processes, and are responsible and accountable to ensure the defined processes are followed, but also that the audits are completed on a timely basis and are accurate.

8 Summary

Systems should be designed and built to be easily maintained. Maintenance is the responsibility of EIT, and should be an auditable process, with mechanisms for tracking and reporting. Systems need to be monitored, measured, and validated to ensure that this happens.

9 Key Maturity Frameworks

Capability maturity for EIT refers to its ability to reliably perform. Maturity is measured by an organization's readiness and capability expressed through its people, processes, data, technologies, and the consistent measurement practices that are in place. See Appendix F for additional information about maturity frameworks.

Many specialized frameworks have been developed since the original Capability Maturity Model (CMM) that was developed by the Software Engineering Institute in the late 1980s. This section describes how some of those apply to the activities described in this chapter.

9.1 IT-Capability Maturity Framework (IT-CMF)

The IT-CMF was developed by the Innovation Value Institute in Ireland. This framework helps organizations to measure, develop, and monitor their EIT capability maturity progression. It consists of 35 EIT management capabilities that are organized into four macro capabilities:

  • Managing EIT like a business
  • Managing the EIT budget
  • Managing the EIT capability
  • Managing EIT for business value

The three most relevant critical capabilities are technical infrastructure management (TIM), service provisioning (SRP), and business planning (BP).

9.1.1 Technical Infrastructure Management Maturity

The following statements provide a high-level overview of the technical infrastructure management (TIM) capability at successive levels of maturity.

Level 1Management of the EIT infrastructure is reactive or ad hoc.
Level 2Documented policies are emerging relating to the management of a limited number of infrastructure components. Predominantly manual procedures are used for EIT infrastructure management. Visibility of capacity and utilization across infrastructure components is emerging.
Level 3Management of infrastructure components is increasingly supported by standardized tool sets that are partly integrated, resulting in decreased execution times and improving infrastructure utilization.
Level 4Policies related to EIT infrastructure management are implemented automatically, promoting execution agility and achievement of infrastructure utilization targets.
Level 5The EIT infrastructure is continually reviewed so that it remains modular, agile, lean, and sustainable.

9.1.2 Service Provisioning Maturity

The following statements provide a high-level overview of the service provisioning (SRP) capability at successive levels of maturity.

Level 1The service provisioning processes are ad hoc, resulting in unpredictable EIT service quality.
Level 2Service provisioning processes are increasingly defined and documented, but execution is dependent on individual interpretation of the documentation. Service level agreements (SLAs) are typically defined at the technical operational level only.
Level 3Service provisioning is supported by standardized tools for most EIT services, but may not yet be adequately integrated. SLAs are typically defined at the business operational level.
Level 4Customers have access to services on demand. Management and troubleshooting of services are highly automated.
Level 5Customers experience zero downtime or delays, and service provisioning is fully automated.

9.1.3 Business Planning Maturity

The following statements provide a high-level overview of the business planning (BP) capability at successive levels of maturity.

Level 1The EIT business plan is developed only for the purpose of budget acquisition, and offers little value beyond this to the organization.
Level 2The EIT business plan typically covers the resource requirements for a limited number of key areas that contribute to the objectives of the EIT strategy.
Level 3The EIT business plan includes standardized details regarding required resources and identifies some of the ways in which the planned activities will contribute to the objectives of the EIT strategy. Some input from some other business units is considered.
Level 4The EIT business plan is comprehensively validated by the EIT function and all other business units, and identifies all required resources and the expected benefits.
Level 5Relevancy of the EIT business plan is continually reviewed, with regular input from relevant business ecosystem partners, to identify opportunities for organization-wide benefits.

10 Key Competence Frameworks

While many large companies have defined their own sets of skills for purposes of talent management (to recruit, retain, and further develop the highest quality staff members that they can find, afford and hire), the advancement of EIT professionalism will require common definitions of EIT skills that can be used not just across enterprises, but also across countries. We have selected three major sources of skill definitions. While none of them is used universally, they provide a good cross-section of options.

Creating mappings between these frameworks and our chapters is challenging, because they come from different perspectives and have different goals. There is rarely a 100 percent correspondence between the frameworks and our chapters, and, despite careful consideration some subjectivity was used to create the mappings. Please take that in consideration as you review them.

10.1 Skills Framework for the Information Age

The Skills Framework for the Information Age (SFIA) has defined nearly 100 skills. SFIA describes seven levels of competency that can be applied to each skill. However, not all skills cover all seven levels. Some reach only partially up the seven-step ladder. Others are based on mastering foundational skills, and start at the fourth or fifth level of competency. SFIA is used in nearly 200 countries, from Britain to South Africa, South America, to the Pacific Rim, to the United States. (http://www.sfia-online.org)

SkillSkill DescriptionCompetency Levels
Application supportThe provision of application maintenance and support services, either directly to users of the systems or to service delivery functions. Support typically includes investigation and resolution of issues and may also include performance monitoring. Issues may be resolved by providing advice or training to users, devising corrections (permanent or temporary) for faults, making general or site-specific modifications, updating documentation, manipulating data, or defining enhancements. Support often involves close collaboration with the system's developers or with colleagues specializing in different areas, such as database administration or network support.2-5
Business risk managementThe planning and implementation of organization-wide processes and procedures for the management of risk to the success or integrity of the business, especially those arising from the use of information technology, reduction or non-availability of energy supply, or inappropriate disposal of materials, hardware, or data.4-7
Capacity managementThe management of the capability, functionality, and sustainability of service components (including hardware, software, network resources, and software/infrastructure as a service) to meet current and forecast needs in a cost-efficient manner aligned to the business. This includes predicting both long-term changes and short-term variations in the level of capacity required to execute the service, and deployment, where appropriate, of techniques to control the demand for a particular resource or service.4-6
Conformance reviewThe independent assessment of the conformity of any activity, process, deliverable, product, or service to the criteria of specified standards, best practice, or other documented requirements. May relate to, for example, asset management, network security tools, firewalls and Internet security, sustainability, real-time systems, application design, and specific certifications.3-6
Customer service supportThe management and operation of one or more customer service or service desk functions. Acting as a point of contact to support service users and customers reporting issues, requesting information, access, or other services.1-6
Database administrationThe installation, configuration, upgrade, administration, monitoring, and maintenance of databases.2-5
Digital forensicsThe collection, processing, preserving, analyzing, and presenting of computer-related evidence in support of security vulnerability mitigation and criminal, fraud, counterintelligence, or law enforcement investigations.4-6
Facilities managementThe planning, control, and management of all the facilities which, collectively, make up the EIT estate. This involves provision and management of the physical environment, including space and power allocation, and environmental monitoring to provide statistics on energy usage. Encompasses physical access control, and adherence to all mandatory policies and regulations concerning health and safety at work.3-6
Financial managementThe overall financial management, control, and stewardship of the EIT assets and resources used in the provision of EIT services, including the identification of materials and energy costs, ensuring compliance with all governance, legal, and regulatory requirements.4-6
Incident managementThe processing and coordination of appropriate and timely responses to incident reports, including channeling requests for help to appropriate functions for resolution, monitoring resolution activity, and keeping clients appraised of progress towards service restoration.2-5
EIT infrastructureThe operation and control of the EIT infrastructure (typically hardware, software, data stored on various media, and all equipment within wide and local area networks) required to deliver and support EIT services and products to meet the needs of a business. Includes preparation for new or changed services, operation of the change process, the maintenance of regulatory, legal, and professional standards, the building and management of systems and components in virtualized computing environments, and the monitoring of performance of systems and services in relation to their contribution to business performance, their security, and their sustainability.1-4
EIT managementThe management of the EIT infrastructure and resources required to plan for, develop, deliver, and support EIT services and products to meet the needs of a business. The preparation for new or changed services, management of the change process, and the maintenance of regulatory, legal, and professional standards. The management of performance of systems and services in terms of their contribution to business performance and their financial costs and sustainability. The management of bought-in services. The development of continual service improvement plans to ensure the EIT infrastructure adequately supports business needs.5-7
Network supportThe provision of network maintenance and support services. Support may be provided both to users of the systems and to service delivery functions. Support typically takes the form of investigating and resolving problems and providing information about the systems. It may also include monitoring their performance. Problems may be resolved by providing advice or training to users about the network's functionality, correct operation or constraints, by devising work-arounds, correcting faults, or making general or site-specific modifications.2-5
Problem managementThe resolution (both reactive and proactive) of problems throughout the information system lifecycle, including classification, prioritization, and initiation of action, documentation of root causes, and implementation of remedies to prevent future incidents.3-5
Security administrationThe provision of operational security management and administrative services. This typically includes the authorization and monitoring of access to EIT facilities or infrastructure, the investigation of unauthorized access, and compliance with relevant legislation.1-6
Storage managementThe planning, implementation, configuration, and tuning of storage hardware and software covering online, offline, remote, and offsite data storage (backup, archiving, and recovery) and ensuring compliance with regulatory and security requirements.3-6
System softwareThe provision of specialist expertise to facilitate and execute the installation and maintenance of system software such as operating systems, data management products, office automation products, and other utility software.3-5

10.2 European Competency Framework

The European Union's European e-Competence Framework (e-CF) has 40 competences and is used by a large number of companies, qualification providers, and others in public and private sectors across the EU. It uses five levels of competence proficiency (e-1 to e-5). No competence is subject to all five levels.

The e-CF is published and legally owned by CEN, the European Committee for Standardization, and its National Member Bodies (www.cen.eu). Its creation and maintenance has been co-financed and politically supported by the European Commission, in particular, DG (Directorate General) Enterprise and Industry, with contributions from the EU ICT multi-stakeholder community, to support competitiveness, innovation, and job creation in European industry. The Commission works on a number of initiatives to boost ICT skills in the workforce. Version 1.0 to 3.0 were published as CEN Workshop Agreements (CWA). The e-CF 3.0 CWA 16234-1 was published as an official European Norm (EN), EN 16234-1. For complete information, see http://www.ecompetences.eu.

e-CF Dimension 2e-CF Dimension 3
C.3.Service Delivery (RUN)
Ensures service delivery in accordance with established service level agreements (SLAs). Takes proactive action to ensure stable and secure applications and ICT infrastructure to avoid potential service disruptions, attending to capacity planning and to information security. Updates operational document library and logs all service incidents. Maintains monitoring and management tools (i.e., scripts, procedures). Maintains IS services. Takes proactive measures.
Level 1-3
C.4. Problem Management (RUN)
Identifies and resolves the root cause of incidents. Takes a proactive approach to avoidance or identification of root cause of ICT problems. Deploys a knowledge system based on recurrence of common errors. Resolves or escalates incidents. Optimizes system or component performance.
Level 2-4
E.3. Risk Management (MANAGE)
Implements the management of risk across information systems through the application of the enterprise-defined risk management policy and procedure. Assesses risk to the organization's business, including web, cloud, and mobile resources. Documents potential risk and containment plans.
Level 2-4
E.4. Relationship Management (MANAGE)
Establishes and maintains positive business relationships between stakeholders (internal or external) deploying and complying with organizational processes. Maintains regular communication with customer/partner/supplier, and addresses needs through empathy with their environment and managing supply chain communications. Ensures that stakeholder needs, concerns, or complaints are understood and addressed in accordance with organizational policy.
Level 3-4
E.8. Information Security Management (MANAGE)
Implements information security policy. Monitors and takes action against intrusion, fraud, and security breaches or leaks. Ensures that security risks are analyzed and managed with respect to enterprise data and information. Reviews security incidents, and makes recommendations for security policy and strategy to ensure continuous improvement of security provision.
Level 2-4

10.3 i Competency Dictionary

The Information Technology Promotion Agency (IPA) of Japan has developed the i Competency Dictionary (iCD) and translated it into English, and describes it at https://www.ipa.go.jp/english/humandev/icd.html. The iCD is an extensive skills and tasks database, used in Japan and southeast Asian countries. It establishes a taxonomy of tasks and the skills required to perform the tasks. The IPA is also responsible for the Information Technology Engineers Examination (ITEE), which has grown into one of the largest scale national examinations in Japan, with approximately 600,000 applicants each year.

The iCD consists of a Task Dictionary and a Skill Dictionary. Skills for a specific task are identified via a "Task x Skill" table. (See Appendix A for the task layer and skill layer structures.) EITBOK activities in each chapter require several tasks in the Task Dictionary.

The table below shows a sample task from iCD Task Dictionary Layer 2 (with Layer 1 in parentheses) that corresponds to activities in this chapter. It also shows the Layer 2 (Skill Classification), Layer 3 (Skill Item), and Layer 4 (knowledge item from the IPA Body of Knowledge) prerequisite skills associated with the sample task, as identified by the Task x Skill Table of the iCD Skill Dictionary. The complete iCD Task Dictionary (Layer 1-4) and Skill Dictionary (Layer 1-4) can be obtained by returning the request form provided at http://www.ipa.go.jp/english/humandev/icd.html.

Task DictionarySkill Dictionary
Task Layer 1 (Task Layer 2)Skill ClassificationSkill ItemAssociated Knowledge Items
System operation design
(operation design)
System maintenance, operation, and evaluation System operations management requirements definition
  • Definition of incident management, problem management, and change control processes
  • Design of incident management, problem management, and change control processes
  • Design of service-level management
  • Definition of service-level management
  • Design of shell scripts
  • Job net management system
  • Definition of job net management method
  • Backup and recovery methods
  • Log rotation, backup, switching, and reference schemes
  • Monitoring systems (operation monitoring, failure monitoring, performance monitoring, threshold monitoring)
  • Design of operation methods and operation flow in normal and abnormal cases

11 Key Roles

Both SFIA and the e-CF have described profiles (similar to roles) for providing examples of skill sets (skill combinations) for various roles. The iCD has described tasks performed in EIT and associated those with skills in the IPA database.

The following roles are common to ITSM:

  • 1st, 2nd, 3rd Level Support
  • Access Manager
  • Facilities Manager
  • Incident Manager
  • EIT Operations Manager
  • EIT Operator
  • Major Incident Team
  • Problem Manager
  • Service Request Fulfillment

Other key roles are:

  • Product Owner
  • Solution Architect
  • Solution Manager
  • Technology Architect

12 Standards

  • ISO/IEC 24765:2016 Systems and software engineering—Vocabulary (also available online as SEVOCAB at https://pascal.computer.org/ and it is free)
  • ISO 20000 series
  • IEEE Std 14764-2006, ISO/IEC 14764-2006—International Standard for Software Engineering—Software Lifecycle Processes—Maintenance
  • IEEE Std 982.1-2005—Standard Dictionary of Measures of the Software Aspects of Dependability

13 References

[1] (2010). SEVOCAB at https://pascal.computer.org/

[2] Retrieved from Guide to the Systems Engineering Body of Knowledge (SEBoK), version 1.3.: http://www.sebokwiki.org/w/index.php?title=Logistics&oldid=48199

[3] Pigoski, Thomas M., 1997. Practical software maintenance: Best practices for managing your software investment. Wiley Computer Pub. (New York) and Lehman M. M., 1980.

[4] Program, Life-Cycles and the Laws of Software Evolution. In Proceedings of IEEE, 68, 9,1060-1076

[5] Introduction to the ITIL Service Lifecycle, Second Edition, Office of Government Commerce, 2010. p. 52; p. 215