Maintenance and Control

From EITBOK
Revision as of 05:56, 23 August 2016 by Jclayton (Talk | contribs)

Jump to: navigation, search

1 Introduction

Maintenance and Control are two sides of the same process. Maintenance activities ensure that a system remains operational and does not degrade over time. Maintenance activities preserve existing function. Control of a system manages the approval process for requested changes to a system, including defect fixes, evolution of third-party components, and in-house enhancements. Control activities evaluate and determine approval and schedules for changes to existing function, which are then implemented through Transition (See Transition chapter).

Considerable research over the past several decades has shown that the majority of expenses (75-80%) reported as “maintenance” costs are actually due to changes to systems in response to enhancement requests. Changes that add functionality almost always add value, and therefore, should not properly be called “maintenance.” They are instead properly seen as stages in the system’s evolution.

1.1 Maintenance

An EIT system (asset) is maintained by activities performed to ensure that the system continues to be operational over time [1]. The maintenance team is responsible for two main functions:

  • The maintenance team plans for, designs, and directs the maintenance necessary to prevent the deterioration and failure of a system, which can be due to defects, obsolescence, or environmental conditions [2].
  • The maintenance team recommends changes to assure continuing environmental compatibility of a system due to evolution of its hardware and software as provided by vendors.

Maintenance is different from Operations. However, the understanding of a system’s maintenance needs is closely connected to its operational function. [2]

  • Operations and & Support activities ensure that production systems operate consistently in a steady state of defined functionality. Operational support focuses on preserving execution of systems providing service to users. (outward)
  • Maintenance evaluates the system’s operations over time in comparison to changing environment component standards (upgrades from vendors) and increasing age of components (failures due to normal use). Maintenance focuses on proactively preserving a system’s ability to provide defined functionality services to users over time. (inward)

Maintenance is different from evolution or enhancement (see Table 1). Evolution and Enhancement changes the functionality of a system, or creates additional functionality in a system. Maintenance does not.

  Operational Support Maintenance Evolutionary Development and Enhancement
Functionality effect Provides consistent functionality and recovers from failures Predicts activities necessary to preserve functionality and/or prevent failures Changes or adds functionality
Focus Activity Process Process and Activity
User affect Low to none Low to none High
Active/Passive Passive Active Active
Frequency Continual Regularly Scheduled Scheduled for inclusion in development projects
Standard Activities Executes Maintenance Processes Creates and Implements Maintenance Processes N/A

Table 1. Comparison of Maintenance with Operational Support and Evolutionary Development and Enhancement

To illustrate the difference, imagine a hot air balloon.

  • Maintenance is an important part of risk management. It scans for and analyzes changes in current safety regulations, wind and weather conditions, and heater, basket, and balloon fabric condition, which may determine that parts need to be replenished, repaired, or replaced. Maintenance determines the schedule and defines the standard checklist of pre- and post-flight tasks, as well as the schedule and criteria for performing occasional maintenance tasks, such as examining parts for signs of wear or failure, resulting in activities to patch the basket or balloon, or replace hoses, straps, heating elements, or ropes before they reach end-of-life.
  • Operational support executes the scheduled standard tasks from Maintenance that keep the balloon safely airborne when in use, such as pre-flight unpacking, fuel tank loading, fuel consumption, equipment inventory, and post-flight packing.
  • Evolution or Enhancement adds new features, or changing out the heater, basket, and/or balloon to enable longer flights, more passengers, or more comfort during flight.

1.2 Control

Control is the process of ensuring only expected and approved changes are implemented in any system. Change requests may come through defect reports, external drivers such as patches or revisions from third party software providers, changes to relevant laws or regulations (such as for tax or payroll systems or privacy protection), or through feature enhancement requests. Change requests are collected and reviewed by a team of stakeholders in the organization, including Operations, Finance, Architecture, and other business users, as well as software portfolio managers. Changes may be deferred (to be done later when more feasible), rejected (will never be done), or approved and prioritized for implementation. See Figure 2.

Approved changes are then developed (see Construction) or acquired (see Acquisition), and placed into Operation through Transition.

    • Change Request Lifecycle graphic here**

2 Goals & Principles

The basic goal for all of Enterprise IT (EIT) is to keep systems operating to provide value to the organization, despite defects discovered after installation, changes in laws, and advances in technology. The main goal for Maintenance and Control is to preserve operations over time through asset lifecycle management and control of changes to assets. This goal has three parts:

  • Ensure service levels and stability through standard maintenance activities to prevent service disruption.
  • Preserve service levels and functionality through approved changes to assets as provided by suppliers, required by law or regulation, or to repair a defect.
  • Reduce risk to assets by designing achievable maintenance activities, removing obsolete assets, reviewing all requested changes, implementing only stakeholder-approved changes, and ensuring changes occur through defined standard Construction, Acquisition, and Transition processes.

EIT risk management responsibilities include coordinating with Disaster Recovery planning, testing and evaluation. This responsibility is especially important for EIT, since the business of the enterprise almost certainly cannot be conducted without EIT systems operating.

2.1 Guiding Principles

DO:

  • Design maintenance activities to be simple and straightforward, and consistent across systems to minimize the need for specialized skills.
  • Have good relationships with suppliers to have open and honest discussions about any offered upgrades or fixes, to ensure no side-effects or unintended consequences occur.
  • Determine how much risk the organization is willing to take on when considering an upgrade – early adopters may get considerations for any defects they discover; late adopters will have fewer implementation issues.
  • Ensure the Business stakeholders have a presence in change request reviews.
  • Ensure that change request reviews include prioritization according to business value

DO NOT:

  • Change more than necessary to reduce risk of failure, fix a defect, or meet a requirement.
  • Approve changes to data that are not implemented through existing applications or interfaces.


3 Context Diagram

14 Maintenance CD.png

The Context Diagram for Maintenance and Control

4 Maintenance Responsibilities

Maintenance is defined as activities required to keep a system operational and responsive after it is accepted and placed into production. The maintenance of EIT systems include preventive (risk reduction) and corrective (fixes) actions that preserve consistent operations. In EIT systems, maintenance can be performed on hardware, software, and data.

As part of Portfolio Management, EIT and the Enterprise are expected to set policies about support levels for operational systems. The support level for any given system will be determined by weighing the value of the service provided against the cost to support it. The goal is to not spend more on a system than its value to the enterprise. The capability to be maintainable must be designed into all systems. This includes the system’s architecture, and by extension, its maintainability requirements. The ability to maintain a system is determined by the processes required (a function of the system’s design) and the availability of resources to execute those processes. [2] Maintenance processes must be monitored and measured for continuous improvement. As part of system evaluations, maintenance tools and processes must be included. Systems should have included, as part of the package, expected maintenance activities (much like regular oil changes on cars), and tools to monitor system behavior, and sometimes tools to perform maintenance activities.

4.1 Define Maintenance Activities

Two main types of maintenance activities exist.

  • Corrective Maintenance: Corrective maintenance may occur as either unscheduled (emergency) or scheduled (to fix something).
    • In an Incident [ITIL] or emergency situation, maintenance activities occur to recover and bring a system back into operation. These occurrences can be reduced by proper implementation and execution of the other types of maintenance, as well as proper Control, which reduces the risk of emergencies.
    • Scheduled corrective maintenance occurs to remove existing defects in a system, which are related to a Problem [ITIL], or are due to an issue (bug) with a Change applied to the system.
  • Preventive Maintenance: Preventive maintenance is scheduled, and depends on analysis of similar systems to find patterns of flaws and to replace the potentially flawed components before they fail. This type of maintenance may be required by vendors, regulations, or laws, especially in safety-related systems. Even when not required by vendors, regulations, or laws, preventative maintenance keeps track of such things as aging of components vis-à-vis their expected useful lives and inspects wires, connections, etc., for signs of wear. This is an important part of risk management. By extension, preventative maintenance activities require certain interactions with facilities management, for example, with regard to provisions for power back-ups for EIT systems in case of power failures.

Other maintenance terms are Adaptive maintenance and Perfective maintenance.

  • Adaptive maintenance is changing one system to adapt to changes in another system, which is a type of enhancement because the entire environment is enhanced when one part is upgraded. It may be as simple as changing a configuration in one system to adapt to an upgrade in another system, using a different driver to connect databases because the other system’s database software was upgraded, increasing data capacity via a parameter change, or a more co9mplex set of operations to increase the number of concurrent users.
  • Corrective Preventative Adaptive Perfective (Enhancements)
    Correcting code errors (patches) Replacing worn parts Required by changes in laws, tax schedules, etc.  
    Correcting errors in install scripts Standard purging of temp areas and protection logging spaces, and standard cleanup of FTP/SFTP sites for old files Required to run on new O/S, or to integrate with or connect to another upgraded system Adding features
          Any change that reflects a change in original requirements


  • Perfective maintenance is a misnomer – it is defined as improving or evolving a system in some manner, which is actually enhancement, not maintenance. See Evolution below.

Defining maintenance activities depends on

  • The type of system and component – activities required will vary based on the system, and the component type within the system. Maintenance of (for example) disk drives may vary depending on whether they are installed in individual servers, or a storage cluster.
  • Expected support levels and system priorities – as assigned by the organization, some systems will be designated as ‘mission-critical’ and therefore will be maintained to preserve the least amount of risk of failure, whereas non-production systems may be on a slower maintenance schedule, or have lower priority for resource assignment during times of peak production usage or risk. A high priority system may be assigned maintenance activities that cycle out components that have a predictable lifecycle, while a low priority system may be assigned a support level that allows only mandatory support activities response to component failure.
  • System Economics – The economics of a system can be described as the difference between the benefits the service is bringing to the business versus the cost to maintain and support it. Some measurements that determine the benefits of a solution are:
    • Criticality of the business process(es) supported
    • number of users accessing the system
    • number of new transaction being processed
    • amount of time saved by using the functions (e.g., versus manual procedures)
  • These measurements should be compared with the total cost of maintenance and support of the system, which includes:

    • The vendor support costs (maintenance, subscription or licensing, or leasing)
    • Infrastructure costs (server, storage, rack space, power, cooling)
    • Technical stack costs (operating system, utilities – printing, data transfer, reporting etc.)
    • FTE costs of the support, maintenance, and operations [3]

    When the cost becomes greater than the benefits, it is time to retire or replace the service. Aging out, and therefore eliminating utility/maintenance/licensing costs, should reduce EIT costs, as long as the functional replacement does not result in an added maintenance burden for EIT staff, or provide a reduced benefit to the business. See Chapter 14 Retirement for the planning and processes for these functions.

Generally, maintenance activity design should take into account:

  • The importance of the system to the organization (priority) – all systems vary in importance to the organization, depending on the function it provides, and whether the organization can continue to transact business without that system operational. Mission critical systems will have a higher priority for recovery from failure, and therefore will have more monitoring and maintenance activities performed to prevent any failures from occurring at all.
  • Maintenance requirements due to regulations and laws – Some industries have monitoring and maintenance regulations, rather than letting each organization define their own. In these cases, there may be reporting of monitoring and maintenance activities to an outside organization as well, to manage compliance.
  • The risk and business cost of a failure occurring –
    • Some components may have a low risk of failure, but once a failure occurs, the business cost and recovery costs are high. Older systems may need parts that become scarce or expensive over time. Technical debt (the additional cost of maintaining and/or upgrading systems that lag behind current releases or technology) increases as components age, which may make the cost of a system failure catastrophic.
    • Some components may have a high risk of failure, but recovery is cheap and quick, such as by swapping out drives or cards in arrays, such that the system overall has a low risk of failure, even though components are replaced frequently.
  • Maintenance recommendations from vendors – Any system purchased or leased will have vendor-recommended activities to keep the system operational. Of course, the vendor will probably err on the side of more frequent and/or expensive activities, so each organization must determine for itself its needs. Cloud systems remove this responsibility from the lessee as part of the Platform-as-a-Service, although maintenance activity costs are built into the contract.
  • The expected life of each component (lifecycle) under normal use – all components will have an expected life under normal conditions. Some components are less durable than others or consumable, and therefore will need more frequent maintenance.
  • The probability of each component’s failure at or before scheduled maintenance – a function of the expected life is a growing probability that a failure will occur over time, or after extraordinary use or strain.
  • The cost of the maintenance activity in both parts and labor – a balance must be found between overprotection: continuous monitoring and replacement at the first sign of trouble (costly) and negligence: inadequate attention to monitoring or delayed maintenance (which leads to Technical Debt).
    • 4.2 Define Maintenance Schedules

      Maintenance activities almost always interfere with normal processing. Components must be made unavailable or incur additional strain from both production and maintenance activities occurring simultaneously. Only in extreme situations should maintenance activities occur on online components. All systems must be designed to enable offline maintenance, even if infrequent, as that ability will also be used in disaster (Incident) situations. (See ITIL, Disaster Recovery chapter)

      Maintenance activities occur as scheduled over time or occur due to an event. Some components may recommend maintenance activities occur both on a schedule and/or due to an event. Both may also be automated to automatically perform some maintenance activity based either on a time or an event.

      • Scheduled maintenance activities have defined time periods for activities, and each activity is placed on the schedule according to the organization’s needs. This type of maintenance is pro-active.
      • Event-based maintenance occurs when a monitoring threshold is reached (time to clean out the shared temp area), or a component signals a need for attention. This type of maintenance is re-active.

      Almost all vendors provide suggested maintenance schedules or monitoring thresholds for the components they support. Schedules are in terms of activities to be performed per time period (week, month, etc.). Otherwise a list of events and suggested actions are provided.

      In distributed (failover) or high-availability systems, maintenance on components may occur while the system is online (even if the component is not), as other components will take over processing from the ones undergoing maintenance. So maintenance activities may occur during business hours as an option.

      For systems without failover, plan the most intrusive maintenance activities to occur during non-peak times, and include options for taking components offline for a time to perform activities that may adversely affect business processing. Most maintenance activities occur during times of lower business processing, such as overnight or on weekends, although with the advances in distributed systems and networks, these activities require less downtime and lower frequency.


      5 Evolutionary Responsibilities

      6 Change Control Systems and Processes

      7 Types of System Change Releases

      8 Summary


      9 Key Competence Frameworks

      While many large companies have defined their own sets of skills for purposes of talent management (to recruit, retain, and further develop the highest quality staff members that they can find, afford and hire), the advancement of EIT professionalism will require common definitions of EIT skills that can be used not just across enterprises, but also across countries. We have selected 3 major sources of skill definitions. While none of them is used universally, they provide a good cross-section of options.

      Creating mappings between these frameworks and our chapters is challenging, because they come from different perspectives and have different goals. There is rarely a 100% correspondence between the frameworks and our chapters, and, despite careful consideration some subjectivity was used to create the mappings. Please take that in consideration as you review them.

      9.1 Skills Framework for the Information Age

      The Skills Framework for the Information Age (SFIA) has defined nearly 100 skills. SFIA describes 7 levels of competency which can be applied to each skill. Not all skills, however, cover all seven levels. Some reach only partially up the seven step ladder. Others are based on mastering foundational skills, and start at the fourth or fifth level of competency. It is used in nearly 200 countries, from Britain to South Africa, South America, to the Pacific Rim, to the United States. (http://www.sfia-online.org)

      SFIA skills have not yet been defined for the this chapter.


      9.2 European Competency Framework

      The European Union’s European e-Competence Framework (e-CF) has 40 competences and is used by a large number of companies, qualification providers and others in public and private sectors across the EU. It uses five levels of competence proficiency (e-1 to e-5). No competence is subject to all five levels.

      The e-CF is published and legally owned by CEN, the European Committee for Standardization, and its National Member Bodies (www.cen.eu). Its creation and maintenance has been co-financed and politically supported by the European Commission, in particular, DG (Directorate General) Enterprise and Industry, with contributions from the EU ICT multi-stakeholder community, to support competitiveness, innovation, and job creation in European industry. The Commission works on a number of initiatives to boost ICT skills in the workforce. Version 1.0 to 3.0 were published as CEN Workshop Agreements (CWA). The e-CF 3.0 CWA 16234-1 was published as an official European Norm (EN), EN 16234-1. For complete information, please see http://www.ecompetences.eu.


      e-CF Dimension 2e-CF Dimension 3
      C.3.Service Delivery (RUN)
      Ensures service delivery in accordance with established service level agreements (SLA's). Takes proactive action to ensure stable and secure applications and ICT infrastructure to avoid potential service disruptions, attending to capacity planning and to information security. Updates operational document library and logs all service incidents. Maintains monitoring and management tools (i.e. scripts, procedures). Maintains IS services. Takes proactive measures.
      Level 1-3
      C.4. Problem Management (RUN)
      Identifies and resolves the root cause of incidents. Takes a proactive approach to avoidance or identification of root cause of ICT problems. Deploys a knowledge system based on recurrence of common errors. Resolves or escalates incidents. Optimises system or component performance.
      Level 2-4
      E.3. Risk Management (MANAGE)
      Implements the management of risk across information system s through the application of the enterprise defined risk management policy and procedure.
      Assesses risk to the organisation’s business, including web, cloud and mobile resources. Documents potential risk and containment plans.
      Level 2-4
      E.4. Relationship Management (MANAGE)
      Establishes and maintains positive business relationships between stakeholders (internal or external) deploying and complying with organisational processes. Maintains regular communication with customer / partner / supplier, and addresses needs through empathy with their environment and managing supply chain communications. Ensures that stakeholder needs, concerns or complaints are understood and addressed in accordance with organisational policy.
      Level 3-4
      E.8. Information Security Management (MANAGE)
      Implements information security policy. Monitors and takes action against intrusion, fraud and security breaches or leaks. Ensures that security risks are analysed and managed with respect to enterprise data and information. Reviews security incidents, makes recommendations for security policy and strategy to ensure continuous improvement of security provision.
      Level 2-4


      9.3 i Competency Dictionary

      The Information Technology Promotion Agency (IPA) of Japan has developed the i Competency Dictionary (iCD), translated it into English, and describes it at https://www.ipa.go.jp/english/humandev/icd.html. It is an extensive skills and tasks database, used in Japan and southeast Asian countries. It establishes a taxonomy of tasks and the skills required to perform the tasks. The IPA is also responsible for the Information Technology Engineers Examination (ITEE), which has grown into one of the largest scale national examinations in Japan, with approximately 600,000 applicants each year.

      The iCD consists of a Task Dictionary and a Skill Dictionary. Skills for a specific task are identified via a “Task x Skill” table. (Please see Appendix A for the task layer and skill layer structures.) EITBOK activities in each chapter require several tasks in the Task Dictionary.

      The table below shows a sample task from iCD Task Dictionary Layer 2 (with Layer 1 in parentheses) that correspond to activities in this chapter. It also shows the Layer 2 (Skill Classification), Layer 3 (Skill Item), and Layer 4 (knowledge item from the IPA Body of Knowledge) prerequisite skills associated with the sample task, as identified by the Task x Skill Table of the iCD Skill Dictionary. The complete iCD Task Dictionary (Layer 1-4) and Skill Dictionary (Layer 1-4) can be obtained by returning the request form provided at http://www.ipa.go.jp/english/humandev/icd.html.

      The information is not available yet.


      10 Key Roles

      Both SFIA and the e-CF have described profiles (similar to roles) for providing examples of skill sets (skill combinations) for various roles. The iCD has described tasks performed in EIT and associated those with skills in the IPA database.

      The following roles are common to ITSM.

      • 1st, 2nd, 3rd Level Support
      • Access Manager
      • Facilities Manager
      • Incident Manager
      • IT Operations Manager
      • IT Operator
      • Major Incident Team
      • Problem Manager
      • Service Request Fulfillment

      11 Standards

      • ISO/IEC 24765:2016 Systems and software engineering—Vocabulary (also available online as SEVOCAB at https://pascal.computer.org/ and it is free)
      • ISO 20000 series
      • IEEE Std 14764-2006, ISO/IEC 14764-2006 - International Standard for Software Engineering - Software Life Cycle Processes - Maintenance
      • IEEE Std 982.1-2005 - Standard Dictionary of Measures of the Software Aspects of Dependability

      12 References

      [1] (2010). SEVOCAB at https://pascal.computer.org/

      [2] Retrieved from Guide to the Systems Engineering Body of Knowledge (SEBoK), version 1.3.: http://www.sebokwiki.org/w/index.php?title=Logistics&oldid=48199

      [3] Ibid., Section 2.3 Maintenance Cost Estimation

      [4] International Institute of Business Analysis. (2009). A Guide to the Business Analysis Body of Knowledge® Version 2.0. IIBA. [IIBA-BABOK 7.5.2, p.134]

      [5] [Ibid., 7.6.2, p. 137]

      [7] Introduction to the ITIL Service Lifecycle, Second Edition, Office of Government Commerce, 2010. p. 52; p. 215