Poor alarm management is one of the leading causes of unplanned downtime, contributing to over $20B in lost production every year, and of major industrial incidents such as the one in Texas City. Developing good alarm management practices is not a discrete activity, but more of a continuous process (i.e., it is more of a journey than a destination). This paper will describe the new ISA-18.2 standard -“Management of Alarm Systems for the Process Industries”. This standard provides a framework and methodology for the successful design, implementation, operation and management of alarm systems and will allow end-users to address one of the fundamental conclusions of Bransby and Jenkinson that “Poor performance costs money in lost production and plant damage and weakens a very important line of defense against hazards to people.” Following a lifecycle model will help users systematically address all phases of the journey to good alarm management. This paper will provide an overview of the new standard and the key activities that are contained in each step of the lifecycle.
Get a Life(cycle)! Connecting Alarm Management and Safety Instrumented Systems
Alarms and operator response are one of the first layers of defense in preventing a plant upset from escalating into an abnormal situation. The new ISA 18.2 standard on alarm management recommends following a lifecycle approach similar to the existing ISA84/IEC 61511 standard on functional safety. This paper will highlight where these lifecycles interact and overlap, as well as how to address them holistically. Specific examples within ISA 18 will illustrate where the output of one lifecycle is used as input to the other, such as when alarms identified as a safeguards during a process hazards analysis (PHA) are used as an input to alarm identification and rationalization. The paper will also provide recommendations on how to integrate the safety and alarm management lifecycles.
Implement an Effective Alarm Management Program
Apply the ISA-18.2 Standard on Alarm Management to design, implement, and maintain an effective alarm system.
Make some Alarming Moves
Tackle distractions that impair operator performance and process efficiency.
Managing Alarms to Support Operational Discipline
Process alarms, coupled with operator action, are frequently cited as a safeguard in a Process Hazard Analysis (PHA) and an Independent Protection Layers (IPL) in a Layer of Protection Analysis (LOPA), but does the alarm management system really support the safeguard/IPL?
According to ISA-18.2 / IEC 62682 an alarm must indicate an equipment malfunction, process deviation, or abnormal condition that requires a timely operator action. If no action is taken, then the alarm is either invalid or the operator is not doing their job. Both scenarios represent a breakdown in operational discipline for alarm management as does the presence of nuisance alarms and alarm floods. This breakdown in operational discipline for alarms has been cited as a contributing factor in many significant safety incidents, some of which will be analyzed in this paper. If operational discipline for alarms is lacking, then it is very possible that the desired risk reduction for a process alarm used as an IPL will not be achieved and the probability of an ineffective operator response will increase.
As systems have evolved from hardwire to computer control, alarms have become easier and less expensive to implement leading to more and less purposeful alarms. Operators must contend with multiple alarms at one time with only their experience to determine priority. Alarms may be added to or removed from a control system without proper management of change. Systems may include alarms for which there is no possible action, or inadequate action time. What can an organization do to take control of their process alarms and improve operational discipline?
Maximizing the Reliability of Operator Response to Alarms
Layers of protection for abnormal event management can be modeled as slices of swiss cheese according to James Reason . An operator’s response to an alarm is one of the first layers of protection to prevent a hazard from escalating to an incident. This paper will present best practices for maximizing the operator’s reliability for understanding and responding to abnormal situations as adapted from the alarm management standards ANSI/ISA-18.2-2016 and IEC 62682. Examples include alarm rationalization to ensure all alarms are meaningful and to capture “tribal knowledge”, prioritization to help operators determine which alarms are most critical, and creation of alarm response procedures. The treatment of safety alarms, which are those that are deemed critical to process safety or to the protection of human life or the environment, will be specifically highlighted.
The paper will also discuss key human factors considerations for maximizing operator situation awareness (SA) by preventing SA “demons”; such as developing an errant mental model of the process, attention tunneling, data overload, and misplaced salience. As such the resolution of issues which inhibit operator performance, such as nuisance alarms and alarm floods, will also be discussed.
Plug the Holes in the Swiss Cheese Model
Stop using operator error as an excuse. Apply human factors considerations to improve your alarm system and help operators respond to alarms effectively.
Alarms play a significant role in maintaining plant safety by notifying operators of an equipment malfunction, process deviation, or abnormal conditions that requires a timely response . Alarms are one of the first layers of protection for preventing a hazard from escalating to an incident or accident. They work in conjunction with other independent protection layers (IPLs) such as relief valves, dikes, and safety instrumented systems (SIS).
Saved by the Bell: Using Alarm Management to make Your Plant Safer
Recent industrial accidents at Texas City, Buncefield (UK) and Institute, WV have highlighted the connection between poor alarm management and process safety incidents. At Texas City key level alarms failed to notify the operator of the unsafe and abnormal conditions that existed within the tower and blowdown drum. The resulting explosion and fire killed 15 people and injured 180 more.1 The tank overflow and resultant fire at the Buncefield Oil Depot resulted in a £1 billion (1.6 billion USD) loss. It could have been prevented if the tank’s high level safety switch, per design, had notified the operator of the high level condition or had automatically shut off the incoming flow.2 At the Bayer facility (Institute, WV) improper procedures, worker fatigue, and lack of operator training on a new control system caused the residue treater to be overcharged with Methomyl – leading to an explosion and chemical release.
Tips for Starting an Alarm Management Program
Using the ISA-18.2 standard can help process engineers understand, simplify, and implement a sustainable alarm management program.
Congratulations. You’ve been assigned the task of establishing an alarm management program for your facility. So where and how do you begin? This article presents four practical tips for starting an effective and sustainable alarm management program that
conforms to the tenets of a relatively new process industry standard for alarm management published by ISA.
Using Alarms as a Layer of Protection
When Good Alarms Go Bad: Learnings from Incidents
Some of the significant process industries incidents occurred by overflowing vessels, including BP Texas City and Buncefield. In many overflow incidents, alarms were designed to signal the need for operator intervention. These alarms may have been identified as safeguards or layers of protection, but they did not succeed in preventing the incident. This paper reviews several overflow incidents to consider the alarm management and human factors elements of the failures.
You Asked: Alarm Management
Setting a new Standard for Performance, Safety, and Reliability with ISA-18.2
Alarm Management affects both the bottom line and plant safety. A well- functioning alarm system can help a process run closer to its ideal operating point – leading to higher yields, reduced production costs, increased throughput, and higher quality, all of which add up to higher profits. Poor alarm management, on the other hand, is one of the leading causes of unplanned downtime and has been a major contributor to some of the worst industrial safety accidents on record.
Cybersecurity (IEC 62443) Certification:
The exida 61508 / Cybersecurity Certification Program FAQ
The exida IEC 61508 Certification Program was established in 2005 in response to demand primarily from end users in the process industries and manufacturers of instrumentation products. There was a need to provide a higher quality of technical expertise with effective and responsive service.
exida is an accredited Certification Body (CB) authorized to perform product certification by the American National Standards Institute (ANSI) in the technical fields of functional safety and cyber-security. ANSI is the Accreditation Body (AB) for IEC standards in the United States. They are a member of the International Accreditation Forum (IAF). Most countries in the world have an AB which is a member of IAF (www.iaf.nu). IAF members have agreed to the Multilateral Recognition Agreement recognizing the equivalence of other member’s accreditations. Thus IAF member accreditations are valid in most countries of the world.
The exida IEC 61508 Certification Program offers the most comprehensive product review of any Certification Body (CB) resulting in products that are safer, more secure, easier to use, and more reliable.
Cybersecurity (IEC 62443) Lifecycle:
Integrating Cybersecurity Risk Assessments ￼Into the Process Safety Management Work Process
￼Cybersecurity is rapidly becoming something the process safety can no longer ignore. It is part of the Chemical Facility Anti-Terrorism Standards (CFATS). In addition, the President’s Executive Order 13636– “Improving Critical Infrastructure Cybersecurity,” has drawn attention to the need for addressing cybersecurity in our plants as it has been demonstrated that in our new world, they are now a source of potential process safety incident.
IEC 61508, “Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES)” now has a requirement to address cybersecurity in safety instrumented systems and ANSI/ISA 84.00.01, “Functional Safety: Safety Instrumented Systems for the Process Industry Sector” is looking to include this requirement in the next revision. Currently the industry is playing catch up as there tends to be a gap in understanding between information technologists, traditionally responsible for cybersecurity, and the process automation and process safety engineers responsible for keeping our plants safe with help from automated controls and safety instrumented systems. As a result, guidance is being developed, but much of it continues to be a work in progress.
The 7 Steps to ICS and SCADA System Security
The past two years have been a wakeup call for the industrial automation industry. It has been the target of sophisticated cyber attacks like Stuxnet, Night Dragon and Duqu. An unprecedented number of security vulnerabilities have been exposed in industrial control products and regulatory agencies are demanding compliance to complex and confusing regulations. Cyber security has quickly become a serious issue for professionals in the process and critical infrastructure industries.
If you are a process control engineer, an IT professional in a company with an automation division, or a business manager responsible for safety or security, you may be wondering how your organization can get moving on more robust cyber security practices. This white paper will give you the information you need to get started. It won’t make you a security expert, but it will put you on the right path in far less time than it would take if you were to begin on your own.
We began by condensing the material from numerous industry standards and best practice documents. Then we combined our experience in assessing the security of dozens of industrial control systems. The result is an easy-to-follow 7-step process:
Step 1 – Assess Existing Systems
Step 2 – Document Policies & Procedures
Step 3 – Train Personnel & Contractors
Step 4 – Segment the Control System Network Step 5 – Control Access to the System
Step 6 – Harden the Components of the System Step 7 – Monitor & Maintain System Security
The remainder of this white paper will walk through each of these steps, explaining the importance of each step and best practices for implementing it. We will also provide ample references for additional information
The ICS Cybersecurity Lifecycle
With the ever changing threats posed by cyber events of any nature, it has become critical to recognize these emerging threats, malicious or not, and identify the consequences these threats may have on the operation of an industrial control system (ICS). Cyber-attacks over time have the ability to take on many forms and threaten not only industrial but also national security.
Saudi Aramco, the world’s largest exporter of crude oil, serves as a perfect example depicting how devastating a cyber-attack can truly be on an industrial manufacturer. In August 2012, Saudi Aramco (SA) had 30,000 personal computers on its network infected by a malware attack better known as the “Shamoon” virus. According to InformationWeek Security this was roughly 75 percent of the company’s workstations and took 10 days to complete clean-up efforts.
The seriousness of cyber-attacks in regards to national security was addressed by former United States Secretary of Defense Leon W. Panetta in his speech on October 2012. Panetta issued a strong warning to business executives about cybersecurity as it relates to national security.” A cyber-attack perpetrated by nation states [and] violent extremists groups could be as destructive as the terrorist attack on 9/11. Such a destructive cyber-terrorist attack could virtually paralyze the nation,” he stated. “For example, we know that foreign cyber actors are probing America’s critical infrastructure networks. They are targeting the computer control systems that operate chemical, electricity and water plants and those that guide transportation throughout this country.”
In addition to Panetta’s address, the U.S. Department of Homeland Security has issued several alerts about coordinated attacks on gas pipeline operators, according to a May 2012 report by ABC News.
This whitepaper will focus on the significance of cyber-attacks on industrial control systems (ICS) and how these attacks can be prevented by proper practice of the ICS Cybersecurity lifecycle.
Failure Rate Data:
Accurate Failure Metrics For Mechanical Instruments
Probabilistic calculations that are done to verify the integrity of a Safety Instrumented Function design require failure rate and failure mode data of all equipment including the mechanical devices. For many devices, such data is only available in industry databases where only failure rates are presented. The failure mode information is rare, if available at all. Many give up and just say 50% safe and 50% dangerous thinking this is conservative. In some cases this is not a conservative assumption. In other cases it can be an over-kill.
Combining field failure data with new instrument design margins to predict failure rates
Comparing FMEDA Predicted Failure Rates to OREDA
Estimated Failure Rates for Sensor and Valve Assemblies
Failure rates predicted by Failure Modes Effects and Diagnostic Analysis (FMEDA) are compared to failure rates estimated from the Offshore Reliability Data (OREDA) project for sensor and valve assemblies. Because the two methods of data analysis are fundamentally different in nature, it may be surprising that, when appropriately compared, the results from the two methods are generally quite similar. The nature of the published data for FMEDA and OREDA is explored. The relative merits of each method are discussed.
Development of a Mechanical Component Failure Database
In this paper, we present a methodology to derive component failure rate and failure mode data for mechanical components used in automation systems based on warranty and field failure data as well as expert opinion. We describe a process for incorporating new component information into the database as it becomes available. The method emphasizes random mechanical component failures of importance in the world of safety analysis as opposed to the wear-out and aging mechanical failures that have dominated mechanical reliability analysis. The method provides a level of accuracy significantly better than warranty failure data analysis alone. The derived database has the same form as that for electrical/electronics databases used in FMEDA analyses used to show compliance with international performance-based safety standards. Thus, the mechanical database can be used in conjunction with existing electrical/electronics databases to perform required probabilistic safety analysis on automation systems comprised of both electrical and mechanical components.
Explaining the Differences in Mechanical Failure Rates: exida FMEDA Predictions & OREDA Estimations
This white paper describes the distinction between failure rate prediction and estimation methods in general and then gives an overview of the procedures used to obtain dangerous failure rates for certain mechanical equipment using exida FMEDA predictions and OREDA estimations. exida frequently compares field failure rate data from various sources to FMEDA results in order to validate the FMEDA component library. However, because OREDA and FMEDA methods are quite different, it is not possible to compare their results directly. A methodology is presented which creates predictions and estimations that are more comparable. The methodology is then applied to specific equipment combinations and the results are compared. When differences in the results exist between the two methods, plausible explanations for the differences are provided.
The comparisons show that the OREDA failure rates are well within the range of the exida FMEDA results. The comparisons also show that, with two exceptions, the average FMEDA predictions for dangerous failure rates are only slightly less than those of the OREDA estimations. In those two exceptions, FMEDA predictions are higher than OREDA. Therefore, it is reasonable to conclude that, when compared in an “apples-to-apples” fashion, for the equipment analyzed in this paper, the exida FMEDA predictions and OREDA estimations are quite comparable.
Field Failure Rates - The Good, The Bad, The Ugly
There are many benefits to a company when they have access to good field failure data. Most of the benefits are categorized as saving money. At the same time, most of the expenditure to get good failure data is already being spent. Given an incremental cost of improving data collection quality and better data analysis, the nice benefits could be achieved.
Good high quality field failure data has often been described as the ultimate source of failure data. However, not all field failure studies are high quality. Some field studies simply do not have the needed information. Some field studies make unrealistic assumptions. The results can be quite different depending on methods and assumptions. Some methods produce optimistic results that can result in bad designs and unsafe processes.
This paper presents some common field failure analysis techniques, shows some of the limitations of the methods and describes important attributes of a good field failure data collection system.
Improving Reliability & Safety Performance of Solenoid Valves by Stroke Testing
Solenoid valves integrated into the design of emergency shutdown (ESD) valves used in industrial process systems, can tend to bind, i.e., to become stuck in one position, when not moved for long periods of time. This binding, also known as failure due to excessive stiction, has significant negative impacts on the valve’s reliability and safety performance. It is a serious and costly problem normally addressed by expensive and time-consuming manual proof tests which typically require a process shutdown to perform testing. This paper describes an effective, alternative in-service testing protocol, known as valve stroke testing, which verifies whether or not the solenoid valve is stuck in position. It recommends a best practice procedure for implementing the valve stroke test. It provides a quantitative example of how valve stroke testing significantly improves safety performance when performed frequently (at intervals of one week or less) or even infrequently (at intervals of three to six months).
Mechanical Failure Rate Data for Low Demand Applications
The use of IEC 61508  and IEC 61511  has increased rapidly in the past several years. Along with the adoption of the standards has come an increase in the need for accurate reliability data for devices used in Safety Instrumented Systems (SIS), both electronic and mechanical. While the methodology of determining failure rates for electronic equipment is fairly well accepted and applied, the same can not be said for mechanical equipment. Several methods are currently being utilized for generating failure rates for mechanical components. These methods vary in their approach and often lead to dramatically different failure rates which can lead to significant differences when calculating the reliability of a safety instrumented function (SIF). Some methods can result in dangerously optimistic failure rate numbers.
This paper reviews the methods utilized to determine mechanical reliability for components utilized in safety systems and provides a recommendation for the most appropriate methodology.
Random versus Systematic Failures – Issues and Solutions
Functional safety standards provide definitions of two different categories of failures: random failures and systematic failures. These were created during the standards committee discussions of failure types to be modeled in the probabilistic failure analysis. It was decided that random failures are counted in the probabilistic failure rate analysis and systematic failures are not counted.
Systematic failures were considered to be a direct result of some design or procedure problem. They occur when a set of circumstances happen to reveal the fault. The committee thinking was that systematic failures could be permanently “fixed” by a change in a design or a procedure. It was assumed that the fix would always be completely effective. After the fix, the failure would not happen again and therefore any such failures should not be counted.
Many companies establish programs to record and analyze failures. A failure rate analysis is performed to determine device failure rates. One problem observed while reviewing these studies is that many people have completely different interpretations of the definitions of random versus systematic failures. In some cases most failures are classified as systematic. This creates a dangerous bias in field failure rate analysis.
At some sites, those performing the analysis have realized that failures classified as systematic do prevent safety devices from performing their safety function and are therefore dangerous. These failures occur under conditions which seem to occur randomly and can be modeled with exactly the same probabilistic analysis. These failures impact the probability of dangerous failure and they certainly should be counted in any failure rate analysis.
This thinking is realistic as systematic failures may not be effectively corrected even when changes to the design or the procedures are made. If a systematic failure is effectively corrected then, in future data collection, the quantity of failure reports will decrease and will reflect the change. If the change was not effective the data will show that as well. Any updated field failure rate analysis will then reflect the improvement or not. So most engineers now understand that to improve safety and achieve realistic measurement of safety:
- All failures must be counted in failure rate analysis and
- All failures must be reviewed to determine if the failure can be practically prevented in the future.