Failure Mode Analysis Methods

# Fault Tree Analysis A somewhat vintage yet still useful technique to assess a system's failure modes is the fault tree analysis. A fault tree analysis can be simply described as an analytical technique, whereby an undesired state of the system is specified (usually a state that is critical from a safety or reliability standpoint), and the system is then analyzed in the context of its environment and operation to find all credible ways in which the undesired event can occur. The fault tree itself is a graphic model of the various parallel and sequential combinations of faults that will result in the occurrence of the predefined undesired event. The faults can be events that are associated with component hardware failures, human errors, or any other pertinent events that can lead to the undesired event. A fault tree thus depicts the logical interrelationships of basic events that lead to the undesired event-which is the top event of the fault tree. It is important to understand that a fault tree is not a model of all possible system failures or all possible causes for system failure. A fault tree is tailored to its top event which corresponds to some particular system failure mode, and the fault tree thus includes only those faults that contribute to this top event. Moreover, these faults are not exhaustive-they cover only the most credible faults as assessed by the analyst. It is also important to point out that a fault tree is not in itself a quantitative model. It is a qualitative model that can be evaluated quantitatively and often is. This qualitative aspect, of course, is true of virtually all varieties of system models. The fact that a fault tree is a particularly convenient model to quantify does not change the qualitative nature of the model itself. A fault tree is a complex of entities known as "gates" that serve to permit or inhibit the passage of fault logic up the tree. The gates show the relationships of events needed for the occurrence of a "higher" event. The "higher" event is the "output" of the gate; the "lower" events are the "inputs" to the gate. The gate symbol denotes the type of relationship of the input events required for the output event. Thus, gates are somewhat analogous to switches in an electrical circuit or two valves in a piping layout. The operation of any system can be considered from two extremes of a spectrum: we can enumerate various ways for system success, or we can enumerate various ways for system failure. ![Failure/Solution Spectrum or Space](image421.png) > [!Figure] > _Failure/Solution Spectrum or Space_ It is interesting to note that certain identifiable points in the success space coincide with certain analogous points in the failure spectrum (see figure above). For instance, "maximum anticipated success" in the success space can be thought of as coinciding with "minimum anticipated failure" in the failure space. Although our first inclination might be to select the optimistic view of our system-success-rather than the pessimistic one-failure-, we shall see that this is not necessarily the most advantageous one. From an analytical standpoint, several overriding advantages accrue to the failure space standpoint. First of all, it is generally easier to attain consensus on what constitutes failure than it is to agree on what constitutes success. We may desire an airplane that flies high, travels far without refueling, moves fast, and carries a big load. When the final version of this aircraft rolls off the production line, some of these features may have been compromised in the course of making the usual trade-offs. Whether the vehicle is a "success" or not may very well be a matter of controversy. On the other hand, if the airplane crashes in flames, there will be little argument that this event constitutes system failure. "Success" tends to be associated with the efficiency of a system, the amount of output, the degree of usefulness, and production and marketing features. These characteristics are describable by continuous variables that are not easily modeled in terms of simple discrete events, such as "valve does not open" which characterizes the failure space (partial failures, i.e., a valve opens partially, are also difficult events to model because of their continuous possibilities). Thus, the event "failure," in particular, "complete failure," is generally easy to define, whereas the event, "success," may be much more difficult to tie down. This fact makes the use of failure space in analysis much more valuable than the use of success space. Another point in favor of the use of failure space is that, although theoretically the number of ways in which a system can fail and the number of ways in which a system can ·succeed are both infinite, from a practical standpoint there are generally more ways to success than there are to failure. Thus, purely from a practical point of view, the size of the population in failure space is less than the size of the population in success space. Fault tree analysis is a deductive failure analysis that focuses on one particular undesired event, and provides a method for determining causes of this event. The undesired event constitutes the top event in a fault tree diagram constructed for the system and generally consists of a complete, or catastrophic failure as mentioned above. Careful choice of the top event is important to the success of the analysis. If it is too general, the analysis becomes unmanageable; if it is too specific, the analysis does not provide a sufficiently broad view of the system. Fault tree analysis can be an expensive and time-consuming exercise and its cost must be measured against the cost associated with the occurrence of the relevant undesired event. ## Symbology-The Building Blocks of the Fault Tree A typical fault tree is composed of several symbols which are summarized in the figure below. ![Fault-tree symbology](image422.png) > [!Figure] > _Fault-tree symbology_ It is important to understand that causality never passes through an OR gate. That is, for an OR-gate, the input faults are never the causes of the output fault. Inputs to an OR-gate are identical to the output but are more specifically defined as to cause. The AND-gate is used to show that the output fault occurs only if all the input faults occur. There may be any number of input faults to an AND-gate. Figure IV-5 shows a typical two-input AND-gate with input events A and B, and output event Q. Event Q occurs only if events A and B both occur. In contrast to the OR-gate, the AND-gate does specify a causal relationship between the inputs and the output, i.e., the input faults collectively represent the cause of the output fault. The AND-gate implies nothing whatsoever about the antecedents of the input faults. ## Performing a fault tree analysis Performing a fault tree analysis is a complex process that involves seven key steps. - Step 1: Define the undesired event Before running the analysis, one should clearly define the undesired event to analyze. This event should be specific and measurable, like a component failure or a system malfunction. It’s also important to define the event in clear, consistent terms since it serves as the starting point for your fault tree diagram. - Step 2: Identify the contributing events and factors Once one defines the undesired event, one should start to identify the factors and events that might contribute to its occurrence. Contributing factors tend to fall into two broad categories: basic events and intermediate events. - Basic events—those events that cannot be further broken down into simpler events—are the most fundamental events in a fault tree, representing the lowest level of events you can analyze. A basic event in a fault tree for a car accident, for example, might be "the driver loses control of the vehicle". - Intermediate events are located between the lower-level basic events and the top event (the primary undesired event being analyzed). Intermediate events are caused by other events in the fault tree and, in turn, cause other events. They represent higher-level events that can be analyzed further. Using the same car accident as an example, an intermediate event in the fault tree might be "tire blows out". One must consider both internal and external events, like component failures, human error, and environmental conditions. You might need to consult with subject matter experts, and/or review historical data, incident reports, and maintenance records, at this stage of the analysis. - Step 3: Construct the fault tree. Using standard standard symbology presented above, one must construct a graphical representation of the relationships between the undesired (or output) event and its contributing factors (also called input events). The fault tree should be organized hierarchically, with the undesired event at the top and the contributing factors branching out below it. Laying out basic events is straightforward since basic events cannot produce other events. However, including intermediate events is a bit more complex, as intermediate events require Boolean logic gates that indicate the relationships between top-level, intermediate, and basic events. Intermediate events can also include undeveloped events, which are events that aren’t fully understood or haven’t been fully analyzed. Using the various available gates will help create a comprehensive fault tree that captures the complex interactions between the various events and factors that precipitated the undesired event. Building a fault tree is an iterative process, so one can continue to break down contributing events into their basic sub-events until the events cannot be parsed out any further. As one gets new information and/or system conditions change, it might be needed to make several adjustments to refine the fault tree. - Step 4: Gather failure data To quantify the risks associated with the undesired event, you need to gather failure data (from historical records, industry databases, expert opinions, etc.) for the basic events in the fault tree. The failure data should be expressed as failure probabilities or failure rates, depending on the type of analysis conducted. - Step 5: Perform the analysis Once one constructs the fault tree and gathers the failure data, the analysis is performed, wherein the probability of the undesired event occurring is quantified, and identify the most critical contributing factors. Either a qualitative or a quantitative data analysis method can be used. A qualitative analysis focuses on understanding the structure of the fault tree, the relationships between events, and the identification of critical paths and minimal cut sets (the smallest set of events that can create the undesired event). Qualitative analysis can help prioritize remedial actions and identify areas for further investigation. A quantitative methodology, on the other hand, involves calculating the numerical probability of the undesired event occurring based on the failure probabilities of the basic events. Quantitative analysis can help inform risk management decisions and evaluate the effectiveness of proposed improvements. - Step 6: Interpret the results After performing the analysis, it’s time to interpret your results and communicate any relevant information to the necessary stakeholders. The results of an event tree analysis depend on the quality of the input data and the assumptions made during the analysis. As such, you should view the results as a starting point for further investigation and validation, rather than a definitive conclusion. - Step 7: Implement improvements and monitor progress Based on the findings of the fault tree analysis, you implement preventive measures and improvements as necessary to eliminate or decrease the likelihood of an undesired event. Therefore, be sure to monitor the performance of these improvements and continually update the fault tree to reflect any changes in system design, operating conditions, or component performance, so that your tree remains accurate—and useful—to your organization. ## Benefits of fault tree analysis - FTA provides a visual depiction of contributing factors and events that can lead to a system failure, making it easier to understand complex interactions between system components. - FTA allows the calculation of the probability of a failure event occurring, enabling better risk management and decision-making and helping teams be proactive about corrective actions. - Since you can analyze only one output event at a time, fault tree analysis helps teams stay organized as they assess system levels and work through effects analyses methodically. - Unlike other approaches to failure mode and effects analyses (FMEAs), FTA accounts for human error, which can help teams understand whether issues are related to deviations from standard operating procedures. - FTA identifies which failures are likeliest to occur, helping teams decide which issues require urgent attention. ## Limitations of fault tree analysis - The accuracy and effectiveness of FTA relies heavily on the expertise of the analysts, their ability to identify relevant causes of failure, and their understanding of the complexities of the fault tree itself. - FTA is best suited for smaller system analyses. Large, complex systems typically require large, complex fault trees, making analysis time-consuming and challenging. - Failure data availability and quality determine the precision of the calculated probabilities in a fault tree. - Fault tree analysis allows you to examine only one top event at a time. # Failure Modes and Effects Analysis (FMEA) FMEA (Failure Modes and Effects Analysis) is an engineering method that helps to define, identify, prioritize, and eliminate known and/or potential failures of the system, of the design process that develops the system, or stemming from the manufacturing process that synthesizes the system before they reach the customer. The goal of FMEA is to eliminate the failure modes or reduce their associated risks. But also, FMEA provides a structure for cross-functional critiques of a design or a process. FMEA helps determine the effect and severity of failure modes. It helps identify the causes and probability of occurrence of said failure modes. FMEA ultimately develops and documents action plans to reduce risk. > [!warning] > This section is under #development