Reliability Assessment Methods

# Reliability Assessment Methods ## Reliability Block Diagram (RBD) The classic Reliability Block Diagram (RBD) method uses a set of blocks connected in a certain topology which reflects the architecture of the system. Each block represents a component of the system with a failure rate $\lambda_{i}$. ![An RBD diagram (note that the diagram may depict any network of blocks in the system: a mezzanine, a board, a unit, a subsystem, a rack, etc)](image409.png) > [!Figure] > _An RBD diagram (note that the diagram may depict any network of blocks in the system: a mezzanine, a board, a unit, a subsystem, a rack, etc)_ The primary purpose of RBD is to illustrate the interconnections and dependencies between different components of a system and how these relationships impact the system's reliability. Key components of RBD are blocks, where each block in an RBD represents a component or a subsystem of the system. These blocks are usually represented by rectangles, and they are linked by connections, where the lines or arrows connecting the blocks depict the relationship between components. These connections can be in series or parallel, indicating how the failure of one component affects others. In a series configuration, the system fails if any single component fails. This arrangement is used when each component is essential for the operation of the entire system. In a parallel configuration, the system continues to operate as long as at least one component is functioning. This is common in systems where redundancy is built for increased reliability. For very complex systems, RBDs can become hard to interpret. Also, RBDs do not typically account for dynamic behaviors of systems over time, such as wear and tear or changing environmental conditions. ## Physics of Failure Physics of Failure (PoF) analysis presents a novel approach in terms of reliability prediction. It shifts the analysis of the system equipment from a box of parts to a box of failure mechanisms. It is, still, a substantial challenge to apply PoF to a complex system of hardware and software. Many failure mechanisms invoke a limited number of failure models, and not all model parameters are available to the reliability engineer. ### Board-Level Failure Mechanism [[Printed Circuit Boards|Printed Circuit Board]] (PCB) level PoF-based reliability predictions are generally based on the fatigue of materials, especially metals, that the PCB comprises, interconnections, and the board [[Physical Layer#Connectors|connectors]]. Fatigue occurs when stress is applied repeatedly, and the material under stress flexes, deforms and weakens (strains) until a fracture occurs. The PoF-based reliability at this level is generally expressed as either a "time to failure" or "cycles to failure" where the "cycles" are the repetitions of fatigue stress applications. Every board design must include a fatigue stress assessment of "time to failure" or "cycles to failure" if the environmental stresses that are expected in field application result in a fatigue life that is less than the required life of the item. A fatigue life-based reliability prediction in general uses the "Miner's Rule". Miner's rule operates on the hypothesis that the portion of useful fatigue life used up by a number of repeated stress cycles at a particular stress is proportional to the total number of cycles in the fatigue life if that were the only stress level applied to the part. Miner's rule goes as follows: $\large CDI\ = \ \frac{n_{1}}{N_{1}} + \frac{n_{2}}{N_{2}} + ... + \frac{n_{k}}{N_{k}}$ Where: $\text{CDI}$: Cumulative Damage Index $k$ *identifies a given stress cycle* $n$ *is the number of cycles at a given stress level* $N$ *is the number of cycles at that stress level required to produce fatigue failure* $n/N$ *is the fraction of the life consumed by a given fatigue level* To calculate a reliability prediction, the fatigue failure should be assumed to have occurred when the CDI is equal to or greater than 1.0. Note that Miner's Rule is not accurate when trying to assess combined environments (thermal cycling + vibration, vibration + mechanical shock). Accelerated techniques that increase the cyclic stress (i.e., wider thermal range than expected in normal use) could require the use of appropriate accelerated life models to supplement a Miner's Rule analysis. #### Vibration Fatigue Vibration is defined generally as an oscillating motion where some item moves back and forth in response to an input stimulus. When the vibration is periodic, with a single frequency, it is called "sinusoidal" vibration. "Random" vibration is made up of the continuum of frequencies within a given bandwidth. Random vibration is usually specified for vibration qualification or screening tests because it has been shown to better represent true usage environments that electronic equipment will experience. Vibration fatigue, also called "high cycle fatigue" is the accumulation of damage due to vibration cycles. The number of fatigue cycles accumulated under sinusoidal vibration is proportional to the product of the frequency (cycles/second) and the time of exposure (seconds). For random vibration, the fatigue cycle exposure is expressed in terms of the power spectral density (PSD). To understand the vibration fatigue damage characteristics of a PCB, it may be tested at some known input PSD and the response measured by placing accelerometers at several places on the board. Resonant frequencies and transmissibility may be determined via structural analysis using analysis tools such as finite element analysis (FEA). If the board is treated as a simple spring-mass system, then the maximum displacement of the individual components mounted on the board must be considered. Steinberg's empirical model for determining the maximum component displacement considers the dimensions of the board, a component factor that depends on the package type, the thickness of the board, the dimension of the component, and the position of the component relative to the board center. Steinberg's empirical model is an accepted and powerful approach for first-pass assessment but is limited to a simple board shape. In boards with more complex geometries, the relative position factor is not valid. For these boards, a better approach is to use finite element analysis to determine the curvature and displacement under critical components and then input this data into a modified Steinberg equation. The models provided for vibration fatigue are for out-of-plane board effects and not for parallel or in-plane effects. Out-of-plane board curvature tends to be the dominant source of failure of components to printed wiring board interconnects under vibration and shock. While surface mount technology is traditionally at a higher risk from perpendicular displacements due to its low profile, traditional thorough-hole components and even newer surface mount technology-derived electrolytic capacitors (V-chip packaging) are still susceptible to movements that induce in-plane displacements. High standoff parts and large mass parts that have a center of mass significantly above the mounting surface can fail due to overstress and fatigue when subject to in-plane loading. These may require modeling, FEA simulation, or testing to determine overstress conditions and design margins. #### Thermal Cycling Fatigue Thermal cycling fatigue, also known as "low cycle fatigue," occurs when temperatures are varied between high and low extremes, either under test or in field usage. Examples of field conditions that subject electronics to thermal cycling fatigue are diurnal temperature variations for ground-based items or the temperature variations experienced between ground and airborne altitudes for platform electronics. Temperature cycling can induce low cycle fatigue damage to solder joint interconnections of electronic components. Fatigue damage is introduced into the solder joint due to differential expansion between the component and the board to which it is attached. The larger the component and/or coefficient of thermal expansion (CTE) mismatch, the larger the strain developed in the solder joint. The most dominant influence in fatigue failure of solder interconnects is the stress arising from the mismatch of expansion of the package and board due to temperature changes. #### Mechanical Shock Mechanical shock is the rapid transfer of energy into a system, which results in a significant increase in stress, velocity, acceleration, or displacement within the system. Although traditional reliability metrics are related to common, randomly occurring, and nominal events, shock occurrence may not be a common or nominal condition in the field. Shock failure models provide design information related to surviving adverse conditions, such as mishandling during shipping, harsh conditions in mobile platform applications, or crash effects. The ability to survive mechanical shock and continue to operate for some period of time may be a key application requirement that requires analytic modeling, design mitigation features, and specific qualification testing. PCB response to a shock pulse is to initially bend in the direction of the pulse, then resonate at its natural resonant frequencies. Design considerations include providing sufficient clearance, not only for the displacements caused by the shock of one PCB but also considering the flexure and displacement of adjacent PCBs. It is also desired to design a PCB for natural frequencies that minimize the damage and limit the number of deflections due to the shock event. PCB reliability metrics of interest related to mechanical shock include the maximum displacement of the board during a shock event and the desired resonant frequency for the PCB to minimize damage due to the shock event. #### Electrochemical Migration Electrochemical migration is the growth of conductive metal filaments on a PCB due to the dissolving of ionic material in surface moisture, the motion of dissolved ions due to the presence of direct current (DC) voltage, and the electro-deposition of ionic material out of solution around a cathode. The filament formation can occur on the surface, internal interface, or through bulk material of composite, or laminate layer. Conductive anodic filament (CAF) failure due to electrochemical migration is the growth or electromigration of copper salts between two oppositely biased copper conductors. The filaments migrate along the resin glass interfaces of PCB laminates in the presence of high temperature, high humidity, ionic impurities, and electrical bias. Degradation of board materials at high processing and use temperatures results in the loss of adhesion to glass weave fibers, allowing moisture to diffuse into the laminate. The moisture forms an aqueous medium that serves as a conduit for copper migration from anode (+) to cathode (-) biased traces, vias and plated through holes (PTH). CAF and electrochemical migration testing and modeling guidelines are provided in the IPC-9691-A standard. #### Conductive Filament Formation Conductive Filament Formation (CFF) or metallic electromigration, is an electrochemical process involving the transport of a metal through or across a nonmetallic medium under the influence of an applied electric field. CFF affects board materials such as multi-layer organic laminates, which develop a loss of insulation resistance between traces. A model for assessing time-to-failure due to CFF in multi-layer organic laminates was developed in 1994 by B. Rudra, M. Pecht and D. Jennings. They found that failures in multi-layer laminates due to CFF are highly dependent on moisture exposure and infiltration into the layers, and are affected by temperature, voltage, use of coatings, spacing between adjacent conductors, geometries, and material and fabrication defects. Conformal coating provides an effective mitigation measure. ##### Corrosion Corrosion is an electrochemical reaction that results in a loss of material, causing physical damage and failure of metal parts. Moisture and contaminants contribute to corrosion. Corrosion rates depend on many factors such as moisture absorption rates, contaminants exposure, surface textures, and imperfections, so it can be difficult to quantify or predict reliability based on corrosion mechanisms. At the board and subsystem assembly levels, conformal coatings are used to mitigate corrosion. Standards like ANSI/VITA 47 specify conformal coating and moisture testing requirements for mitigating the effects of corrosion on electronic modules. ### Packaging Level Failure Mechanism Here, "packaging" refers only to the encapsulation that contains a chip or multiple chips and the die attachments. Some examples are dual in-line packages (DIP), pin grid array (PGA), ball grid array (BGA), quad flat pack (QFP), leadless chip carrier (LCC), etc. We spoke a bit about those [[Semiconductors#Packages|here]]. The failure mechanisms at the package level are similar to the board-level mechanisms, and fatigue of materials will be governed by the same basic principles and similar equations as the fatigue of board interconnects. However, some of the board-level fatigue mechanisms do not scale to chip-level attachments because the smaller size is accompanied by less load-bearing areas, lower profiles, and reduced standoff heights resulting in an increased stiffness and different strain and creep properties. The failure modes and effects also differ because they relate to the die attachment to a substrate, wire bonds within the package, encapsulant, and package leads and finishes. For example, a package-level failure mode may be an open trace, a cracked bond, or a short caused by dendrite growth. #### Vibration Fatigue Vibration fatigue at the packaging level follows the same basic principles as at the board level, but the failure mechanisms differ because of different materials, tension points, attachment, and finishes creating specific packaging level issues. Vibration will result in fatigue of solder die attachments and wire bonds potentially leading to brittle fracture and fatigue cracking. Substrates and substrate attachments potentially will experience brittle fracture, ductile fracture, yield failure, or fatigue cracking. Modeling for vibration fatigue life at the packaging level can be accomplished using finite element analysis (FEA) modeling of the package, bonds, and interfaces. The variety of materials and bonds used in electronic packaging, and the sensitivity of the stress/strain results to manufacturing variation make FEA modeling both challenging and less able to be generically applied to a family of components. Experimental testing can be used to characterize materials and verify mechanical behavior. #### Thermal Cycling Fatigue Thermal cycling fatigue at the packaging level is similar to thermal cycle fatigue at the board level in that it is also caused by differences in the coefficient of thermal expansion (CTE) between materials. The thermal cycling fatigue at the packaging level is caused by the CTE mismatch between the die and substrate, causing stresses and strains in the die attachments or in the wire bond to pad interface. The use of component termination finishes and platings that are compliant with the European Union (EU) directive 2002/95/EC Restriction of Hazardous Substances (RoHS) will affect the thermal cycling fatigue at the packaging level. #### Tin Whisker Formation The European Union (EU) directives 2002/95/EC Restriction of Hazardous Substances (RoHS) and 2002/96/EC Waste Electrical and Electronic Equipment (WEEE) resulted in the restriction or elimination of several hazardous materials as of July 2006. One material restricted was lead (Pb), widely used in electronic solder, piece part terminations, and surface finishes. The reintroduction of pure tin or other Pb-free solders and finishes has resulted in new reliability concerns in electronics due to differences in material properties and the introduction of tin whisker problems. Tin whiskers are thin filament crystalline growths that sprout spontaneously from a tin surface. Tin whiskers are sometimes long enough to create shorts between features or leads and sometimes break off and create conductive contamination problems. The mechanisms behind tin whisker formation and growth are not well understood, and there are no known methods of predicting when, where, or how long they will grow. Consequently, it is difficult to define a recommended approach to predicting electronics failures due to tin whiskers. There are, however, methods to mitigate the reliability risk due to tin whiskers. For any high-reliability application, the use of pure tin or Pb-free solders or finishes requires a plan for tin whisker risk mitigation. GEIA-STD-0005-2 provides methods for mitigating the effects of tin whiskers. ![Tin Whisker shown above growing between pure tin-plated hook terminals of an electromagnetic relay (credit: Andrew Pelham, NASA GSFC)](image410.jpg) >[!Figure] >_Tin Whisker shown above growing between pure tin-plated hook terminals of an electromagnetic relay (credit: Andrew Pelham, NASA GSFC)_ #### Moisture Moisture penetration of packages can cause swelling of adhesives causing tensile stress at the interface between the adhesive and the metallization, possibly resulting in contact failures. Moisture penetration causing electrical shorting, and corrosion of metal components have been major reliability concerns costing the industry billions of dollars annually. Moisture-induced damage to packages can also result in fatigue of molding compounds, interfacial delamination, increased thermal fatigue due to changing CTE of materials, and high initial stresses during manufacturing. For each of these factors, package design features such as geometry, materials, and loading, contribute to the rate of failure. Classifications of susceptible non-hermetic devices can be found in the Joint Industry Standard IPC/JEDEC J-STD-020D.1. Guidelines for handling, packing, shipping, and use of sensitive devices can be found in the Joint Industry Standard IPC/JEDEC J-STD-033B.1. #### Corrosion The rate of corrosion-induced failure of electronics packaging can be modeled using finite element analysis (FEA), testing under moisture stress, or some combination of both. The corrosion rate depends on: - The rate of moisture penetration of the materials - Exposure times - The susceptibility of the materials to corrosion once moisture is present. ##### Popcorn Cracking Popcorn cracking is the term applied to a particular failure induced by the combination of moisture ingress and applied heat. The moisture trapped inside the package will have been absorbed during poor storage conditions at some stage in the life of the device. When powered on, or during soldering, heat turns the trapped moisture to steam, and the resulting pressure causes the package to fail. A relieved area in the center of the molding may bulge and produce a loud "pop" sound in the manner of a popcorn kernel, or the casing may fail with a "crack" sound, starting at a stress raiser. Where the failure is less dramatic or not visible externally, internal damage is still done to the device. Surface mount devices are more susceptible to popcorn cracking as the soldering temperature is higher and the soldering operation happens on the same side of the board as the device. Prevention is achieved by baking devices to remove moisture, it is necessary to pay attention to the moisture sensitivity level (MSL), to store and process devices accordingly. Popcorn cracking is often a latent defect, difficult to detect using electrical testing but can be found using cross-sectioning or C-mode Scanning Acoustical Microscopy (C-SAM) inspection. ### Semiconductor-Level Failure Mechanisms The semiconductor level Physics-of-Failure approach to reliability assessment has become especially important for complementary metal-oxide semiconductor (CMOS) technology with feature sizes less than 130nm. For these technologies, greater susceptibility to physical failure mechanisms driven by current and affected by temperature has led to an industry-wide experience of early wear-out failures. These wear-out failures are being seen in timeframes that previously were mostly considered the domain of random, constant failure rates or the flat part of the bathtub curve. For this reason, wearout modeling is becoming an essential supplement to traditional handbook reliability predictions. Semiconductor PoF models depend on an assessment of small-scale physical effects, such as the physical motion of ions and small amounts of material across feature boundaries in integrated circuits. The rate of occurrence is assessed based on well-established physics and chemical principles, and research projects during the past five years have validated several small-scale effect models with testing and analysis. Characterization of feature-level effects, such as Electromigration (EM), Time-Dependent Dielectric Breakdown (TDDB), Hot Carrier Injection (HCI), and Negative Bias Temperature Instability (NBTI) results in a material, process, and application-specific assessment, which can support design or process improvement. These models can be combined to model logic cell-level corruption rates, which are then combined and rolled up into a component-level model of the time to failure. This approach can be used to model new integrated circuits (ICs) without field history. This is particularly useful when applying new technology COTS electronics to military and aerospace applications. The semiconductor PoF models are provided in two forms: failure rate (λ) and acceleration factor (AF). The AF equation is used to scale between two environments, generally a test environment and the usage environment. The failure rate forms of the equations are a more traditional form that can be found in the references provided. The AF forms of the equations are translated into a form that makes the independent variables directly controllable by the IC user. For example, EM voltage is provided as a proxy for current density. #### Electromigration Electromigration (EM) is the movement of metal atoms in a conductor due to the momentum exchange between the conducting electrons in the applied electric field and the metal atoms that make up the conducting material. Failure can occur when enough material has migrated to form voids where the material is missing and growths where the material is deposited. EM failures can occur whenever electric current flows through a medium, and the rate of occurrence depends on the material properties, current density, and temperature. #### Hot Carrier Injection Hot carrier injection (HCI) is an effect that causes a MOS transistor switching characteristics to degrade. Carriers, such as electrons, become "hot" or gain extra energy as they move through the MOSFET toward the state interface barrier where they may be energetic enough to jump across the barrier. This may lead to a weaker state difference between logic high and logic low levels. #### Time-Dependent Dielectric Breakdown Time-dependent dielectric Breakdown (TDDB) is a failure mechanism of CMOS devices, in which the dielectric material breaks down forming an electron current tunnel through the oxide. It is caused by voltage stress in the gate oxide and is often observed when device feature sizes are decreased without scaling of the supply voltage or when devices are operated close to or exceeding their rated voltage. #### Negative Bias Temperature Instability Negative Bias Temperature Instability (NBTI) is a failure mechanism primarily of P-type metal oxide semiconductor field effect transistors (PMOS) with negative gate voltage bias at high temperatures. Thermally activated holes are trapped within the interface between the silicon dioxide and the substrate, gaining sufficient energy to disrupt the drain region. #### Radiation Although we all are technically in space as we travel across interstellar regions while riding on this geoid we call earth[^96], we tend to live in a sort of protected crystal bubble in terms of the coziness of this blue dot we live in. Space is a harsh place to be, at least compared to life here at the surface of the ground. We happen to be protected by two huge shields: the magnetosphere, which captures and deflects particles of different energies that otherwise would be harmful to us, and by a thick layer of gas we call the atmosphere which captures and neutralizes space debris wanting to hit us in the head. And both shields complement each other well. Unlike Mercury, Venus, and Mars, Earth is surrounded by an immense magnetic field called the magnetosphere. The Earth has a magnetic field because it has a molten outer core of iron and nickel that is constantly in motion. The motion of the liquid outer core creates electrical currents, which in turn generate a magnetic field, as André-Marie Ampère stated in his eponymous circuital law. Our magnetosphere shields us from erosion of our atmosphere by the solar wind (charged particles the Sun continually spews at us), erosion and particle radiation from coronal mass ejections (massive clouds of energetic and magnetized solar plasma and radiation), and cosmic rays from deep space. The magnetosphere plays the role of gatekeeper, repelling this unwanted energy that's harmful to life on Earth, trapping most of it a safe distance from Earth's surface in doughnut-shaped zones called the Van Allen Belts. The inner Van Allen belt is located typically between 6000 and 12000 km (1 - 2 Earth radii[^97]) above Earth's surface, although it dips much closer over the South Atlantic Ocean. The outer radiation belt covers altitudes of approximately 25000 to 45000 km (4 to 7 Earth radii). Any semiconductor in space crossing these regions will not have the best time ever. Geostationary satellites must pierce through the inner belt on their way to their final orbits. ![Van Allen radiation belts; crossing them is not the nicest ride for electronics on board of a satellite going somewhere (source: public domain)](image411.png) > [!Figure] > _Van Allen radiation belts; crossing them is not the nicest ride for electronics on board of a satellite going somewhere (source: public domain)_ Electronics exposed to space must be ready to withstand all aspects of the environment. This includes vacuum, thermal cycling, charged particle radiation, ultraviolet radiation, plasma effects, and atomic oxygen. Radiation is generally classified as being either ionizing or non-ionizing. The basic dividing line between the two is the energy levels involved. Ionizing radiation has sufficient energy to strip electrons from atoms, thus creating ions, that is, atoms with charge. Remember the previous section about holes, electrons, impurities, and the delicate mechanism inside the silicon lattice? Now imagine such a fragile microscopic scenery being bombarded by highly energetic particles impacting the crystal structure knocking electrons out of place and disrupting the charge balance across the place. This is what electronics on board every satellite flying over our heads is experiencing as we speak. Examples of ionizing radiation are alpha and beta particles, protons, X-rays, and gamma rays. Neutrons are not directly ionizing, but the resulting radiation from their collisions with nuclei is ionizing. In contrast, non-ionizing radiation only has sufficient energy to change the energy state of electrons. Examples of non-ionizing radiation are visible and infrared light, microwaves, and radio waves. Non-ionizing radiation cannot induce upsets in electronic devices but can still create undesired effects. Additionally, Galactic cosmic rays (GCR) comprised of high-energy particles, overwhelmingly protons, impact the Earth's atmosphere constantly. These particles, when they collide with molecules in the Earth's atmosphere, produce a wide range (and a high number) of particles, primarily neutrons and protons. Neutrons are particularly troublesome because they can penetrate most man-made construction[^98]. The hard vacuum of space with its pressures below $10^{-4}$ Pa (0.0010 Pascals) causes some materials to outgas, which in turn affects any spacecraft component with a line-of-sight to the emitting material, principally optics sensitive to impurities in their lenses. Another effect to suffer in space is thermal cycling. Thermal cycling occurs as the spacecraft moves through sunlight and shadow while in orbit or while maneuvering. Thermal cycling temperatures are dependent on the spacecraft component's thermo-optical properties, i.e., solar absorptance, or how much solar energy the material absorbs, and infrared emittance, or how much thermal energy can be emitted to space. The lower the ratio of absorptance to emittance, the cooler the temperature of the spacecraft's surface. Thermal cycling can cause cracking, delamination, and other mechanical problems, particularly in assemblies where there is a mismatch in the coefficient of thermal expansion between materials. Radiation can also affect materials. Charged particle radiation, along with ultraviolet radiation can cause cross-linking (hardening) and chain scission (weakening) of polymers, darkening, and color center formation in windows and optics. If all that was not enough, micrometeoroids and space debris particles may impact at high velocities. All of these may have significant effects on material properties. > [!Figure] > _Magnetic field strength at Earth's surface_ ![[Pasted image 20250122183319.png]] > [!Figure] > _Magnetic field total strength at Earth's surface (nT)_ (source: https://geomag.bgs.ac.uk/education/earthmag.html) Satellites are designed to incorporate mitigation measures for the undesired effects of radiation mentioned: Ionization Dose, which refers to the cumulative effect of the energy deposited in matter by ionizing radiation per unit mass (known as Total Ionization Dose, or TID), and Single Event Effects (SEE), which are related to single, highly energetic particles interacting with the atomic structure of semiconductors and altering its behavior, both destructive and non-destructively. TID affects semiconductors in several ways, for example by modifying threshold voltages. The mechanics of this is as follows: the trapping of holes in the material may cause a charge buildup and it occurs in the bulk of the semiconductor oxide. These charges will increase the gate oxide electric fields, leading to a change in the [[Semiconductors#MOSFETs#Behavior and Characteristic curves|current-voltage (I-V) characteristic]] of the device. The most prominent change is the shift of the power-ON (threshold) voltage which is negative for [[Semiconductors#MOSFETs|NMOS]] and positive for PMOS. As a result, a device might become unresponsive to some commands as it might get "stuck" in a specific state. ![[Pasted image 20250122182743.png]] > [!Figure] > TID levels according to orbit altitude and inclination (credit: European Space Agency) Devices might also see an increase in their leakage current (remember the concept of leakage current at the beginning of this section). In NMOS transistors, charges might draw an image charge in the semiconductor which can reverse the interface and free leakage paths. These parasitic leakage currents cause degraded timings and increase power consumption. In general, BJT transistors are more robust against radiation compared to MOS (Metal-Oxide-Semiconductor) transistors. This is because the operation of a [[Semiconductors#The Junction|BJT transistor]] is based on the physical movement of minority carriers, which are not as susceptible to radiation-induced damage as the oxide layer in MOS transistors. In contrast, MOS transistors rely on the formation of a thin oxide layer, which can be disrupted by ionizing radiation. Furthermore, MOS transistors are more prone to Single Event Effects (SEEs, see next section to learn more about these effects), which occur when a charged particle strikes the gate oxide and alters the state of the transistor. TID can also cause amplifier gain degradation. This usually manifests as a reduction in gain with increasing total dose exposure. To compensate for this, more power needs to be supplied to the device. TID can also cause dark signals in camera sensors as a direct effect of the charging of gate oxides. This is manifested as an increased noise background and is observed in both CCD and CMOS technologies. As a consequence, the dynamic range of the imager is compromised. This is a major problem with Star Trackers that could fail to locate reference stars. For TID mitigation, for the onboard electronics to maintain its electric performance (timing, current consumption) throughout the mission lifetime while subject to ionizing radiation energy deposition, a typical measure is to add shielding, which basically consists of adding barriers to certain materials. The effectiveness of a material in shielding radiation is determined by its half-value thicknesses, that is, the thickness of the material that reduces the radiation by half. This value is a function of the material itself and of the type and energy of ionizing radiation. As for single-event effects, the onboard electronics and onboard software shall be designed in a way that the SEEs will disrupt the nominal operations of the subsystem the least. Given that SEEs can be of destructive and non-destructive nature, different strategies will be defined for each case. Let's unpack SEEs in the next section. ##### Single Event Effects (SEEs) There are different kinds of single-event effects, and different types of electronic devices are susceptible to SEEs in different ways. The table below summarizes how SEEs impact different types of devices, both for non-destructive and destructive single-event effects[^99]. ![Susceptibility of SEE to different device types](image413.png) > [!Figure] > _Susceptibility of SEE to different device types (source: #ref/FAA )_ It can be seen that, for instance, analog and mixed-signal circuits tend to be more robust (immune to different types of SEEs), as opposed to memories and FPGAs which tend to be highly susceptible to several kinds of SEEs. The SEEs listed above are split into two halves: non-destructive (that is, unable to cause permanent damage) or destructive (able to create permanent damage). Let's unpack the acronyms. ###### Non-destructive Effects - SEU: Single event upset - MBU: Multiple bit upset - MCU: Multiple cell upset - SEFI: Single event functional interrupt - SET: Single event transient - SED: Single event disturb Single Event Upset (SEU) An SEU causes a change of state in a storage cell. The SEU affects memory devices, latches, registers, and sequential logic. Depending on the size of the deposition region and the amount of charge deposited, a single event can upset more than one storage cell in which case the effect is called a multiple cell upset (MCU). Multiple Bit Upset (MBU) An MBU is defined as a single event that causes more than one bit to be upset during a single measurement. During an MBU, multiple-bit errors in a single word can be introduced, as well as single-bit errors in multiple adjacent words. Single Event Functional Interrupt (SEFI) The loss of functionality (or interruption of normal operation) in complex integrated circuits due to perturbation of control registers or clocks is called a single event functional interrupt (SEFI). A SEFI can generate a burst of errors or long-duration loss of functionality (e.g., lockup). The functionality may be recovered either by cycling the power, resetting, or reloading a configuration register. Single Event Transient (SET) A single event transient (SET) is a short impulse generated in a gate resulting in the wrong logic state at the combinatorial logic output. The wrong logic state will propagate if it appears during the active clock edge. Single Event Disturb (SED) The transient unstable state of a static random-access memory (SRAM) cell is described as resulting from a single event disturb (SED). This unstable SRAM state will eventually reach a stable state and the characterization will fall under SEU. Because the unstable state of the cell can be long enough that read instructions can be performed and soft errors generated, SEDs are identified separately. ###### Destructive Effects Destructive Single Event Effects are: - SHE: Single Event Hard Error - SEL: Single event latch-up - SESB: Single Event Snap-Back - SEB: Single Event Burnout - SEGR: Single event gate rupture - SEDR: Single event dielectric rupture Single Event Hard Error (SHE) A single event hard error (SHE) is used to highlight the fact that a neutron-induced upset (e.g., SEU, MBU) is not recoverable. For example, when a particle hit causes damage to the device substrate in addition to the flipping bit, an SHE is declared instead of an SEU. Single event latch-up (SEL) In a four-layer semiconductor device, an SEL occurs when the energized particle activates one of a pair of the parasitic transistors, which combines into a circuit with large positive feedback. As a result, the circuit turns fully on and causes a short across the device until it burns up or the power is cycled. The effect of an electric short is potentially destructive when it results in overheating of the structure and localized metal fusion. Single Event Snap-Back (SESB) SESBs are a subtype of SEL and, like SEL, they exhibit a high current consuming condition in the affected device. When the energized particle hits near the drain, an avalanche multiplication of the charge carriers is created. The transistor is open and remains so (hence, the reference to a latch-up condition) until the power is cycled (the device snaps back). Single Event Burnout (SEB) A single-event burnout (SEB) is a condition that can cause device destruction due to a high current state in a power transistor, and the resulting failure is permanent. A SEB susceptibility has been shown to decrease with increasing temperature. SEBs include burnout of power metal-oxide-semiconductor field effect transistors ([[Semiconductors#MOSFETs|MOSFET]]), gate rupture, frozen bits, and noise in charge-coupled devices. Single Event Gate rupture (SEGR) A SEGR is caused by particle bombardment that creates a damaging ionization column between the gate oxide and drain in power components. It typically results in leakage currents at the gate and drain that exceed the normal leakage current on a non-exposed device. SEGRs may have destructive consequences. Single Event Dielectric Rupture (SEDR) The single-event dielectric rupture (SEDR) has been observed in testing but not in space-flight data. Therefore, it is currently considered mostly an academic curiosity. A SEDR is identified from a small permanent jump in the core power supply current. ##### Mitigation For the typical mitigation techniques against SEEs, two distinctive approaches are frequently used: internal and external. Internal here means, intra-integrated circuit (inside the chip), also called rad-hard VLSI, which we will discuss next. ###### Internal (Radiation Hardened Semiconductors) In high-radiation environments, MOS devices suffer from serious degradation and failures. Ionizing radiation effects in MOS transistors such as a buildup of positive charge in the oxide layer and interface state production, lead to gate threshold voltage $V_{T}$ shifts and channel mobility degradation. These parameter shifts and radiation-induced leakage cause MOS circuit characteristics degradations and failures. In addition to these total dose radiation effects which cause permanent failures, transient ionizing radiation exposure produces photocurrents in every junction in integrated circuits. These photocurrents can cause logic upset or latch-up as the dose rate is increased. Latch-up is caused by a nonlinear, thyristor-like action due to parasitic bipolar transistors in bulk CMOS structures. To design radiation-tolerant CMOS chips for high-reliability applications, it is necessary to prevent these radiation-induced characteristic degradations and circuit failures. Although it is well known that positive charge buildup in the oxide layer and interface state production, due to ionizing radiation effects, lead to threshold voltage shifts and channel mobility degradation, it was confirmed experimentally that total dose effects, ranging to 10E5 rad (Si), lead to NMOS and PMOS threshold negative shifts and negligibly small mobility degradation. Furthermore, the threshold voltage shifts depend on the gate oxide bias condition during irradiation. ![Experimental and simulated CMOS inverter logic threshold shift as a function of gamma-roy total dose](image414.png) > [!Figure] > _Experimental and simulated CMOS inverter logic threshold shift as a function of gamma-ray total dose_ Threshold voltage shifts due to radiation effects, strongly depend on oxidation temperature and post-oxidation process temperature. For >950ºC processes, threshold voltage shifts are significantly large. However, threshold voltage shift and their process temperature dependence are small for <900ºC processes. Therefore, it is possible to suppress threshold voltage shifts due to radiation by lowering gate oxidation temperature and post-oxidation process temperature below 900°C. Furthermore, threshold voltage shifts depend strongly on the gate oxide thickness and are substantially greater for thicker oxide MOS transistors. Therefore, it is useful to introduce thin oxide transistors in radiation-hardened VLSI designs. Structural hardening is useful to suppress NMOS field leakage, caused by large negative threshold voltage shifts in parasitic field MOS transistors, due to total dose radiation effects. There are two kinds of thick field oxide leakage. One is source/drain leakage under NMOS gate edges. The other is field leakage between NMOS transistors. ![Structurally radiation-hardened and conventional bulk NMOS transistor](image415.png) > [!Figure] > _Structurally radiation-hardened and conventional bulk NMOS transistor_ Thin field oxide between the source/drain diffusion layer and thick field oxide can be introduced to suppress the radiation-induced field leakage, as shown in the figure above. For example, with a boron implantation of 1E12/cm2 at 40 keV applied to the thin oxide region, the result is as in the figure below. ![Post radiation subthreshold characteristics for structurally radiation hardened and conventional bulk NMOS (total dose 3E5 rad (Si), W=10um, L=3um)](image416.png) > [!Figure] > _Post radiation subthreshold characteristics for structurally radiation hardened and conventional bulk NMOS (total dose 3E5 rad (Si), W=10um, L=3um)_ In CMOS logic, since negative threshold voltage shifts due to radiation reduction post-radiation PMOS drive current, NAND circuits should be used instead of NOR circuits, in which PMOS transistors are connected in series. From the point of view of obtaining a maximum circuit noise margin for the radiation tolerance, NAND and NOR CMOS logic circuits were studied by optimizing ratio and threshold voltage, based on the transistor parameter shift data due to radiation effects. The obtained results indicate high noise immunity, high packing density, and high-speed superiority of NAND to NOR in radiation-hardened VLSI circuits. Latch-Up Mitigation: Latch-up in CMOS VLSI design is a phenomenon where a parasitic structure, typically a PNPN thyristor, is inadvertently created within the CMOS integrated circuit. This can lead to a short circuit between the power supply and ground, causing the device to fail or even be permanently damaged. The issue arises because of the way CMOS technology is structured, with both P-type and N-type transistors placed closely together, which under certain conditions can form this unwanted thyristor effect. To mitigate latch-up, several strategies are employed in the design and manufacturing processes of CMOS VLSI chips. One common approach is to increase the doping concentration in the substrate and well regions. This helps to increase the holding voltage of the parasitic thyristor, making it less likely to turn on unintentionally. Another technique involves the use of guard rings. These are heavily doped regions that encircle sensitive areas of the circuit, acting as barriers to prevent the spreading of minority carriers that could trigger the latch-up. Guard rings are typically connected to the power supply or ground, providing a path for the carriers to be safely discharged. The layout of the CMOS devices is also critical in minimizing latch-up risk. By spacing out the NMOS and PMOS transistors and designing the circuit layout to minimize the length of the parasitic thyristor paths, the susceptibility to latch-up can be reduced. Additionally, the use of silicon-on-insulator (SOI) technology can significantly reduce latch-up risks. In SOI, the silicon layer in which the transistors are fabricated is separated from the bulk substrate by an insulating layer, typically silicon dioxide. This isolation helps to prevent the formation of the parasitic thyristor paths that lead to latch-up. Finally, careful control of the manufacturing process, including the use of specific fabrication steps designed to reduce the susceptibility to latch-up, is key. This includes optimizing the thermal budget of the process to minimize diffusion of dopants and careful management of implantation steps. SEU Mitigation To mitigate single-event upsets (SEUs) at the CMOS cell design level, several strategies are integrated directly into the transistor and cell layout to improve resilience against radiation-induced errors. One effective approach is the design of hardened memory cells, such as the DICE (Dual Interlocked storage Cell) configuration. This design involves interlocking redundant nodes within a single cell to ensure that a single particle strike cannot flip the cell state; it requires multiple nodes to be affected simultaneously, which is far less likely. Another technique involves increasing the critical charge of the cell, which is the minimum amount of charge needed to flip the state a bit. By designing transistors and cells that require a higher critical charge, the impact of ionizing particles is reduced since these particles may not generate enough charge to exceed this threshold. Additionally, the use of p-wells or n-wells for transistors, depending on the substrate type, can be optimized to minimize the interaction of charge carriers generated by radiation with critical nodes. This involves careful planning of the dopant concentrations and well depths to ensure that any charge generated by radiation is quickly recombined or conducted away from sensitive areas. The physical layout of the CMOS cells is also critical. By strategically placing sensitive nodes and using layout techniques that minimize the area exposed to potential radiation strikes, the probability of an SEU affecting critical parts of the cell is reduced. This can involve the use of compact layouts and shielding of sensitive areas with less sensitive circuit elements. Incorporating feedback mechanisms within the cell design is another strategy. These mechanisms can detect the occurrence of an SEU and automatically restore the correct state. This can be achieved through the design of circuits that continuously monitor their own operation and can correct single-bit errors without external intervention. Finally, the adoption of SOI (Silicon on Insulator) technology at the cell level can significantly reduce the sensitivity to SEUs. In SOI designs, the silicon layer in which the transistors are fabricated is isolated from the bulk substrate by an insulating layer. This reduces the volume of silicon in which charge can be generated by a passing ion, thereby decreasing the likelihood of an SEU. ###### External External mitigation techniques are those that can be added at the board or subsystem level, i.e., outside the chip. Current limiters, scheduled power cycling, memory scrubbing, and generous design margins; are all techniques that do not require tampering with semiconductors and can be added by the system designers, with the penalty of increasing the complexity of the design, especially mass and power. #### Electrical Overstress (EOS) and Electrostatic Discharge (ESD) Electrical overstress (EOS) and electrostatic discharge (ESD) occur when large currents are generated by excessive applied fields caused by poor initial circuit design, mishandling or voltage pulses. EOS and ESD issues can be mitigated by good design and handling practices, and ESD sensitivity can be measured or verified using testing. There are several EIA/JEDEC standards for ESD testing. #### Burn Out Burnout occurs when an electronic device is permanently damaged as a result of the energy absorbed by a radiation event. Burnout can also result from electrical overstress (EOS). #### Corrosion The time to corrosion failure for microelectronic die metallization depends on many factors including the package type, corroding material, fabrication and assembly processes, and storage and usage environmental conditions. M. Pecht and W. Ko (CALCE, University of Maryland) developed a model for calculating time to failure for microelectronic die metallization based on electrochemical corrosion including parameters for geometric, material, assembly, operation and exposure environments. Their paper[^100] provides the derivation of the model, model parameters, a discussion of failure mechanisms, mitigation measures, a comparison of several different corrosion models and an example of an application of the model. ### Connectors An electrical [[Physical Layer#Connectors|connector]] undergoes varying types of environmental and operating load conditions depending on the nature of the application. The loads that can act on an electrical connector depend on a connector's state of operation. These states are "in operation," "insertion," and "disconnection". For the "in operation" state, the connector is considered a mechanical assembly because the separable parts of the connector make contact and operate as one single assembly to provide continuity in the circuit. When an electrical connector is switched on, the current flows through the contact spring, contact interface, and connector pins. Resistive joule heating at the contact interface can result in the following failure mechanisms: creep, welding of joints, silver migration, tin whisker formation, fretting, and corrosion. The current flow can cause silver migration and corrosion. Vibration can lead to fretting action of the contacts, tin whisker formation on the connector pins, and wear of the surface layer on the contact pins. Humidity can cause corrosion of the contact material, silver migration, and fretting corrosion of contact material. On insertion of the contact spring into the connector housing, there can be wear at the surface of the connector and contact pin. The contact normal force can lead to surface layer cracking and bending of the connector pin due to misalignment in assembling or high insertion force. Because connectors that do not operate continuously will be inserted and disconnected multiple times, the contact spring may become fatigued. On disconnection of the contact spring from the connector housing, arcing can occur at the contact pin. Wiping on the contact surface can occur as a result of this motion and can lead to wear on the connector pin surface. At the connector housing, "stress corrosion cracking" can occur as a result of residual stress generation from the external environment, which may contain corrosive, hot contaminants or gases at high pressure. Differential heating, differential pressure, and/or vibrations at the connector housing can lead to fretting action or "micro-motion" of contacts at their common interface. If there is humidity or moisture present, the housing material can swell, which may lead to degradation of the electrical and mechanical performance. ##### Environmental Factors Affecting Connectors External factors act on the connector housing. These factors are related to the environment surrounding the electrical connector. They can be high-temperature conditions due to hot environments or heat released from operating machines, humidity or contaminants in the air, or corrosive gases released from industries or vehicles. Vibration, including oscillation or displacement, can occur. External gases can also exert pressure on the connector. Air can seep in through the pores in the connector housing material. Because the loads of temperature, humidity, contamination, corrosive gases, and pressure can act inside the connector, it is usually the contact interface of the two contact materials where the forces act. However, they may also act at other locations inside the connector. ###### Thermal The temperature conditions under which a connector operates will affect its performance and reliability. The temperature of the connector is affected by the ambient environmental conditions as well as any nearby components that generate heat or provide thermal management of the product that the connector is used in (e.g. fans), and the heat generated by the connector due to current flow. ###### Vibration Vibration refers to the displacement or oscillation of a body from a stable state of equilibrium, where the displacements can be periodic, non-uniform, or transient in nature. The elements that lead to the vibration of a system include the mass or inertia element, which stores the kinetic energy, and the spring or elastic element, which stores the potential energy. As the mass of the oscillating body continues to oscillate about its mean position, the energy continually exchanges between the kinetic and potential energy of the system. Sources of vibration include contact normal force, periodic oscillation of the material in response to external forces of excitation such as engine vibration or transportation, and micro-motions due to differential thermal expansion of materials in the electrical connectors. The variables of vibration include frequency, amplitude, and time. It is the contact point of two mating surfaces inside the mechanical system that will move relative to each other as a response to vibration. Vibration can cause large deformation or stresses in the contact material. Accordingly, it can lead to fatigue, wear, or improper operation of the associated mechanical assembly. ###### Humidity When there is a high percentage of humidity in the atmosphere, water molecules can be adsorbed on contact surfaces. This adsorption could lead to galvanic corrosion, which is an electrochemical reaction that forms an oxidizing layer. The presence of humidity degrades the performance of an electrical connector. Most traditional types of corrosion can occur by local electrolytic cell formation on an electrical contact surface. In the presence of dust particles, crevice corrosion may occur at the contact spring. Corrosion behavior depends on the connector pin finish material; for instance, the corrosion rate of silver is greater than that of bronze. ###### Contamination One of the main reasons for electric contact failure by the mechanism of wear is a high level of contamination from dust in the surrounding environment. The contaminations can be due to surface oxidation, dust deposition, or corrosion of contact material. Dust particles on the contact surface act as an insulator, and consequently, its deposition increases electrical contact resistance. ##### Failures Modes Electrical connector failures can be caused by three kinds of failures: mechanical, electrical, chemical, or a combination of these. A mechanical process refers to a degradation mechanism where a physical deterioration of the structural material occurs. Physical damage refers to the weakening of the mechanical properties at the surface or in the bulk material. This can be in the form of removal of the surface coating and subsequent damage to the surface finish layer and formation of dimples and pores, crack formation in the housing material, defects from manufacturing, generation of stresses, and spontaneous growth of structures from the contact pin surface. Connectors can be only connected and disconnected a finite amount of times before it starts to fail. This is called mating cycles in connectors, and it is an important aspect of connector design and directly impacts their reliability. A mating cycle is defined as one complete connection and disconnection of a connector pair. This is an important metric for assessing the durability and life expectancy of a connector. Connectors designed for high mating cycles are typically more robust and have better contact stability, which is important in applications where frequent connecting and disconnecting are expected. The material and design of the connectors play a significant role in determining their mating cycle capability. For instance, connectors with gold-plated contacts may have higher mating cycle ratings due to the durability and corrosion resistance of gold. Similarly, the design of the contact interface, the presence of guiding features, and the overall mechanical construction contribute to mating cycle performance. Over time, connectors can experience wear and tear due to physical and electrical stress during each mating cycle. This can lead to degradation in performance, such as increased contact resistance or mechanical failure, which can compromise the reliability of the connection. Connectors are often tested for mating cycles as per industry standards. These tests simulate the wear and tear a connector will experience over its lifetime, helping manufacturers to estimate the lifespan and reliability of their products. Different applications require different mating cycle capabilities. For instance, while the mating cycle for a USB connector will be in thousands, a VPX connector will withstand only a few hundreds of cycles. This is because a USB connector is regularly connected and disconnected. On the other hand, a VPX Connector is not made for routine usage and is unmated rarely. Mating cycles does not have a standard value in determining the quality of the product. It varies from one device to another. Some ruggedized connectors, for example the MultiGig RT3 used in VPX (see section ) offer some level of redundancy by employing a quad-redundant contact system. An electrical failure mode refers to a mechanism where electric current flow or voltage drop across the electrical contacts leads to a failure mechanism in an electrical connector. The flow of electric current across the contacts can generate Joule heating. The heat load may remain constant for the time period of action in the case of electrical connectors that operate for long durations. This can result in contact joint welding, change in contact resistance, or change in surface properties of electrical contacts. Silver migration refers to the growth of dendrites on the surface of an insulator material in between two conductive lines as a result of the flow of electric current through the current lines and in the presence of humidity. Hot disconnection of an electrical connector forms an electric arc at the interface between the two contacts. "Hot disconnection" refers to a separation of the connector spring from its housing when the electrical connector is on. A chemical failure mechanism is an alteration in the chemical behavior of the material. It refers to a chemical reaction leading to the formation of a corrosive product layer. The chemical process usually follows a diffusion-controlled mechanism. Figure below shows a Fishbone diagram for failures in connectors. A Fishbone diagram, also known as an Ishikawa or cause-and-effect diagram, is a visual tool used for systematic analysis and understanding of the potential causes of a specific problem or issue. It is called a Fishbone diagram because of its shape, resembling the skeleton of a fish, with a central spine and several branches. At the head of the diagram is the problem statement, clearly defined and specific. Extending from the central spine are major cause categories, which are identified based on the problem context. These categories might include areas like Methods, Machines, People, Materials, Measurements, and Environment, but can vary depending on the situation. Each major category branch then further branches out into detailed factors that could potentially contribute to the problem. This branching out continues as more specific causes are identified. By systematically breaking down the causes in this way, the diagram helps in identifying root causes and visualizing the relationship between different factors and the problem. ![Fishbone diagram for failures in electrical connectors](image417.png) > [!Figure] > _Fishbone diagram for failures in electrical connectors_ ## Reliability Graphs Established methods focus on individual components' reliability and use the series/parallel analysis we explored before to compute total reliability from basic formulas that combine individual intra-component reliabilities. A nuance to add here is that reliability also depends on the interaction of a component with neighboring components. That means: a perfectly reliable component with an individual failure rate of zero could still fail thanks to the influence of a neighboring aggressor component. We explored this idea of aggressor and victim when we discussed [[Physical Layer#Cross Talk and Ground Bounce|crosstalk]] and [[Physical Layer#Electromagnetic Interference (EMI)|electromagnetic interference]]. Then, it is useful to think of reliability and physics of failure in terms of networks, services, and the risk and implications of potential deniability of these services. The thesis is: that a digital system is a network of networks, and every node in that aggregation of networks provides a certain service, which can be interrupted either permanently (damaged forever) or non-permanently (a glitch or a momentary interruption). A network, when observed with a sufficient coarse lens, can be seen as a "system" that executes a function; it's only when we zoom in that we see the network of networks collaborating to provide such a function. From this perspective, a PCB board can be thought of as a network. This is hardly revolutionary: a PCB is undoubtedly a network of components hooked up together, following topologies the designer chose. Moreover, a PCB is also a network of networks because they might use for example mezzanines or Computer-on-Modules with internal circuitry or Systems-on-Chip (SoC) that internally contain complex interconnections of elements. The figure below illustrates this by showing the internal composition of the Zynq Ultrascale+ MPSoC from Xilinx, which showcases a network of peripherals and cores connected through different types of links (buses, switches, etc). ![A simplified view of the Zynq Ultrascale+ architecture. Credit: Xilinx.](image418.png) > [!Figure] > _A simplified view of the Zynq Ultrascale+ architecture. Credit: Xilinx._ SoCs and PCB boards are fixed networks, in the sense that rerouting the flow of services between nodes is not so straightforward because devices (both at the board level and on-chip) are glued to the boards, traces are etched, and all the stuff is soldered together. Note that here we are referring to the network of services. For instance, an FPGA can be considered reconfigurable but an FPGA, as a device, provides a service ("provide programmable logic in a board") that cannot change. We can borrow from graph theory and model our architectures as a hierarchical composition of networks made of nodes and edges. By identifying what service nodes provide and assessing the implications of nodes being unable (momentarily or permanently) to provide their service, we can for instance analyze which nodes are too influential and therefore create alternative paths. Building reliability networks can also be a mechanism to overcome secretive suppliers. Suppliers will most likely not provide us with schematic diagrams of their products, and in the worst-case scenario, not even failure rate information. But we may ask information to build a graph depicting the topology of their products and they would not be revealing proprietary information. Then, we could input this into our general reliability network model and assess parameters like eigencentrality (how relevant a node is in the network), centrality (how well connected it is), betweenness centrality (shortest paths), alternative, etc. This analysis should then feed our decisions when we add [[Fault-Tolerant Design Techniques|fault-tolerant design measures]]. ![Eigenvector centrality in a network (yellow=higher, bluer=lower)](image419.png) > [!Figure] > _Eigenvector centrality in a network (yellow=higher, bluer=lower)_ ![An undirected graph colored based on the betweenness centrality of each vertex from least (red) to greatest (blue)](image420.png) > [!Figure] > _An undirected graph colored based on the betweenness centrality of each vertex from least (red) to greatest (blue)_ > [!True Story] > A company that was launching its first product (an autonomous cargo vehicle for indoor factory transportation), faced some issues after rolling out their first batch of vehicles. As soon as the wheeled robots were deployed in the field, the engineers observed that one of the control computers was consistently switching to ERROR MODE and the vehicle was refusing to continue operating because it sensed a faulty read from a critical sensor. After a thorough investigation, the engineers found that a cable out of a switching power supply was inducing noise in neighboring devices, jamming nearby equipment, including a speed sensor that caused the transition to ERROR MODE. As soon as the engineers were commanding the computer to use another set of alternative sensors, the error disappeared. This (real) story has an interesting moral: the power supply, the cable, and the affected sensor were perfectly healthy equipment assessed individually. In this scenario, the sensor was then constantly denied from providing its service (measure speed) due to factors that no classic probability-based assessment could have caught. Moreover, the engineers realized the centrality of the speed sensor: as soon as the sensor was denied from providing its service (giving a speed in a frame of reference) the whole system was coming to a halt. Indeed, a better verification should have flagged a power supply cable emitting beyond acceptable levels, but these things can go under the radar: we are always exposed to factors that we cannot test in the lab. Also, the offending cable could have moved during deployment and maybe no anechoic chamber test could have made the problem reveal itself during the testing phase. Reliability and availability are broader concepts that must also include deniability of function due to the interaction of the architecture with itself. If we approach the reliability/availability question from a networking and denial of service perspective, we could then equip our architectures with the necessary resilience to always provide a path to ensure the internal services continue being provided and thus avoid downtimes and disasters, less the same approach irrespective if you are talking about a whole subsystem, a unit, a board or a system-on-chip. [^96]: The name Earth is an English/German name that simply means the ground. It comes from the Old English words \'eor(th)e\' and \'ertha\'. In German it is \'erde\'. The name Earth is at least 1000 years old. [^97]: The Earth is almost, but not quite, a perfect sphere. Its equatorial radius is 6378 km, whereas its polar radius is 6357 km. A radius value of 6371 km is usually adopted. [^98]: <https://www.microsemi.com/document-portal/doc_view/130760-neutron-seu-faq> [^99]: FAA, "Single Event Effects Mitigation Techniques Report" <https://www.faa.gov/sites/faa.gov/files/aircraft/air_cert/design_approvals/air_software/TC-15-62.pdf> [^100]: Pecht, M., and Ko, W., \"A Corrosion Rate Equation For Microelectronic Die Metallization," The International Society of Hybrid Microelectronics, Volume 13, No.2, pp.41-52, June 1990