# Fault-tolerant design techniques One of the remarkable sides of engineering is how we can create fault-tolerant systems out of unreliable parts. There is no magic, though: the principle behind it is to design the system in such a way that it can recover from failures of its components, thereby continuing to operate correctly even when some parts are malfunctioning. The construction of fault-tolerant systems from unreliable components is achieved through various strategies, including but not limited to: - Redundancy: This is one of the most common methods, where multiple copies of the same component are used. Redundancy can be implemented in several forms (in reality, most fault-tolerant systems employ a combination of all these redundancies addressed below): - Hardware Redundancy: Involves using duplicate hardware components, such as processors, memory, or power supplies. - Software Redundancy: Involves running multiple copies of software processes, possibly on different hardware to ensure that even if one fails, others can take over. - Information Redundancy: Involves adding extra bits for error detection and correction in data transmissions or storage. - Diversity: Using different implementations of the same functionality can protect against a specific failure mode affecting all instances simultaneously. This can include using different algorithms, software versions, or even different hardware architectures to achieve the same end goal. - Graceful Degradation: In some systems, it is acceptable for the system to offer a reduced level of service instead of complete failure. By carefully degrading functionality, the system can continue to provide essential services even when some components fail. - Fault Detection and Correction: Systems must be able to detect when a fault occurs and take corrective action, which can include switching to a redundant component, restarting a failed process, or reconfiguring the system to exclude the faulty part. - Predictive Maintenance: By monitoring the condition of components and predicting failures before they occur, systems can preemptively address potential issues, either by replacing components or by activating redundancy. Building fault-tolerant systems from unreliable parts requires a deep understanding of the potential failure modes of the components, the criticality of different system functions, and the trade-offs between cost, complexity, and reliability. Therefore, with a thorough [[Failure Mode Analysis Methods|failure mode analysis]] to assess the potential failure mechanisms, we should be as designers reasonably aware of the weaknesses of our design. The obligatory question is: what is next? How do we equip our designs with mitigation measures to avert the potential failure mechanisms that we the system could put itself into? Fortunately, there is a set of techniques we can apply as system architects. These techniques do not aim to rule out the possibility of faults and failures, but to have a plan for when those will happen. As system designers, we are often forced to accept components as they are, without many possibilities of modifying their internal structure to make them more reliable or fault-tolerant. For instance, we may have no means to modify the semiconductors we use, which implies we must use those chips "as is" because there is no practical means for us to add fault-tolerant measures *inside* them. Then, we are left with the task of: - Selecting the devices accordingly - Adding external measures to handle fault tolerance if requirements dictate so. ## Using Fault-Tolerant Microprocessors Of course, one option is to employ CPU cores that are purposely designed for high reliability. Fault-tolerant CPU microarchitectures are specifically designed to ensure continuous and correct operation in the presence of faults, often used in critical applications like space missions, nuclear facilities, and aviation. Examples of fault-tolerant CPU microarchitectures are: - LEON3-FT: It's a fault-tolerant version of the LEON3 SPARCV8 processor. The LEON3-FT processor was developed for space applications and is designed to handle single-event upsets (SEUs) and other radiation-induced faults. See more details in the next section. - RAD750: This is a radiation-hardened microprocessor designed for high-radiation environments such as space. Developed by BAE Systems, the RAD750 can withstand radiation doses that are fatal to standard processors and is used in various spacecraft and planetary rovers. - IBM Z Mainframes: While not exclusively used in space or aviation, IBM Z mainframes are designed for critical enterprise applications requiring high availability. They feature fault-tolerant design elements in their processors and system architecture, allowing them to handle hardware failures without significant interruption. - Intel Itanium 9700 Series: These processors, often used in enterprise servers, include features like machine check architecture that enhances reliability, availability, and serviceability by allowing the processor to work around many types of errors. - HP NonStop Systems: These systems are designed for high-availability applications. They use a fault-tolerant architecture to ensure that they continue to operate in the face of hardware or software failures, and are commonly used in banking, stock exchanges, and other critical applications. - Lockheed Martin Space Processor (LMSP): This is another example of a radiation-hardened microprocessor, developed for use in spacecraft. It is designed to withstand the harsh conditions of space and can handle various types of radiation-induced faults. #### LEON3-FT Overview >[!note] >LEON3/LEON3-FT models are developed and distributed by Aeroflex Gaisler (now Frontgrade Gaisler). More information regarding these models is available on their website https://www.gaisler.com/. We will now use an example of a microprocessor purposely designed for high reliability. The LEON3-FT design is a fault-tolerant and SEU-proof version of the LEON3[^108], suitable for mission-critical applications such as space systems. The LEON3-FT core is provided at design (VHDL) level and does not require an [[Reliability Assessment Methods#Internal (Rad-Hard Semiconductors)|SEU-hard foundry process]], nor a custom cell library or special back-end tools. The LEON3-FT model features most of the functionality of the LEON3 processor, and also includes the following fault-tolerance features: - Register file SEU error-correction of up to 4 errors per 32-bit word - Cache memory error-correction of up to 4 errors per tag or 32-bit word - Autonomous and software-transparent error handling - No timing or performance impact due to error detection and correction It is relevant to note that it is possible to incorporate a LEON3FT as a [[Semiconductors#IP Cores|softcore]] into some custom design, or it is also possible to obtain a LEON3FT CPU core inside a full-fledged System-on-Chip that Gaisler commercializes under the name UT699. The UT699 SoC is centered around the AMBA [[Semiconductors#Advanced Microcontroller Bus Architecture (AMBA)#AHB|Advanced High-speed Bus]], to which the LEON3FT processor and other high-bandwidth units are connected. Low-bandwidth units are connected to the AMBA [[Semiconductors#Advanced Microcontroller Bus Architecture (AMBA)#APB|Advanced Peripheral Bus]] (APB) which is accessed through an AHB to APB bridge. The architecture is shown in the figure below. ![UT699 SoC architectural block diagram](image423.png) > [!Figure] > _UT699 SoC architectural block diagram_ The LEON 3FT architecture includes the following peripheral blocks: - LEON3 SPARC V8 integer unit with 8kB instruction cache and 8kB of data cache - IEEE-754 floating point unit - Debug support unit - UART and JTAG debug links - 8/16/32-bit memory controller with EDAC for external PROM and SRAM - 32-bit SDRAM controller with EDAC for external SDRAM - Timer unit with three 32-bit timers and watchdog - Interrupt controller for 15 interrupts in two priority levels - 16-bit general purpose I/O port (GPIO) which can be used as external interrupt sources - AMBA AHB status register - Up to four SpaceWire links with RMAP - Up to two CAN controllers - [[High-Speed Standard Serial Interfaces#Ethernet|Ethernet]] with support for MII - CompactPCI interface with 8-channel arbiter If the design incorporates the CPU core as a soft core, the fault tolerance in LEON3FT is implemented using ECC coding of the on-chip RAM blocks in the target FPGA. The [[Fault-Tolerant Design Techniques#Error Correcting Codes|ECC codes]] are adapted to the type of RAM blocks that are available for a given FPGA technology, and to the type of data that is stored in the RAM blocks. The general scheme allows to detect and correct up to four errors per 32-bit RAM word. In RAM blocks where the data is mirrored in a secondary memory area (e.g., cache memories), the ECC codes are tuned for error detection only. A correction cycle consists then of reloading the faulty data from the mirror location. In the cache memories, this equals an invalidation of the faulty cache line and a cache line reload from main memory. In RAM blocks where no secondary copy of the data is available (e.g., register file), the ECC codes are tuned for both error detection and correction. The focus is placed on fast encoding/decoding times rather than minimizing the number of ECC bits. This approach ensures that the FT logic does not affect the timing and performance of the processor and that LEON3FT can reach the same maximum frequency as the standard non-FT LEON3. The ECC encoding/decoding is done in the LEON3FT pipeline in parallel with normal operation, and a correction cycle is fully transparent to the software without affecting the instruction timing. The ECC protection of RAM blocks is not limited to the LEON3FT processor. In an SoC design based on LEON3FT, any IP core using block RAM will have the RAM protected similarly. This includes for instance the FIFOs in the SpaceWire IP core and the buffer RAM in the CAN-2.0 IP core. #### Using Non-Fault-Tolerant CPUs In High-Reliability Designs One sensible observation to make would be: what if we do not have a fault-tolerant CPU or SoC in a fault-tolerant design? Is it strictly necessary to have a fault-tolerant microprocessor in a high-reliability architecture? The answer is: no. As system designers, we have tools at hand to robustize our designs to alleviate the fact that certain parts of them may not be fault-tolerant by design. A fault-tolerant CPU or SoC only protects one particular mechanism of failure (radiation and [[Reliability Assessment Methods#Single Event Effects (SEEs)|single-event effects]]) but still does not guarantee fault-free operation for all other failure mechanisms that can affect [[Reliability Assessment Methods#Physics of Failure#Semiconductor-Level Failure Mechanisms|semiconductor devices]]. The next section reviews techniques we have at hand to use when our systems are a composition of fault-tolerant components with non-fault-tolerant components. ## Redundant Design Duplicating entities is a way of ensuring that, should failure take one of the entities out of operation, there would be other entities ready to enter service in and provide the functionality the failed entity just stopped providing. Note that we used the term "entity" without specifying if this was a piece of hardware, software, or information. And this is intentional, as there might be different levels of the hierarchy where we can apply redundancy. Hardware redundancy is provided by incorporating extra hardware into the design to either detect or override the effects of a failed component. For example, instead of having a single processor, we can use two or three processors, each performing the same function. By having two processors and comparing their results, we can detect the failure of a single processor; by having three, we can use the majority output to override the wrong output of a single faulty processor. This is an example of static hardware redundancy, the main objective of which is the immediate masking of a failure. A different form of hardware redundancy is dynamic redundancy, where spare components are activated upon the failure of a currently active component. A combination of static and dynamic redundancy techniques is also possible, leading to hybrid hardware redundancy. Hardware redundancy can thus range from simple duplication to complicated structures that switch in spare units when active ones become faulty. These forms of hardware redundancy incur high overheads, and their use is therefore normally reserved for critical systems, where such overheads can be justified. In particular, substantial amounts of redundancy are required to protect against malicious faults. The best-known form of information redundancy is error detection and correction coding (EDAC). Here, extra bits are added to the original data bits so that an error in the data bits can be detected or even corrected. The resulting error-detecting and error-correcting codes are widely used today in memory units, CPU microarchitectures, and various storage devices to protect against benign failures. Note that these error codes (like any other form of information redundancy) may require extra hardware to process the redundant data. Error-detecting and error-correcting codes are also used to protect data streams transferred over [[Physical Layer#Signal Integrity|imperfect interconnects]]. Note that these channels can be the communication links among either widely separated devices or cores that form a local network inside a System-on-Chip. If the code used for data communication is capable of only detecting the faults that have occurred (but not correcting them), we can retransmit as necessary, thus employing time redundancy. Software redundancy is used mainly against software failures. It is a reasonable guess that every large piece of software that has ever been produced has contained faults (bugs). Dealing with such faults can be expensive: one way is to independently produce two or more versions of that software (preferably by disjoint teams of programmers) in the hope that the different versions will not fail on the same input. The secondary version(s) can be based on simpler and less accurate algorithms (and, consequently, less likely to have faults) to be used only upon the failure of the primary software to produce acceptable results. Just as for hardware redundancy, the multiple versions of the program can be executed either concurrently (requiring redundant hardware as well) or sequentially (requiring extra time, i.e., time redundancy) upon failure detection. ### Supervision and Reconfiguration (aka Failover) A factor that is not always discussed enough is: Where there is hardware redundancy, there must be failover management, this means, an entity in charge of performing the switchover between nominal and redundant units. For performing said switchovers, is necessary a dose of awareness of the overall conditions to decide when switching units over is necessary. Every complex digital system needs different levels of supervision and reconfiguration. One challenge that appears with distributed architectures is who is in charge of acting as the system's watchdog, and what action to perform should the watchdog not be kicked at the expected times. > [!attention] > A paradox appears: who watches the watchdog? But also, who watches the watchdog's watchdog? In general, the solution to this conundrum is to design the supervision circuitry to be highly reliable and with minimal complexity. That means, to equip the system's watchdog with internal redundancy and with multiple mechanisms to recover from off-nominal situations. To keep supervisors simple, they typically have little or practically no software or firmware. FPGAs appear as good candidates for watchdogs since it's rather straightforward to create independent logic blocks with them, although care must be taken to mitigate all the issues FPGAs may suffer, notably upsets in their configuration memories. Anti-fuse FPGAs appear as better suited for high-reliability supervision circuitry compared to programmable FPGAs, considering also that an FPGA destined for supervision purposes will most likely never be reconfigured during its lifetime. The obligatory question designers face is: should the watchdog need to take action, what is this action going to look like? In general, system watchdogs will switch over between units. If a nominal unit fails to respond at the right time, the watchdog will assume the unit is no longer operative and perform the operations to power it off and power on the nominal one. Of course, FPGAs cannot handle high-power loads, so this is typically done either by commanding a power distribution unit or driving relays through external circuitry that can handle the loads. Reconfiguration settings are typically stored in nonvolatile storage. Note that reconfiguration managers not only act upon the failure of a subsystem to kick a watchdog but also upon other alarms that might trigger reconfiguration. In remotely operated, mission-critical systems, watchdogs can be directly commanded by special commands sent by an operator, overriding some of the internal interfaces. This allows a remote operator to reconfigure systems in light of failures. Mission-critical systems keep logs of reconfiguration activity for subsequent analysis. Supervision circuitry plays a critical role in high-availability systems. The times needed for a switchover between nominal and redundant units may directly impact the system's capability of providing a service. What is more, overly sensitive reconfiguration control logic may lead to unnecessarily frequent switchovers, impacting the system's uptime and reliability risks due to wear out of switching components like relays. #### FPGA Configuration Memory Scrubbing Configurable devices like FPGAs must store their configuration data somewhere, and here lies the issue: what if the configuration memory gets corrupted? For instance, the configuration memory of SRAM-based FPGAs can be sensitive to the effects of radiation, which can create a permanent malfunction of the circuit programmed inside the FPGA by changing, for example, the nature or performance of a logical function implemented in a LUT or the type of an I/O port in use. In such cases, when the configuration memory can be altered by radiation, it is then very important to periodically reload the configuration bitstream of the FPGA to overwrite the configuration bits with the good ones and to avoid the accumulation of faults in the configuration memory. This continuous loading of the bitstream is popularly called "scrubbing". Scrubbing, as explained in Xilinx Application Notes 138[^110] and 151[^111], allows a system to repair bit-flips in the configuration memory (including the memory cells that configure the [[Semiconductors#Look-Up Tables (LUTs)|LUT]], the ones that control the routing, and the [[Semiconductors#Configurable Logic Block (CLB)|CLB]] customization) without disrupting its operation. Configuration scrubbing prevents multiple configuration faults and reduces the time during which an invalid circuit configuration is allowed to operate. For some Xilinx FPGAs, the whole configuration memory is divided into several frames representing the minimal amount of resources that can be configured. Such a structure allows reconfiguring either the full device (full scrubbing) or only a part of the design (partial scrubbing). The selection of the scrubbing mode mainly depends on the selected redundancy scheme. Due to the large sizes that the configuration memory can have compared to "user memory" (i.e., flip-flops and embedded memory cells), it is equally important to protect against faults both in the configuration memory and in the user logic. As an example, the configuration memory for the Xilinx Virtex-6 family is about four to eight times larger than the user memory. It is important to notice that scrubbing is not sufficient to completely protect SRAM-based FPGAs from particle effects, as it only avoids the accumulation of faults in the configuration memory. Indeed, faults can occur between two scrubbing cycles and provoke errors in the application until the next refresh of the configuration memory. Moreover, scrubbing will not correct faults in user registers or embedded RAM. Consequently, it is important to apply additional mitigation techniques as a complement to scrubbing. ### Cross-strapping We cannot discuss failover and reconfiguration without discussing cross-strapping. Cross-strapping is the act of managing alternative paths for a given functionality, from source to sink. Cross-strapping means ensuring an element in the architecture can be used by a designated set of drivers. In architectures with one or more CPUs controlling a set of slave devices (sensors, actuators, storage, data handling nodes, etc) cross-strapping oversees how different slave devices can be accessed by either CPU or controller, depending on the operating stage of the system. In the most simplistic scenario—a scenario with no actual cross-strapping in reality—a set of slave devices is assigned to one specific CPU or controller. This is called an independent functional "chain", and if the main controller goes offline, the slave device goes offline as well (see figure below). ![Redundant chains without cross-strapping (note that the "enable" signal from the supervisor can be a low-voltage signal or it can be power rails)](image424.png)In some architectures, slave devices cannot be so easily duplicated; reasons for such limitation may vary, but it might be due to mass/weight, budget, or complexity constraints. In such cases, a mechanism must ensure that a slave device can be accessed by either controller (figure below). When this is the case, we usually cannot afford to just split the signal and send them both to the nominal and redundant controllers; this will create, in most cases, data conflicts in the data bus to the slave device. Therefore, a mechanism to route signals from either the nominal or redundant controller must be placed. This switch can be of many different kinds: - An electromechanical relay - A solid-state CMOS analog switch - A network switch Then, different switches may operate in different layers. For the physical layer (as in the case of the relay or the analog CMOS switch) the routing of signals happens purely at the electrical level. If switching must be done at the data link layer, an Ethernet switch might be needed. In any case, the switching device requires configuration and control signals which will mandate which path the signals shall take depending on the operational scenario. Taking into account that the control signals are critical for the performance of the system, these signals typically come from the high-reliability supervisor. Additionally, switches must include failsafe mechanisms that will still guarantee a consistent routing path in case of failure. ![Redundant system with cross-strapping](image425.png) > [!Figure] > _Redundant chains with cross-strapping (note that the "enable" signal from the supervisor can be a low-voltage signal or it can be power rails)_ Needless to say, adding cross-strapping capabilities brings reliability and fault-tolerance concerns. As soon as we need a switching device to act between a pair of redundant controllers and a remote device, we are inserting a new component with its own failure modes. The more complex the switching entity is, the more failure mechanisms it may contain. In a cross-strapping scheme like the one shown in the figure above, the switch is a single point of failure. How to solve this will depend on the type of switching device we are talking about. For relays, it's usually done utilizing redundant schemes. For Ethernet networks, redundant switches and routing schemes must be devised. There is a case where the slave devices can also be redundant (see figure below). In this scenario, either controller can drive either slave device.; therefore, the switching circuitry again appears as instrumental in ensuring all the combinatorial paths the architecture allows are implemented, thus its reliability is critical. The switching circuitry must be designed to be fail-safe, in the sense that a potential failure would still ensure a functional path to a controller. ==The technology of the switching circuitry is intentionally left unspecified here: it can be a pair of relays, a backplane topology, or CMOS buffers with enable/disable signals.== Depending on the physical layers and protocols involved, this architecture may require that both controllers cannot be ON at the same time, forcing the design to be cold-redundant. ![](cross_strap2.png) > [!Figure] > _Cross-strapped redundant controllers with redundant slave devices. Any controller can reach any slave device_ One detail that is somewhat implicit from the figures above is the watchdog role of the supervisor. In most designs, the controllers must send some sort of "heartbeat" signal to the supervisor for this to understand the managing controller is in good health. Should the heartbeat not be received in time, the supervisor would do the necessary switchovers and reconfigurations for the system to continue working. ![](cross_strap3.png) > [!Figure] > _Controllers must send a heartbeat signal to the Supervisor; otherwise, a reconfiguration will take place_ As can be seen, implementing cross-strapping comes with its own set of challenges, particularly related to the management of signals between redundant components, and the ownership of the control signals to define the functional paths. One significant challenge when switching or rerouting analog signals is ensuring signal integrity. In a cross-strapped system, high-speed signals must be accurately and reliably transmitted between components. Any degradation or distortion in these signals can lead to errors, reducing the system's reliability. Another challenge is the complexity and reliability of the control logic required to manage these signals. The system must be able to detect failures, reroute signals, and switch control to backup components without human intervention. This requires sophisticated and often custom-designed control algorithms. The complexity increases with the number of redundant components and the diversity of functions they perform. But also, there's the challenge of synchronization between components. In systems where timing is critical, such as in data processing or communication systems, ensuring that all redundant components are perfectly synchronized is of great importance. Any timing discrepancies can lead to data corruption or loss. Also, cross-strapping creates transients in systems. Because controllers must switch between options, in general, getting the new sensor or slave devices up and running will take time, and the system might see an impact in performance because of this. For instance, in satellites, it is quite common to have cross-strapped star sensors. When there is a switchover and a redundant sensor becomes operative, the sensor may take several seconds until it gets into "tracking" mode (i.e., being able to provide orientation). During the time the redundant sensor is not tracking, the satellite must rely on other sensors to keep orientation and it might drift. This is usually noticed as a glitch in the orientation telemetry. ### Triple-Modular Redundancy (TMR) Triple Modular Redundancy (TMR) is a technique used to enhance the reliability and fault tolerance of digital systems, particularly in environments where the cost of failure is high, such as in aerospace, military, and nuclear applications. The fundamental concept of TMR is relatively straightforward: it involves replicating a critical piece of digital logic three times, hence 'triple'. Each of these three modules performs the same function independently. The outputs of these three nodes are then fed into a majority-voting system. This voting system compares the outputs and selects the majority result as the correct output. In essence, if one of the modules fails and produces an erroneous output, the other two, assuming they are still functioning correctly, will override the incorrect output, ensuring the overall system continues to function correctly. FPGAs are particularly well-suited for implementing TMR because of their reprogrammable nature and flexibility. An FPGA can be configured to contain multiple instances of the same logic circuit, which makes the replication part of TMR relatively straightforward. Additionally, the dynamic reconfiguration capabilities of FPGAs can be leveraged to implement more sophisticated TMR strategies, such as the ability to isolate and reconfigure faulty modules without disrupting the entire system. FPGA vendors provide tools to implement TMR[^109]. Implementing TMR in an FPGA, however, is not without some challenges. One of the main issues is the increased resource utilization. Tripling every critical component of a system means a significant increase in the number of logic blocks, memory elements, and interconnects used, which can ramp up the consumption of FPGA's resources. This can lead to larger, more expensive FPGAs being required, increasing the cost and power consumption of the system. Another challenge is designing the voting mechanism. The voter must be highly reliable; just as with the watchdog, the recursive problem persists: who checks the health of the voting system? Any fault in the voting logic can lead to the incorrect handling of the outputs from the three modules. The design of the voter often depends on the specific requirements of the application and the characteristics of the FPGA being used. In some cases, redundant voting logic may be employed to ensure the reliability of the voting process. Multiple redundant voters can independently analyze the outputs of the redundant modules and select the most common output among themselves. Vendors offer more sophisticated alternatives to TMR where they implement majority and minority voters. Timing considerations are also critical in TMR systems. The replicated modules need to be synchronized carefully to ensure that their outputs are valid and comparable at the time of voting. This synchronization must account for the propagation delays through the different modules, which might vary due to slight differences in the FPGA's internal routing. There is also the added complexity of testing and validating a triple-modular design. Ensuring that all three modules operate identically and that the voter correctly selects the majority output under all possible fault conditions can be a time-consuming process. Below is a simple example of how Triple Modular Redundancy (TMR) could be implemented in Verilog. This example will use a basic digital logic function: a flip-flop. The TMR system will consist of three identical flip-flops and a majority voter to decide the final output. ```Verilog // Define a simple D flip-flop module module DFlipFlop(input D, input clk, output reg Q); always @(posedge clk) begin Q <= D; end endmodule // Define a majority voter module module MajorityVoter(input A, input B, input C, output Y); assign Y = (A & B) | (A & C) | (B & C); endmodule // Integrate D flip-flops and the voter in a TMR system module TMRSystem(input D, input clk, output Y); wire Q1, Q2, Q3; // Instantiate three flip-flops DFlipFlop FF1(D, clk, Q1); DFlipFlop FF2(D, clk, Q2); DFlipFlop FF3(D, clk, Q3); // Instantiate the voter MajorityVoter voter(Q1, Q2, Q3, Y); endmodule ``` A testbench for this design is as follows: ```Verilog `timescale 1ns / 1ps module TMRSystemTest; // Inputs reg D; reg clk; // Output wire Y; // Instantiate the TMR System TMRSystem uut ( .D(D), .clk(clk), .Y(Y) ); // Clock generation always #10 clk = ~clk; initial begin // Initialize Inputs D = 0; clk = 0; // Add stimulus here // Test case 1: All flip-flops hold 0 D = 0; #20; if (Y !== 0) $display("Test Case 1 Failed"); // Test case 2: All flip-flops hold 1 D = 1; #20; if (Y !== 1) $display("Test Case 2 Failed"); // Test case 3: Two flip-flops hold 1, one holds 0 // Here, we inject faults manually to test the voter uut.FF1.Q = 1; uut.FF2.Q = 1; uut.FF3.Q = 0; #10; if (Y !== 1) $display("Test Case 3 Failed"); // Test case 4: Two flip-flops hold 0, one holds 1 uut.FF1.Q = 0; uut.FF2.Q = 0; uut.FF3.Q = 1; #10; if (Y !== 0) $display("Test Case 4 Failed"); // Test case 5: One flip-flop holds 0, two hold 1 uut.FF1.Q = 1; uut.FF2.Q = 0; uut.FF3.Q = 1; #10; if (Y !== 1) $display("Test Case 5 Failed"); // Test case 6: One flip-flop holds 1, two hold 0 uut.FF1.Q = 0; uut.FF2.Q = 1; uut.FF3.Q = 0; #10; if (Y !== 0) $display("Test Case 6 Failed"); $display("All tests passed!"); $finish; end endmodule ``` The result is shown below in GtkWave. ![Voting Logic in a triple-redundant flip-flop](image426.png) > [!Figure] > _Voting Logic in a triple-redundant flip-flop_ ## Software-Implemented Fault Tolerance When hardware redundancy is limited or not affordable at all, there are still some techniques we can implement in software. For instance, *temporal redundancy* can be a viable solution to deal with failures. The general idea is to execute certain parts of the application software several times on the same processing unit before comparing the results. The key points of this methodology are a limited hardware overhead and a significant time overhead. Needless to say, mitigation techniques of this kind imply modifications in the software used by the electronic system, although these modifications are not always suitable for all types of software (e.g. software that uses interrupts or dynamic memory allocation). The term Software-Implemented Hardware Fault Tolerance (SIFT) refers to a set of techniques that allows a piece of software to detect and possibly correct faults affecting the hardware on which the software is running. SIFT can be applied to Commercial-Off-The-Shelf (COTS) processors, or to soft processors embedded in FPGAs, which either do not include any mitigation techniques for the radiation effects faults of concern or where not enough mitigation can be implemented in the hardware due to system requirements (e.g. power consumption or chip area occupation). SIFT provides support by implementing an active time redundancy scheme: - The software running on the faulty hardware detects the occurrence of misbehaviors, with the possible support of an additional hardware module different from that running the software (e.g., a watchdog timer implemented on a dedicated chip working in parallel with the processor running the SIFT-enabled software). - Suitable actions are initiated for removing the fault from the hardware and bringing the system back to a healthy state (e.g., by rolling back the system state to a known good state previously saved). The common denominator to all SIFT techniques is to insert in the original program code redundant instructions allowing fault detection. In our context, instructions are understood as individual instructions or groups or blocks of instructions. Transients and upsets are considered types of faults, redundancy is obtained by selectively duplicating computations and by inserting consistency checks to detect differences among the computations. Duplication can be performed at different levels of granularity: - Instruction-level redundancy applies to statements of the program source code and inserts consistency checks that work on the output of pairs of replicated statements. - Task-level redundancy applies to each task composing the program, and in placing consistency checks that work on the output of pairs of replicated tasks. - Application-level redundancy can be applied when the program source code is not available, as is often the case for third-party software such as special libraries or operating systems. Software-level techniques are applied at a high level of abstraction. Consequently, they cannot determine the source of the errors, but they can only notice their impact on the computation. ### Redundancy at Instruction Level > [!Warning] > This section is under #development ### Redundancy at Task Level > [!Warning] > This section is under #development ### Redundancy at Application Level > [!Warning] > This section is under #development ## Error Correcting Codes Error-correcting codes (ECC) or Forward Error Corrections (FEC) are algorithms capable of detecting and/or correcting errors in data by adding some redundant data or parity data to the original data. When errors are detected and corrected, the term EDAC (Error Detection and Correction) is used. This family of techniques aims at protecting the content of memory cells by the use of Error-Correcting Codes (ECC). ECC, also called Forward Error Correction (FEC), relies on adding redundant data, or parity data, to a piece of data, in such a way it can be recovered even when a number of errors (up to the capability of the code being used) occurs, either during the process of transmission, or on storage. Error-correcting codes are frequently used in lower-layer communication, as well as for reliable storage in media such as hard disks, and RAMs to reduce Soft Error Rate (SER). When the original data is read, its consistency can be checked with the additional data. Each ECC has its own characteristics in terms of fault detection and fault correction, however, they all impact the system by adding an area overhead to store the redundant data and a time overhead to compute these data and check the original data for consistency. There are two main families of ECC: block codes and convolutional codes. Convolutional codes are mainly used for data transfer such as digital video, mobile communication, and satellite communication, whereas block codes are rather used for the integrity of data storage. Consequently, ECC presented in this section are block codes that can be classified into two groups whether they are limited to error detection, or they can achieve error detection and/or correction, depending on the amount of redundant data. There is not one ECC which is the solution to every problem. Each application has its own requirements and only one code can meet all of them. When several codes fit the conditions, the designer has to carefully examine each of them and make his own choice. Some examples of applications are provided: - Parity checking: Slow, legacy data interfaces ([[Physical Layer#RS-232|RS232]]) - CRC: Data Interfaces, independent of their speed - Hamming codes: data protection in computers (DRAM, hard drives, SCSI bus) - Reed-Solomon: Complex image transfer, data protection in computers (CD-ROM drive, associated to the RAR compression protocol to rebuild missing data) - Reed-Muller: Used on special space missions - Low Density Parity codes have been proposed for error correction in high-density memories Table below lists error detection and correction capability for some common ECC types. | ECC | Error detection | Error correction | | ---------------------------- | --------------- | ---------------- | | **Parity check** | X | | | **Cyclic Redundancy Check** | X | | | **BCH codes** | X | X | | **Hamming codes** | X | X | | **Reed-Solomon codes** | X | X | | **Low Density Parity codes** | X | X | ### Parity check A parity bit is a bit that is added to ensure that the number of bits with the value "1" in a set of bits is even or odd. Parity bits are used as the simplest form of error-detecting code. There are two variants of parity bits: even parity bit and odd parity bit: - Even parity, the parity bit is set to 1 if the number of ones in a given set of bits (not including the parity bit) is odd, making the entire set of bits (including the parity bit) even. - Odd parity, the parity bit is set to 1 if the number of ones in a given set of bits (not including the parity bit) is even, keeping the entire set of bits (including the parity bit) odd. In other words, an even parity bit is set to "1" if the number of 1's + 1 is even, and an odd parity bit is set to "1" if the number of 1's +1 is odd. Even parity check is a special case of a Cyclic Redundancy Check (CRC), where the single-bit CRC is generated by the divisor $x+1$. Because of its simplicity, parity is used in many hardware applications where an operation can be repeated in case an error is detected, or where simply detecting the error is helpful. In asynchronous serial data transmission, a common format is 7 data bits, an even parity bit, and one or two stop bits. This format neatly accommodates all the 7-bit ASCII characters in a convenient 8-bit byte. Other formats are possible; 8 bits of data plus a parity bit can convey all 8-bit byte values. In serial communication contexts, parity is usually generated and checked by interface hardware (e.g., a UART), and, on reception, the result is made available to the CPU via a status bit in a hardware register in the interface hardware. Recovery from the error is usually done by retransmitting the data as commanded by the CPU and its software. Parity check is a very simple ECC, it is limited to detect an odd number of flipped bits. Note that an even number of bit-flips makes the parity bit appear correct even though the data is erroneous. ### Cyclic Redundancy Check (CRC) A Cyclic Redundancy Check (CRC) is an error-detecting (not correcting) cyclic code and non-secure hash function designed to detect accidental changes to digital data in computer networks. It is characterized by the specification of a so-called generator polynomial, which is used as the divisor in a polynomial long division over a finite field, taking the input data as the dividend, and where the remainder becomes the result. Cyclic codes have favorable properties as they are well suited for detecting burst errors (a burst error is a continuous sequence of data containing errors.) CRCs are particularly easy to implement in hardware and are therefore commonly used in digital networks and storage devices such as hard disk drives. The table below provides some examples of commonly used CRCs and the applications they apply to. | Name | Polynomial | Typically used in | | ---------------- | ---------------------------------------------------------------------------------------------------------- | -------------------------------- | | **CRC-1** | $x + 1$ = 0x3 | Parity check | | **CRC-4-ITU** | $x^4 + x + 1$ = 0x13 | ITU-T G.704 standard | | **CRC-8-CCITT** | $x^8 + x^2 + x + 1$ = 0x107 | ISDN header Error Control | | **CRC-16-CCITT** | $x^16 + x^12 + x^5 + 1$ = 0x1021 | HDLC, Bluetooth, SD memory cards | | **CRC-32** | $x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1$ = 0x04C11DB7 | Ethernet, SATA, MPEG-2 | Figure below gives an example of a CRC computation on a binary message "1101011011" using the CRC‑4‑ITU polynomial ("10011"). The first step is to append n bits to the message where n is the order of the polynomial. The order of a polynomial is the power of the highest non-zero coefficient. The order of the CRC‑4‑ITU polynomial is 4. Thus, the message becomes "1101011011**0000**". The following step consists of *XORing* the message and the polynomial: ![Example of a CRC computation on a binary message "1101011011"](image427.jpg) > [!Figure] > _Example of a CRC computation on a binary message "1101011011"_ ### BCH codes BCH codes form a class of parameterized error-correcting codes which have been the subject of much academic attention in the last fifty years. BCH codes were invented in 1959 by Hocquenghem, and independently, in 1960, by Bose and Ray-Chaudhuri. The acronym BCH comprises the initials of these inventors' names. [[Fault-Tolerant Design Techniques#Error Correcting Codes#Reed-Solomon codes|Reed-Solomon codes]] are a special case of BCH codes. The principal advantage of BCH codes is the ease with which they can be decoded, via an elegant algebraic method known as syndrome decoding (syndrome decoding is a highly efficient method of decoding a linear code over a noisy channel). This allows very simple electronic hardware to perform the task, obviating the need for a computer, and meaning that a decoding device can be made small and low-powered. In technical terms a BCH code is a multilevel cyclic variable-length digital error-correcting code used to correct multiple random error patterns. ### Hamming codes Hamming codes were introduced by Richard W. Hamming in 1950. The code stemmed from his work as a theorist at Bell Telephone laboratories in the 1940s. Hamming invented the code in 1950 to provide an error-correcting code to reduce the wasting of time and valuable computer resources. Today, Hamming code really refers to a specific (7,4) code that encodes 4 bits of data into 7 bits by adding 3 parity bits. Hamming Code adds three additional check bits to every four data bits of the message. Hamming's (7,4) algorithm can correct any single-bit error or detect all single-bit and two-bit errors. In other words, the Hamming distance between the transmitted and received words is not greater than one to be correctable. This means that for transmission medium situations where burst errors do not occur, Hamming's (7,4) code is effective (as the medium can be extremely noisy for 2 out of 7 bits to be flipped). Hamming noticed the problems with flipping two or more bits and described this as the "distance" (now called the Hamming distance). Parity has a distance of 2, as any two bit flips are not detectable. The (3,1) repetition has a distance of 3, as at least three bits are flipped in the same triplet to obtain another code word with no visible errors. A (4,1) repetition (each bit is repeated four times) has a distance of 4, so flipping two bits can be detected, but not corrected. When three bits flip in the same group there can be situations where the code corrects towards the wrong code word. Hamming was interested in two problems at once; increasing the distance as much as possible, while at the same time increasing the code rate as much as possible. During the 1940s he developed several encoding schemes that were dramatic improvements on existing codes. The key to all of his systems was to have the parity bits overlap, such that they managed to check each other as well as the data. ### SEC-DED codes As a single error-correcting code is not satisfactory for many applications, SEC-DED is the most often used in computer memories as these codes can detect two errors and correct one. These codes have a minimum distance of three, which means that the code can detect and correct a single error, but a double-bit error is indistinguishable from a different code with a single-bit error. Thus, they can detect double-bit errors but cannot correct them. The Hamming code can be converted to a SEC-DED code including an extra parity bit: it increases the minimum distance of the Hamming code to 4 This gives the code the ability to detect and correct a single error and at the same time detect (but not correct) a double error. It can also be used to detect up to 3 errors but not correct any. ### Reed-Solomon codes Reed-Solomon (RS) codes are non-binary cyclic error-correcting codes invented by Reed and Solomon. They described a systematic way of building codes that can detect and correct multiple random errors. By adding $t$ check symbols to the data, an RS code can detect any combination of up to $t$ erroneous symbols, and correct up to $t/2$ symbols. Furthermore, RS codes are suitable as multiple-burst bit-error correcting codes, since a sequence of $b+1$ consecutive bit errors can affect at most two symbols of size $b$. The choice of $t$ is up to the designer of the code and can be selected within wide limits. In Reed-Solomon coding, source symbols are viewed as coefficients of a polynomial $p(x)$ over a finite field. The original idea was to create $n$ code symbols from $k$ source symbols by oversampling $p(x)$ at $n > k$ distinct points, transmit the sampled points, and use interpolation techniques at the receiver to recover the original message. That is not how RS codes are used today. Instead, RS codes are viewed as cyclic BCH codes, where encoding symbols are derived from the coefficients of a polynomial constructed by multiplying $p(x)$ with a cyclic generator polynomial. This gives rise to an efficient decoding algorithm, which was discovered by Elwyn Berlekamp and James Massey, and is known as the Berlekamp-Massey decoding algorithm. Reed-Solomon codes are used in many different applications from consumer electronics to satellite communication. They are prominently utilized in data transmission technologies such as DSL & WiMAX, and broadcast systems such as DVB and ATSC. RS codes are also well known for their role in encoding pictures of Saturn and Neptune during Voyager space missions^[https://destevez.net/2021/12/voyager-1-and-reed-solomon/]. RS codes are extensively documented in several technical documents from NASA^[https://ntrs.nasa.gov/api/citations/19900019023/downloads/19900019023.pdf]. ### Arithmetic codes Arithmetic codes are very useful when it is desired to check arithmetic operations such as additions, multiplications, and divisions. The data presented to the arithmetic operation is encoded before the operations are performed twice in parallel. After completing the arithmetic operations, the resulting code words are checked to make sure that they are valid. If they are not, an error condition exists. Arithmetic codes are interesting for checking arithmetic operations because they are preserved under such operations. Indeed, they have the following property: $A(a*b) = A(a) * A(b)$ where $a$ and $b$ are operands, $A(x)$ is the arithmetic code of $x$ and $*$ is an operation such as addition, multiplication, or division. Among the arithmetic codes, the so-called *separable codes* are the most practical. They are obtained by associating a check part issued from a suitable generator with an information part. The arithmetical operation is performed separately on both the original and the coded operands. Comparison of results allows to detect potential errors. Most common arithmetic codes are *residues* defined by $R(N) = N mod m. The figure below depicts an arithmetic function using an arithmetic code for error detection. ![An arithmetic function using an arithmetic code as an error detection mechanism](image428.png) > [!Figure] > _An arithmetic function using an arithmetic code as an error detection mechanism_ These codes have a specific application in designing arithmetic units that are self-checking. Arithmetic codes have, though, limited use in reliability measures such as SEU protection since the area overhead applies in registers and on the combinatorial part, and it is not applicable for logic function protection. ### Low-Density Parity Codes This type of codes has been proposed to correct errors in high-density memory blocks implemented in deep submicron technology that can exhibit a high fault rate. Their high error correction capability, when compared to other ECC codes like Hamming, can be adjusted to find a compromise between operation speed and power consumption. An advantage of LDPC when compared to other high error correction rates codes such as Reed-Solomon or BCH is their simpler algebraic principles, making it easier to implement them in programmable logic and using less computational time, and therefore being suitable as an error correction strategy with high-speed memories. ## Failure Detection, Isolation, and Correction Can a digital system diagnose itself and take action to correct an abnormal situation? To some extent, it can. It is quite typical in mission-critical designs to add software modules that monitor software. Yes, that's correct: software that looks after software, and all running on the same CPU. This may sound a bit off at first; the supervising and the supervised software are all on the same boat, so then what is the value of having this in the first place? This technique assumes that the fault tolerance strategy, as a layered activity as it surely is, ensures that more critical levels of the fault handling capabilities are managed outside of the software; therefore, the failure detection routine is assigned with the task of supervising less critical things, or elements that are outside of the software. These software modules are meant to detect, isolate (as in, avoid its spreading), and recover from faults. A recovery requires the software to execute an action. Equipping digital systems with decision-making power during emergencies does not come without certain challenges. If anything, machines can be equipped with failure isolation capabilities which, at least, will minimize the probability of the situation worsening. > [!warning] > This section is under #development ### False Alarms Systems will surely eventually generate telemetry values that will be off-nominal, i.e., alarms. Because telemetry is basically a set of measurements, and measurements can be affected by a variety of factors, operators have to continuously discern true alarms from false alarms. Shortly after 8 a.m. local time on January 13th, 2018, Hawaii residents received an emergency cell phone alert with an alarming message: “BALLISTIC MISSILE THREAT INBOUND TO HAWAII. SEEK IMMEDIATE SHELTER. THIS IS NOT A DRILL.” The message, reportedly sent by the Hawaii Emergency Management Agency in error, would turn out to be a false alarm, officials said. Nevertheless, it would take 38 minutes for authorities to clear up the mistake with a follow-up alert. On 26 September 1983, during the Cold War, the nuclear early-warning radar of the Soviet Union reported the launch of one intercontinental ballistic missile with four more missiles behind it, from bases in the United States. Stanislav Petrov, an officer of the Soviet Air Defense Forces on duty at the command center of the early-warning system, decided to wait for corroborating evidence—of which none arrived. His decision is seen as having prevented a retaliatory nuclear attack against the United States and its allies, which would likely have resulted in an escalation to a full-scale nuclear war. Investigation of the satellite warning system later determined that the system had indeed malfunctioned. False alarms start off as just an alarm. It requires a cognitive process to assess the context and the conditions around the alarm to flag it as either genuine or false. Under pressure, evaluating the authenticity of an alarm can be tricky: an inbound ICBM alarm today would be a different thing, cognitively speaking, from an ICBM alarm in 2018 or 1983. Similarly to the tale of [the boy who cried wolf](https://en.wikipedia.org/wiki/The_Boy_Who_Cried_Wolf), when we are alarmed by unrealistic dangers till the cry grows stale and threadbare, we grow incapable of knowing when to guard ourselves against real ones. But the opposite can also happen: disregarding true, genuine alarms. Interacting with technical artifacts and systems requires understanding their status through measurements, symbols, and cues that can malfunction and mislead us, and whose interpretation heavily depends on our psychological biases. And we—fallible humans—may easily ignore obvious, even accurate signs about what’s going on, due to cognitive pressures, lack of discipline, or both. At times, with tragic consequences: the pilots of flight AF447^[https://bea.aero/docspa/2009/f-cp090601.en/pdf/f-cp090601.en.pdf] ignored repeated alarms indicating a stall was approaching, although the alarm was a recorded voice loudly saying “stall!”. The pilots of LAPA flight 3142^[https://en.wikipedia.org/wiki/LAPA_Flight_3142] ignored an alarm repeatedly stating there was a problem with take-off configuration—flaps were not in the correct position. ## Hardware Protections One can also add protective circuitry to robustize systems. For instance, Latching Current Limiters (LCL) which are active overload protections for power lines, typically used in mission-critical systems. These devices are placed at the power input of a subsystem. Their generic role is to provide overload protection without generating dangerous voltage transients. In applications sensitive to [[Reliability Assessment Methods#Radiation|radiation]], and in particular sensitive to Single Event Latch-up (SEL), they are critical to detect the phenomena and to rapidly recover it by switching off the power supply before devices get permanently damaged. An LCL is based on a power [[Semiconductors#MOSFETs|MOSFET]] which is saturated during ON condition, open during OFF condition, and in linear mode during limitation. A low ohmic sense resistor measures input current. The small voltage observed on the resistor is then amplified to drive the power MOSFET. The reaction time of the limiter is as short as possible (<10 µsec). Whenever the overload limitation is reached, the power MOSFET is switched off (see figure below). ![A block diagram of a latch current limiter (LCL)](image429.png) > [!Figure] > _A block diagram of a latch current limiter (LCL)_ Another interesting feature shown in the figure above is the ON/OFF command (CMD). This signal can be generated by a supervising entity to power on/off the circuit or system protected by the LCL. Operating the power MOSFET in linear mode during the sub-system's start-up allows limiting inrush current spikes. A current limiter is a critical reliability element for the system and therefore it is important to pay special attention during the selection of its parts to ensure that they meet the specified radiation immunity according to the mission. When setting the current threshold, the designer must take into account the supply current increase caused by the total ionizing dose (TID). There also exists a non-latching current-limiting function where the current limit decreases with the output voltage; these are called the Foldback Current Limiter (FCL). [^107]: https://www.gaisler.com/index.php/products/processors/leon3ft [^108]: https://www.esa.int/Enabling\_Support/Space\_Engineering\_Technology/Onboard\_Computers\_and\_Data\_Handling/Microprocessors [^109]: One example is TMRTool from Xilinx (AMD): https://www.xilinx.com/content/dam/xilinx/support/documents/user\_guides/ug156-tmrtool.pdf [^110]: Xilinx, Application note 138: Virtex FPGA Series Configuration and Readback. 2005. [^111]: Xilinx, Application note 151: Virtex Series Configuration Architecture User Guide. 2004.