Fault Tolerance and Security (Start Here)

# Fault Tolerance and Security As if it was not enough with all the challenges associated with transmitting signals through imperfect channels, we also need to consider that systems may fail at any time. Being complex aggregations of many different parts, a system grows new failure mechanisms as it grows in complexity. Knowing those failure mechanisms and preparing the system to withstand them should they happen—and we must assume they will happen—is part of the design process. In everyday language, the terms fault, failure, defect, bug, and error are used interchangeably. In fault-tolerant systems parlance, however, they have distinctive meanings. Mind you, it also depends on the bibliography when it comes to the denotation of these terms. The most decent definition I've found on these are: 1. Errors. An error is a human mistake that occurs when a designer develops or interprets the system design. These errors are typically due to misunderstanding requirements or incorrectly implementing logic. Errors are introduced at the development stage but can manifest in various forms, including logic flaws or incorrect assumptions. Errors can occur anywhere in the development cycle, including requirements, propagating errors through various development phases. 2. Defects. Defects, on the other hand, are the result of unanticipated interactions or behaviors that occur when the system executes its functions. These often stem from mistakes in design or unexpected conditions that weren’t adequately accounted for. A defect is an unintended consequence or oversight that makes the system behave differently from its intended design. 3. Bugs. Bugs are a more colloquial term and are often seen as a subset of defects, primarily when referring to issues discovered in a live environment. In other words, bugs are defects that have slipped past the testing phase and made their way into the final product. These are often more critical and require immediate attention. However, we often see developers refer to bugs as any error or defect that occurs anywhere in the development cycle, including before the software reaches production. 4. A fault can be either a hardware defect or a software/programming mistake. Generally speaking, all failures are faults but not all faults are failures. Both faults and errors can spread through the system. For example, if a chip shorts out power to ground, it may cause nearby chips to fail as well. Errors can spread when the output of one unit is used as input by other units. The highly coupled nature of the digital systems we design makes them vulnerable to [[Common-Mode Failure|common-mode failure]] and thus we need to design our architectures with mechanisms not only to detect faults but also to prevent them from spreading to other parts of the hierarchy. Malfunction has an undeniable link with quality. We quickly mentally associate defects and errors with low quality, therefore our work to equip our systems with fault management capabilities impacts the way customers perceive our products. We discussed quality in extent in a [[The Quality of Quality||specific section]]. ==Ultimately, fault tolerance is about knowing and lowering the vulnerability of our designs.== [[Why Things Don't Fail More Often?]] [[Failures vs Near-Misses]] [[Dependability, Reliability, and Availability]] [[Risk and Uncertainty]] [[Reliability Assessment Methods]] [[Software Bugs, Glitches, and The Big Lie Behind Unit Testing]] [[Common-Mode Failure]] [[Failure Mode Analysis Methods]] [[Fault-Tolerant Design Techniques]] [[Security]] [[site/Resources/Fault Tolerance and Security/References|References]]