Failures vs Near-Misses

# Failures vs Near-Misses Failure seldom happens instantaneously, unless extremely catastrophic failures. Failure is, more generally, a trajectory, a sequence of steps toward an outcome that represents some type of loss. If the path to loss is not complete, a potential failure turns into a near miss. A near miss refers to a situation in which a loss-inducing event *almost* occurs but does not result in any harm, injury, or damage. It can be described as a "close call" or a narrowly averted accident. In other words, a near miss is a situation where the potential for disaster materializes to its fullest, but by chance or timely intervention, the negative outcome is prevented. A video on YouTube was circulating years ago[^93] showing a group of people gathering at a distance from an old clothing factory in the Czech Republic to watch it get demolished. Just as the charges are detonated, a stray chunk of masonry flies from the collapsing building, narrowly missing the group of boom-gawkers. When watching the slow-motion, it is obvious that had the chunk of concrete hit the onlookers, it would have instantly killed them. Luckily for these folks, it missed. Barely. See the video below. ![](https://www.youtube.com/watch?v=ztP4cDdy83o) On the flip side, cellist Mike Edwards, 62, and an early member of the Electric Light Orchestra drives his van around the A381 in Halwell, Devon, United Kingdom, while some hundreds of meters up a hill, a 600 kg bale of hay falls from a tractor and starts rolling downhill. The timing between Mr. Edwards' van and the hay couldn't have been more perfect (or imperfect, depending on how you want to see it), and the massive bale crushes his van, killing him instantly. For Mike Edwards, there was no near miss. He won the anti-lottery. Had Mr. Edwards gone for another toast during breakfast or stopped to tie his shoelaces on his way out, he could've been telling this story over a pint in a pub. Failure events are a combination of deterministic and stochastic processes. The deterministic side of them is in our control, and by our ignorance or negligence, we may provide a path into the stochastic side, which we do not control. And once things are out of our control, it's a matter of probabilities (luck). The person witnessing the building being demolished could've stayed home watching a movie, but he decided to go to a place where explosive charges would be used to bring a massive building down. Still, on the deterministic side of things, he (and others) purposely decided to stay at an insufficient distance from the events. He consciously put himself in the hands of probabilities. The fact his head is still in one piece is the stochastic—read, lucky—combination of his position and the dynamics of the collapsing building which created the piece of debris that could have decapitated him but didn't. Near misses are puzzling because life appears to continue just like nothing happened. From a certain perspective, nothing has in fact happened: there are no injuries, no insurance claims, no blood, no hospitals or funeral houses involved. Still, all the mechanics of an accident were at play. Risks peaked, and it was just a matter of chances at the end of the chain of events that defined whether we are talking about people getting hurt or property damage or not. Near misses are real-life, full-scale, overly realistic simulacrums of accidents. Near misses are—or at least should be—transformative if you happen to witness one. They surely broadcast a sense of fragility and how the line between normal life and disaster is thin and redrawn all the time. Near misses force us to reflect on how lucky we were this time and how we must be more cautious and attentive in the future on the things we can deterministically control if we want to prevent it from happening again. A very wrong interpretation is to read near misses as some sort of "strength", a superpower, or a "guardian angel" who was checking that nothing would happen. Engineering is about probabilities and not about mythical winged creatures. >[!attention] >Every failure requires a driving action and a path for the loss to materialize. Failures also need time to build up. Think about football for a moment, where team 1 plays against team 2. Any goal scored from team 1 represents a failure for team 2, and vice-versa. From this perspective, the forcing action is the offensive pressure from one of the teams against the other; Team 1 pushes continuously Team 2 to make a mistake (fail) until it eventually happens. For the defending team, to prevent failure, it is important to add as many barriers as possible, creating the right number of obstacles for the pressure the attacking team is exerting. In sports, offensive and defensive roles can easily change places. When we operate complex systems, we can only aim to adopt defensive roles against failure. It\'s the external environment pushing the system to fail, and we cannot take any offensive action against the environment hoping that will put this forcing action to a stop: An aircraft cannot do anything to make the physics laws that govern its flight to be "nicer" for its functioning, nor a satellite can do anything to prevent high-energy particles to hit its chips on board. We, as designers, can prepare for the effects, but we cannot prevent the causes. In short, failure management is a game where we can only play defense. Failures insist on happening, and we need to have a plan for that; our design needs to take that into account. When we operate complex systems, we rely on measurements to construct an understanding for assessing potential faults and failures, and to decide courses of action. But what if the measures were wrong? Then our understanding would be flawed, detaching from actual reality and tainting our decisions, eventually opening an opportunity to make things worse, and letting failure forces to align and find a trajectory towards a loss. Errors and failures are troublesome. They can cause tragic accidents, destroy value, waste resources, and damage reputations. Operating and using complex systems involves an intricate interaction of complex technical systems and humans. Reliability can only be understood as the holistic sum of reliable design of the technical systems and reliable operations of such systems by human operators. Any partial approach to reliability will be insufficient. Organizations develop procedures to protect themselves from failure. But even highly reliable organizations are not immune to disaster and prolonged periods of safe operation are punctuated by occasional major failures. Scholars of safety science label this the "paradox of almost totally safe systems" noting that systems that are very safe under normal conditions may be vulnerable under unusual ones. Organizations must put in place different defenses to protect themselves from failure. Such defenses can take the form of automatic safeguards on the technical side, or procedures and well-documented and respected processes on the human side. The idea that all organizational defenses have holes and that accidents occur when these holes line up, often following a triggering event—the Swiss cheese[^94] model of failure—is well known. In the Swiss cheese model, an organization's defenses against failure are modeled as a series of barriers, represented as slices of cheese. The holes in the slices represent weaknesses in individual parts of the system and are continually varying in size and position across the slices. The system produces failures when a hole in each slice momentarily aligns, permitting a trajectory of accident opportunity, so that a hazard passes through holes in all the slices, leading to a failure. There is some "classic" view of automation that points out that its main purpose is to replace human manual control, planning, and problem-solving with automatic devices and computers. However, even highly automated systems, such as electric power networks, need human beings for supervision, adjustment, maintenance, expansion, and improvement. Therefore, one can draw the paradoxical conclusion that automated systems still are human-machine systems, for which both technical and human factors are important. Quite some research has been done on human factors in engineering, which highlights the irony that, the more advanced a control system is, the more critical the contribution of the human operator. It might also be implicit that the role of the operator is solely to monitor numbers and call an "expert" when something is outside safe boundaries. This is not true. Operators gain critical knowledge when they operate in real conditions that no designer can gain during the development stage. Such knowledge must be captured and fed back to the designers, continuously. Automation never comes without challenges. An operator overseeing a highly automated system might start to lose his/her ability to manually react when the system requires manual intervention. Because of this, a human operator of a complex system needs to remain well-trained in manual operations in case automations disengage for whatever reason^[https://pubsonline.informs.org/doi/full/10.1287/orsc.2017.1138]. If the operator blindly relies on automation, under the circumstances that the automation is not working, then the probability of making the situation even worse increases considerably. The right approach is about having human operators focus more on decision-making rather than systems management. In summary: - Reliability can only be achieved if tackled holistically by means of combining complex systems with reliable operations through human-machine interfaces, which consider the particularities of such systems. - All defenses to prevent failures might have flaws. It is essential to avoid flaws (holes in the cheese) that align, providing a trajectory to failure. - Automation is beneficial for optimizing data-driven repetitive tasks and streamlining human errors, but the possibility to run things manually must always be an alternative, and operators must be trained to remember how to do so. # Byzantine Faults A Byzantine fault refers to a failure mode in distributed systems where a component, such as a server or a node, behaves arbitrarily or maliciously. Unlike simple faults like crashes or data loss, Byzantine faults can include lying, sending conflicting information to different parts of the system, or pretending to function correctly while sabotaging operations. The term originates from the "Byzantine Generals Problem", a thought experiment introduced by Leslie Lamport, Robert Shostak, and Marshall Pease in the 1980s. In the scenario, a group of generals must agree on a common battle plan, but some of them may be traitors trying to prevent consensus by sending conflicting messages. The challenge is to find an algorithm that allows the loyal generals to reach agreement despite the presence of traitors. In computer systems, this models the situation where some nodes might be compromised or malfunction in unpredictable ways, including colluding with each other. Traditional fault-tolerant systems assume crash faults or omission faults, where a node simply stops responding or drops messages. Byzantine faults are much harder to handle because the faulty node may appear to be working and may selectively provide incorrect data. Solving the Byzantine fault tolerance problem involves building protocols that can reach reliable consensus even when some nodes are acting dishonestly. A classic result is that in an asynchronous system, to tolerate _f_ Byzantine faults, you need at least _3f + 1_ total nodes. This threshold ensures that the honest majority can outvote the bad actors. Byzantine fault tolerance is foundational in the design of secure distributed systems and systems where nodes may become erratic due to radiation or other harsh conditions. Byzantine faults can appear in distributed computing systems whenever individual nodes or communication links begin to behave unpredictably, inconsistently, or deceptively. These faults are especially insidious because they do not follow a simple failure pattern like a crash or a timeout; instead, they involve behaviors that can actively undermine the system’s correctness or consensus. One common cause is software bugs or state corruption. A node suffering memory corruption due to hardware issues, such as a cosmic ray flip in a high-altitude server or a bit flip in a spacecraft computer, may start making invalid decisions or sending contradictory information. Since it’s still "alive" and producing output, the rest of the system may believe it is functioning normally unless it carefully checks for consistency. Malicious attacks are another major source. In adversarial environments—like permissionless peer-to-peer networks, critical infrastructure, or military systems—a compromised node might be under the control of an attacker. The node might intentionally lie to different peers, forging timestamps, fabricating transaction logs, or selectively acknowledging some messages and not others. If the rest of the system doesn’t have strong fault-tolerant mechanisms, it may be unable to detect or correct for these inconsistencies. Misconfiguration can also cause Byzantine-like behavior. Imagine a distributed database cluster where one replica is configured with a different schema or data model. It might still respond to queries or participate in consensus rounds, but with subtly incompatible logic. Similarly, software version mismatches can lead nodes to interpret the same message differently, producing divergent state updates. Network-induced faults can also be Byzantine in nature when combined with certain application-level bugs. For instance, if a network causes messages to be delayed or delivered out of order, and some nodes fail to handle this properly, they may end up in inconsistent states and generate conflicting outputs. If these outputs propagate, they can appear as if the node is actively misbehaving. Concurrency issues—especially in multithreaded distributed components—can lead to race conditions that cause nondeterministic state transitions. One node might enter an invalid state and behave incorrectly from then on, again without visibly failing. In practical systems, developers attempt to guard against Byzantine faults using redundancy, cryptographic signatures, quorum-based voting, and careful auditing of node behavior. Byzantine faults can emerge from hardware, software, human error, or adversarial behavior, and they require systems to treat every participant as potentially hostile or unreliable unless proven otherwise. This is why designing for Byzantine fault tolerance is so complex, costly, and essential in high-stakes distributed systems. [^93]: https://www.youtube.com/watch?v=ztP4cDdy83o&ab\_channel=Yanz [^94]: https://en.wikipedia.org/wiki/Swiss\_cheese\_model\#Failure\_domains