Why Things Don't Fail More Often?

# Why Things Don't Fail More Often? There are many sources and literature about how systems fail. This section alone is about analyzing failures. There are accident reports, there are detailed forensic analyses of famous disasters involving complex systems such as nuclear reactors, ships, rockets, and aircraft. There are millions of words written about why failures happen and how to prevent them. There is a science of failure. But there isn't much written, though, on why systems work. And yes, I am aware of the initial absurdity of the question. It is like asking medical science to write about "why are people healthy?". Granted, it makes more sense to write about diseases—the off-nominal situations—and not the normal scenarios. In the same way, my wife does not remember when I have done things right but vividly recalls every time I have screwed up. Unlike living organisms—including wives with a very good memory—and medicine, the technical artifacts we make are not the result of selfless evolution and ethics but, on the contrary, subject to market forces, production imperfections, and some other unholy factors moved by selfish principles all the way from the ideation stage. Asking why artificial things work is not so absurd, as we shall see. --- When I got my first job back in 2002 as an "all-rounder" in a now-extinct company that made automation and access control systems, I was fortunate to witness the full life cycle of this quite complex electronic board which was in charge of controlling the access of personnel to those old phone service cabinets which used to be spread around cities. As a young engineer wannabe, I managed to see this device going from a scribble on a whiteboard, all the way to block diagrams, schematic drawings, embedded software, prototypes, and—finally—the actual thing sitting on a table. After the team fixed some initial mistakes here and there, the board software finally flashed and an LED started blinking indicating "self-check good". By then, a strange feeling invaded me. Just as if something magical had just happened. Sure, the board was not a Boeing 737 in terms of complexity, but it had its density, with parts and components coming from different vendors across the world, from the USA to Asia, many different interfaces and protocols, and millions of lines of software dictating the behavior of the whole thing. I had heard and witnessed all the heated meetings where different engineers argued about how they would have designed or implemented some parts this way or another. I had seen the budgets fluctuating, expanding, and shrinking. And yet, it worked. And it didn't only work on the "well-behaved" lab benchtop environment, it also did in the field: the company managed to install this new controller in hundreds, if not thousands of phone service boards across the city. That simple question has stayed in my head, a question that has been chasing me after more than two decades: how did that ever work? The same question assaults me every time I am put in front of any complex system that involves a sensitive amount of components. How on earth do systems with hundreds, thousands, or millions of parts manage to turn on and run? I feel that, as engineers, we get too used to the fact things work, and we do not stop to appreciate how the probabilities are mysteriously playing on our side. What is more, we tend to feed our egos with what is ultimately a fair dose of luck. I somehow refuse to get unemotionally accustomed to seeing complex things work, although I can nervously enjoy it as I just won't stop thinking about the underlying probabilities. I mean, sure, I know why systems are supposed to work in the design domain where everything is ideal and perfect, where I control most of the variables, and where everything abides by deterministic physics and circuit theory laws. But complex systems working in production is a different story. Think about it. Myriads of components, all with their own life span, their variances, their bathtub curves, their intrinsic design imperfections, tolerances, bugs, and stochastic noises. A Boeing 737—with over 10,000 units produced worldwide—has an average of 600,000 (six hundred thousand) parts[^86][^87], the Airbus A380 4 million[^88]. If we can safely assume that no perfect system exists in the world, we can also say the same for any subsystem, any sub-subsystem, and down to the most elementary part. Nothing is perfect. So, how can a reliable function emerge from a collection of imperfect elements hooked together? I am not the only one asking this _a priori_ silly question. In a famous blog post[^89] and a video[^90], Richard I. Cook (MD), a systems researcher from the University of Chicago, points out the same observation: >[!cite] >_The surprise is not that there are so many accidents. The surprise is that there are so few._ >Richard I. Cook, MD. ![](https://www.youtube.com/watch?v=2S0k12uZR14&ab_channel=O%27Reilly) The real world often offers surprises: the world is not well-behaved. Even so, a lot of operational settings achieve success. Because of or despite designs? Dr. Cook goes on to define a divide between what he calls the system-as-imagined vs system-as-found. In short, the schism between how we imagine—i.e.; design—things and how the actual instances of our designs evolve out in real settings. When we talk about why systems do or do not work, we can't leave human errors out of the equation. Although we live in an "age of autonomy" of sorts, we humans remain a critical piece of the puzzle. Therefore, human errors are very relevant when it comes to observing complex systems failure. For example, in the early days of flight, approximately 80 percent of accidents were caused by the machine and 20 percent were caused by human error. Today that statistic has reversed. Approximately 80 percent of airplane accidents are due to human error (pilots, air traffic controllers, mechanics, etc.) and 20 percent are due to machine failures^[https://www.boeing.com/commercial/aeromagazine/articles/qtr_2_07/article_03_2.html]. These statistics open a few more questions: - Have we humans become sloppier operators in time? - Have systems become more complicated to operate? - Have systems become more intrinsically reliable? Nothing indicates we have been especially worsening as operators, but systems have indeed become more complicated to operate, and for sure there is an intrinsic increase in reliability with better materials, better tools, and better software. As said, no system in the history of humanity has been made perfect. There is always a non-zero amount of flaws on every system ever made, or to be made. Every single plane you and I have boarded or will ever board has a non-zero amount of software bugs. Same for any of the several hundreds of nuclear reactors currently operating around the world. The question is how capable those unavoidable flaws are to bring the whole system down at once. All the systems we use, fly on and we operate, show errors at almost any given time. Every existing system out there is experiencing at this precise moment checksum mismatches, data corruptions, bit flips, material cracks, and bolts getting infinitesimally loose by the second. Be it planes, cars, nuclear reactors, hair dryers. Some of those events may never get reported, and perhaps even noticed or logged, and may be obliviously corrected by routine maintenance. And for external factors that can affect the system as a whole and bring it to shambles, we luckily tend to learn: pilots once realized flying through storms is a *no-no*. Transatlantic ship captains noted at some point icebergs are worth keeping an eye on. One can criticize complexity *ad nauseam*. "Keep it simple, stupid", "less is more", and other platitudes are always mentioned. All good, but complexity happens. Complexity is something you can't fully avoid. You read everywhere that complexity is the enemy of reliability. But is it? Take the Bleriot XI (illustrated in the figure below), a systemically very simple artifact compared to, say, a Boeing 767. But is the Bleriot more reliable? You may think we are comparing apples and oranges, correctly pointing out that the difference in stakes is abysmal: a crash of a Bleriot can kill way fewer people than a crash of a 767. Ok, let me fix that a bit, at least in terms of potential casualties: let's take a cargo version of the B767, which carries a crew of 2, and again compare it with the 2-seats version of the Bleriot. Again, which one is more complex? That's an easy one. But which one is more reliable? Which one would you choose for taking a flight on a windy and foggy night? Equivalently, a network with many nodes is harder to bring down than a network with only one node[^91]. ![The Bleriot XI, practically a flying bicycle](image406.jpg) > [!Figure] > _The Bleriot XI, practically a flying bicycle_ Coming back to Dr. Cook's valid question on how come we don't see more accidents; I would rephrase this a bit differently. Systems are abundant in accidents. Although very small ones. Systems coexist with imperfections and errors. They continuously sustain an organic amount of small internal mishaps and "benign" off-nominal behaviors they manage to live with. As we equip the artifacts we make with more parts and components, the path to systemic, global failure gets longer, more intricate, and less combinatorially possible. We also incorporate learnings from big, loud failures of others so they don't happen to us: every time a complex system fails badly, we all take notes. This loop has allowed for fewer *Titanics* and *Chernobyls*. Is it that things work because they are complex, then? Is complexity a valid protection for failure? No, here comes the third dimension which compensates for anyone thinking that over-engineering is the key to protecting themselves from a cataclysm. More complexity exerts more pressure on whoever has to deal with the more states and transitions complexity brings, either a human operator or an algorithm. Any potential reliability increase brought by complexity is compensated by an increased risk of human-made, operational error. You could get to fly the Bleriot XI or the Wright Flyer after a few tries, but good luck trying to fly an airliner. The complete answer on which one is more reliable greatly depends on who's driving; the operator becomes part of the system. Ok, so why do systems work, then? It largely depends on the zoom level when we look. From an external observer, systems "work" because their internal faults are not cooperating to "win" the disaster lottery. Under the magnifier, systems do not *fully* work; in fact, they are failing all the time[^92]. [^85]: For a passionate, somewhat truck-mouthed review on AUTOSAR, see this gem: https://www.reddit.com/r/embedded/comments/leq366/comment/gmiq6d0/ [^86]: https://www.boeing.com/farnborough2014/pdf/BCA/fct%20-737%20Family%20Facts.pdf [^87]: https://investors.boeing.com/investors/fact-sheets/default.aspx [^88]: https://www.airbus.com/sites/g/files/jlcbta136/files/2021-12/EN-Airbus-A380-Facts-and-Figures-December-2021\_0.pdf [^89]: https://how.complexsystems.fail/ [^90]: https://www.youtube.com/watch?v=2S0k12uZR14&ab\_channel=O%27Reilly [^91]: https://ieeexplore.ieee.org/document/4273892 [^92]: See our world as perhaps the biggest "human-made" (bear with me) system we can think of. For a hypothetical observer sitting on the Moon, the world, more or less, works. Now, accidents—even very serious ones—are happening all the time.