During World War II, the US Air Force was desperate to solve an urgent problem: how to improve the odds of a bomber making it home. Chances were slim for bomber crews back in the day, almost like tossing a coin. So, the military thought of presenting the problem to statisticians and mathematicians to analyze it, probably due to the increasing interest in [operations research](https://en.wikipedia.org/wiki/Operations_research) during those hectic times.
Military engineers explained to the scientists—one of them being the brilliant mathematician [Abraham Wald](https://en.wikipedia.org/wiki/Abraham_Wald)—that they knew their bombers needed more armor, but they couldn’t just cover the planes like tanks, not if they wanted them to get off the ground. So, engineers thought, the key was to figure out the best places to add what little protection they could.
For this, the military provided some information they had collected from the bombers that had returned from enemy territory, recording where those planes had taken the most damage. The collected data seemed to show that the bullet holes tended to accumulate along the wings, around the tail gunner, and down the center of the body. Wings. Body. Tail gunner.
Considering this information, the commanders wanted to put the thicker protection where they could clearly see the most frequent damage, where the holes clustered. But Wald pointed out that this would be precisely the wrong decision. The mistake, which Wald saw instantly, was that the holes showed where the planes were the strongest. The holes showed where a bomber could be shot and still survive the flight home. After all, here they were, holes and all. It was the planes that didn’t make it home that needed extra protection, and they needed it in places that these planes had not. The holes in the surviving planes actually revealed the locations ==that needed the least additional armor==. Look at where the survivors are unharmed, he said, and that’s where these bombers are most vulnerable; that’s where the planes that didn’t make it back were hit. In short, the returning bombers provided little amount of information on how to increase survivability. Sadly, those not coming back home did not only leave families devastated but also kept invaluable information with them.
==Survivorship bias is our tendency to use evidence of success as the primary measure for planning for future successes.==
Survivorship bias can significantly affect the way we design reliable systems by skewing our understanding of failure rates and patterns:
1. Survivorship bias may cause designers to underestimate the true failure rates of components or systems. If only successful instances are studied, the failures that led to other instances being removed from consideration are overlooked. This can lead to a false sense of security regarding the reliability of certain components or systems.
2. Relying solely on successful examples may lead to misguided design decisions. For example, if a particular design feature is present in all successful systems but absent in failed ones, designers may incorrectly attribute success to that feature, overlooking other critical factors. This can result in the incorporation of ineffective or even detrimental design elements into new systems.
3. Survivorship bias can hinder a comprehensive understanding of risks associated with system failures. By focusing only on successful outcomes, designers may fail to identify and adequately address potential failure modes and their associated consequences. This can leave systems vulnerable to unexpected failures that were not accounted for in the design process.
4. Failure to account for potential failure modes due to survivorship bias may result in inadequate redundancy and resilience measures being implemented in system design. Without considering all possible failure scenarios, designers may overlook the need for backup systems or fail-safe mechanisms, leaving the system vulnerable to catastrophic failures.
To mitigate the impact of survivorship bias in designing fault-tolerant systems, designers should:
- Conduct thorough failure analysis, including studying both successful and failed instances.
- Incorporate failure data from similar systems or industries to gain a broader perspective.
- Implement robust testing procedures that simulate a wide range of failure scenarios.
- Utilize redundancy and resilience measures based on comprehensive risk assessments rather than relying solely on successful examples.
- Encourage a culture of learning from failures and near-misses to continuously improve system reliability.