The Design Error Problem

# The Design Error Problem A computer system may fail to perform as expected because a physical component fails or because a design error is uncovered. For a system to be both [ultra-reliable](https://shemesh.larc.nasa.gov/fm/fm-why-def-ultra.html) and safe, both of these potential causes of failure must be handled. Established techniques exist for handling physical component failure; these [[Fault-Tolerant Design Techniques|techniques]] use [redundancy and voting](https://shemesh.larc.nasa.gov/fm/fm-why-def-redun.html). The reliability assessment problem in the presence of physical faults is based on [Markov modeling techniques](https://shemesh.larc.nasa.gov/fm/fm-why-def-markov.html) and is well understood. The design error problem is a much greater [threat](https://shemesh.larc.nasa.gov/fm/fm-why-def-aeroerrors.html). Unfortunately, no scientifically justifiable defense against this threat is currently used in practice. There are 3 basic strategies that are advocated for dealing with the design error: 1. Testing (Lots of it) 2. Design Diversity (i.e. software fault-tolerance: N-version programming, recovery blocks, etc.) 3. Fault Avoidance (i.e. formal specification/verification, automatic program synthesis, reusable modules) The problem with [life testing](https://shemesh.larc.nasa.gov/fm/fm-why-def-lifet.html) is that in order to measure ultra-reliability one must test for [exorbitant amounts of time](https://shemesh.larc.nasa.gov/fm/fm-why-def-long-time.html). For example, to measure a $10^{-9}$ probability of failure for a 1 hour mission one must test for more than 114,000 years. Many advocate design diversity as a means to overcome the limitations of testing. The basic idea is to use separate design/implementation teams to produce [multiple versions](https://shemesh.larc.nasa.gov/fm/fm-why-def-multiplev.html) from the same specification. Then, non-exact threshold voters are used to mask the effect of a design error in one of the versions. The hope is that the design flaws will manifest errors [independently](https://shemesh.larc.nasa.gov/fm/fm-why-def-indep.html) or nearly so. By [assuming independence](https://shemesh.larc.nasa.gov/fm/fm-why-def-assumei.html) one can obtain ultra-reliable-level estimates of reliability even though the [individual versions](https://shemesh.larc.nasa.gov/fm/fm-why-def-ivers.html) have failure rates on the order of $10^{-4}$. Unfortunately, the independence assumption has been rejected at [the 99% confidence level](https://shemesh.larc.nasa.gov/fm/fm-why-def-reject.html) in several experiments for low-reliability software. Furthermore, the independence assumption cannot ever be validated for high-reliability software because of the exorbitant test times required. If one cannot assume independence then one must [measure correlations](https://shemesh.larc.nasa.gov/fm/fm-why-def-mcorr.html). This is infeasible as well—it requires as much testing time as life-testing the system because the correlations must be in the ultra-reliable region in order for the system to be ultra-reliable. Therefore, [it is not possible](https://shemesh.larc.nasa.gov/fm/fm-why-def-notp.html), within feasible amounts of testing time, to establish that design diversity achieves ultra-reliability. Consequently, design diversity can create an “illusion” of ultra-reliability without actually providing it. From this analysis, we conclude that formal methods currently offer the most promising method for handling the design fault problem. Because the often quoted $10^{-9}$ reliability is beyond the range of quantification, we have no choice but to develop [life-critical systems](https://shemesh.larc.nasa.gov/fm/fm-why-def-life-critical.html) in the most rigorous manner available to us, which is the use of formal methods.