Stochastic Models of Failure
4.4 INTERPRETATION AND USES OF THE FAILURE RATE FUNCTION – THE MODEL FAILURE RATE
4.4.2 Decreasing Failure Rates, Reliability Growth, Burn-in and the Bathtub Curve Rationale for a Decreasing Failure Rate
Whereas the judgment that the failure rate of an item is increasing with time is easy to support, namely, that the item deteriorates with use, reasons for the judgment of a decreasing failure rate are more subtle. There are two possibilities. The first is based on the physics of failure, namely, that certain items exhibit an improvement (over time) in their ability to perform their intended tasks. Examples are reinforced concrete that strengthens with exposure, drill bits that sharpen with use, materials that experience work hardening, and immune systems which mature with time. The second, and probably the more commonly occurring reason, is a psychological one.
It comes into play when we change our opinion, for the better, about an item’s survivability.
For example, our opinion of the credibility of computer software that is thoroughly tested and debugged keeps improving with use; thus we may judge the software as having a predictive failure rate that is decreasing. Since the software does not deteriorate with use, its model failure
rate should be a constant, but since we do not know this constant, its predictive failure rate is decreasing; (Singpurwalla and Wilson, 1999, p. 77). Mathematically, the above type of argument is captured by a celebrated theorem in reliability which says that scale mixtures of certain chance distributions result in distributions having a decreasing failure rate. For example, in Barlow and Proschan (1975, p. 103), it is shown that F t=
exp−tFd, which is a scale mixture of exponentials with a parameter , has a decreasing failure rate. In particular, if above has a gamma distribution with a scale parameter and a shape parameter (section 3.2.2), thenF t =/+t, which is aPareto distribution. Its density att is of the form /t++1 t≥0, and its failure rate is/t+, which is a decreasing function oft. The mean of this Pareto is/−1; it exists only when >1. Note that a scale mixture is merely the law of total probability of section 2.4.2. Scale mixtures of distributions can be motivated by the following two, related, scenarios.
The first scenario is a tangible one, involving the physical act of putting together several items to form a batch, with each item having its own model failure ratei. Suppose that the is are known. We are required to assess the failure rate of an item picked at random from the batch.
We assume that in forming the batch, the identity of each item is lost so that we do not know the ifor the selected item, whose predictive failure rate we are required to assess. Such physical mixtures arise in practice when items coming from different sources are stored in a common bin, and the proportion of eachidetermines the mixing distribution.
The second scenario under which scale mixtures arise is less natural, because it models the mathematics of our thought process. In the case of exponentially distributed lifelengths, if we are uncertain about– as we will be – and if our uncertainty is described by , then our unconditional distribution of lifelengths would be a scale mixture of exponentials. The mixing now goes on in our minds and is therefore purely psychological.
An intuitive explanation as to why a scale mixture of exponentials results in a predictive distribution with a decreasing failure rate is easy to see if we bear in mind the notion that the failure rate is an individual’s judgment, and that judgments change with added information.
In our case the judgment is made at some reference time, and it pertains to the uncertainty about failure at a future timet, assuming some added knowledge. The added knowledge is the supposition that the item has not failed att. It is important to note that when the failure rate is assessed at time, for any future timet, we are not saying that in actuality the item is surviving att; rather, we inquire as to how we would assess our probability were we to suppose that the item is surviving att. Thus we may start off with a poor opinion of the item’s reliability, and upon the supposition of its continued survival, change our opinion for the better, resulting in a decreasing failure rate; more details are in Barlow (1985).
To summarize, there are two possible reasons for the judgment of a decreasing failure rate.
The first is motivated by the physics of failure and generally pertains to the model failure rate.
The second pertains to the predictive failure rate and is due to the psychology of altering our opinion about survivability; this is accomplished by mixtures of distributions.
Often reliability growth and burn-in have been given as the reasons for assuming decreasing failure rates. This is incorrect because both the above operations change the physical character- istics of an item, so that the changed item has a failure rate which may be different from its failure rate before the change.
Reliability Growth and Debugging
Reliability growth pertains to an enhancement of reliability due to changes in the design of the item or due to an elimination of its identified defects (cf. Singpurwalla, 1998); for example, debugging a piece of software. To describe reliability growth, we must talk about the concatenated (or adjoined) failure rates – model or predictive – of different versions of the same item, each version having a failure rate function that is dominated by the failure rate function of its preceding
MODEL FAILURE RATE 71
Time Failure rate function
T1 T2 T3 T1 T2 T3 Time
Failure rate function
(a) (b)
r1(t)
r1(t) r2(t)
r2(t)
r3(t) r4(t) r3(t)
r4(t)
Figure 4.3 Concatenated failure rate functions for items undergoing reliability growth.
version. An example is the reliability growth of computer software that undergoes several stages of testing and debugging (Chen and Singpurwalla, 1997; or Al-Mutairi, Chen, and Singpurwalla, 1998). In Figure 4.3(a) and (b), we see concatenated failure rate functions that are composed of constant and decreasing failure rate segments, respectively. Concatenated failure rate functions are often encountered in the literature on software reliability and have been erroneously cited as examples of items having a decreasing failure rate.
Burn-in and Screening Tests
Burn-in pertains to the elimination of weak items by subjecting each item to a life test, called a screening test, for a specified period of time (for example Block and Savits, 1997). The hope is that the weak and defective items will drop out due to early failures so that what remains are items that have proven themselves in a life test. In practice, burn-in appears to be almost always done on a one-of-a-kind item that is to be used in life-critical systems, such as spacecraft. The bathtub curve, discussed later, is often given as a reason for undertaking burn-in. Because of an accumulation of age during the screening test, the failure rate of an item that survives the test will be different from its failure rate before the test. Thus, like reliability growth, burn-in changes the physical characteristics of the item tested. In particular, for items that are judged to have an increasing failure rate, a burn-in test will make the surviving items inferior to what they were before the test. Thus, theoretically, burn-in is advantageous only when lifetimes are judged to have a decreasing predictive failure rate. Why is it that in practice all life-critical items, even those that are known to deteriorate with use, are subjected to a burn-in? Are the engineers doing something they should not be doing, or is it so that even items which deteriorate with use could be judged as having a failure rate which is initially decreasing and then increasing in time? It turns out that when we are uncertain about the precise form of an increasing model failure rate function for a deteriorating device, the predictive failure rate would initially decrease and then increase (Figure 4.4) and thus the engineer’s hunch to always burn-in makes sense. The intuition underlying the above phenomenon is analogous to the one used to explain the decreasing failure rate of mixtures of exponentials. That is, mixtures of distributions are a mathematical description of the psychological process of altering one’s opinion about the reliability of an item. Figure 4.4 shows, for different values ofaandb, a U-shaped predictive failure rate ofT when its model failure rate is increasing and is of the formrt =t+withassumed known and the uncertainity of described by a gamma distribution having a shape parameter aand a scale
20
0 t
Lower bound 2
r(t)
a = 4, b = 2
a = 6, b = 3
a = 2, b = 1
Figure 4.4 A U-shaped predictive failure rate generated by increasing model failure rates.
parameterb(for details, Lynn and Singpurwalla, 1997). Gurland and Sethuraman (1995) describe an analogous phenomenon in which increasing model failure rates could result in monotonically decreasing predictive failure rates.
Determining an optimum burn-in time is a problem in decision making under uncertainty wherein we are trading-off one attribute for another (for example Clarotti and Spizzichino, 1990).
There are two separate cases that need to be discussed. In the case of items whose failure rate is always decreasing we are trading-off the cost of a burn-in versus having an item with the lowest possible failure rate. The more time we spend under burn-in, the lower will be the failure rate of an item that survives the burn-in test. Thus ideally, for items having a decreasing failure rate, the burn-in period should be indefinite, but then we would never commission a surviving item for use; thus the trade-off. In the case of items that are known to deteriorate with use, but whose model failure rate we are unable to specify, the predictive failure rate will be judged to initially decrease and then increase, and now we are trading off the cost of testing plus the depletion of useful life for the added knowledge about the specific form of the model failure rate. With the above perspective on burn-in testing, the prevailing argument that the purpose of burn-in is the elimination of defective items can be justified if we are to view defective items as those having large values for the model failure rate.
The Bathtub Curve
Many complex technological systems, and also humans, are judged to have a failure rate function which is in the form of a bathtub curve (Figure 4.5(a)). In fact the bathtub curve has become one of the hallmarks of engineering reliability. Observe that there are three segments to the bathtub curve (which is assessed at a reference time). The initial decreasing segment is referred to as theinfant mortality phase, the middle constant segment as therandom phase, and the final increasing segment as theaging or thewear-out phase. The rationale behind a choice of the bathtub curve for the failure rate function is as follows: Typically, a newly developed system may contain design and manufacturing faults, known asbirth defects, that would trigger an early failure; thus the initial failure rate is described by a decreasing function of time. Were we to be told that the system does not experience a failure due to birth defects, then it is likely that its subsequent failure is due to causes that cannot be explained, and so the middle phase of the failure rate is constant over time. Should we suppose that item survives this random phase, then
MODEL FAILURE RATE 73
Age
0 10 30
Failure Rate
Age
Age 0
0
(a) (b)
(c)
10 30
10 30
Infant mortality
phase
Random phase Wear-out phase
Infant mortality
phase
Random phase Wear-out phase
Infant mortality
phase
Random phase Wear-out phase F(t)
f (t)
Figure 4.5 The bathtub curve for humans.
its subsequent failure is predominantly due to deterioration or wear, and so the final phase of the failure rate is an increasing function of time.
The bathtub form of the failure rate is used by actuaries to establish insurance premiums. The infant mortality period used by actuaries is from birth to age 10 years, the random phase from 10 to 30 years, and the wear-out phase commences at age 30, when premiums increase. The failure rate for humans during the initial period of the infant mortality phase is very high; that is, there is a high risk of death immediately after birth. Consequently, many insurance policies become effective after 15 days following birth. The causes of death between the ages of 10 and 30 years are assumed to be random and are due to events such as epidemics, wars, etc.;
aging is assumed to commence at age 30. It is important to bear in mind that the bathtub curve is a specific form of the failure rate function (specified at time), and like all other forms of the failure rate function is the opinion or the judgment of an assessor about the survivability of a unit. Under our interpretation of probability, the bathtub curve does not have a physical reality, and could be chosen as a model for a single one-of-a-kind item, or for any member of a population of items. Figures 4.5bandcillustrate the nature of the distribution functionFt and its corresponding density functionftfor the bathtub curve of Figure 4.5a.
Even though the bathtub form of the failure rate function is a reasonable idealization for the failure rate of humans, it may not be appropriate for certain engineering and biological systems.
For many systems, the judgment of a constant failure rate, which implies that the system neither improves nor deteriorates with use, may not be meaningful. For such systems, a U-shaped failure rate function may be a more reasonable description of their failure behavior (Figure 4.6(a)). For systems that do not experience wear, a strictly decreasing or an L-shaped failure rate function may be appropriate, whereas for systems that do not experience infant mortality, a failure rate
U-shaped L-shaped
Reverse L-shaped
0 t 0 t
(a) (b)
r(t) r(t)
Figure 4.6 U-shaped and L-shaped failure rate functions.
function that is strictly increasing, or is a constant, or is initially constant and then increasing may be reasonable (Figure 4.6(b)).
Before closing this section it is desirable to comment on the infant mortality phase of the bathtub curve. We also need to comment on whether the failure rate function of a bathtub curve represents the model failure rate or the predictive failure rate. The decreasing form of the failure rate function does not necessarily imply that the system is indeed improving with use. As I have said before, this is a rare phenomenon, restricted to scenarios such as the setting of cement, or in the case of humans, the building-up of an immune system. Rather, the decreasing form of the failure rate function typically describes our improving opinion about the survivability of the item. In actuality, a system may experience a gradual wear and/or destruction due to randomly occurring events, as soon as it is put to use; but these are not judged as being the predominant causes of initial failure. It is our lack of precise knowledge about the presence or the absence of manufacturing (or birth) defects which causes us to judge a decreasing failure rate function during the initial phase of a system’s life. Thus it appears that in most instances the infant mortality phase of the bathtub curve should pertain to the predictive failure rate. Since the other two phases of the bathtub curve are a constant and an increasing function of time, it appears (in the light of our discussions on decreasing failure rates), that these two phases must pertain to the model failure rate. Thus, in most instances the bathtub curve is more likely to be a concatenation of a predictive and a model failure rate. If the bathtub curve is U-shaped, then it could possibly pertain to the predictive failure rate, but most likely it will be the concatenation of a predictive and a model failure rate.
To conclude, it is our view that the bathtub curve that is commonly discussed and used by practitioners has an interpretation that is not as elementary as it is made out to be. In the framework that has been put forth, it is a representation of the opinion of an individual, or a collection of individuals, and this could be a concatenation of predictive and model failure rate functions, or just a predictive failure rate function. More on the shape of the failure rate function can be found in Aalen and Gjessing (2001).