Qualitative and quantitative evaluation of human error in risk assessment
8.2 Human reliability analysis in risk assessment
8.2.1 Introduction to the risk analysis process
Since HRA is usually applied in the context of technical safety and risk assessment, sometimes referred to as Formal Safety Analysis (FSA), we will first provide a brief overview of this process. The overall purpose of technical safety analysis and risk assessment is the identification and management of risks so that they are reduced to an acceptable level. The stages of the process can be summed up as follows:
• Identification of the hazards. These are aspects of a system likely to cause harm to people (e.g. high temperatures, pressures, toxic substances, voltages, high velocities) or financial loss.
• Evaluation of scenarios or credible incidents. These are events or sequences of events that could release the hazards.
• Evaluation of consequences. This is concerned with the different ways in which the hazard could exert its effects or influences on people, company assets or the environment once released.
• Evaluation of the probability or frequency with which the hazard is likely to be released (e.g. once every 10,000 operations of the system, once every 10 years).
• Evaluation of the risk. The product of the severity of the consequences and the frequency of its occurrence is the risk (alternatively the product of the severity of consequences, the frequency of exposure and the probability of the incident leading to the release of the consequences).
• Assessment of whether the risk is acceptable, using risk criteria, or bands of acceptability.
• Modification of the system if the risk is deemed to be unacceptable.
These stages are represented in the flow diagram shown in Figure 8.1. Risk assessment is often used to perform cost effectiveness evaluations, to decide which of a possible set of interventions will achieve a required level of risk for the lowest cost. For analyses of this type, it is necessary to be able to assess both the severity and the probability of a particular set of consequences occurring as a function of a number of mitigation options, each of which may have different associated costs. In the context of aircraft accidents, for example, it may be possible to demonstrate that a particular type of accident is due to a certain type of skill deficiency. This could be remedied by extending the training period of pilots, but this will have implications for both costs and operational availability. In order to assess whether or not the training option is viable, it would be necessary to evaluate the probability of the accident type as a function of the degree of training.
End of
analysis Yes
Risk acceptable?
Modify design No
Describe system
Identify hazards
Identify and select incidents
Evaluate consequences
Estimate frequencies/
probabilities
Combine frequencies and
consequences to calculate risk Figure 8.1 Overall risk analysis process
The potential costs of accidents arising from this cause will then have to be compared with the increased costs of training and reduced operational availabil- ity. In order to assess the risk and perform the cost-effectiveness calculations, the different ways in which the failures occur will need to be modelled, and appropriate costs and probabilities inserted in this model. This will require the use of a range of tools and techniques, which will be discussed in the following sections of this document.
8.2.2 Modelling tools used in risk analysis
In technical risk assessments and safety analyses, typical failures of safety critical systems are modelled in the form of representations such as event trees and fault trees. The failure probabilities of various hardware components in the system are then combined together using the logic represented by these models to give the probability of the failure. Where there are a number of systems that could contribute to the mitigation of an accident sequence, the probabilities of failure of each of the individual systems (which have been evaluated using fault tree analysis) are then combined using an event tree. This process can be used for both hardware and human failure probabilities and will be illustrated later.
The structure of a typical fault tree is illustrated in Figure 8.2. The event being analysed, the ‘Total failure of car brakes’ is called the ‘top event’. It can be seen that if appropriate failure probabilities are inserted in such a tree for each of the items, then the overall failure probability of the top event can be calculated from the logic of the fault tree model. Probabilities are multiplied at AND gates, added at OR gates, and the result propagated up the tree. It will be apparent that some of the failure probabilities in the event tree could easily arise from human error. For example, although the wear-out of the brake linings is a property of the hardware, it could arise from a failure to carry out routine servicing correctly (a ‘latent’ failure,
Total failure of car brakes
Hand-brake fails
Foot-brake fails
Cable broken Loss of
brake fluid All linings
worn out
Rear linings worn out
Front linings worn out AND
AND
OR OR
Figure 8.2 Fault tree for brake failure
where the effects are delayed). The same considerations apply to loss of brake fluid or the probability of the cable being broken.
Both of these latter failures could also arise from failures in maintenance activities.
Another way in which human activities could be involved in an apparently hardware- based system would be in the safety-related aspects of the design. For example, the linings may not have been designed for the types of braking encountered in the use of the vehicle. It might also be argued that some form of redundant system would have been appropriate so that the loss of brake fluid did not fail the foot-brake. In addition, certain failures, in this case the failure of the rear brake linings, can affect more than one branch of the fault tree and may therefore have a greater impact than failures that only make one contribution to the top event. Failure probabilities for fault trees of this type are usually derived from databases of failure probabilities for components, which are obtained by observing failure ratesin situ, or as a result of testing programmes.
Failure rates can be expressed in terms of time, such as a mean time between failures, or as a failure probability per demand. These measures can be converted to one another if the failure distribution is known. It is important to emphasise that these probabilities are conditional probabilities, since they are dependent on the conditions under which the data was collected. Although the issue of the con- text within which a failure occurs is not insignificant for hardware, it is particularly important in the case of human failure. For both humans and hardware components, such as pumps, failure probabilities arise from the interaction between the person (or component) and the environment. However, whilst the failure probability of a pump can be largely predicted by its basic design and its level of use, human error
probabilities are influenced by a much wider range of contextual factors, such as the quality of the training, the design of the equipment and the level of distractions.
These are sometimes referred to as Performance Shaping Factors (PSFs). The term
‘Performance Shaping’ originated in the context of the early days of psychology where various types of conditioning were used to shape the performance of simple tasks performed by animals under laboratory conditions. However, this context has little relevance to the performance of skilled people in technical systems, and hence the term Performance Influencing Factors (PIFs) will be used to refer to the direct and indirect factors that influence the likelihood that a task will be performed successfully. Williams [1] uses the term ‘Error Producing Conditions’ with a similar meaning.
The event tree is used to model situations where a number of events need to occur (or be prevented) in sequence for an undesirable outcome to arise or be averted.
Typically, these events can be either initiating events (hardware/software or human), which start the accident sequence, or possible preventative actions (again hard- ware/software or human), which may prevent the sequence proceeding to the final undesirable consequence. Depending on which failure occurs and whether it can be recovered by subsequent actions, a range of paths through the event tree is possible.
The overall probability of the undesirable consequence, therefore, has to take into account all of these paths.
The event tree shown in Figure 8.3 shows the different types of event that could give rise to a signal passed at danger (SPAD) in a railway situation. This is based on the MARS approach (Model for Assessing and Reducing SPADs) developed by Embreyet al. [2]. The probability of each failure is evaluated by multiplying the probabilities along each of the routes that could be traversed. If S stands for successes at each node of the tree, and F for the corresponding failures, the probabilities of the
Diagnosis Signal identified
Interpretation Signal interpreted
Response Correct response made Detection
Signal detected Signal
at red
No braking (SPAD 4) F1
S1
No braking (SPAD 3) S2
No braking (SPAD 2) S3
Train stops before signal S4
Insufficient braking (SPAD 1)
F2
F3
F4
Figure 8.3 Event tree for signals passed at danger (SPADs)
different types of SPAD are given by:
P(SPAD 1)=S1×S2×S3×F4 P(SPAD 2)=S1×S2×F3 P(SPAD 3)=S1×F2 P(SPAD 4)=F1
where P(X)=probability of SPAD type X occurring.
The overall combined probability of a SPAD arising from those modelled in the event tree is therefore:
P(SPAD 1)+P(SPAD 2)+P(SPAD 3)+P(SPAD 4)
Although mathematically equivalent to a fault tree, the event tree has certain useful features when used for modelling human performance. In particular, it is possible to take into account modifications in the probabilities of the event tree as a result of antecedent events. For example, if one event in the sequence failed, but was subsequently recovered, the time taken to recover could lead to less time to perform subsequent operations, raising the probability of failure for these operations. In the case of the SPAD event tree shown above, a signal aspect (red or green) could initially be misdiagnosed and then recovered. However, the time lost in correcting the initial failure could give the driver less time to brake. Therefore, the probabilities in the event tree are not necessarily independent, as is usually assumed to be the case in a fault tree.
The structure of the event tree often makes it easier to understand the nature of these dependencies. In practice, the probabilities in event trees are evaluated by fault trees that contain lower level events for which probabilities are more readily obtainable than at the event tree level. For example, the first failure event in the SPAD event tree in Figure 8.3 ‘Failure to detect signal’ could be decomposed into the events:
• Failure to maintain visual attention to the trackside.
• Lack of visibility due to poor contrast between signal and background.
• Poor visibility due to weather conditions.
Each of these events will have a corresponding probability, and these probabilities can be added if they can be assumed to be independent and any one or any combination, (called an ‘OR gate’ in fault tree terminology) could give rise to the signal detection failure at the level of the event tree.
Although the fault tree is the tool of choice in hardware reliability assessment, it has a number of disadvantages when applied to human reliability analysis. In particular, it is difficult to model interactions between contextual factors, e.g. poor visibility and a high level of workload, which in combination could have a greater effect on the probability of failure than either event occurring alone. The influence diagram (see Section 8.7.6) can be used to overcome some of these problems.
In conclusion, the analyst has to be aware that human reliability cannot be blindly assessed by feeding numbers into fault tree software tools without careful consid- eration of the subtle interactions between these probabilities. In particular, human error probabilities are frequently subject to ‘Common Causes’, i.e. factors operating
globally across a number of tasks or task elements that may negate any assumptions of independence. For example, the probabilities of failures of two independent checks of an aviation maintenance re-assembly task may be naively multiplied together by the analyst on the assumption that both must fail for the re-assembly to be performed incorrectly. In reality, if both checkers are old friends and have a high opinion of each other’s capabilities, their checks may in fact be far from independent, and hence the actual failure probability may be that of the worst checker. Modelling these failure routes and interactions, is often the most difficult but most useful aspect of a human reliability assessment. Once a realistic model of the ways in which a system can fail has been constructed, a significant portion of the benefits of HRA has been achieved.
Unfortunately HRA is often carried out as an engineering number crunching exer- cise without the development of comprehensive and complete qualitative models.
Although a selection of techniques for generating the numerical data to populate the failure models will be discussed in the next section it cannot be emphasised too strongly that applying these techniques to inadequate failure models can produce completely incorrect results. Developing accurate models of human failures in sys- tems requires considerable experience, and hence it is recommended that professional advice be sought if the analyst lacks experience in these areas.
8.3 A systematic human interaction reliability assessment