Principle 11. Testing is a creative and challenging task [12]
2. The coupling effect. This effect relates to questions a tester might have about how well mutation testing can detect complex errors since the
changes made to the code are very simple. DeMillo has commented on that issue as far back as 1978 [10]. He states that test data that can distinguish all programs differing from a correct one only by simple errors are sensitive enough to distinguish it from programs with more complex errors.
Mutation testing starts with a code component, its associated test cases, and the test results. The original code component is modified in a simple way to provide a set of similar components that are called mutants.
Each mutant contains a fault as a result of the modification. The original test data is then run with the mutants. If the test data reveals the fault in the mutant (the result of the modification) by producing a different output as a result of execution, then the mutant is said to be killed. If the mutants do not produce outputs that differ from the original with the test data, then the test data are not capable of revealing such defects. The tests cannot distinguish the original from the mutant. The tester then must develop additional test data to reveal the fault and kill the mutants.
A test data adequacy criterion that is applicable here is the following [11]:
A test set T is said to be mutation adequate for program P provided that for every inequivalent mutant Pi of P there is an element t in T such that Pi(t) is not equal to P(t).
The term T represents the test set, and t is a test case in the test set. For the test data to be adequate according to this criterion, a correct program must behave correctly and all incorrect programs behave incorrectly for the given test data.
Mutations are simple changes in the original code component, for example: constant replacement, arithmetic operator replacement, data statement alteration, statement deletion, and logical operator replace-ment. There are existing tools that will easily generate mutants. Tool users need only to select a change operator. To illustrate the types of changes made in mutation testing we can make use of the code in Figure 5.2. A first mutation could be to change line 7 from
i ⳱ i Ⳮ 1 to i ⳱ i Ⳮ 2.
If we rerun the tests used for branch coverage as in Table 5.1 this mutant will be killed, that is, the output will be different than for the original code. Another change we could make is in line 5, from
if a[i]⬎ 0 to if a[i] ⬍ 0.
This mutant would also be killed by the original test data. Therefore, we can assume that our original tests would have caught this type of defect.
However, if we made a change in line 5 to read if a[i] ⬎ ⳱ 0,
this mutant would not be killed by our original test data in Table 5.1.
Our inclination would be to augment the test data with a case that in-cluded a zero in the array elements, for example:
a ⳱ 0, 45, 3, SIZE ⳱ 3.
However, this test would not cause the mutant to be killed because adding a zero to the output variable sum does not change its final value. In this case it is not possible to kill the mutant. When this occurs, the mutant is said to be equivalent to the original program.
To measure the mutation adequacy of a test set T for a program P we can use what is called a mutation score (MS), which is calculated as follows [12]:
# of dead mutants MS (P,T) ⳱
# total mutants ⳮ # of equivalent mutants
Equivalent mutants are discarded from the mutant set because they do not contribute to the adequacy of the test set.
Mutation testing is useful in that it can show that certain faults as represented in the mutants are not likely to be present since they would have been revealed by test data. It also helps the tester to generate hy-potheses about the different types of possible faults in the code and to develop test cases to reveal them. As previously mentioned there are tools to support developers and testers with producing mutants. In fact, many hundreds of mutants can be produced easily. However, running the tests, analyzing results, and developing additional tests, if needed, to kill the mutants are all time consuming. For these reasons mutation testing is usually applied at the unit level. However, recent research in an area called interface mutation (the application of mutation testing to evaluate how well unit interfaces have been tested) has suggested that it can be applied effectively at the integration test level as well [12].
Mutation testing as described above is called strong mutation testing.
There are variations that reduce the number of mutants produced. One of these is called weak mutation testing which focuses on specific code components and is described by Howden [13].
5 . 6
E v a l u a t i n g T e s t A d e q u a c y C r i t e r i aMost of the white box testing approaches we have discussed so far are associated with application of an adequacy criterion. Testers are often faced with the decision of which criterion to apply to a given item under test given the nature of the item and the constraints of the test environ-ment (time, costs, resources) One source of information the tester can use to select an appropriate criterion is the test adequacy criterion hierarchy as shown in Figure 5.5 which describes a subsumes relationship among the criteria. Satisfying an adequacy criterion at the higher levels of the hierarchy implies a greater thoroughness in testing [1,14–16]. The criteria at the top of the hierarchy are said to subsume those at the lower levels.
For example, achieving all definition-use (def-use) path adequacy means the tester has also achieved both branch and statement adequacy. Note from the hierarchy that statement adequacy is the weakest of the test adequacy criteria. Unfortunately, in many organizations achieving a high level of statement coverage is not even included as a minimal testing goal.
All paths
All def-use paths
All uses
All c-uses, some p-uses
All p-uses, some c-uses
All-defs All p-uses
All branches
All statements
Strong mutation
Weak mutation
All c-uses
F I G . 5 . 5
A partial ordering for test adequacy criteria.
As a conscientious tester you might at first reason that your testing goal should be to develop tests that can satisfy the most stringent criterion.
However, you should consider that each adequacy criterion has both strengths and weaknesses. Each, is effective in revealing certain types of defects. Application of the so-called “stronger” criteria usually requires more tester time and resources. This translates into higher testing costs.
Testing conditions, and the nature of the software should guide your choice of a criterion.
Support for evaluating test adequacy criteria comes from a theoretical treatment developed by Weyuker [2]. She presents a set of axioms that allow testers to formalize properties which should be satisfied by any good program-based test data adequacy criterion. Testers can use the axioms to
• recognize both strong and weak adequacy criteria; a tester may decide to use a weak criterion, but should be aware of its weakness with respect to the properties described by the axioms;
• focus attention on the properties that an effective test data adequacy criterion should exhibit;
• select an appropriate criterion for the item under test;
• stimulate thought for the development of new criteria; the axioms are the framework with which to evaluate these new criteria.
The axioms are based on the following set of assumptions [2]:
(i) programs are written in a structured programming language;
(ii) programs are SESE (single entry/single exit);
(iii) all input statements appear at the beginning of the program;
(iv) all output statements appear at the end of the program.
The axioms/properties described by Weyuker are the following [2]:
1 . A p p l i c a b i l i t y P r o p e r t y
“For every program there exists an adequate test set.” What this axiom means is that for all programs we should be able to design an adequate test set that properly tests it. The test set may be very large so the tester will want to select representable points of the specification domain to test it. If we test on all representable points, that is called an exhaustive test set. The exhaustive test set will surely be adequate since there will be no other test data that we can generate. However, in past discussions we have ruled out exhaustive testing because in most cases it is too expensive, time consuming, and impractical.
2 . N o n e x h a u s t i v e A p p l i c a b i l i t y P r o p e r t y
“For a program P and a test set T, P is adequately tested by the test set T, and T is not an exhaustive test set.” To paraphrase, a tester does not need an exhaustive test set in order to adequately test a program.
3 . M o n o t o n i c i t y P r o p e r t y
“If a test set T is adequate for program P, and if T is equal to, or a subset of T⬘, then T⬘ is adequate for program P.”
4 . I n a d e q u a t e E m p t y S e t
“An empty test set is not an adequate test for any program.” If a program is not tested at all, a tester cannot claim it has been adequately tested!
Note that these first four axioms are very general and apply to all pro-grams independent of programming language and equally apply to uses of both program- and specification-based testing. For some of the next group of axioms this is not true.
5 . A n t i e x t e n s i o n a l i t y P r o p e r t y
“There are programs P and Q such that P is equivalent to Q, and T is adequate for P, but T is not adequate for Q.” We can interpret this axiom as saying that just because two programs are semantically equivalent (they may perform the same function) does not mean we should test them the same way. Their implementations (code structure) may be very different.
The reader should note that if programs have equivalent specifications then their test sets may coincide using black box testing techniques, but this axiom applies to program-based testing and it is the differences that may occur in program code that make it necessary to test P and Q with different test sets.
6 . G e n e r a l M u l t i p l e C h a n g e P r o p e r t y
“There are programs P and Q that have the same shape, and there is a test set T such that T is adequate for P, but is not adequate for Q.” Here Weyuker introduces the concept of shape to express a syntactic equiva-lence. She states that two programs are the same shape if one can be transformed into the other by applying the set of rules shown below any number of times:
(i) replace relational operator r1 in a predicate with relational operator r2;
(ii) replace constant c1 in a predicate of an assignment statement with constant c2;
(iii) replace arithmetic operator a1 in an assignment statement with arith-metic operator a2.
Axiom 5 says that semantic closeness is not sufficient to imply that two programs should be tested in the same way. Given this definition of shape, Axiom 6 says that even the syntactic closeness of two programs is not strong enough reason to imply they should be tested in the same way.
7 . A n t i d e c o m p o s i t i o n P r o p e r t y
“There is a program P and a component Q such that T is adequate for P, T⬘ is the set of vectors of values that variables can assume on entrance to Q for some t in T, and T⬘ is not adequate for Q.” This axiom states that although an encompassing program has been adequately tested, it does not follow that each of its components parts has been properly tested. Implications for this axiom are:
1. a routine that has been adequately tested in one environment may