• Tidak ada hasil yang ditemukan

A METHOD AGREEMENT AND MEASUREMENT ERROR CHECKLIST

Moreover, both measurement tools (not just the more convenient or cheaper alternative) should be appraised for test–retest measurement error. Only through such an analysis can a firm conclusion be made regarding the source of any disagreement between different methods of measurement (Bland and Altman, 2003; Atkinson et al., 2005a).

Some aspects of least-squares regression (LSR) should be used with caution to examine agreement between measurement methods relevant to exercise physiology. It is likely that both physiological measurement methods show approximately similar degrees of test–retest error due to the major com-ponent of this error being ubiquitous and biological in origin. This error, present when using either measurement method, means that an important assumption for LSR might be violated leading to biased estimates of LSR slope and intercept statistics (Ludbrook, 1997; Bland and Altman, 2003;

Atkinson et al., 2005a). These statistics are conventionally used to make inferences about systematic differences between methods but the slope and intercept of a LSR line is unbiased only if the ‘predictor’ method is associ-ated with substantially lower levels of test–retest measurement error than the other method or is in fact a ‘fixed’ variable. Moreover, the prediction philosophy of regression does not sit well with the fact that most researchers desire to select, a priori, the best measurement tool to use throughout their investigations, rather than them aiming to predict measurements using another method as part of their study.

Like all statistics, those used to describe error and agreement are popula-tion specific, since different populapopula-tions may show different degrees of error due to biological sources. Different individuals sampled from the same population may also show different degrees of error, for example, individuals who record the highest physiological values in general might also show the greatest amount of measurement error. Therefore, whether error might differ for different individuals in the population, and whether the statistical precision of the sample error estimate is adequate, are important considerations.

There are other philosophical issues, which underpin the statistical techniques used to appraise a physiological measurement tool. Our aim is not to discuss these issues, since there are now several comprehensive reviews in which the background to the statistical analysis is explained (Atkinson and Nevill, 1998;

Bland and Altman, 1999, 2003; Atkinson et al., 2005a). Alternatively, we aim to summarise the most important aspects of a measurement study in the form of a checklist for exercise physiologists.

A METHOD AGREEMENT AND

scientific journal. By ‘measurement study’, we mean an investigation into either the agreement between measurement methods (a method comparison study) or test–retest measurement error (a repeatability or reliability study). We have categorised the various important points into (1) Delimitations, (2) Systematic error examination, (3) Random error examination and (4) Statistical precision.

1 Delimitations

Ideally, the measurement study should involve at least 40 partici-pants. If there are less than 40 participants, then scrutiny of confi-dence limits for the error statistics becomes even more important (see Section 4), since error estimates calculated on a small sample can be imprecise (Atkinson, 2003).

Try to match the characteristics of the measurement study to planned uses of the measurement tool, that is, a similar population, a similar time between repeated measurements (for investigations into test–retest error), a similar exercise protocol as well as compa-rable resting conditions during measurements.

Select a priori an amount of error that is deemed acceptable between the methods or repeated tests. This delimitation may depend on whether one wishes to use the measurement tool predominantly for research purposes (i.e. on a sample of participants) or for making measurements on individuals (e.g. for health screening purposes or for sports science support work). Atkinson and Nevill (1998) termed these considerations ‘analytical goals’.

For research purposes, the analytical goal for measurement error is best set via a statistical power calculation. One could delimit an amount of test–retest error (described by the standard deviation of the differences, for example) on the basis of an acceptable statistical power to detect a given difference between groups or treatments with a feasible sample size (Atkinson and Nevill, 2001).

If a relatively large sample is feasible for future research, then a given amount of measurement error should have less impact on use of the measurement tool, and vice versa (Figure 5.1).

For use of the measurement tool on individuals, one might delimit the acceptable amount of error on the basis of the ‘worst scenario’

individual difference, which would be allowable. This delimitation is related to the 95% limits of agreement (LOA) statistic (Bland and Altman, 1999, 2003) as well as applications of the standard error of measurement (SEM) statistic (Harvill, 1991). For example, a differ-ence as large as 5 beat·min1between two repeated measurements of heart rate during exercise would probably still make little differ-ence to the prescription of heart rate training ‘zones’ to individuals.

The use of arbitrary ‘rules of thumb’ such as accepting adequate agreement between methods or tests on the basis of a correlation coefficient being above 0.9 or a coefficient of variation (CV) being below 10% is discouraged, since no relation is made between error and real uses of the measurement tool with such generalisations (Atkinson, 2003). Nevertheless, reviews (e.g. Hopkins, 2000), in

which error statistics are cited for various measurements may be useful in establishing a ‘typical’ degree of acceptable error to select.1 2 Systematic error examination

Compare the mean difference between methods/tests with the a priori defined acceptable level of agreement (see Figure 5.2 and Section 4 below on the use of confidence limits for interpretation of this mean difference).

If systematic error is present between repeated tests using the same method (i.e. a repeatability study) and if no performance test has been administered, then be suspicious about the design of the repeatability study. Perhaps, there have been carry-over effects from previous measurements being obtained too close in time to subsequent measurements. Such a scenario could occur with measurements of intra-aural temperature, for example (Atkinson et al., 2005b).

If a performance test is incorporated in the protocol, then systematic differences between test and retest(s) in a repeatability study may occur due to learning effects, for example. Such information is important for advising future researchers how many familiarisation sessions might be required prior to the formal recording of physiological values.

44 GREG ATKINSON AND ALAN M. NEVILL

0 20 40 60 80 100 120 140 160 180 200

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Ratio limits of agreement

Sample size

1% 5% 10%

20%

30%

0 5 10 15 20 25 30 35

Coefficient of variation (%)

Figure 5.1 A nomogram to estimate the effects of measurement repeatability error on whether

‘analytical goals’ are attainable or not in exercise physiology research. Statistical power is 90%. The different lines represent different worthwhile changes of 1%, 5%, 10%, 20% and 30% due to some hypothetical intervention. The measurement error statistics, which can be utilised are the LOA and CV.

For example, a physiological measurement tool, which has a repeatability CV of 5% would allow detection of a 5% change in a pre-post design experiment (using a paired t-test for analysis of data) and with a feasible sample size (~20 participants)

Source: Batterham, A.M. and Aktinson, G. (2005). How big does my sample need to be? A primer on the murky world of sample size estimation. Physical Therapy in Sport 6, 153–163.

Examine whether the degree of systematic error alters over the measurement range. One can consult the Bland–Altman plot for this information (Figure 5.3). The presence of ‘proportional’ bias would be indicated if the points on the Bland–Altman plot show a pronounced downward or upward trend over the measurement range. If this characteristic is present, it means that the systematic difference between methods/tests differs for individuals at the low and high ends of the measurement range. Bland and Altman (1999) and Atkinson et al. (2005a) discuss how this proportional bias can be explored and modelled.

3 Random error examination

Scrutinise the degree of random error between methods/tests that is present (Figure 5.3). Popular statistics used to describe random error include the SEM (Harvill, 1991), which is also known as the within-subjects standard deviation, CV, LOA and standard devia-tion of the differences. Each of these statistics will differ, since they are based on different underlying philosophies. For example, the LOA statistic is rooted in clinical work and can be viewed as representing the ‘worst scenario’ error that one might observe for an

Systematic error between methods (mmHg) A

B

C

D

–20 –15 –10 –5 0 +5 +10 +15 +20

Figure 5.2 Using a confidence interval and ‘region of equivalence’ (shaded area) for the mean difference between methods/tests. Using the agreement between two blood pressure monitors as an example, if both of the 95% CI fall inside an a priori selected ‘region of equivalence’ of 5 mmHg for the mean difference between methods, as is the case with point A, then we can be reasonably certain that the true systematic difference between methods is not clinically or practically important. For point D, even the lower 95% confidence limit of 10.6 mmHg for the systematic difference between methods does not lie within the designated region of equivalence, so we can be reasonably certain that the degree of systematic bias would have practical impact. The width of the CIs for points B and C suggest that the population mean bias might be practically important, but one would need more cases in the measurement study to be reasonably certain of the true magnitude, or in the case of point B, even the direction of the systematic bias for the population

individual person. The SEM statistic is popular amongst psychologists who conceptualise an average of many repeated measurements; the

‘true score’ for an individual (Harvill, 1991).

Explore the relationship between random error and magnitude of measured value (Atkinson and Nevill, 1998; Bland and Altman, 1999). If the random error does increase in proportion to the size of the measured value, then a ratio statistic should be employed to describe the measurement error (e.g. CV) (Nevill and Atkinson, 1997; Bland and Altman, 1999). If the error is ‘homoscedastic’

(uniform over measurement range), than measurement error can be described in the particular units of measurement by calculating the LOA or SEM. Complicated relationships between magnitude of error and measured value can be analysed using non-parametric methods according to Bland and Altman (1999) or by calculating measurement error statistics for separate sub-samples within a population (Lord, 1984).

In a repeatability study, which involves multiple retests, examine whether random error changes between separate test and retest(s).

The researcher could explore whether random error reduces as more

Figure 5.3 Various examples of relationships between systematic and random error and the size of the measured value as shown on a Bland-Altman plot. (A) Proportional random error. (B) systematic error present, which is uniform in nature, random errors also uniform. (C) Systematic error present, which is proportional to size of measured value, random errors uniform. (D) Proportional random error with no systematic error. (E) Uniform systematic error present with proportion random error. (F) Proportional systematic and random error

Source: Atkinson, G., Davison, R.C.R. and Nevill, A.M. (2005). Performance characteristics of gas analysis systems:

what we know and what we need to know. International Journal of Sports Medicine, 26 (Suppl. 1): S2–S10.

46 GREG ATKINSON AND ALAN M. NEVILL

(A)

(B)

(C)

(D)

(E)

(F)

tests are administered (Nevill and Atkinson, 1998). If this is so, and in keeping with the advice above for exploration of systematic error, the researcher should communicate this learning effect on random error so that future users know exactly how many familiarisation sessions are required for total error variance to be minimised for their research.

4 Statistical precision

Calculate the 95% confidence interval (CI) for the mean difference between methods/tests (Jones et al., 1996) and compare this CI to a

‘region of equivalence’ for the two methods of measurement (Figure 5.2). This CI is not the same as the LOA statistic. Scrutiny of the lower and upper limits of this CI should not change the conclusion that has been arrived at regarding the acceptability of systematic error between methods/tests (Figure 5.2). For example, one might observe a mean difference of 10 mmHg between blood pressure measuring devices but the 95% CI might be 4 to 24 mmHg. This means that the population mean difference between methods could be as much as 4 mmHg in one direction or as much as 24 mmHg in the other direction. Only a narrower CI, mediated mostly by a greater sample size, would allow one to make a more conclusive statement regarding systematic error. Atkinson and Nevill (2001) and Atkinson et al. (2005a) discuss further the use of CI’s and limits.

Calculate confidence limits for the random error statistics. As above, scrutiny of the CI should not change the decision that has been made about acceptability of random error. For example, a physiological measurement method with a repeatability CV of 30% and associated CI of 25–35% indicates poor repeatability, even if the lower limit of the CI is taken into account. Bland and Altman (1999) provide details relevant to limits of agreement.

Hopkins (2000) shows how to calculate confidence limits for CV and Morrow and Jackson (1993) provide details for intraclass correlation.