Reliability statistics: General - Textbook in Psychiatric Epidemiology

As we have seen, the reliability coefficient of Equation 5.1 is defined in terms of variances: variances of systematic person characteristicsσ²_T, and variances of measurements across replications for a single person, σ²_E. There are several ways to estimate the variance ratio shown in Equation 5.1 [9], but one direct method is simply to estimate the separate variance components and then combine them in the form of Equation 5.1 Estimates of this sort are called intraclass correlations.

Intraclass correlation is not a single statistic, but rather a family of statistics that can be used for estimating reliability. In this section we will review several versions that can be used with a wide variety of variables. We focus here on the easiest part of reliability analysis, ‘point estimation’ of the statistic that summarises the reliability results. Although it is important, we do not have the space to present the methods that must be used to estimate 95%

confidence intervals for the study results. The form of the interval estimators depends on the nature and distribution of the data, and new methods are being actively developed in the literature. For reviews of methods for confidence intervals see Shrout [7], Dunn [9] and Blackman and Koval [33]. It is important to note, however, that estimates of reliability are often less precise than we would like [34], and that this fact is made clear by the use of confidence intervals.

The intraclass correlation point estimates are derived from information summarised in the analysis of variance (ANOVA) of the data from the reliability

study. The ANOVA treats each subject as a level of the SUBJECTS factor. Usually subjects are considered to be arandom factor, because they are selected to be representative of a population of interest.

If the replicate measurements of the subjects are systematically obtained using a certain set of k raters or measuring devices, then the ANOVA might involve a two-way SUBJECTS by MEASURES design. If, on the other hand, the replicate measurements of each subject are obtained by randomly samplingkmeasures, then the analysis would use a one-way ANOVA.

5.6.1 One-way ANOVA analyses

Table 5.1 illustrates data that might be collected in reliability study of relative informants. Each of N=10 probands is rated byk=3 distinct relatives.

Between-subject variation can be estimated using all k ratings, and within-subject variation is used to estimate the magnitude of the error variation. When the relationships of the relatives vary from proband to probands (e.g. siblings for one proband, parents for another, cousins for a third), these data do not have a data analytic structure for informant. If there had been such a structure, we might have considered a proband-by-relationship two-way ANOVA. In our analysis we will assume that the informants are essen- tially a random sample of possible informants for a given respondent.

Table 5.2 shows the layout of the one-way ANOVA, along with the numerical estimates obtained from the data in Table 5.1. The actual com- putation of the ANOVA results can be obtained from standard computer software, such as SPSS RELIA- BILITY [35]. The numerical example illustrates a pattern in which the between-subjects (probands) mean squares is substantially larger than the within- subjects mean squares. Consistent with an informal examination of the hypothetical data in Table 5.1, this pattern suggests that the differences between

Table 5.1 Hypothetical data on functioning of 10 probands by three of their relatives.

Proband Relative 1 Relative 2 Relative 3

1 29 32 17

2 23 33 28

3 19 17 18

4 6 10 5

5 13 20 20

6 0 0 2

7 10 11 15

8 5 1 15

9 31 26 19

10 15 17 18

subjects’ mean ratings are larger than the disagree- ments among relatives regarding the subjects’ scores.

The reliability estimate for the one-way ANOVA is calculated using the first formula in Table 5.3A.

This form of the intraclass correlation was called ICR (1,1) by Shrout and Fleiss [36], and we retain that designation. To illustrate the calculation with the numerical example from Table 5.1, we find,

ICR(1, 1)=(251.0−22.2)/(251.0+2∗22.2)

=0.77

This result describes the reliability of a single randomly selected informant. About 77% of the variance of a single informant’s ratings is attributable to systematic differences between subjects. Although the stability of the result might be questioned because of the limited sample size, the result is encouraging that this rating, in this population appears to be made fairly reliably by a single informant.

Suppose that it is possible to obtain three informant ratings for each subject in the survey. How much more reliable would the average of the three ratings be than an individual informant? The answer can be calculated using the Spearman–Brown formula (Equation 5.2), with k=3 and R_X=0.77.

Alternatively, one can use the formula for ICR(1,k) Table 5.2 Analysis of variance when replications are nested within subjects: one way ANOVA.

Source of variation df Sums of squares Mean squares Table 5.1 Example: MS on df

Between subjects n−1 BSS BMS=BSS/(n−1) BMS=251.0 on 9 df

Within subjects n(k−1) WSS WMS=WSS/[n(k−1)] WMS=22.2 on 20 df

Table 5.3 Versions of intraclass correlation statistics useful for various reliability designs.

Type of reliability study design Raters fixed or random? Version of intraclass correlation^a (A) Reliability of single rater

Nested:nsubjects rated bykdifferent raters

Random ICR(1,1)= BMS−WMS

BMS+(k−1)WMS.

Subject by rater crossed design Random ICR(2,1)= TMS−EMS

TMS+(k−1)EMS+k(JMS−EMS)/n.

Subject by rater crossed design Fixed ICR(3,1)= TMS−EMS

TMS+(k−1)EMS. (B) Reliability of the average of k ratings

Nested:nsubjects rated bykdifferent raters

Random ICR(1,k)= BMS−WMS

BMS .

Subject by rater crossed design Random ICR(2,k)= TMS−EMS

TMS+(JMS−EMS)/n.

Subject by rater crossed design Fixed ICR(3,k)= TMS−EMS

TMS .

aBMS and WMS refer to between-subject and within-subject mean squares from a one way ANOVA. TMS, JMS and EMS refer to between-subjects (targets), between measures (judges) and error mean squares from two way ANOVA based onn target-subjects andkraters.

shown in Table 5.3B. This formula is obtained by algebraically combining the expression for ICR(1,1) with the Spearman–Brown formula. In this case the answer is ICR(1,k)=0.91. About 91% of the variance of the average of three randomly chosen informants is attributable to systematic differences between subjects.

5.6.2 Two-way ANOVA analyses

Table 5.4 illustrates data that might be collected in a reliability study of two professional raters or interviewers. As a result of the interview by Interviewer 1 we have both a binary diagnosis (disorder present [X=1] vs. disorder absent [X=0]) and a quantitative score such as a total functioning score (called Zin the table). Replicate scores and diagnoses are obtained by a second interviewer, called Interviewer 2. The hypothetical data on X1, X2, Z1 and Z2 are shown for 17 respondents. The layout of the two-way ANOVA is shown in Table 5.5, along with numerical results from the Table 5.4 examples.

Only two interviewers were used in the reliability study illustrated in Table 5.4, but we might consider the two to be a random sample from all possible interviewers from the study. If so, then they must not be selected on the basis of their special skills as interviewers, but rather should be selected to be

Table 5.4 Hypothetical data on assessment of depression and functioning. X1 and X2 represent test–retest diagnoses of major depression (X=1, present;X=0, not present), and Z1 and Z2 represent ratings of adaptive functioning.

Respondent X1 X2 Z1 Z2

1 0 1 17 11

2 1 1 17 15

3 0 0 26 25

4 0 0 24 22

5 0 0 19 14

6 0 0 22 16

7 0 0 17 18

8 0 0 23 19

9 1 0 19 16

10 0 1 18 12

11 0 0 21 18

12 1 1 13 11

13 0 0 21 23

14 0 0 22 17

15 0 0 15 12

16 0 0 20 18

17 0 0 21 20

representative. When interviewers who are employed in the reliability study represent the population of potential interviewers, we say that they are randomeffects.

In some cases we are interested in the ratings of specific interviewers rather than a population of

Table 5.5 Analysis of variance when replications have structure: two way ANOVA.

Source of variation df Sums of squares Mean squares Table 5.4 Examples: Mean

square on df

Between subjects (targets) n−1 TSS TMS=TSS/(n−1) Variable X: 0.254 on 16 df Variable Z: 25.7 on 16 df Between measures (judges) (k−1) JSS JMS=JSS/(k−1) Variable X: 0.029 on 1 df Variable Z: 67.8 on 1 df

Residual (error) (n−1)(k−1) ESS EMS= ESS

(n−1)(k−1) Variable X: 0.092 on 16 df Variable Z: 2.67 on 16 df

interviewers. Suppose Interviewer 1 is a doctoral candidate who carried out her own data collection, and that Interviewer 2 is a colleague who is hired to document that the ratings are systematic. In this case we simply wish to describe the quality of data collected by the doctoral candidate, and we say that the interviewers arefixedeffects.

Depending on whether the raters are considered to be random or fixed, we use different versions of the intraclass correlation to estimate reliability.

When we wish to estimate the reliability of a randomly sampled interviewer, we use the expression for ICR(2,1) shown in of Table 5.3A. This intraclass correlation is not only a function of the between- subjects mean squares and the error mean squares, but also the between-measure (judge) mean squares.

If different raters are more or less liberal in assigning high scores, then the final variability of the ratings will be affected. ICR(2,1) takes this extra variation into account in estimating reliability.

In the two examples of Table 5.4, one reveals a large between-measure effect and the other does not.

From the numbers in Table 5.4 it can be seen that the Z2 ratings are usually smaller than theZ1 ratings.

Rater 2 seems to believe that most subjects are functioning somewhat worse than perceived by Rater 1.

Even with this rater difference, the reliability ofZis higher than the reliability ofX, according to the data in Table 5.4. The ICR(2,1) forZis calculated as

(25.7−2.67)/(25.7+(1)∗2.67 +2∗(67.8−2.67)/17)=0.64 The ICR(2,1) forXis calculated as

(0.254−0.092)/(0.254+(1)∗0.092 +2∗(0.029−0.092)/17)=0.48

For the rating of adaptive functioning we could consider averaging both individual Z ratings to obtain a more reliable score. We can use either the Spearman–Brown formula, or the expression ICR(2,k) to calculate the reliability of the mean of two such ratings. In this case, the result is 0.78 rather than 0.64.

Although the reliability of the binaryX variable is worse than that of the quantitativeZ variable, it would not usually be meaningful to rely on an average diagnosis instead of a truly binary rating.

For this reason the ICR(2,k) form of the intraclass correlation would not be applied toXin Table 5.4.

The calculations carried out so far have assumed that the two sets of ratings in Table 5.4 are representative of a host of possible interviewers. Now we turn our attention to the situation in which the two raters can be considered to be fixed. In this case we can either ignore systematic rater differences in mean ratings, or we can adjust for them.

The expression for ICR(3,1) in Table 5.3A is appropriate when we wish to describe the reliability of a single fixed rater. Unlike ICR(2,1), this version of the intraclass correlation is not affected by the between-rater mean squares. On average, ICR(3,1) will be larger in magnitude than ICR(2,1). By fixing the raters to certain persons, the extraneous variation due to sampling of raters is eliminated and the resulting reliability is usually higher.

This effect is especially obvious forZ, which had a large between-rater effect. The ICR(3,1) for Z is calculated as

(25.7−2.67)/(25.7+(1)∗2.67)=0.81 ICR(3,1) forXis not much different than ICR(2,1), as the rater effects were small:

(0.254−0.092)/(0.254+(1)∗0.092)=0.47

5.6.3 The reliability of the average of k fixed measures: Cronbach’s alpha

Just as ICR(1,1) and ICR(2,1) can be used in the Spearman–Brown formula to determine how much reliability improves by using an average score, so can ICR(3,1) be used when an average measurement is of interest. In this case the reliability of the averaged measurement can be computed directly using ICR(3,k) from Table 5.3. For the quantitative Zvariable, the reliability of the average is expected to be 0.90.

One common application of ICR(3,k) is to internal consistency analyses of psychometric scales. Items in self-report questionnaires are usually fixed in that the same items are used with all respondents. Suppose that n subjects are administered k scale items, and the results are analysed using the two-way ANOVA layout of Table 5.5. The estimate of the reliability of the sum or average of the k fixed items can be computed using ICR(3,k). The result is identical to Cronbach’s alpha [12], which we discussed in the first section. Alpha is computed directly by computer programs such as SPSS RELIABILITY [35].

Dalam dokumen Textbook in Psychiatric Epidemiology (Halaman 92-96)