In the D-study, the researcher makes use of the information obtained in the G-study in order to make decisions regarding the final study design that she will employ when collecting data using the instru- ment. One major decision that will come from the D-study is the number of units for each facet that will be employed in the final data collection effort. In our case, this would be the number of raters that should be employed when rating the science projects. Using the variance components informa- tion from the G-study, we can estimate the reliability of scores for varying numbers of raters, assum- ing that the new raters are drawn from the same population as the raters used in the G-study. This last point is crucial to using GT, because it allows us to assume that the variation due to the rater effect will be the same with the new raters as it was with the original raters; i.e., raters are interchangeable.
In addition to the number in each facet, we must also consider whether we will be making absolute or relative decisions based on our scores. GT allows for different estimates of reliability depending upon the type of decision that we will be making. This decision is based upon whether our interest is in using the scores in a norm-referenced fashion to compare the projects with one another, or in a criterion referenced fashion to compare the projects to an external standard. This decision will drive which of the reliability indices, discussed below, we would use. Finally, with the D-study we can make determinations regarding whether a particular source of variation should be included in the study at all. If, for example, we find that one of the facets included in the G-study accounts for essentially none of the variance in the observed scores, and the D-study confirms that changing the number of levels of this facet does not impact reliability, then we may conclude that it is not necessary to include it in the final data collection effort. In summary, then, the D-study will take information from the G-study and help us to determine which facets to include in future data collection using the instrument, how many levels of each facet are necessary, and what we can expect the reliability for either a norm refer- enced or criterion referenced assessment.
G and ϕ Coefficients
In equation (5.3) we defined reliability as the ratio of the true score variance to the observed score variance, where the observed score variance was the sum of the true and error variances. Given this relationship, we can see that the smaller the error variance, the larger the reliability estimate. GT provides us with estimates of reliability that can be directly tied back to this theoretical definition. As we noted above, there are such estimates for both norm and criterion referenced decision making.
The generalizability coefficient is the reliability estimate for use in the norm-referenced context, and is defined as
E P
P
ρ σ
σ σδ
2 2
2 2
= + (Equation 5.10)
Where
σP2= Variance due to person σδ2= Variance due to error = σˆPR
NR 2
NR = Number of raters (or number of items).
This statistic is directly analogous to reliability as expressed in equation (5.3). In order to estimate Eρ2 we will use the variance components from Table 5.1. Thus, we can see that the sample estimate of σP2 is MS MS
N
P PR
R
− . Likewise, the estimate of σδ2 is
σˆδ2 = MS N
PR R
(Equation 5.11) The value σδ2 is referred to as relative error (Brennan, 2001), and can be thought of as the difference between a person’s observed deviation score, and his universe deviation score.
Given these estimates, we can write the sample estimate of Eρ2 as
E
MS MS N MS MS
N
MS N
P PR
R
P PR
R
PR R
ρˆ2 =
−
−
+
(Equation 5.12)
When making use of equation (5.12) in a D-study, we will want to alter the value of NR in order to ascertain how the reliability of the scale might change given differing numbers of raters, or items.
We discussed a similar idea in Chapter 4 in the context of the Spearman-Brown prophecy formula.
Recall that with Spearman-Brown we were able to obtain values for what the scale reliability might be if we were to increase (or decrease) the number of items on the scale, assuming that any new items would be of equal quality to the existing items. Similarly, in a D-study we can obtain estimates of Eρˆ2 when we have differing numbers of elements in each facet. So in our example, we could deter- mine what the reliability estimate for the science project scores would be if we used six raters to score each project, rather than four. Likewise, if raters are difficult to obtain, and we would like to use fewer of them in future science fairs, we could estimate Eρˆ2 with two only and determine whether that number would yield sufficient reliability. Using the results of the D-study, therefore, we can make a final determination regarding the optimal number of raters for our situation. Finally, we should note here that Eρˆ2 is a biased and consistent estimator of Eρ2. In particular, Brennan (2001) notes that when the number of facets used in the D-study differs from the number used in the G-study there is potentially some bias, thought it tends to be small, and as noted the estimates are consistent.
In some instances, scores on an assessment will be compared to a standard, rather than to one another. For example, when scoring the science projects, the raters might have specific criteria for elements that need to be present to obtain a score of four. Thus, a rater’s determination of that score value will be based upon the extent to which those elements are included in a given project, rather than how that project might compare to other projects in the same science fair. When such criterion referenced or absolute decisions are being made, the estimate of reliability that we use in the D-study is ϕ (Phi), also known as the index of dependability. It is defined as:
φ σ
σ σ
= +
p p
2
2 2
∆
(Equation 5.13)
Where
σ σ σ
∆2
2 2
= pr +
r r
n nr
How It Works 5.5
We can use the variance components that we calculated in How It Works 5.3 to calculate the norm and criterion referenced reliability estimates from GT for the case of four raters.
E
MS MS N MS MS
N
MS N
P PR
R
P PR
R
PR R
ρˆ2
7
=
−
−
+
=
22 1 11 5 4 72 1 11 5
4
11 5 4
15 2 15 2 2
. .
. . .
.
. .
−
−
+
=
+ 99=0 84.
φ =ˆ
−
− +
MSP MSPR NR MSP MSPR
NR MSPR
nnr
MSR MSPR NPnr
+
−
=
−
− +
72 1 11 5 4 72 1 11 5
4
11
. .
. . .. . .
5 4
25 8 11 5 100
4
+
−
= +15 2+ =
15 2 2 9 0 04. 0 84
. . . .
In this case, the criterion and norm-referenced coefficients were very close in value, because the vari- ance component associated with the raters was so low. In other words, there was not much variation in the scores that could be attributed to differences in the raters. Now let us consider what happens when the raters do provide substantially different scores from one another, leading to a larger mean square associated with rater.
MS MS MS
N N E
MS MS N MS MS
P R PR R P
P PR
R
P P
=
=
=
=
=
=
−
− 72 1 145 8 11 5 4 100
2
. . .
ρˆ
R R R
PR
N R
MS N
+
=
−
−
72 1 11 5 4 72 1 11 5
4
. .
. .
+
=
+ =
=
−
11 5 4
15 2
15 2 2 9 0 84 .
.
. . .
φˆ
MSP MSPR
NR
− MSP MSPR +
NR MSPR
nr
+
MSR MSPR− NPnr
=
−
− +
72 1 11 5 4 72 1 11 5
4
11 5 4
. .
. . .
+
145 8 11 5− 100
4
. .
= +15 2+ =
15 2 2 9 0 34. 0 82
. . . .
When the raters’ scores differ from one another by a greater magnitude, the reliability estimate for the criterion referenced condition declines somewhat, though in this example it is still certainly in the acceptable range. Also notice that the increase in variance attributable to the raters does not impact the norm reference reliability estimate at all.
Brennan (2001) refers to σ∆2 as the absolute error variance and defines it as the difference between an individual’s observed and universe score or called the mean squared deviation for the persons.
In order to obtain the estimate of ϕ, φˆ, we use the variance components in Table 5.1 in equa- tion (5.14):
φ =ˆ
−
−
+
+
− MS MS
N
MS MS N
MS n
MS MS
P PXR
R
P PXR
R
PXR r
R PPXR
P r
N n
(Equation 5.14)
When comparing Eρˆ2 and φˆ, we can see that the numerators are identical, and the denominators share the terms MS MS
N
P PXR
R
−
, reflecting variance associated with the persons, and MS n
PXR r
, providing information about the interaction of person and rater, or error. In addition, φˆ includes in
the denominator
MS MS N
n
R PXR
P r
−
, reflecting the variance associated with the raters. This addi-
tional information acknowledges the fact that with a criterion referenced assessment the actual score provided by the raters to each science project is important, and not merely the relative scores of the projects to one another. Finally, we can see when comparing equations (5.12) and (5.14) that φˆ will always be larger than Eρˆ2, except when there is no rater variance, in which case φˆ=Eρˆ2. From an applied perspective, this fact means that there is more error associated with making absolute mea- surements, as opposed to relative measurements. In the science project example, we would expect to have greater error associated with a decision regarding whether a student failed the assignment (absolute), compared to obtaining a ranking of students (relative) in terms of their performance.
Psychometrics in the Real World: Example 1
One-Facet Crossed Design
Now that we have covered the basic concepts underlying GT, let us see how we can use it in practice.
We note that we provide a few examples of this but extensive study of GT can be undertaken if the reader is interested. We refer one to Brennan (2001) for a comprehensive and technical treatment of the topic. We will start with the simplest application to which we can place GT, the one-facet crossed design, which corresponds to our science fair example. For the one-facet design, each individual is rated by the same set of judges or administered the same set of items. In turn, all judges rate all individuals, or all items are given to all examinees. The science project example is a classic one-facet crossed design, in that each of the four raters gives a score to each of the 100 science projects in the fair. Later, we will discuss alternatives to this simplest example. The data used in this example is
provided in the eResources, along with computer examples for conducting these analyses. We do note that many software programs are available for conducting GT analyses, and that we only show a few examples. Alternatives include SAS and SPSS to obtain variance components analysis, as well as specific software for GT such as EDUG (Cardinet, Johnson, & Pini, 2011), as well as mGENOVA, urGENOVA, and GENOVA (Brennan, 2001; Crick & Brennan, 1983). The Cardinet et al. text provides a user- friendly introduction to GT and the software for application as well.
First, we must conduct the G-study to obtain the variance components. These values appear in Table 5.2.
In addition to being used to estimate the reliability coefficients, the results in Table 5.2 also pro- vide information regarding the relative sources of variability in ratings of the science fair projects.
For example, we can see that the largest source of variance is the raters (45.4%), followed by the interaction, or error (31.4%). A relatively small portion of variance comes from the individuals being rated (23%). We also, for clarity, show in the last column how the proportion is calculated, and that the total should add to 1.0 or 100%. These results suggest that the raters were relatively different from one another in terms of how they scored projects, and that there was a fair amount of error associated with the scores as well. This does not bode well for the reliability estimates associated with these ratings.
Next, we can obtain estimates of the relative and absolute errors associated with these measurements.
ˆ
ˆ
. .
σ
σ
δ 2
2
0 234 4 0 059
= = =
= +
−
MS
N MS N
MS MS N
PXR R
PXR R
R PXR
P
∆
= +
− nr
0 234 4
33 99 0 234 . 100
. .
= + =
4 0 059 0 084. . 0 143.
The E ˆρ2 and φˆ values for this example are then calculated as follows.
E
MS MS N MS MS
N
MS N
P PXR
R
P PXR
R
PXR R
ρˆ2=
−
−
+
= + =
=
−
0 173
0 173 0 059. 0 746
. . .
φˆ
MS MS N MS
P PXR
R
P−−
+
+
−
MS N
MS n
MS MS
PXR N
R
PXR r
R PXR
P
= + +
nr
0 173 0 173 0 059 0
.
. . .. .
084=0 547
Table 5.2 Mean Squares and Variance Component Estimates for Science Fair Projects Source of
Variation Mean Square Variance
Component Proportion Proportion is From?
Person (P) 0.925 0.173 0.232 0.173/0.745
Rater (R) 33.990 0.338 0.454 0.338/0.745
Interaction (PR) 0.234 0.234 0.314 0.234/0.745
Total 0.745 1.0 (100%)
Thus, with four raters (as in the G-study), the reliability coefficient for making relative decisions using the science project scores with this sample is approximately 0.75. On the other hand, the reliability coefficient for making absolute decisions with this sample is approximately 0.55. Recall that absolute estimates will never be larger than relative estimates. Thus, if our primary goal is to rank the science projects relative to one another, having four raters provides us with reasonable consistency, at least for low consequence situations or for research purposes. However, if our goal is to make consistent decisions regarding the absolute level of performance represented by the science projects (e.g., the student receives a passing score), then the situation is not so good, with much lower reliability than for the relative decision making. This lower reliability appears to be largely a function of the relatively high proportion of variance in the scores due to the raters themselves. Again, this implies that the raters are scoring the projects quite differently from one another, thereby making it more difficult for us to get a good, consistent idea regarding the actual performance of any one project.
In other words, if the four raters provide very different scores to the same project, then it will not be easy for us to get a good sense for the actual level of performance represented by that project. Given this lack of consistency, our raters may need additional instructions or calibration before the next science fair. You can project this example to high consequence situations as well (e.g., essay grades for college admissions) where large absolute values (i.e., > 0.90) would be required.
As we have discussed previously, in a D-study we use the variance components obtained through a G-study to get estimates of reliability for differing numbers of facets. In this example, we can vary the number of raters providing scores, in order to determine how many are necessary for us to achieve a pre-specified level of reliability (e.g., 0.8) based on the decision to be made. We can also use the D-study to determine at what point adding additional levels of a facet (additional raters) will not result in relatively large gains in reliability. Table 5.3 includes the relative and absolute errors, as well as the values for E ˆρ2 and φˆ for differing numbers of raters. Note that in the one-facet design, there is not another facet for which we can alter the number of units.
From these results, we can see that including more than four raters leads to increasingly diminished returns for E ˆρ2. The increase in the relative reliability index increases by 0.093 when we go from two to three raters, but only by 0.057 from three to four, and 0.041 from four to five. If we had a predetermined goal of reliability for our rates of 0.8, then we would need six raters for E ˆρ2. With regard to reliability for absolute decision making, even having ten raters is unlikely to yield φˆ of
Table 5.3 D-study Results for Relative Error, Absolute Error, Eρˆ2, and φˆ, and by Number of Raters
Raters Relative Error Absolute Error Eρρˆ2 φφˆ
1 0.234 0.572 0.425 0.232
2 0.117 0.286 0.596 0.377
3 0.078 0.191 0.689 0.476
4 0.059 0.143 0.746 0.547
5 0.047 0.114 0.787 0.602
6 0.039 0.095 0.816 0.645
7 0.033 0.082 0.838 0.679
8 0.029 0.071 0.855 0.707
9 0.026 0.064 0.869 0.731
10 0.023 0.057 0.881 0.751
Psychometrics in the Real World: Example 2
Two-Facet Crossed Design
In many cases, we may be interested in situations where there exists more than one facet of interest.
For example, consider the situation in which each student produces two science projects during the school year, and each rater scores each of the projects at each occasion. This is an example of a two- facet crossed design, in which the facets are all random; i.e., the four raters are a sample taken from the universe of all possible raters, and the two science projects are taken from the universe of all possible projects that the students could have made. Table 5.4 includes the EMS and the correspond- ing mean squares for each source of variation.
The EMS for the facets can be used to construct the relative and absolute errors, as well as the reliability estimates. In the two facet completely crossed case, the relative error is calculated as
σ σ σ σ
δ
2 2 2 2
= + +
( )
PXR R
PXO O
PRO
R O
n n n n (Equation 5.15)
The variance components that are used in equation (5.15) are as defined above. The absolute error term that is used in calculating ϕ in the two-facet design is
σ σ σ σ σ σ σ
∆2 2 2 2 2 2 2
= + + + +
( )+ ( )
R R
O O
PXR R
PXO O
RXO
R O
PRO
R O
n n n n n n n n (Equation 5.16)
0.8. Indeed, just to get to 0.7, we will need a minimum of eight raters, based on these results. In summary, if we are primarily interested in the relative ranking of the science project scores, then we can have as few as four raters and be fairly certain of obtaining reliability of more than 0.7. However, if our primary interest is in the absolute scores assigned to the projects, then our reliability will be fairly low, unless we have a large number of raters (perhaps as many as eight or nine).
Table 5.4 Expected Mean Square and Variance Component Estimates for One-Facet Crossed G-Study Design
Source of
Variation EMS Mean
Square Variance Component Estimate
Person (P) σPRO2 +N NR O Pσ2+NO PRσ2 +NR POσ2 MSP σˆP2 Rater (R) σPRO2 +N NP O Rσ2+NO PRσ2 +NPσRO2 MSR σˆR2 Occasion (O) σPRO2 +N NP R Oσ2+NR POσ2 +NPσRO2 MSo σˆO2
PXR σPRO2 +NO PRσ2 MSPXR σˆPXR2
PXO σPRO2 +NR POσ2 MSPXO σˆPXO2
RXO σPRO2 +NPσRO2 MSRXO σˆRXO2
PRO σPRO2 MSPRO σˆPRO2
The reliability estimates for relative and absolute decisions are calculated using the results in equa- tions (5.15) and (5.16) in much the same way that they were for the simpler one-facet design.
E P
P
ρ σ
σ σδ
2 2
2 2
= + (Equation 5.17)
φ σ
σ σ
= +
p p
2
2 2
∆
(Equation 5.18)
The estimates of the quantities in equations (5.17) and (5.18) can be obtained using the estimated variance components in Table 5.4.
Let’s take our current example and apply these equations in order to obtain the variance com- ponent values and resulting reliability estimates. The variance component estimates appear in Table 5.5.
The results once again point to fairly large differences in scores given by the individual raters, with the rater term remaining the single greatest source of variation in the scores. Person being rated accounted for the second largest proportion of variance in the scores, and the occasion at which the scores were given was associated with very little variation. This last result suggests that raters gave similar scores to the two science projects produced by the same individual. The only inter- actions that accounted for more than 10% of the variance in scores were associated with person by rater (14.9%) and person by rater by occasion (20.9%).
The relative and absolute errors associated with the number of raters and number of occasions for the D-study appear in Table 5.6.
From these results, we can see that the lowest error variances are associated with ten raters and four occasions, neither of which may be feasible in actual practice. Table 5.7 contains the reliability estimates for relative and absolute decisions, by number of raters and number of occasions.
If we are planning to use the ratings to compare the students’ science project performance with one another in a norm-referenced context, and our goal is to achieve a reliability of at least 0.8, then the results of the D-study would suggest that we need a minimum of four raters and three occasions.
On the other hand, if we would be satisfied with a reliability of 0.7 or higher, then we could either have two raters and four measurement occasions, or three raters and two measurement occasions.
The question for us then would be, which design is more feasible for use in actual practice? Can we have three teachers rate each of two science projects over the course of the school year, for each
Table 5.5 Mean Squares and Variance Component Estimates for Science Fair Projects Measured at Different Occasions
Source of Variation Mean Square Variance Component
Proportion
Person (P) 1.901 0.183 0.258
Rater (R) 48.308 0.227 0.321
Occasion (O) 2.101 0.000 0.000
PR 0.358 0.105 0.149
PO 0.228 0.020 0.028
RO 2.631 0.025 0.035
PRO 0.148 0.148 0.209