Observed, True, and Error Scores
The true-score model has been used to address issues related to measurement error.
During the past century, this model has evolved to explain the concept of reliability and is called classical test theory (CTT). CTT is based on the idea that there is always a true score associated with an individual’s response on an instrument. The true score is the person’s real level of knowledge, strength of opinion or attitude, or measure of skill or behavior as obtained under ideal conditions. Because no measure is perfect,
40 0 20 40 60 80 100
60
Scores on a health behavior theory test (high consistency)
Frequency
80 100
FIGURE 9.2. HISTOGRAM OF JOSH’S SCORES ON 100 REPEATED ATTEMPTS ON A HEALTH BEHAVIOR THEORY TEST.
the actual or observed scores on an instrument are assumed to include some error. The relationships among true, error, and observed scores are expressed in the following equation.
Equation 9.1
Observed score = True score + Error score,
where the observed score (O) reflects the actual score a respondent obtains on an in- strument, the true score (T) reflects the actual level or degree of the attribute possessed by the respondent, and the error score (E) reflects the amount of error present in the observed score at the time of measurement.
Because the true and error scores are latent (that is, unobservable), a set of assumptions is necessary in order to use CTT to understand reliability. The first as- sumption addresses the mean of the error scores. Because error is random, sometimes inflating a score and sometimes decreasing a score, it is assumed that random errors
Definition of Error Score
In the context of CTT, error is not the same as the wrong answer on a test. If one knows the correct answer but selects the wrong answer, a CTT error has occurred.
Likewise, if one does not know the correct answer but selects the correct one, a CTT error has occurred. A CTT error occurs when the answer to a test—the opinion, belief, or feeling that is selected—is different from what one would have selected under ideal test conditions.
Definition of Parallel Tests
Parallel tests are tests that have the same mean, variance, and correlations with other tests. They also contain items from the same content domain and are the- oretically identical, but they have different items. The Scholastic Aptitude Test (SAT) is a good example of a test with parallel forms. SAT tests are often admin- istered in large rooms. Students sitting next to each other are given different forms of the SAT (Form A or Form B). Although the items on the two forms are different, it is assumed that the content assessed and the characteristics of the items are the same for the two forms.
will cancel out over multiple tests so that the sum and mean of the error scores are both 0. A second assumption is that the true and error scores are not correlated, and a third assumption is that the error scores across parallel tests are not correlated.
Assumptions and Characteristics
The following example will help clarify the first assumption, that the sum and mean of error scores are equal to 0. Table 9.1 displays a set of scores for Josh, who takes the same research methods quiz ten times. Assume that each of his attempts is indepen- dent of all the others. In addition to the actual scores (observed scores) on the quiz, Table 9.1 presents Josh’s true score on the quiz, or the score he would have obtained without errors. In the final column is the error score for each attempt on the quiz, or O–T. There are a few important points to be made about the information in Table 9.1. First, note that when the error scores are summed, they cancel each other out, and the total error score for this set of scores is 0. Thus, sometimes Josh made errors that resulted in an observed score that was higher than his true score, and at other times, he made errors that resulted in an observed score that was lower than his true score.
Because errors are assumed to be random, we expect that the number of positive er- rors will be equal to the number of negative errors and that these will cancel each other out over multiple tests. Thus, both the total error score and the mean error score will equal 0. Second, note that although the mean error score is 0, there is variability, or variance, associated with the error scores.
Table 9.1 also shows that the observed-score mean is equal to the true score.
If we assume that the mean error score over multiple tests is 0, then the mean of the TABLE 9.1. JOSH’S OBSERVED, TRUE, AND ERROR SCORES
FOR TEN ATTEMPTS ON A RESEARCH METHODS QUIZ.
Quiz Attempt Observed Scores True Scores Error Scores (O–T)
1 4 3 1
2 2 3 –1
3 5 3 2
4 5 3 2
5 1 3 –2
6 3 3 0
7 3 3 0
8 2 3 –1
9 2 3 –1
10 3 3 0
Mean 3 3 0
Variance 1.78 0 1.78
Standard Deviation 1.33 0 1.33
observed score will always be equal to the true score. Thus, the best estimate of the true score is the mean of the observed scores. Note also that the variance of the ob- served score is equal to the sum of the variance of the true score plus the variance of the error score. In this example, the variance of the true score is 0. Thus, the variance in the observed score is due completely to error variance. Note also that the standard deviation of the error score is 1.33. Another name for the standard deviation of error scores is the standard error of measurement.
When we think of reliability, we think of consistency of scores on repeated use of an instrument for groups of people. Thus, we need to shift our focus from the scores of one person taking a test numerous times to the use of an instrument to measure the knowledge, attitudes, opinions, or behaviors of many people. The assumptions of CTT are the same for a set of scores from different respondents as they are for a set of scores from one respondent: (1) the mean error score for a sample is expected to be 0, (2) true scores are not correlated with error scores, indicating that people who have different ability levels or have different opinions or attitudes are equally likely to make ran- dom errors, and (3) the error scores on parallel tests are not correlated with each other.
Table 9.2 presents the true, observed, and error scores on three different research methods tests taken by ten students. The first column of Table 9.2 shows the identifi- cation number (ID) for each student; the second column shows the students’ true scores (this is a theorectical value whose exact value is never really known); the next three columns show each student’s observed scores for the three tests (tests 1 through 3); and the final three columns show error scores for each of the three tests (errors 1 through 3).
The error scores were obtained by subtracting the true score from each corresponding observed score. Notice that the mean of each set of observed scores is approximately equal to the mean of the true scores, and that the mean for each set of error score is
TABLE 9.2. DESCRIPTIVE STATISTICS FOR TRUE SCORES, PARALLEL TEST SCORES, AND ERROR SCORES FOR A SAMPLE OF STUDENTS.
ID True Test 1 Test 2 Test 3 Error 1 Error 2 Error 3
1 95 100 90 93 +5 -5 –2
2 92 94 100 96 +2 +8 +4
3 90 88 98 86 –2 +8 –4
4 88 78 84 92 –10 –4 +4
5 87 90 84 92 +3 –3 +5
6 82 80 84 75 –2 +2 –7
7 76 82 70 74 +6 –6 –2
8 70 74 72 68 +4 +2 –2
9 65 62 68 69 –3 +3 +4
10 60 60 62 58 0 +2 –2
Mean 80.5 80.8 81.2 80.3 .3 .7 –.2
close to 0, supporting assumption 1. We present a small sample here; however, as the number of respondents increases, the mean of the observed scores will approach the mean of the true scores, and the mean of the error scores will approach 0.
Table 9.3 shows a correlation matrix for the correlations among the true, observed, and error scores presented in Table 9.2. These correlations help clarify the second and third assumptions of CTT—that true scores are not correlated with error scores and errors are not correlated across parallel tests. First, notice the relationships be- tween the true and error scores (1). The correlations, which range from –.009 to .089, are low and for all practical purposes can be considered negligible. This finding sug- gests that there is no association between students’ knowledge of the content of the test and random errors, supporting assumption 2, that there is no relationship between true and error scores. Thus, students who know the content well make about the same number of random errors as those who are not as well informed.
Second, examine the correlation among the error scores of the three parallel tests (2). These correlations range from –.094 to –.205, and although the correlations are
True
True
Test 1
Test 1
Test 2
Test 2
Test 3
Test 3
Error 1
Error 1
Error 2 Error 3
1 .930 .920 .950
−.009
−.059 .089
1 .831 .860 .360
−.118 .007
.863
−.075
−.338 .047
−.072
−.084 .396
−.170
−.205 −.094 Error 2 3. Correlations
among true and observed scores
4. Correlations among observed scores
1. Correlations between true and error scores
2. Correlations among error scores
TABLE 9.3. CORRELATIONS AMONG TRUE SCORES, PARALLEL TEST SCORES, AND ERROR SCORES FOR A SAMPLE OF STUDENTS.
not 0, as would be expected according to theory, they are close enough to 0 to be con- sidered insignificant. This finding indicates that errors from one test are not associated with errors on another test (assumption 3). In other words, the same respondents are not making the same errors on each test. One person might have had a headache and might have missed several items on the first test, but felt fine for the second test and answered these items correctly. Another person might have done well on the first test and on the second test made mistakes due to pressure to complete the exam quickly because of a pending appointment.
In addition to demonstrating assumptions 2 and 3, Table 9.3 presents correlations showing the relationship between true and observed scores (3) and among the observed scores (4). The correlations between the true and observed scores range from .920 to .950, indicating a fairly high association between the true and observed scores. This means that respondents who really know the content well are also obtaining higher scores on the test, and those who really do not know the content are answering fewer items correctly. The correlations among the observed scores range from .831 to .863.
Although these correlations are not as high as those between true and observed scores, the relationships are strong, indicating that students who obtain high scores on one test also tend to score high on the others. We will return to the discussion of the correla- tions between observed scores in the next section.