• Tidak ada hasil yang ditemukan

Evaluating the Quality of Educational Measures: Reliability and Validity

Dalam dokumen METHODS IN EDUCATIONAL RESEARCH Y (Halaman 116-134)

Reliability and validity are the two criteria used to judge the quality of all preestablished quantitative measures. Reliability refers to the consistency of scores, that is, an instrument’s ability to produce “approximately” the same score for an individual over repeated testing or across different raters. For example, if we were to measure your intelligence (IQ), and you scored a 120, a reliable IQ test would produce a similar score if we measured your IQ again with the same test. Validity, on the other hand, focuses on ensuring that what the instrument

Measurement in Educational Research and Assessment 87

“claims” to measure is truly what it is measuring. In other words, validity indicates the instrument’s accuracy. Take, for example, a fourth-grade state math assess-ment that requires students to read extensive word problems in answering math questions. Some math teachers might argue that the assessment is not valid be-cause students could know all the mathematical functions to answer the problems correctly but might be unable to answer the questions correctly because of their low reading ability. In other words, the assessment is really more of an assessment of students’ reading skills than it is of math ability.

Validity and reliability are typically established by a team of experts as part of the process of developing a preestablished instrument. As part of the develop-ment process, the measuredevelop-ment team would work to establish sound reliability and validity for the instrument before it is packaged and sold to researchers and practitioners (see Figure 4.7).

If an instrument does not have sound reliability and validity, the instrument is of no value. Therefore, it is important that beginning researchers and educators in general have at least a basic understanding of issues surrounding reliability and validity to be able to select the most appropriate and accurate instruments to serve as measurement tools for their study.

Measurement team develops instrument for

company

Packages instrument and disseminates through

test publisher and distributors

Researcher or practitioner finds instrument through MMY or test publisher

Web Site, purchases instrument, and uses it

to collect data Establishes

reliability

Establishes validity

Note: MMY = Mental Measurement Yearbook.

FIGURE 4.7 OVERVIEW OF THE PROCESS IN DEVELOPING PREESTABLISHED INSTRUMENTS.

Reliability

Notice that in our initial explanation of reliability, we referred to consistency as meaning obtaining approximately the same score for an individual over repeated testing. Even the most reliable test would not produce the exact same score if given twice to the same individual. This difference is referred to as the measurement error. Error in measurement is due to temporary factors that affect a person’s score on any given instrument. Factors that influence reliability include

• Test takers’ personal characteristics (e.g., motivation, health, mood, fatigue)

• Variations in test setting (e.g., differences in the physical characteristics of the room)

• Variations in the administration and scoring of the test

• Variation in participant responses due to guessing (caused by poorly written items)

Because there is always some error associated with measurement, obtained scores (the actual score the individual received) are made up of two components: the true score plus some error. If a test were perfectly reliable, the true score would be equal to the observed score. Students and researchers alike understand that their score on a given test may not represent their true score (their real knowledge or abilities). In essence, this means that unless an instrument is perfectly reliable, you can never know a person’s true score. However, the simple calculation of the standard error of measurement (SEM) will allow one to estimate what the true score is likely to be. The SEM takes in account both the reliability coefficient of the test and the variability of the scores of the norm group as calculated by the standard deviation. The formula for SEM is as follows:

where SD is the standard deviation of the test scores obtained by the norm group, and r indicates the reliability coefficient of the test calculated from a pilot group.

For example, consider the following situation. Test A has a reliability coeffi-cient of 0.91. This suggests that the test is highly reliable, and therefore, a per-son’s observed score is a relatively good estimate of his or her true score. The scores for test A produce a standard deviation of 9. Let us say that a student’s ob-served score is 70. What would the SEM be for this example? Using our formula, it would be calculated as follows:

Measurement in Educational Research and Assessment 89



SEM = SD 1− r

 

SEM = 9 1− 0.91 = 9 0.09 = 9 × 0.3 = 2.7

This means that the student who received a score of 70 might score 2.7 points higher (72.7) or lower (67.3) if she or he retook the test. All preestablished tests should report the SEM. Knowing the SEM is important when a measure is used to make decisions about students (e.g., placement or admission to programs). De-cisions about students scoring near a cutoff point should be made cautiously be-cause retesting might result in the student scoring differently with respect to the cutoff. Parents or teachers might consider asking that the student be retested or that additional information be considered.

Essentially, then, reliability is the extent to which the scores produced by a single test correlate. Reliability is a concept that is expressed in terms of

correla-tions that are summarized in a reliability coefficient. The reliability coefficient can assume values from zero to +1.00. The closer to +1.00 the reliability coefficient is, the more highly reliable the instrument. Presented below are several types of reliability. Keep in mind that not all of these types have to be conducted on a sgle instrument. Typically, at least one is used to establish reliability during an in-strument’s design and development.

Stability or Test-Retest Reliability (Consistency Across Time). One form of re-liability is called stability or test-retest. The purpose of stability is to show that an instrument can obtain the same score for an individual if that person takes the in-strument more than once. What is the importance of this, you might be asking yourself ? Well, imagine an intelligence test that gives a different score for an in-dividual each time it is administered. The first time a student takes it, he receives a score of 140, indicating that he is a genius. However, the second time the in-strument is administered, the student receives a score of 104, which equates to about an average intelligence. Which one is the correct score? How would a school psychologist or a researcher who is using this instrument interpret this informa-tion? The answer is that they couldn’t because the test is unreliable.

Stability of an instrument is also critical when one is trying to infer causality or that a change occurred in a study. Imagine a math assessment that was so un-stable that a student received a score of 70 out of a possible 100 the first time the test was taken and a score of 90 out of a possible 100 the second time, even when no type of intervention or independent variable took place between the times of testing! Knowing the instrument’s instability, how could you use the measure in a study to determine if changes in student achievement in mathematics were the result of a new math program you planned to implement? You would never re-ally know if the change occurred because of the intervention, the unstable in-strument, or both. Therefore, a researcher or a practitioner would want to avoid using an instrument that had poor test-retest reliability.

To establish stability, first a pilot sample of persons is selected. Pilot samples in instrument development are usually large populations, which allow for the great-est degree of generalizability. Once the pilot sample has been great-established, partici-pants are given the instrument, and scores are obtained for every individual. Next, some time is allowed to pass. The amount of time varies based on the type of in-strument that is being developed. The most important thing is that enough time passes so that participants are unable to recall items and perform better on the in-strument because of this experience. An average wait time of four to six weeks be-tween instrument administrations is usual. Following the second administration of the same test, the two scores are compared for each individual. Correlations are then calculated on the two sets of scores, which produces a correlation coefficient.

Measurement in Educational Research and Assessment 91

Depending on the instrument’s purpose, correlation coefficients ranging from 0.84 or stronger would be considered very strong; however, correlations from 0.35 to 0.64 are more typical and considered to be moderately strong and certainly ac-ceptable. Coefficients of 0.24 to 0.34 are considered to show a slight relationship and may be acceptable levels for instruments that are considered exploratory.

Equivalent-Form Reliability (Consistency Across Different Forms). Another type of reliability the measurement team might decide to establish for the instrument is equivalent-form reliability. This type of reliability is also referred to as alterna-tive form. For this type of reliability, two forms of the same tests are given to par-ticipants, and the scores for each are later correlated to show consistency across the two forms. Although they contain different questions, these two forms assess the exact same content and knowledge, and the norm groups have the same means and standard deviations.

Internal Consistency Reliability (Consistency Within the Instrument). Internal consistency refers to consistency within the instrument. The question addressed is whether the measure is consistently measuring the same trait or ability across all items on the test. The most common method to assess internal consistency is through split-half reliability. This is a form of reliability used when two forms of the same test are not available or it would be difficult to administer the same test twice to the sample group. In these cases, a split-half reliability would be conducted on the instrument. To assess split-half reliability, the instrument is given to a pilot sample, and following its administration, the items on the instrument are split in half ! Yes, literally split in half—often, all the even- and all the odd-numbered items are grouped together, and a score for each half of the test is calculated. This pro-duces two scores for every individual from the one test, one score for the even-numbered items, and one score for the odd-even-numbered items. The Spearman-Brown prophecy formula is then applied to the data to correct for the fact that the correlations for internal consistency are being calculated on the two halves of the instrument and not the whole original instrument. For the split-half approach to be appropriate, the instrument must be long enough to split in half and should measure a single trait or knowledge domain. In determining the acceptability for coefficients, the re-search team would use the ranges discussed earlier in test-retest reliability.

Another approach used to establish internal consistency of a measure is to examine the correlations between each item and the overall score on the instru-ment. The question addressed here is whether a given score consistently measures the same amount of knowledge on the test. For example, consider scores on a 50-item history test taken by high school students. If all 50-items are worth one point, students might get 40 out of 50 points by missing the first 10 items, the last 10

items, or some other mixture of 10 items. Can you say that all of these scores rep-resent the same amount of knowledge? This type of internal consistency is as-sessed using either the Kuder-Richardson or Cronbach’s coefficient alpha formula and is often referred to simply as internal consistency.

Validity

Along with the need to establish reliability for an instrument, the measurement team also has to address the issue of validity by constantly asking themselves the question: Does the instrument measure what it is designed to measure, or is it mea-suring something entirely different? To add to the confusion, keep in mind that an instrument can be extremely reliable but not valid in that it consistently ob-tains the same score for an individual but is not measuring what it intends to mea-sure. For example, an educational research professor gives a midterm exam that she believes tests the content of the first half of her course. However, she inad-vertently includes items from inferential statistics, which is covered in the second half of the course. This test would lack validity but would likely be reliable. That is, student performance would likely be fairly consistent over time. However, if you have a valid test that measures what it was intended to measure, the instru-ment will also have reliability. So when constructing a test or using a standardized instrument, validity is the single most important characteristic.

As with establishing reliability, validity is also established during the piloting of an instrument. In most cases, it is done simultaneously with procedures to estab-lish reliability. Validity is usually estabestab-lished through an depth review of the in-strument, including an examination of the instrument’s items to be certain that they are accurately measuring the content or objectives being tested, and by re-lating scores on the instrument to other measures.

Content Validity. Content validity is composed of two items of validity: sampling validity and item validity. Both sampling validity and item validity involve having experts examine items that make up the instrument. However, sampling validity is more focused on the breadth of those items, whereas item validity is in-terested in the depth of the items. Let us say, for example, that the measurement team was working on a math test for fifth grade for the purpose of assessing the knowledge of content at the end of the year. To establish sampling validity, the measurement team might select a sample of fifth-grade teachers who would re-view the instrument to ensure that there are items on the test representative of the content taught over the entire year of fifth-grade math. This would include items from the first part of the school year, from the middle of the year, and from the end of the school year.

Measurement in Educational Research and Assessment 93

Item validity is concerned about examining each individual item to deter-mine if it measures the content area taught. In other words, do the test items mea-sure the content or objectives for the course or unit being taught? Think about the many times you have taken a test and wondered, “Where did this question come from? I did not read it in the text. I don’t remember discussing it in class.”

Well, it may be that you just forgot, but it could be that the test lacked content va-lidity. Maybe the item just was not covered. To establish content validity, the mea-surement team would have an expert panel of usually teachers or experts in the field examine the objectives and the items on the test to determine the degree of content validity.

Criterion-Related Validity. Essentially, criterion-related validity involves the examination of a test and its relationship with a second measure. It reflects the degree to which two scores on two different measures are correlated. There are two forms of criterion-related validity. Concurrent validity examines the de-gree to which one test correlates with another taken in the same time frame. When might this be done? Let us say that you have developed a new instrument that is quicker and faster for measuring intelligence. To establish its validity, you might administer your new test with an existing, already validated, intelligence test to a pilot group. You would then correlate the two sets of scores from your pilot group, and if you found a high correlation between the two tests, you could then offer your “new fast and easy intelligence test” as a replacement for the existing intel-ligence test, saying that it has evidence of concurrent validity.

Another type of criterion-related validity is predictive validity. Predictive validity uses the score on a test to predict performance in a future situation. In this situation, a test is administered and scored. Let us use the American College Test (ACT) that is taken by many high school juniors and seniors as an example.

The ACT (the predictor) is used by colleges and universities around the country as a predictor of college success (the criterion). To determine its predictive valid-ity, one would administer the ACT to a group of high school students and then allow some time to pass before measuring the criterion (college success—in this case, defined as one’s freshman grade point average). Once the first-year grade point averages are released, they can be correlated with the ACT scores to deter-mine if it is a successful predictor. If the correlation is high between the ACT and freshman grade point average (and there are some studies that suggest that this is the case), then the ACT is considered to have criterion-related validity in pre-dicting college success.

Construct Validity. Construct validity involves a search for evidence that an instrument is accurately measuring an abstract trait or ability. Remember that

constructs are traits that are derived from variables and are nonobservable. They include complex traits such as integrity, intelligence, self-esteem, and creativity. In constructing an instrument to measure one of these complex constructs, one must first have well-developed ideas or a comprehensive theory about what the construct is and how people who vary in that construct would differ in their behaviors, abil-ities, feelings, or attitudes. Consider the construct of self-esteem. The Pictorial Scale of Perceived Competence for Children and Adolescents (Harter & Pike, 1984) grew out of their research on the development of self-esteem. They conceptualized that self-esteem for children comprised four components: cognitive competence, peer acceptance, physical competence, and maternal acceptance. They believed that children have an overall sense of self-esteem that is composed of these separate components. Questions on the instrument reflect this overall theory of what self-esteem is. Examining the construct validity of this instrument means examining whether it really captures the full meaning of the self-esteem construct.

Construct validity is one of the most complex types of validity, in part be-cause it is a composite of multiple validity approaches that are occurring simul-taneously. This might include aspects of content, concurrent, and predictive validity. Therefore, many researchers consider construct validity to be a superor-dinate or overarching type of validity. Fully establishing construct validity also in-volves a lengthy process of collecting evidence from many studies over time. Below are some of the questions and procedures that might be involved in establishing construct validity.

• Does the measure clearly define the meaning of the construct? Evidence cited as support might include constructing a measure from a well-developed the-ory or examining whether its questions and parts correspond to research or theo-ries related to the construct.

• Can the measure be analyzed into component parts or processes that are appropriate to the construct? In a test of creativity, the mental processes required by the tasks presented might be analyzed to see if they truly require creative think-ing and not memory or reasonthink-ing abilities. Factor analysis is a statistical proce-dure used to see if clusters of questions identified in the statistical analysis match the groupings of questions identified by the test maker as the different parts of the test. So, for Harter and Pike’s scale, one would expect the items on Physical Com-petence to be clustered together by the factor analysis.

• Is the instrument related to other measures of similar constructs, and is it not related to instruments measuring things that are different? Researchers want to be certain that the instrument actually measures the construct and not some-thing else. For example, let us consider the construct self-esteem. To validate their measure of self-esteem, Harter and Pike would correlate the self-esteem test with

Measurement in Educational Research and Assessment 95

measures of similar constructs as identified in the literature. For example, if the lit-erature says that self-esteem is related to body image and self-awareness, measures of these constructs should correlate with the self-esteem instrument. These corre-lations would produce confirming or convergent evidence. However, they would also want to make sure that the measure did not correlate with measures of con-structs that differ from self-esteem. So scores on the self-esteem measure might be correlated with scores on measures thought to be unrelated or only weakly related to self-esteem (e.g., attentiveness, emotionality, or physical activity level) to estab-lish disconfirming evidence (constructs where there was little or no correlation).

• Can the measure discriminate between (or identify separately) groups that are known to differ? Researchers will sometimes see if the measure yields differ-ent scores for two groups who are expected to differ in the construct. Harter and Pike (1984) demonstrated that the self-esteem measure did distinguish between children who had been held back in school and those who were not held back.

Perhaps you can see why it takes a long time to establish construct validity.

Finding Preestablished Measures

Researchers and educators often use preestablished instruments to collect their data. One source that many researchers use is the Mental Measurement Year-book (MMY). University and college libraries and many large public libraries carry this reference book both in their bound collections and through their elec-tronic databases. Essentially, MMY is an overview of a wide variety of measure-ment instrumeasure-ments. Each instrumeasure-ment described in the MMY includes the following information:

• Complete name of the instrument

• The publisher, date of publication, and most recent edition

• A description of what the test measures

• The population for which the test was designed

• Procedures for administration and scoring

• A critique of the instrument, including its validity and reliability (terms de-scribed in detail later in the chapter)

In most cases, the description provided by the MMY is detailed enough to provide the researcher with enough information to know if the instrument might be appropriate for his or her study. The MMY also provides enough information to generate an instrument section for a research proposal. Although this general description of the instrument may suffice initially, it is strongly encouraged that

Dalam dokumen METHODS IN EDUCATIONAL RESEARCH Y (Halaman 116-134)