Thư viện số Văn Lang: Advancing Human Assessment: The Methodological, Psychological and Policy Contributions of ETS

Conditional mean item scores ( )xik estimated from raw data on percentile groupings of the total test scores. In item analysis, the total test score is almost always used as an approximation of the trait of interest.

Table 2.1 Summary key item analysis concepts

Conditional Average Item Scores

Relatively strong mathematical models such as normal curves and logistic functions have been found to be undesirable in theoretical discussions (i.e. the mean slope of all test items' conditional mean item scores does not reflect the normal curve model; Lord 1965a) and in empirical investigations (Lord 1965b). . The rationale of the kernel smoothing procedure is to smooth out sample irregularities by means of adjacent xik values, but also to trace the general trends in xik by giving the largest weights to the xik values at y scores closest to yk and at y scores with relatively large conditional sample sizes, viz.

Visual Displays of Item Analysis Results

First, the correct proportions of the items administered by 3-year-old, 4-year-old, …, 14-year-old subjects were converted into indices similar to the delta index shown in Eq. 2.3 (i.e. 'equated deltas', ΔE), the percentage of subjects who responded to the item (PTOTAL), the percentage of subjects who responded correctly to the item (P+) and the biserial correlation (rbis).

Fig. 2.1 Thurstone’s (1925) Figure 5, which plots proportions of correct response (vertical axis) to selected items from the Binet–Simon test among children in successive age groups (horizontal axis)

Roles of Item Analysis in Psychometric Contexts .1 Differential Item Functioning, Item Response Theory,

Subgroup Comparisons in Differential Item Functioning
Comparisons and Uses of Item Analysis and Item Response Theory

Similarities of Item Response Theory and Item Analysis
Comparisons and Contrasts in Assumptions of Invariance
Uses of Item Analysis Fit Evaluations of Item Response Theory Models

Item Context and Order Effects
Analyses of Alternate Item Types and Scores

Item analysis methods were applied to compare an item's level of difficulty for different examinee subgroups. Some ETS researchers have suggested the use of item analysis to evaluate IRT model fit (Livingston and Dorans 2004; Wainer 1989).

An alternative definition of the ETS delta scale of item difficulty (Research Report No. RR-85-43). Properties of test results expressed as functions of item parameters (Research Bulletin No. RB-50-56).

Psychometric Contributions: Focus on Test Scores

Test Scores as Measurements

Foundational Developments for the Use of Test Scores as Measurements, Pre-ETS
Overview of ETS Contributions
ETS Contributions About σσ E |T X XP
Intervals for True Score Inference
Studying Test Score Measurement Properties With Respect to Multiple Test Forms and Measures

Alternative Classical Test Theory Models
Reliability Estimation
Factor Analysis

Applications to Psychometric Test Assembly and Interpretation

Factor analysis models are conceptually like the multivariate versions of the classical test theory results in Sect. Factor analysis models can be viewed as multivariate variations of the classical test theory model described in Sect.

Test Scores as Predictors in Correlational and Regression Relationships

Foundational Developments for the Use of Test Scores as Predictors, Pre-ETS
ETS Contributions to the Methodology of Correlations and Regressions and Their Application to the Study

Relationships of Tests in a Population’s Subsamples With Partially Missing Data
Using Test Scores to Adjust Groups for Preexisting Differences In practice, correlations and regressions are often used to serve interests such as
Detecting Group Differences in Test and Criterion Regressions Some ETS scientists such as Schultz, Wilks, Cleary, Frederiksen, and Melville have
Using Test Correlations and Regressions as Bases for Test Construction

For example, Novick (1983) elaborated on the importance of making appropriate assumptions about a subpopulation of which individuals are interchangeable members, Holland and Rubin (1983) advised investigators to express their untestable assumptions about causal inferences, and Linn and Werts (1973) emphasized research designs that provide sufficient information about the measurement errors of the variables. Further extensions deal with the calculation of composite battery scores as the sum of unweighted test scores in the battery rather than on the basis of regression weights (Jackson and Novick 1970).

Integrating Developments About Test Scores as Measurements and Test Scores as Predictors

Lord also showed that the reliability of observed change can be estimated as follows (related to the Lord-McNemar estimate of true change, Haertel 2006). Comparisons of the prediction error variances from Eqs. reliability and moderate correlation with total test score).

Discussion

Test bias: The validity of the Scholastic Aptitude Test for black and white students in integrated colleges (Research Bulletin No. RB-66-31). An empirical study of normality and independence of measurement errors in test scores.

Contributions to Score Linking Theory and Practice

Why Score Linking Is Important

When two or more tests measuring different constructs are administered to a common population, the results for each test can be transformed to have a common distribution for the target population of test takers (ie, the reference population). In this way, all tests are performed by equivalent groups of test takers from the reference population.

Conceptual Frameworks for Score Linking

Score Linking Frameworks
Equating Frameworks

Most of the research described in the following pages has focused on this particular form of scale matching, known as score equalization. Many of the types of point linking listed by Mislevy (1992) and Dorans b) can be found in the broad field of scale alignment, including coherence, vertical linking, and calibration.

Data Collection Designs and Data Preparation

Data Collection
Data Preparation Activities

Sample Selection
Weighted Samples
Smoothing
Small Samples and Smoothing

Some studies have attempted to link grades on tests in the absence of either common test material or similar groups of test takers. Irregularities in score distributions can produce irregularities in the equipercentile equation adjustment, which may not generalize to different groups of test takers because the methods developed for continuous data are applied to discrete data.

Score Equating and Score Linking Procedures

Early Equating Procedures
True-Score Linking
Kernel Equating and Linking With Continuous Exponential Families
Preequating
Small-Sample Procedures

Levine true-score equalization equates true scores. 2007) introduced an equipercentile version of the Levine linear observed-score equating function, which is based on assumptions about true scores. The procedure worked reasonably well in the score range that contained the middle 90th percentile of the data, as well as the IRT true-score equating procedure.

Evaluating Equatings

Sampling Stability of Linking Functions

The Standard Error of Equating
The Standard Error of Equating Difference Between Two Linking Functions

Measures of the Subpopulation Sensitivity of Score Linking Functions
Consistency of Scale Score Meaning

These EJL estimates and related measures are based on the delta method. Using the approximate normality of the estimate, the SEE can be used to form confidence intervals.

Comparative Studies

Different Data Collection Designs and Different Methods
The Role of the Anchor
Matched-Sample Equating
Item Response Theory True-Score Linking
Item Response Theory Preequating Research
Equating Tests With Constructed-Response Items
Subscores
Multidimensionality and Equating
A Caveat on Comparative Studies

Based on the results of their study, they suggested using more than two links. ETS researchers such as Cook et al. 1985) examined the relationship between violations of the unidimensionality assumption and the quality of fit of true IRT scores.

The Ebb and Flow of Equating Research at ETS

Prior to 1970
The Year 1970 to the Mid-1980s
The Mid-1980s to 2000
The Years 2002–2015

When they compared three equation methods—the FE equipercentile equation method, the chained equipercentile equation method, and the IRT observed-score equation method—each performed best in data consistent with its assumptions. Holland and Dorans provided a detailed framework for coupling classes (section 4.2.1) as a further response to calls for couplings between scores from a variety of sources.

Books and Chapters

The third part of the book discussed educational testing programs in a state of transition. In volume 26 of the Handbook of Statistics, devoted to psychometrics and edited by Rao and Sinharay (2007), Holland et al. 2007) have provided an introduction to comparing test scores, the data collection procedures and the methods used are used for comparison.

Concluding Comment

The effect of repeaters on equating scores in a comprehensive licensing test (Research Report No. RR-09-27). The effect of different types of anchor test on observed equalization of scores (Research Report No. RR-09-41).

Item Response Theory

Some Early Work Leading up to IRT (1940s and 1950s)

Green was one of the first two psychometric fellows in the joint doctoral program of ETS and Princeton University. He introduced and defined many of the now common IRT terms, such as item characteristic curves (ICCs), test characteristic curves (TCCs), and standard errors dependent on latent ability.3 He also.

More Complete Development of IRT (1960s and 1970s)

An important aspect of the ETS work in the 1960s was the development of software, notably by Wingersky, Lord and Andersen (Andersen 1972; Lord, 1968a; Lord and Wingersky 1973) enabling practical applications of IRT. During this period, Erling Andersen visited ETS and during his stay developed one of the main works on goodness-of-fit testing for the Rasch model (Andersen 1973).

Broadening the Research and Application of IRT (the 1980s)

Further Developments and Evaluation of IRT Models
IRT Software Development and Evaluation
Explanation, Evaluation, and Application of IRT Models

With regard to IRT software, Mislevy and Stocking (1987) provided a guide to using the LOGIST and BILOG software programs that was very helpful for new users of IRT in applied settings. Kingston and Dorans (1982a) used IRT in the analysis of the effect of item position on test taker response behavior.

Advanced Item Response Modeling: The 1990s

IRT Software Development and Evaluation

Also in the DIF area, Dorans and Holland (1992) produced a widely used and widely used work on the Mantel-Haenszel (MH) and standardization methodologies, in which they also outline the relationship of the MH to IRT models. He illustrated the use of the model using NAEP writing trend data and also discussed item parameter shifting.

IRT Contributions in the Twenty-First Century .1 Advances in the Development of Explanatory

Antal (2007) presented a coordinate-free approach to MIRT models, focusing on understanding these models as extensions of the univariate models. Hartz and Roussos (2008)7 wrote about the fusion model for skill diagnosis, indicating that the development of the model has yielded advances in modeling, parameter estimation, model fitting methods, and model fit evaluation procedures.

IRT Software Development and Evaluation

The Signs of (IRT) Things to Come

In the same publication, Wendler and Walker (2006) discussed IRT scoring methods, and Davey and Pitoniak (2006) discussed CAT design, including the use of IRT in scoring, calibration, and scaling. 2007) described Bayesian network models and their use in IRT-based modeling of CD. At the time of writing this chapter, history is still in the making; there are three other edited volumes that would not have been possible without the contributions of ETS researchers reporting on the use of IRT in a variety of applications.

Conclusion

An upper asymptote for the three-parameter logistic item-response model (Research Report No. RR-81-20). Sampling variance and covariance of parameter estimates in item response theory (Research report no. RR-82-33).

Research on Statistics

Linear Models

Computation
Inference
Prediction
Latent Regression

From the beginning, researchers have been interested in the strength of the relationship between college admissions test scores and school performance, as measured by grades. A long-standing concern in studies of predictive validity, especially in the context of college admissions, is the nature of the criterion.

Bayesian Methods

Bayes for Classical Models
Later Bayes
Empirical Bayes

Comparisons of the ETS approach with so-called direct estimation methods were carried out by M. Rubin (1979a) and provided a Bayesian analysis of the bootstrap procedure proposed by Efron, which had already gained some fame.

Causal Inference

If this is the case, the observed mean differences between the matched treatment groups would be approximately unbiased estimates of the treatment effects. A comparison of the results of applying different VAMs to the same data was considered in Braun, Qu and Trapani (2008).

Missing Data

They showed that if there are multiple observations on the outcome, under certain stability assumptions it is possible to obtain estimates of the parameters controlling the unobserved binary variable and thus obtain a point estimate of the treatment effect in the extended model. A more explanatory account of the EM algorithm and its applications can be found in Little and Rubin (1983).

Complex Samples

Regarding inference, Rubin b) investigated the conditions under which estimation in the presence of missing data would yield unbiased parameter estimates. Various strategies have been advanced for handling missing (or missing) data, especially for cognitive items.

Data Displays

Conclusion

Notes on the use of log-linear models for fitting discrete probability distributions (Research Report No. RR-87-31). Parameter recovery and subpopulation skill estimation in hierarchical latent regression models (Research report no. RR-07-27).

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

Fair Prediction of a Criterion

Essentially, the Cleary model examines whether the regression of the criterion to the predictive space is invariant across subpopulations. X and Y were measures of the same construct, but admitted that scaling test scores to grades or vice versa had problems.

Differential Item Functioning (DIF)

Differential Item Functioning (DIF) Methods

Early Developments: The Years Before Differential Item Functioning (DIF) Was Defined at ETS
Mantel-Haenszel (MH): Original Implementation at ETS
Subsequent Developments With the Mantel-Haenszel (MH) Approach
Standardization (STAND)
Item Response Theory (IRT)
SIBTEST

Matching Variable Issues

The inclusion of the studied item in the matching of the variable and the refinement or refinement of the criterion were mentioned. Based on their study, they recommended including the studied item in the matching variable when the MH procedure is used to detect DIF.