PDF Educational Evaluation, Assessment, and Monitoring

It concentrates on the application of educational evaluation, assessment and monitoring activities embedded in organizational, management and teaching processes. The structure of the book is built around a three-dimensional model on the basis of which different types of educational "M&E", as it is sometimes abbreviated, are distinguished.

Basic Concepts

Monitoring and Evaluation (M&E) in Education: Concepts, Functions and

Context

Introduction
Why do we Need Monitoring and Evaluation in Education?
A Conceptual Framework to Distinguish Technical Options in Educational M&E
Pre-Conditions in Educational M&E
Conclusion: Why Speak of “Systemic Educational Evaluation”?

By crossing these three dimensions (see Table 1.1), the main forms of educational M&E can be characterized. What the example illustrates is that when it comes to taking concrete steps to establish or improve educational M&E, one cannot take "the political will" to do so for granted.

Basics of Educational Evaluation

Introduction

Basics of Evaluation Methodology .1 Evaluation objects, criteria and standards

Measurement of criteria and antecedent conditions Measuring outcomes
Controlling for background variables (value added)
Design: answering the attribution question

Indicators of the context, for example of the school, can be judged according to whether they are favorable or unfavorable for the proper functioning of the school. External validity is threatened by selection biases (uncontrolled initial differences between treatment groups that influence treatment conditions) and artificial aspects of the experimental situation.

Important Distinctions in Evaluation Theory .1 Ideal-type stages in evaluation

Formative and summative roles
Accountability and improvement perspectives reconsidered

In this situation, evaluability assessment is described as an analytical activity that focuses on the structure and feasibility of the program to be evaluated. The key variables on which to collect data should be selected based on the determination of evaluation criteria and standards (ends) and the structure of the program (means)—see ad a) and ad b) above.

Introduction

Forms That are Based on Student Achievement Measurement .1 National assessment programs

International assessment programs General description
School performance reporting General description
Student monitoring systems General description
Assessment-based school self evaluation General description
Examinations General description

Main target groups and types of use of the information These are more or less the same as in national assessment programmes. By allowing the objective scoring of open-ended test items, testing of more general cognitive skills, and "authentic testing," the achievement testing method appears to be "moving up" in addressing these more complex aspects.

Forms That are Based on Education Statistics and Administrative Data

System level management information systems General description
School management information systems General description

MIS requires an Office of Education Statistics with a specialized unit for developing indicators in areas where traditional statistics do not fully cover all categories of the theoretical model. The information could be used for all kinds of corrective actions in school management.

Forms That are Based on Systematic Review, Observations and (Self)Perceptions

International review panels General description
School inspection/supervision General description
School self-evaluations, including teacher appraisal General description
School audits General description
Monitoring and evaluation as part of teaching

Finally, some of these forms of school self-evaluation can be combined and integrated with each other. School management and staff teams are the primary audience for school self-evaluation results.

Program Evaluation and Teacher Evaluation .1 Program evaluation

Teacher evaluation

Results of program evaluations can lead to political disputes when the results are critical, the stakes in the program are highly evaluated and the credibility of the applied research methodology is less than optimal. This type of "input" control has long been one of the most important measures for quality care in education; especially when combined with another type of input control, namely centrally standardized curricula.

Theoretical Foundations of Systemic M&E

The Political and Organizational Context of Educational Evaluation

Introduction

Rationality Assumptions Concerning the Policy-Context of Evaluations

According to the third characteristic of the rationality model, planned programs are 'actually' implemented. In many cases, a clear exploration of the goals of the evaluation will help overcome resistance.

Gearing Evaluation Approach to Contextual Conditions; the Case of Educational Reform Programs

Phase models
Articulation of the decision-making context
Monitoring and evaluation in functionally decentralized education systems

The next question in the sequence is whether the intended direct results of the project have been achieved. The distinction between areas of decision-making in educational systems has some similarities with Bray's use of the term "functional decentralization," as cited by Rondinelli. Examining the location of decision-making in relation to domains and subdomains is one of the most interesting possibilities.

Figure 4.1 provides a schematic model of the progression of events.

Creating Pre-Conditions for M&E

Political will and resistance
Institutional capability for M&E
Organizational and technical capacity for M&E

But the rules of the game can also be less formal and depend on convention and implicit norms. Institutional capacity for M&E is most realistically addressed as an assessment activity to gain an idea of the general climate in which M&E activities will “land” in a country. In the case of gaps, several options should be considered: narrowing the M&E objectives or changing and improving current practices, e.g.

Conclusion: Matching Evaluation Approach to Characteristics of the Reform Program, Creating Pre-Conditions and Choosing an

The above conjectures all express a contingent approach: the appropriateness or efficiency in the choice of monitoring and evaluation strategy depends on the characteristics of the reform context. The fields of educational evaluation in the sense of measuring student performance on the one hand and educational evaluation in the sense of program evaluation on the other hand have developed as two relatively separate fields. In all reform programmes, where some kind of curriculum revision is at stake, it would also be possible to use assessments of student achievement in the particular curricular area as effect criteria.

Evaluation as a Tool for Planning and Management at School Level

Introduction
The Rationality Paradigm Reconsidered

Synoptic planning and bureaucratic structuring
Creating market mechanisms: alignment of individual and organizational rationality
The cybernetic principle: retroactive planning and the learning organization
The importance of the cybernetic principle
Retroactive planning

The Organizational Structural Dimension

Organizational learning in “learning organizations”
Management in the school as a “professional bureaucracy”
Educational leadership as a characteristic of “effective schools”
Schools as learning organizations?

Conclusion: The Centrality of External and Internal School Self- Evaluation in Learning and Adapting School Organizations

In the remaining sections, the focus will shift from procedural variations of the rationality model to organizational structures. Operational management is firmly in the hands of the professionals (teachers) in the operational core (classroom) of the organization. Second, "pedagogical management" is not entirely at odds with certain demands of the professional bureaucracy.

Figure 5.1 The complete cycle of choice, cited from March & Olsen (1976).

Assessment of Student Achievement

Basic Elements of Educational Measurement

Introduction

This includes the objective of the test (eg, curriculum-based skills, cognitive or psychomotor abilities) and the type of decisions to be made (eg, mastery decisions, pass/fail decisions, selection, prediction). Depending on the purpose of the test, the content area of the test and the level at which the content should be tested can be determined. Test scoring and analysis of tests and items, including conversion of scores into grades and evaluation of test quality as a measurement instrument.

Test Purposes

Construction of test materials, such as construction of multiple-choice items and open-ended items, or the construction of performance assessments. This chapter will conclude with some of these topics: assessment systems, item banking, optimal test construction, and computerized adaptive testing. Implications for a future criterion setting concern knowledge, skills and affective goals that must continue in the period after the teaching has ended.

Quality Criteria for Assessments

A judgment of the relevance and representativeness of the content is based on a specification of the boundaries and structure of the domain to be tested. One can think of the properties of the test (for example, the length of the test) or the evaluation procedure (for example, rater effects). The purpose of a reliability analysis is to quantify the consistency and inconsistency of student performance on the test.

Test Specifications

Specification of test content
Specification of cognitive behavior level

The choice of test content involves a trade-off between the breadth of content coverage and the reliability of subscores. The main objective of the specification table is to ensure that the test is a valid reflection of the test domain and purpose. Table cell entries give the relative importance of a specific combination of content and cognitive level of behavior on the test.

Test Formats

Selected response formats
Constructed response formats
Performance assessments
Choosing a format

One of the main mistakes made in this format is that the wording of the statement closely matches the wording used in the instructional materials. The relationship between the number of items to be purchased and the probability of gambling will also be discussed in the next chapter. Fill-in items resemble multiple-choice items in that they can (in principle) be scored objectively and can provide good substantive coverage as a result of the number that can be administered in a given time.

Table 6.3 Indication of Response Time per Item Type.

Test and Item Analysis

The expected value of the test result is equal to the true score, i.e. This means that the unreliability of the scores suppresses the correlation between the observed scores. The dependence of reliability on the variance of the true scores can be abused.

Assessment Systems

Item banking
Item construction
Item bank calibration
Optimal test assembly
Computer based testing
Adaptive testing

What the test constructor has done is change the definition of the population of interest. An overview of the use of computerized testing in psychological assessment can be found in Butcher (1987). Having an appropriate IRT model can confirm construct validity, it does not imply test reliability.

Figure 6.1 Overview of an assessment system.

Measurement Models in Assessment and Evaluation

Introduction

Unidimensional Models for Dichotomous Items .1 Parameter separation

The Rasch model
Two- and three-parameter models
Estimation procedures
Local and global reliability
Model fit

The element parameters are estimated simultaneously with the mean and standard deviation of the ability parameters. In the section on model fitting, a test for the appropriateness of the ability distribution will be described. Not that the model violation did not result in a significantly biased estimate of the item parameters.

Table 7.1 Data Matrix with Observed Scores.

Models for Polytomous Items .1 Introduction

Adjacent-category models
Continuation-ratio models
Cumulative probability models
Estimation and testing procedures

In the following, the answer to item k can be in one of the categories m=0,..., Mk. For dichotomous items, the response function was defined as the probability of a correct response as a function of the ability parameter θ. In this formulation, we define the item category function as the probability of scoring in a given item category as a function of the ability parameter θ.

Figure 7.5 Response curves of a polytomously scored item.

Multidimensional Models

In general, however, these identification constraints will do little to provide an interpretation of the dimensions of ability. This approach is a generalization of the marginal maximum likelihood (MML) estimation procedure for unidimensional IRT models (see, Bock & Aitkin, 1981), and has been implemented in TESTFACT (Wilson, Wood & Gibbons, 1991). In the framework of adjacent category models, the logistic versions of the probability of a response in category m can be written as.

Figure 7.6 Item response surface for a multidimensional IRT model (Reckase, 1977).

Multilevel IRT Model .1 Models for item parameters

Testlet models
Models for ratings

In the level 2 model, the values of the item parameters are considered as realizations of a random vector. It is assumed that the element parameters e.g. has a 3-variate normal distribution with mean µp and a covariance matrix Σp. This approach discards part of the information in the item responses, which will lead to a certain loss of measurement precision.

Applications of Measurement Models

Test Equating and Linking of Assessments .1 Data collection designs

Multi-stage testing
Test equating

Note that this expectation depends only on the parameters of the items of the reference exam. In the example of Table 8.3, the cut-off point of the reference exam was 27; consequently, 28.0% failed the exam. In the bold row marked "English H" information is given about the results of the reference population doing the reference examination.

Multiple Populations in IRT .1 Differences between populations

Multilevel regression models on ability

The Bayesian approximation deals with the posterior distribution of the parameters, for example p(θ, δ, β, µ, σ | y). At the student level, the variables were gender (0=male, 1=female), SES (with two indicators: father's and mother's education, scores ranging from 0 to 8) and IQ (range from 0 to 80). It can be seen that the magnitudes of the fixed effects in the MLIRT model were larger than the analogous estimates in the ML model.