MEASURING OUTCOMES - Report of the Review Committee, April 1985 (Karmel report)

8.1 Chapter 3 marshalled evidence about the educational outcomes which appear to have flowed from the increases in expenditure on education since 1973. It began with a brief review of the difficulties of determining criteria by which to identify and evaluate outcomes and a comment on the relative lack of relevant data for such purposes in Australia. The chapter provided a range of evidence of benefits from the educational initiatives of the 1970s but this evidence is not systematic and certainly not exhaustive.

8.2 This chapter looks ahead. It is concerned with the question of whether more systematic evaluation procedures should be established to monitor either the general progress of education in Australia or the consequences of specific initiatives.

EV ALUA TING PROGRAMS

8.3 The specific purpose programs administered by the Commonwealth Schools Commission have had specific and relatively limited purposes. They have had more or less explicitly stated goals in terms of which their effectiveness might be judged. Such programs provide a better opportunity for evaluation than do more general efforts to improve the quality of schooling. such as increasing the general recurrent resources of school systems and schools. That is not to say, however, that specific initiatives are necessarily to be preferred to general ones. Relative ease of evaluation is not an adequate criterion for choosing among initiatives.

8.4 Even with the specific purpose programs of the Schools Commission the task of evaluation is complex. The adequacy and appropriateness of the programs themselves must be evaluated. The variations in implementation across the range of school systems must be monitored. Both intended and unintended consequences must be assessed and their importance judged. Evaluations of this type are summative, undertaken to offer a final summing up of what a program's impact has been and what it is worth (I).

8.5 Whether the magnitude of any net beneficial outcomes justifies the magnitude of the inputs which yielded them is to a large extent a matter of judgment. Similarly, to decide whether the same inputs put to some other purpose might have yielded more or less benefit requires a judgment of both relative. quality and quantity of benefit. For example, a given resource level in a music program might yield a particular increase in the number of students skilled sufficiently to move to professional careers as musicians. The same resources directed to instruction in spelling might yield a better general performance in the spelling of irregular English words. The choice between programs, even if the evaluative data were as clear as this illustration suggests, would be unlikely to be unanimous.

8.6 The complexity of the judgments involved in summative evaluations of this kind does not necessarily appear where formative evaluation is used. This involves neither the comparative judgment of one program against another nor absolute judgments about the ultimate benefits of a program. Instead, programs are evaluated to establish how well they 101

are operating. In this process, program objectives are usually, but not always, taken as given and the evalaution tends to focus on identifying measures which might be implemented to improve program operation. In many respects, formative evaluation is little more than an extension of sound program administration, that is, the continuous monitoring and modification of a program's operation to ensure that it is working well in relation to set objectives.

8.7 Summative evaluations are more clearly feasible if a program is limited in its scope and relatively quick to yield noticeable consequences. Commonwealth specific purpose programs have tended to be broad thrusts with benefits more likely to be mid to long term than short term. The Schools Commission has commissioned evaluations of these programs at various times in the last decade and the evaluators, themselves independent of the programs, have tended to offer fonnative, not summative, evaluations.

They have certainly identified successful and unsuccessful aspects of the programs but their purpose has been to suggest modifications which might improve program efficiency or effectiveness. As evaluators, they have generally not attempted to answer the question of whether the benefits a program has yielded justify its continuation or abandonment.

8.8 Much public debate about education, and the Committee's terms of reference, presume that questions about the value of programs can be answered adequately. In practice, the answers can seldom be straightforward. For example, the American Head Start Program referred in paragraph 3.3 had its federal funds reduced on the strength of a major evaluation which reported that the program was not effective (2). New analyses, based on performances of the children involved ten years later, show that the program in fact produced at least some of the intended results. Significant gains were reported precisely where the funds were targetted: in reading and writing rather than mathematics or science; among disadvantaged groups and initiaiiy iow achievers rather than others;

among blacks more than whites; and in the southeast more than in other areas. The program appears to have worked in a way that a major early evaluation failed to detect because it was conducted too soon.

8.9 The Committee believes that the Commonwealth should evaluate its programs for schools and seek evidence of the outcomes arising from them. With the specific purpose programs there should be formative evaluation in terms or-the programs' objectives, carried out by those who manage the programs and, periodically, through the appointment of external evaluators. These evaluators may also be used to undertake summative evaluation once a program is well established and longer term effects can be investigated.

However, the final judgments about the worth of any particular program are likely to require assessments of it in comparison with other programs and a consideration of whether the objectives being pursued through it are themselves worthwhile. These are essentially judgments to be made following formal evaluations of either the formative or summative kind. In the Committee's opinion an agency like the Schools Commission can playa major role in informing judgments of this kind.

8.10 The question of evaluation of the more general Commonwealth initiatives is more complex. The objectives of the General Recurrent Grants Program are diffuse and, furthermore, the Program operates under the control of government and non-government systems and individual non-government school authorities. In some cases the funds are directed by the recipient authority to specific programs within a system or school. In others, they provide a relatively large proportion of the total recurrent expenditure by the

school or school system. Elsewhere they provide general supplementation for programs already operating.

8.11 In many cases the Commonwealth funds form a small proportion of the total expenditure, for example, in government school systems and in those non-government schools eligible for only the lowest rates of Commonwealth subsidy. They are nevertheless important for the flexibility they provide. Evaluation of their impact is relevant and necessary but a precise demarcation between their effects and those of the resources provided by the school authority concerned is difficult to establish. The Commonwealth might oblige the recipient authorities to undertake and then report their own evaluations.

Alternatively, it might set more prescriptive guidelines for the way in which its funds are to be used and seek evaluative information accordingly. It might also choose to reserve the right to conduct its own evaluations as a condition of its funding.

MONITORING

THE

EDUCATIONAL SYSTEM

8.12 The evaluation question most evident in public discussion of education is not concerned with the efficacy of specific programs but with the total effects of education, that is, the general consequences of all the separate programs which contribute to the education of a person in our society. Some of the criticisms of schooling expressed in public debate reflect an informal evaluation of the total educational system using student achievement as the criterion for judgments. Much of that criticism takes a narrow view of student achievement in terms of basic skills or the way in which public examination results are interpreted and uses personal experiences or others' anecdotes to judge the performance of the educational system.

Measurement of Student Achievement

8.13 Assessment of students is a routine activity in education. Teachers assess students' performances to diagnose misunderstandings and other weaknesses in order to plan further instruction. This type of diagnostic assessment is formative in its purpose. It is not intended to make some summary judgment of a student's progress but to make it clear how best to promote the student's further development. The assessment itself may involve the teacher's own observations of the student working in class or the use of more structured tasks in tests designed by the teacher or SOme external agency. Particularly useful in this regard are item banks, or stocks oftest items from which teachers are able to construct tests appropriate to the purpose they have at a particular time or with a particular student. Developed by subject and test design experts, item banks have the advantage of allowing teachers to compare their students' attainment with an expected pattern and to devise remediation exercises where necessary. They are being used in most government school systems (see paragraph 3.33).

8.14 Assessment of students may also be summative. External examinations at the end of Year 12 attempt to provide assessments of the levels students have reached in the subjects they have studied. These assessments have virtually no diagnostic function since they are seldom used by teachers in post-school education to decide what individual 103

students next need to learn. The assessments are summary statements of students' achievements. Similar statements are made to parents on students' school reports, although sometimes those assessments and the reports are given a diagnostic orientation as well.

8.15 To interpret a summative assessment of a student's performance, some point of reference or comparison is needed. As indicated in Chapter 7, the reference is most commonly normative, allowing each student's performance to be compared with those of other students. Rank in class reported to parents gives them a basis for judging their child's performance in comparison with others in their child's class, but without some knowledge of how that class compares with other classes, the parents may seriously misinterpret the information.

8.16 Assessments at the end of Year 12 are USed to produce an overall ranking of all students to help higher education institutions choose the students to whom they might offer places. By using the State-wide group of candidates a large reference group is obtained. Nevertheless, the assessments still give no indication of the actual levels of achievement. A student's performance is judged only against the performances of others in the same year. If standards of performance were actually rising or falling over time, the annual assessments of each new group would not reflect that in any systematic way. Chief examiners for the external examinations at Year 12 have from time to time sought to vary pass rates in order to reflect what they perceive to be shifts in overall levels of performance from previous years. A serious technical problem is that they cannot identify separately causes of the perceived variations in levels of performance from year to year.

One may be attributable to changes in the levels of students' ability in the subject

("'o!!~e!!!e"d. Another m~y be ~ resnlt of changes in the level of difficulty of the examination. If student performance in one year appears much better than that of the previous year it could be that the examination was easier, the students superior, or that the examiners' standards of evaluation changed.

8.17 To avoid the inadequacies of comparative assessments of this type two approaches have been adopted in the reports parents receive from teachers. Some teachers offer parents assessments of the extent to which their child is achieving to capacity. This does not, however, give parents an uncontaminated judgment because it depends on the teacher's initial judgment of what is the student's capacity. If that assessment is seriously wrong, quite inappropriate expectations of what the student can do will be developed and misleading reports of whether he or she is working to capacity will be produced.

Furthermore, the basis on which capacity is judged is likely to be performance on school tasks, not some more general measure of ability, so that a student's early levels ·of achievement may set the limits on what is expected subsequently.

8.18 The other alternative to the use of norm referenced measurement is the use of criterion referenced measurement to allow comparison of students' performances with explicitly defined standards of performance. Some schools use curricula with such definitions and then seek to report to parents their child's level of performance in relation to them. The national assessments of performance in numeracy and literacy (3) attempted to define and use such standards as criteria for judging the performances of a national sample of 10 and 14 year olds. The Board of Secondary School Studies in Queensland is implementing this approach with Year 12 assessments (4). The Schools Year 12 and

Tertiary Entrance Certificate in Victoria attempts it also for those students choosing that option for their final years of study in secondary education in Victoria.

8.19 As stated in paragraph 7.7, defining criteria for performance and assessing actual performance in those terms is not without problems. The selection of the criteria itself requires some normative considerations. Defining the mathematical tasks a student in Year 7 ought to be able to complete satisfactorily, for example, depends on there being some expectations about what Year 7 students on average can do. Judgments of standards and the establishment of the criteria by which the judgments are to be made are inevitably related. The higher criteria are set, the more likely it is that performance will be low.

Furthermore, the more completely the criteria are defined the greater is the risk of reducing the goals of the educational task to those things which can most readily be stated as specific performance criteria. The less. readily measurable and the more long term and diffuse goals may be lost from sight in the teaching as well as in the testing. Despite these difficulties, however, the Committee's preference is for criterion referenced tests because they place emphasis on absolute achievement, avoid the automatic attribution of poor results to many students and make the nature of the tests and the standards applied explicit.

Student Assessments and System Peiformance

8.20 It is clear that none of the current regularly obtained measures of student achievement can provide a basis for monitoring levels of achievement over time. Whether the first system-wide attempt to use criterion referenced assessments at Year 12 in Queensland will provide a picture with sufficient comparability from year to year and sufficient public acceptance remains to be seen.

8.21 The only systematic attempts at monitoring Australia-wide levels of achievement in any educational domain were the surveys of literacy and numeracy levels of 10 and 14 year olds. The first was conducted in 1975 for the House of Representatives Select Committee on Specific Learning Difficulties (5) and the second was conducted in 1980 for the Australian Education Council (6). These and other attempts to monitor levels of achievement within single educational systems were discussed in Chapter 3.

8.22 The Australian Education Council decided in October 1979, that the second of the national surveys of literacy and numeracy would begin in 1980 and that it would be the first in a pattern of annual monitoring of achievements of students across a wider range of subject areas. Before this first assessment was conducted, plans for the regular series had been abandoned. The Directors-General of Education had been opposed to the plans and their advice was heeded at a subsequent meeting of the Council. Although the full system of regular monitoring had been dropped, it was too late to abandon the first testing scheduled for October 1980. Before that testing was conducted, the Australian Teachers' Federation announced a boycott which affected sample sizes, but not sufficiently to prevent statistically sound comparisons with performance levels achieved in the 1975 survey.

8.23 The objections of the Directors-General and the Australian Teachers' Feder- ation were based primarily on the narrow range of performance to be monitored and the likelihood that a continuing regular focus on such restricted objectives would distort

\05

teaching in schools. Other objections were that intersectoral comparisons and those based on ethnicity would be socially divisive and that interstate comparisons were undesirable.

8.24 These objections had been fuelled by the press treatment of the results of the earlier 1975 survey of numeracy and literacy levels. That survey had reported the proportions of students able to complete satisfactorily each of the tasks used in the tests.

Press reports had concentrated on the proportions which could not complete the tasks adequately. They claimed that these were sufficiently large to support a conclusion that Australian schools were failing to provide an adequate education. The data were not capable of showing whether the state of affairs reported was any worse than it had been before. Nevertheless, the press reports were clearly based on the assumption that it was and that the study had not only monitored levels of educational achievement but also had shown that a decline had occurred.

8.25 The 1980 survey of numeracy and literacy levels, as reported in Chapter 3, showed that levels of achievement had risen or remained constant but not declined. This result went largely unremarked in the press, as an evaluation of the survey revealed (7).

The lack of interest by the media on this occasion confirmed for many the view that only what can be construed as bad news about education is extensively reported.

8.26 National monitoring is undertaken in some other countries. The National Assessment of Educational Progress Program has been conducted in the United States since the 1960s. The Assessment of Performance Unit in the National Foundation for Educational Research has monitored performance in England, Wales and Northern Ireland since 1976. Both programs were designed to monitor a wider range of subject areas than numeracy and literacy and both were concerned with assessment over a range of of the two Australian studies of literacy and numeracy. In the United States, the testing pattern was designed to produce no school, district or State comparisons. Neither the United Kingdom nor the United States programs could claim, however, to monitor more than a narrow range of the educational goals which schools pursue.

8.27 In the United States, the debate about whether educational standards are declining has revolved not around the National Assessment of Educational Progress program but around analyses of scores on the Scholastic Aptitude Test taken at the end of Year 12 by potential college and university entrants. A new emphasis has recently been given to the National Assessment of Educational Progress Program. Comparative analyses are being used in the hope that differences in performance levels might be related to differences in schools and school systems in ways which will identify the more successful schools and systems, or at least their characteristics (8).

8.28 The use of national assessment data to provide long term evidence of the efficacy of the educational initiatives of the 1960s and early 1970s is an example of a productive use of such data (9). However, the data give only a general indication that there has been improvement for certain groups without showing which features of any particular programs yielded the gains. They are thus of limited value in the detailed shaping of any new policy initiatives.

8.29 The Committee does not believe that general monitoring of the type undertaken through national assessment programs could ever be sufficiently specific in relevant areas to provide the definitive measure of the quality of schooling. The range of purposes of 106

Dalam dokumen Report of the Review Committee, April 1985 (Karmel report) (Halaman 113-121)