Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=vjeb20
Download by: [Universitas Maritim Raja Ali Haji] Date: 11 January 2016, At: 20:54
Journal of Education for Business
ISSN: 0883-2323 (Print) 1940-3356 (Online) Journal homepage: http://www.tandfonline.com/loi/vjeb20
Initial Impressions and the Student Evaluation of
Teaching
Dennis E. Clayson
To cite this article: Dennis E. Clayson (2013) Initial Impressions and the Student Evaluation of Teaching, Journal of Education for Business, 88:1, 26-35, DOI: 10.1080/08832323.2011.633580
To link to this article: http://dx.doi.org/10.1080/08832323.2011.633580
Published online: 19 Nov 2012.
Submit your article to this journal
Article views: 212
View related articles
JOURNAL OF EDUCATION FOR BUSINESS, 88: 26–35, 2013 CopyrightC Taylor & Francis Group, LLC
ISSN: 0883-2323 print / 1940-3356 online DOI: 10.1080/08832323.2011.633580
Initial Impressions and the Student Evaluation
of Teaching
Dennis E. Clayson
University of Northern Iowa, Cedar Falls, Iowa, USA
Do first impressions influence the final evaluations given in a class? The author looked at the initial student perceptions and conditions of a class and compared these with conditions and evaluations 16 weeks later at the end of the term. It was found that the first perceptions of the instructor and the instructor’s personality were significantly related to the evaluations made at the end of the semester. Implications for the validity of and utilization of the student evaluation of instruction are discussed.
Keywords: confirmatory bias, faculty evaluation, initial perceptions, personality, student evaluation of teaching
What influence do the initial impressions that students form about an instructor have on the final evaluation of a class? The answer to this question would add valuable information to a long-lasting debate about the validity of the student evaluation of teaching (SET).
The utilization of a SET process has become almost uni-versal in modern universities and colleges (Clayson, 2009). These instruments are not only utilized to improve instruc-tion, but they are also extensively used to establish tenure, promotion, merit pay, and public reputations. Consequently, the SET process has been extensively debated and researched. Even though the first published article on the evaluations ap-peared almost 85 years ago (Remmers & Brandenburg, 1927, as cited in Kulik, 2001), little agreement has been reached about the validity of the instruments. This is primarily the re-sult of two broad issues that have plagued SET research from the start. First, there has been no generally accepted defini-tion of good or effective teaching. Institudefini-tions have utilized instruments to measure constructs that they have not defini-tively identified. Second, even without construct definitions, there are aspects of pedagogical practice that would generally be accepted as indicators of good instruction. For example, many would assume that good or effective instruction would lead to increased student learning, or that the personality of the instructors would be tangentially related to the process,
Correspondence should be addressed to Dennis E. Clayson, University of Northern Iowa, Department of Marketing, 344 CBB, Cedar Falls, IA 50614–0126, USA. E-mail: [email protected]
but not overwhelmingly so. However, the research findings on these issues have proven to be ambiguous.
These problems raise an interesting methodological issue. How could it be demonstrated that variables unrelated to teaching influence the evaluations if good and/or effective teaching has not been clearly defined? One solution would be to search for influences on SET that could not be logically connected to what happens in a setting, either physical or temporal, where instruction actually occurs. What influence, for example, would a brief initial exposure to an instructor have on the later evaluation of that instructor, especially if the first exposure was made before any instruction took place?
Specifically, I looked at perceptions of students before in-struction actually began and compared them with the SET outcomes at the end of a 16-week term. No published liter-ature could be found looking at this relationship so early in a term. If SET validly measures the quality of instruction, except for second order variables such as personality, initial impressions before instruction begins should be unrelated to the final evaluations.
REVIEW OF THE LITERATURE
SET research has been handicapped by a number of prob-lems including strongly held opinions, statistical questions, and methodological issues arising from gathering data from anonymous sources. As suggested previously, however, the fundamental problem has been the failure to define the con-struct underlining the measurement process. There has been
no widely accepted definition of what good or effective teach-ing is (Adams, 1997; Clayson, 2009; Kulik, 2001). Attempts to circumvent this failure have centered on several issues.
Learning
Many assume that any criteria to define a construct of good and/or effective teaching would include some measure of learning. Cohen (1981) affirmed, “Even though there is a lack of unanimity on a definition of good teaching, most researchers in this area agree that student learning is the most important criterion of teaching effectiveness” (p. 283). Although some early studies identified a negative rela-tionship between learning and SET (Attiyeh & Lumsden, 1972; Rodin & Rodin, 1972), most studies found either no relationship or a positive association (Baird, 1987; Cohen, 1981; Dowell & Neal, 1982; Lundsten, 1986; Marlin & Niss, 1980). In the last 20 years, however, there has been a shift in the findings. A recent meta-analysis (Clayson, 2009) found no published findings after 1990 that contained a significant positive association between learning and the evaluations. Further, the relationship between SET and learning became increasingly neutral or negative as more statistical sophis-tication was utilized in studies, and measures of learning became more objective. The research concluded that while there was a relationship between SET and perceived learn-ing, there was none between objective measures of learning and the evaluations.
If learning is seen as an improvement in subsequent per-formance, then findings suggest that learning may actually be negatively related to SET. In a study of accounting students, it was found that a significant negative relationship existed be-tween student evaluations of their instructors in introductory classes and how well they performed in a subsequent class (Yunker & Yunker, 2003). V. E. Johnson (2003), utilizing a university-wide database, reported that “stringent grading is associated with higher levels of achievement in follow-up courses” (p. 161), but that stringent grading was strongly associated with lower evaluations. At the U.S. Air Force Academy, students in calculus classes in which learning can be objectively measured gave higher evaluations to instruc-tors of classes in which they were getting higher grades, but lower evaluations to instructors who produced students who did well in subsequent calculus classes. The authors concluded, “the correlation between introductory calculus professor value added in the introductory and follow-on courses is negative. Students appear to reward contempo-raneous course value added . . . but punish deep learning” (Carrell & West, 2010, p. 429). Consistent with this, they found that inexperienced instructors got better evaluations in introductory classes than did more seasoned instructors, who produced students who did better in subsequent classes.
Personality
A similar pattern of mixed findings has been found with the influence of the instructor’s personality on SET. In
general, researchers from the educational colleges report few personality traits that correlated with student ratings (Boice, 1992; Braskamp & Ory, 1994; Centra, 1993). Yet, studies that manipulated actual classroom conditions found positive relationships (Naftulin, Ware, & Donnelly, 1973; Widmeyer & Loy, 1988). Other studies have found associations between personality variables and the evaluation outcomes that accounted for over 50–75% of the total variance of the evaluations (Erdle, Murray, & Rushton, 1985; Feldman, 1986; Marks, 2000; Murray, Rushton, & Paunonen, 1990; Sherman & Blackburn, 1975).
In a study of business students, Clayson and Sheffet (2006) compared change in the students’ perception of personality with change in the evaluations in the last six weeks of the term. Even after the midterm, changes in evaluations, neg-ative and positive for individual instructors, were highly re-lated to changes in the students’ perception of personality, and in the same direction. The study ruled out the possibility that the personality–evaluation association was a statistical artifact resulting from insufficient control of secondary vari-ables. Another earlier study of business students found that each standard deviation change in personality resulted in a 0.83 standard deviation change in the evaluations. Personal-ity was found to be significantly related to every other factor in the study, including the students’ perception of the in-structor’s knowledge and fairness. It was negatively related to rigor, and positively related to the students’ perception of how much they had learned (Clayson & Haley, 1990).
Validity Inconsistency
These and other problems have led some researchers to question the validity of SET. After reviewing the results of a study of over 2,000 business students, Marks (2000) concluded that “student evaluations lack discriminant validity. No matter how reliable the measures, student evaluations are no more than perceptions and impressions” (p. 117). Greenwald and Gillmore (1997) previously pointed out that while evaluations of instructors have convergent validity, they lack discriminant validity. In other words, SET are correlated with attributes that a concept of good teaching would be expected to be related with, but they are neutral or are correlated with numerous attributes with which they should not be related. It is also claimed that the instruments lack divergent and outcome validity (Onwuegbuzie, Daniel, & Collins, 2009; Sproule, 2002). This partially results from, and is complicated by, a considerable halo effect (Orsini, 1988). Convergent validity and discriminant and divergent invalidity would be expected if the evaluations were mea-suring a global construct that the students have a tendency to apply to whatever question is addressed (Langbein, 1994).
What would that global construct be? Some researchers have concluded that the evaluations most likely create some-thing that could be called a likeability scale (Clayson, 2009; Clayson & Haley, 1990; Marks, 2000; Tang & Tang, 1987). This interpretation answers numerous questions about
28 D. E. CLAYSON
apparently contradictory findings, including the high impact of instructor personality.
While most individuals have had experiences of learning a great deal from thoroughly disliked instructors, most would agree that instruction is facilitated when a teacher is liked. Yet, as found in the lack of relationship of SET with learning, being liked may not be related to what many educators would consider to be good teaching. As Foote, Harmon, and Mayo (2003) concluded after reviewing the literature and the results of their own study, “those [instructors] who score highly on evaluations may do so not because they teach well, but simply because they get along well with students” (p. 17).
Initial Impressions
Due to serial learning effects, it would be expected that the initial exposure would have a strong impact on students’ per-ception of personality. More than half a century ago, Solomon Asch (1946) found that the order of terms used to describe a person made a difference in how that individual was per-ceived. When a person was described as envious, stubborn, critical, impulsive, industrious, and intelligent, rather than intelligent, industrious, impulsive, critical, stubborn, and en-vious, the second order produced higher personal ratings than the first. In some cases, a brief initial experience seems to create a perception that is only slightly modified by further interactions.
It has been shown that when subjects are first introduced to another person, they make judgments of attractiveness, likeability, trustworthiness, competence, and aggressiveness within one tenth of a second. Surprisingly, it has also been shown that more extended exposure (beyond one half of a second) simply boosted the confidence of judgments (Willis & Todorov, 2006). These findings fall under the rubric of the primacy effect and refer to the process by which early information may alter the perception of subsequent information. This is especially true if the initial information has high relevance, but is less true if subsequent information is stronger, the situation is more structured, or if subjects have higher cognitive sophistication (Haugtvedt & Wegener, 1994; Krosnick & Alwin, 1987).
Moreover, observers have a tendency to look for, find, and remember information that fits their preconceived expecta-tions, while information that contradicts these expectations may be dismissed, ignored, or distorted. This confirmatory bias was found in early studies by Wason (1960), who showed subjects a sequence of three numbers and then asked them to find a rule and use that rule to create a new sequence of numbers that would conform to the original set. After every attempt, the subjects were told whether they were correct or wrong. Wason found that subjects had a tendency to create rules that were much more complex than warranted. Fur-thermore, they seemed to offer only positive tests for their hypotheses, and did not attempt to falsify their rules.
In other words, the subjects chose to select evidence that would confirm a prior hypothesis rather than disconfirm it. Later research found that the retrieval of confirming evidence actively inhibits the retrieval of disconfirming evidence, fur-ther strengthening bias (Davies, 2003). Rabin and Schrag (1999) found that initially being wrong often only strength-ened the original hypothesis, and that people could believe with near certainty in a false hypothesis despite receiving an infinite amount of information. Prior training, education, and experience seem to have little effect on this tendency (Mahoney & DeMonbreun, 2005).
For initial impressions to alter final evaluations of a class, the students can only use their past experience and their brief exposure to the instructor to form their impressions. Evidence suggests that students do form initial impressions about per-sonality that are long lasting and do affect their perception of the instructor. For example, Widmeyer and Loy (1988) conducted an experiment in which all students were exposed to the same guest instructor, but before the class began half received descriptions of the instructor indicating that he was warm, and the other half that he was cold. After the instruc-tional period, not only did the students in the warm group rate the instructor higher on positive aspects of personality, but they also rated the instructor previously defined as warm as having more teaching ability.
Other evidence indicated that many students appear to form an opinion of a class and the instructor very early in a course, and subsequent class and learning experiences may do little to change that opinion (Feldman, 1977; Ortinau & Bush, 1987; Sauber & Ludlow, 1988). Harvard psychologists (Ambady & Rosenthal, 1993) investigated students’ reac-tions to randomly selected 30-s clips of soundless videotapes of actual classroom instruction and found them highly corre-lated with end-of-course evaluations. Evaluations based on 6-s exposures were as significant as judgments based on 30-s clips. Not only were classroom and instructor evaluations similar, but also personality traits identified by independent raters were also highly correlated with the evaluations. Their findings have been replicated under actual instructional class-room conditions (Babad, Avni-Babad, & Rosenthal, 2004).
HYPOTHESES
Unlike earlier investigations (Kohlan, 1973), which took their first measures after instruction began, here I compared ini-tial impressions gathered after students were exposed to the instructor, but before the syllabus was distributed and before any actual instruction had taken place. The literature predicts that initial impressions of personality may be long lasting. Indeed, one study (Clayson & Sheffet, 2006) did report a simple least-squares correlation between measures of per-sonality taken before a class began and the final evaluation. To the extent that student evaluations at the end of a period of instruction reflect actual teaching practice, it would not be
expected that the initial perceptions and impressions would be related to the final evaluations (Wallace et al., 2001).
Hypothesis 1: Student initial impressions of the instructor’s personality would be related to the final student impres-sions of personality.
Hypothesis 2: An initial SET before instruction begins would not be related to the final SET at the end of the instruc-tional period.
METHOD
The study was made possible by mining an existing database. During the spring semester of 2003, over 700 students in or-ganizational management and principles of marketing classes were followed for an entire semester. Longitudinal data was gathered about the students and their perceptions of the class and instructor periodically over a period of 16 weeks. Within this data were measures of student perceptions before in-struction actually began and corresponding perceptions in the last week of the semester. These measures taken before instruction actually began could be compared with data taken at the end of the 16 weeks in order to investigate the question raised in this study. The portions of the original study that are directly related to the present research issue, or that may bias the findings, are outlined subsequently.
Utilized Variables
Eight instructors, who taught 13 sections of introductory un-dergraduate business classes (six sections of organizational management, and seven sections of principles of market-ing), gave permission for the study to be conducted in their classes over the period of a semester. On the first meeting of the class, the instructors introduced themselves, turned the class over to a researcher, and left the room. At this point, students had not seen the syllabus, and had an average of about 5 min of exposure to the instructor. Due to the na-ture of class schedules and the physical facilities, a student could be exposed to the instructor for no less than 1 min and not more than 10, depending on how early the student arrived. Students who signed a consent form were then asked to complete a questionnaire containing the variables that are outlined subsequently, plus a set of demographic questions. Authorized consent procedures were utilized throughout the study. Pertinent to this investigation, the class sections were evaluated again at week 16 during a 16-week term. Because each student was identified by a code, the last questionnaires were identical to the one given before the class began except that no demographic data was gathered.
The initial database came from a total of 737 students. Not all questions were answered by each student, and not all students completed their enrolled course. Consequently, the sample size for this study consisted of 567 students who
responded both to the initial questionnaire and to the ques-tionnaire 16 weeks later at the end of the semester.
Variables
Several demographics were gathered at the first class meeting. The student’s gender (male=51%, labeled as 0; female=49%, labeled as 1, utilized as a dummy variable) was self-reported. In addition, the actual cumulative GPA of each student at the beginning of the class was obtained by student permission from the university registrar (Mgrade point average [GPA]=3.03,SD=0.47).
A number of questions were asked to establish initial class and student conditions. Students reported whether they had heard anything about the instructor’s grading policy before the class began (0=not heard, 69%; 1=heard, 31%), and to estimate how difficult they thought the class would be (0
=easy or average, 81%; 1 =hard, 19%). A preliminary analysis indicated that easy and average estimates were not significantly different on the major variables of the study.
Student grade-related expectations were also surveyed. Respondents were asked, “What grade do you think you will receive in this class?” and “What grade do you think you will deserve to receive in this class?” A new variable was created utilizing these two measures. When the expected grade (Exp Grade:M=3.27,SD=0.53) was the same as the deserved grade (M=3.34,SD=0.54) for the class, then it could be assumed that the students expected to be treated fairly in grading, but when the two measures did not match, the students apparently believed that they would not receive the grade they deserved. An initial analysis showed that if the deserved grade was higher or lower than the expected grade, no significant differences were found in subsequent variables, consequently fair was dichotomized as a dummy variable (0=fair[deserved=expected grade], 84%; 1=
unfair[deserved not equal to expected], 16%).
Student evaluation of the teaching was measured by using the five questions on the student evaluation (SET) instru-ment actually used by the university. These five measures were summed and averaged (the instructor: “Created an at-mosphere conducive of learning,” “Instructor explains ma-terial appropriately,” “Instructor shows interest in student learning,” “Instructor sets high but reasonable standards,” and “Rate your satisfaction with your learning in this class”). A second unambiguous SET measure, “What grade would you give your instructor?” was also asked in all testing pe-riods. The measures were similar with correlations above 0.80. Consequently, the two measures of evaluation were summed to create a total evaluation measure called Evalua-tion (Cronbach’sαwas .71 initially, and .93 for Week 16). This measure is similar to the dependent variables utilized in most SET studies (Feldman, 1986). The evaluation scale ranged from 0 to 4 as in the classical GPA metric.
Students also evaluated the personality of the instructor at each testing period. The Big Five personality inventory was
30 D. E. CLAYSON
utilized. Many personality theorists have concluded that an adequate taxonomy for personality attributes could be cre-ated by five factors (Digman, 1990). This has been referred to as the Big Five, or as the Five Factor Model of personality. The factors have been found to be stable over long periods of a person’s life (Soldz & Vaillant, 1999), and are largely genetic (Jang et al., 1998). They seem to be unrelated to cul-ture and have been found in societies as diverse as those in Germany and China (McCrae & Costa, 1997). Because the personality evaluations can be too long and detailed for a brief administration, the factors were measured by utilizing a simple semantic 7-point scaling device. The question read, “From what you know now, rate this instructor on the fol-lowing dimensions.” The five dimensions were disagreeable-agreeable, not conscientious-conscientious, emotionally unstable-emotionally stable, introverted-extroverted, and unimaginative-imaginative.
When the larger data set was compiled from which this study was drawn, a validity check compared this shortened personality inventory with the standardized inventory. The shorter instrument was found to have both concurrent and predictive validity. The five factors were summed and aver-aged to produce a compensatory, global measure of the over-all negative-positive perception of personality. Cronbach’s alpha initially was .91 and .83 at the end of the term. This variable was called “personality.” The measure is not tradi-tional personality in that the construct is typically defined as a cluster of independent traits. Nevertheless, a student could, for example, believe an instructor was positive on one or several factors, but not on all, and still perceive the instructor as having a good or a bad personality globally and indepen-dent of the perception of any specific factor. This measure is consistent with many prior studies that did not utilize a per-sonality inventory when measuring perper-sonality, but instead relied on some global measure (Erdle et al., 1985; Murray et al., 1990).
RESULTS
There were no significant differences between students from different majors on any dependent variable. Hence, the data were combined for analysis. There has been an ongoing de-bate on whether data from individual students or from class means should be utilized when studying the effects of student evaluation of instruction (Marsh & Roche, 1997; Stumpf & Freedman, 1979). Because in this study I looked at student perceptions and not at teacher characteristics, within-class student data were utilized rather than between-class means (Clayson, 2007; Stumpf & Freedman).
Effects of Initial Conditions
Table 1 shows the differences of the initial and the final mea-sure of personality by initial independent variables. Table 2
TABLE 1
Initial and Final Personality Measure by Initial Variables Statistical probability .952 .543 .087 .897 Fair grading
Initial evaluations .462 .432 .228 .143 Statistical probability .000∗ .000∗ .000∗ .001∗
Initial personality measure .246 .198
Statistical probability .000∗ .000∗
Note. Values in parentheses represent standard errors.
aProbability was adjusted for class effects.
∗p<.05.
shows the similar differences of the initial evaluation and the final evaluation of the instructor. The statistical probability represents the probability of the null hypothesis assuming no differences. The column labeled “Adj.” gives the probability of the same variables controlled for class effects. There are several techniques that would allow an estimate of the asso-ciation of within-class effects controlled for by class effects. The method utilized in this study is the main effect, analysis of covariance utilizing Type III sum of squares. This allows for a test of each variable in the model with all other vari-ables simultaneously included in the analysis. Because I was not interested in class differences, the problem of using this technique without the assumption of homogeneity of group regressional betas was minimal (Tatsuoka, 1971). Except for an estimate of total class effect variance, the result in this model is identical to a linear regression utilizing the same variables.
As shown in Table 1 and consistent with the literature re-view, the initial expected grade, and initial SET were strongly related to the initial measure of personality. In addition, both the initial measure of personality and the initial SET were
TABLE 2
Initial and Final Evaluations by Initial Variables
Initial evaluation Final evaluation Statistical probability .756 .045∗ .031 .395
Rigor Initial personality measure .462 .415 .182 .089 Statistical probability .000∗ .000∗ .000∗ .340
Initial evaluations .229 .142
Statistical probability .000∗ .001∗
Note. Values in parentheses represent standard errors.
aProbability was adjusted for class effects.
∗p<.05.
significantly associated with the final measure of personality, along with the addition of the initial perception that the grad-ing with be fair. Note that the ordinal effects of the variables on the initial evaluation are identical to the ordinal
relation-TABLE 3
Personality Regression: All Variables Included
Variable B t p
Evaluation Regression: All Variables Included
Variable B t p
ship of the same variables on the final evaluations. The same pattern is shown in Table 2 looking at the effects on the SET measures, with one important exception. The initial SET is significantly related to the final SET evaluation, but the initial impression of personality was not significantly related to the final SET when controlled for class effects.
As can be seen in Tables 3 and 4, with all variables in-cluded, the initial perception of personality was significantly associated with the final perception of personality, and the initial SET evaluation was significantly related to the final SET evaluation, as were sex and perceptions of fairness. Note that the initial impression of personality was not related to the final SET, and that the initial SET was not related to the final measure of personality. The collinearity measures are all well within the acceptable limits for the model with the smallest tolerance measure resulting from the measures of the initial evaluation (rii=.68).
CONCLUSIONS
The first hypothesis that the initial impressions of the in-structor’s personality would be related to the final student impressions of personality could not be rejected. The second hypothesis that an initial SET before instruction begins would not be related to the final SET at the end of the instructional period was rejected.
The initial SET evaluation, before any instruction took place, was significantly related to the final SET evaluation given 16 weeks later. Note that the initial belief that the stu-dent held about the instructor’s fairness in assigning grades also influenced the final SET. Controlling for the students’ past performance as measured by GPA seemed to have no effect on these relationships. It would appear that the very best and the very worst students (as measured by previous
32 D. E. CLAYSON
grades) are reacting to the instructor in the same fashion. The same pattern is found with student perceptions of personality. In the regressional analysis, the initial impressions of per-sonality were not related to the final SET evaluation, and the initial SET was not related to the final personality measure. This is most likely the result of the two variables standing as proxies for each other, and reinforces previous findings that SET procedures essentially construct a personality measure similar to the proposed likeability scale.
Limitations
It is possible that some of the findings in this study are a result of unique conditions found only in the institution from which the data was taken. Data from other sources may find more or less of the effects found here. It is also possible that some of the effects were related to the nature of the classes. These courses are introductory and contain little technical and quantitative material.
Nevertheless, other research would indicate that the find-ings of the persistence of the initial perceptions of the in-structor on the final results are most likely not a function of unique sampling, but could be generalized to other popula-tions (for reviews of grade effects, see Clayson, 2004; V. E. Johnson, 2003; Marsh & Roche, 2000; Stumpf & Freedman, 1979 for reviews of personality effects, see Erdle et al., 1985; Marks, 2000; Murray, et al., 1990).
SET Implications
As mentioned in the introduction, the utilization of some sort of student evaluation of teaching has become almost univer-sal. The instruments are used to make important decisions that can result in major changes in an instructor’s career. They are also utilized to make improvements in teaching and hopefully to make the students’ experience more pro-ductive. Given the importance placed on the process, it is essential that the instruments are valid measures of instruc-tion. An attempt to establish or discredit this validity has been the aim of most all of the hundreds of reports and pub-lications pertaining to SET. It appears in this long process that SET does have convergent validity, but is lacking di-vergent and discriminant validity (for reviews of this issue, see Clayson, 2009; V. E. Johnson, 2003). In other words, the evaluations appear to be related to what the construct of good instruction should be associated with, but they are also related to many factors that would not be logically as-sociated with the construct. This discrepancy impacts on the ability of SET to discriminate between a good teacher and a poor one; the very use to which the instruments are most applied.
Finding that the instruments are influenced by factors that are unrelated to actual instruction weakens arguments that the evaluations can continue to be utilized as they are presently. This study adds to this chorus by showing a strong relation-ship between student attitudes and perceptions developed
before any instruction has taken place, which are still im-pacting evaluations, supposedly measuring only instruction, after four months of instructional interaction.
For individual instructors, this study adds two warnings. First, initial impressions are important. They create percep-tions that are long lasting and continue to influence the stu-dents’ evaluation of the instructor long after what would logically be expected. Second, instructors must be careful in utilizing student evaluations to improve teaching. A strong primacy effect, matched with a newly reported propensity of students to purposefully falsify evaluations (Clayson & Haley, 2011), requires that an instructor by judicial in ac-cepting suggestions found in SET reports for instructional improvement.
In this study, none of the students had seen a course syl-labus, nor had students been exposed to any class instruction when they made their initial evaluations. Finding an associ-ation between the evaluassoci-ation of the class at the end of the term with evaluations made within the first ten minutes of exposure, as well as corresponding persistence in perceived grade fairness, indicates that SET instruments are biased to-ward student perceptions unrelated to the instructor’s actual teaching style and abilities.
The results also help clarify several hypotheses made in the literature that attempted to make a validity argument for SET without including variables related to actual instruction. For example, Erdle et al. (1985) maintained that instructors’ personalities are reflected in certain classroom teaching be-haviors, which in turn are validly rated by students. The findings of this study do not contradict this argument, but make it unlikely in that students would have to be extremely keen observers of individual differences that would predict future classroom behavior. This acuity is unlikely given that the initial perception of personality was not related to the final teaching evaluation when controlled by the initial evaluation of instruction.
Almost 40 years ago, Kohlan (1973) found a significant relationship between an initial evaluation made early in the class (after instruction had begun) and a final evaluation made by students. He suggested three possible explanations for his findings: (a) the SET process which uses these assessments cannot be valid, (b) very little new information about in-structor behavior is presented after the first few classes, or (c) there may be a primacy effect due to stereotyping. This last explanation was not confirmed by this study. With class effects controlled, which also controls for individual instruc-tor effects, the initial evaluation was still related to the final evaluation. Because the evaluation was made in less than ten minutes exposure to the instructor, Kohlan’s second ex-planation that little information about instructor behavior is presented after the first few classes cannot be used as an explanation of the findings of this study because no such information was available. Kohlan’s first hypothesis that the SET evaluation process is invalid was not contradicted by this study.
The findings reported here reinforce Marks’s (2000) and Greenwald and Gillmore’s (1997) contention that student evaluations of teaching lack discriminant validity. Even after 16 weeks of personal face-to-face instruction, the students’ limited initial impressions of personality and of teaching can still be found in the final evaluation. Irrespective of any def-inition of good teaching, which includes actual instruction, this study indicates that the evaluations are biased.
Especially in the last decade, there have been numerous research findings that have raised troubling questions about the evaluation process. Rachel Johnson (2000) argued that while the student evaluation of teaching has bureaucratic ad-vantages, the system of usage is detrimental to actual teach-ing, both for practice and theory. Theall and Franklin (2001) stated, “Student ratings are only one source of information about teaching, and teaching is only one aspect of faculty performance. Never make the mistake of judging teaching or overall performance on the basis of ratings alone” (p. 51). The findings of this study reinforce their warning.
Research Implications
Some research has found that unmet student performance expectations on exams may result in student dissatisfaction (Grimes, 2002). It was found here that the students’ expec-tation of a fair grade, even before the class began, was sig-nificantly related to the final course evaluation. This could suggest that the mere expectation of lower grades may influ-ence the evaluations, especially if students are not accurately estimating their future performance. Although the expected grade’s influence on the evaluations at the end of the term have been extensively studied, it is still unknown how those expectations are made and how that process influences SET. The unrealistic grade expectations of students at the be-ginning of the term raise another issue. Previous research has indicated that students have difficulty estimating their own academic performance (Clayson, 2005; Kennedy, Lawton, & Plumlee, 2002; Williams & Ceci, 1997). Consistent with this literature, the present data find that the initial expected grade, while being significantly related to the initial SET, was not associated with the final evaluation, and more surprisingly, unrelated to the final class grade (r=.036,p=.393).
The students began their classes expecting a grade significantly higher than the one actually received (3.27 vs. 2.77;t(566)=21.59,p<.001), and even higher than their own cumulative GPA. This was true even though these classes reported grades well below the university average, a fact regularly and publically announced by the business college. Furthermore, I found that prior grading information about the class did not modify the exaggerated expectations. These students were not inexperienced. Almost all were juniors and seniors.
There are two possible explanations for these paradoxi-cal findings: (a) students may think their expectations will be met because they believe that their performance will lift their grades, or that the instructor will give a grade more
lenient than earned; or (b) the students may be dishonest or cavalier in their answers. Fortunately, there is a way to test between these explanations. As indicated previously, re-search has found a persistent association between expected grades and the evaluation (for detailed reviews, see Clayson, Frost, & Sheffet, 2006; Greenwald & Gillmore, 1997; V. E. Johnson, 2003). By inspecting the data from the last week of the term, it would be expected that if (a) is true, then there should be an association between the final expected grade and the final evaluation, but not between the final grade (not yet received) and the evaluation. If (b) is true, there should be an association between the final course grade, which would be highly related to their given grades by week 16, and the eval-uation, but not between the final expected grade (capriciously reported) and the evaluation. All three measures were corre-lated primarily because of GPA, so a regression was run with the final evaluation as the dependent variable and expected final grade, actual final grade, and GPA as independent vari-ables. The result was highly significant,F(3, 566)=23.50,
p<.0001, but the only significant variable loading was for the expected grade (β=.384),t=7.82,p<.0001. The final grade was nonsignificant (β=–.073),t=–1.29,p=.197. The students appeared to be giving an honest response from their perspective, but how did they predict their grade if not from performance?
Because expected grades are related to the evaluations, and faculty believe that this association lowers academic standards (Simpson & Siguaw, 2000), this is a question that needs to be addressed.
REFERENCES
Adams, J. V. (1997). Student evaluations: The rating game.Inquiry,1(2), 10–16.
Ambady, N., & Rosenthal, R. (1993). Half a minute: Predicting teacher eval-uations from thin slices of nonverbal behavior and physical attractiveness.
Journal of Personality and Social Psychology,64, 431–441.
Asch, S. E. (1946). Forming impressions of personality.Journal of Abnormal and Social Psychology,41, 258–290.
Attiyeh, R., & Lumsden, K. G. (1972). Some modern myths in teaching economic: The UK experience.American Economic Review,62, 429–433. Babad, E., Avni-Babad, D., & Rosenthal, R. (2004). Prediction of students’ evaluations from brief instances of professors nonverbal behavior in de-fined instructional situations.Social Psychology of Education,7, 3–33. Baird, J. S. (1987). Perceived learning in relation to student evaluation of
university instruction.Journal of Educational Psychology,79, 90–91. Boice, R. (1992). Countering common misbeliefs about the student
evalua-tion of teaching.ADE Bulletin,101(Spring), 1–4.
Braskamp, L. A., & Ory, J. C. (1994).Assessing faculty work: Enhancing individual and institutional performances. San Francisco, CA: Jossey-Bass.
Carrell, S. E., & West, J. E. (2010). Does professor quality matter? Evidence from random assignment of students to professors.Journal of Political Economy,118, 409–432.
Centra, J. A. (1993).Reflective faculty evaluations: Enhancing teaching and determining faculty effectiveness. San Francisco, CA: Jossey-Bass. Clayson, D. E. (2004). A test of reciprocity effects in the student evaluation
of instructors in marketing classes.Marketing Education Review,14(2), 11–21.
34 D. E. CLAYSON
Clayson, D. E. (2005). Performance overconfidence: Metacognitive effects or misplaced student expectations?Journal of Marketing Education,27, 122–129.
Clayson, D. E. (2007). Conceptual and statistical problems of using between-class data in educational research.Journal of Marketing Education,29(1), 1–5.
Clayson, D. E. (2009). Student evaluations of teaching: Are they related to what students learn? A meta-analysis and review of the literature.Journal of Marketing Education,31(1), 16–30.
Clayson, D. E., Frost, T. F., & Sheffet, M. J. (2006). Grades and the student evaluation of instruction: A test of the reciprocity effect.Academy of Management Learning & Education,5(1), 52–65.
Clayson, D. E., & Haley, D. A. (1990). Student evaluations in marketing: What is actually being measured?Journal of Marketing Education,12(3), 9–17.
Clayson, D. E., & Haley, D. A. (2011). Are students telling us the truth? A critical look at the student evaluation of teaching.Marketing Education Review,21, 101–112.
Clayson, D. E., & Sheffet, M. J. (2006). Personality and the student evalua-tion of teaching.Journal of Marketing Education,28, 149–160. Cohen, P. A. (1981). Student ratings of instruction and student achievement:
A meta-analysis of multi-section validity studies.Review of Educational Research,51, 281–309.
Davies, M. F. (2003). Confirmatory bias in the evaluation of personality descriptions: Possible test strategies and output interference.Journal of Personality and Social Psychology,85, 736–744.
Digman, J. M. (1990). Personality structure: An emergence of the five-factor model.The Annual Review of Psychology,41, 417–440.
Dowell, D. A., & Neil, J. A. (1982). A selective review of the validity of student ratings of teaching.Journal of Higher Education,53, 51–62. Erdle, S., Murray, H. G., & Rushton, J. P. (1985). Personality, classroom
behavior and student ratings of college teaching effectiveness: A path analysis.Journal of Educational Psychology,77, 394–407.
Feldman, K. A. (1977). Consistency and variability among college students in rating their teachers and courses: A review and analysis.Research in Higher Education,6, 223–274.
Feldman, K. A. (1986). The perceived instructional effectiveness of col-lege teachers as related to their personality and attitudinal charac-teristics: A review and synthesis. Research in Higher Education,24, 139–213.
Foote, D. A., Harmon, S. K., & Mayo, D. T. (2003). The impacts of instruc-tional style and gender role attitude on students’ evaluation of faculty.
Marketing Education Review,13(2), 9–19.
Greenwald, A. G., & Gillmore, G. M. (1997). Grading leniency is a re-movable contaminant of student ratings. American Psychologist, 52, 1209–1217.
Grimes, P. W. (2002). The overconfident principle of economics students: An examination of a metacognitive skill.Journal of Economic Education,
33, 15–30.
Haugtvedt, C. P., & Wegener, D. T. (1994). Message order effect in persua-sion: An attitude strength perspective.Journal of Consumer Research,
21, 205–218.
Jang, K. L., McCrae, R. R., Angleitner, A., Riemann, R., & Livesley, W. J. (1998). Heritability of facet-level traits in a cross-cultural twin sample: Support for a hierarchical model of personality.Journal of Personality and Social Psychology,74, 1556–1565.
Johnson, R. (2000). The authority of the student evaluation questionnaire.
Teaching in Higher Education,5, 419–434.
Johnson, V. E. (2003).Grade inflation: A crisis in college education. New York, NY: Springer.
Kennedy, E. J., Lawton, L., & Plumlee, E. L. (2002). Blissful ignorance: The problem of unrecognized incompetence and academic performance.
Journal of Marketing Education,24, 243–252.
Kohlan, R. G. (1973). A comparison of faculty evaluations early and late in the course.Journal of Higher Education,44, 587–595.
Krosnick, J. A., & Alwin, D. F. (1987). An evaluation of a cognitive theory of response -order effects in survey measurement.The Public Opinion Quarterly,51, 201–219.
Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy.New Directions for Institutional Research,109, 9–25.
Langbein, L. I. (1994). The validity of student evaluations of teaching.
Institutional Politics,27, 545–552.
Lundsten, N. L. (1986). Student evaluations in a business administration curriculum: A marketing viewpoint.AMA Developments in Marketing Science,9, 169–173.
Mahoney, M. J., & DeMonbreun, B. G. (2005). Psychology of the scientist: An analysis of problem-solving bias.Cognitive Therapy and Research,
1, 229–238.
Marks, R. B. (2000). Determinants of student evaluations of global mea-sures of instructor and course value.Journal of Marketing Education,22, 108–119.
Marlin, J. W., & Niss, J. F. (1980). End-of-course evaluations as indicators of student learning and instructor effectiveness.Journal of Economic Education,11, 16–27.
Marsh, H. W., & Roche, L. A. (1997). Making students’ evalua-tions of teaching effectiveness effective. American Psychologist, 52, 1187–1197.
Marsh, H. W., & Roche, L. A. (2000). Effects of grading leniency and low workload on students’ evaluations of teaching: Popular myth, bias, validity, or innocent bystanders?Journal of Educational Psychology,92, 202–228.
McCrae, R. R., & Costa, P. T. (1997). Personality trait structure as a human universal.American Psychologist,52, 509–516.
Murray, H. G., Rushton, J. P., & Paunonen, S. V. (1990). Teacher personality traits and student instructional ratings in six types of university courses.
Journal of Educational Psychology,82, 250–261.
Naftulin, D. H., Ware, J. E., & Donnelly, F. A. (1973). The Doctor Fox lec-ture: A paradigm of Educational seduction.Journal of Medical Education,
48, 630–635.
Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. T. (2009). A meta-validation model for assessing the score-validity of student teaching eval-uations.Quality and Quantity,43, 197–209.
Orsini, J. L. (1988). Halo effects in student evaluations of faculty: A case application.Journal of Marketing Education,10(2), 38–45.
Ortinau, D. J., & Bush, R. P. (1987). The propensity of college students to modify course expectations and its impact on course performance information.Journal of Marketing Education,9, 42–52.
Rabin, M., & Schrag, J. L. (1999). First impressions matter: A model of confirmatory bias. The Quarterly Journal of Economics, 114(1), 37–82.
Remmers, H. H., & Brandenburg, G. C. (1927). Experimental data on the Purdue rating scale for instructors.Educational Administration and Su-pervision,13, 519–527.
Rodin, M., & Rodin, B. (1972). Student evaluation of teachers.Science,
177, 1164–1166.
Sauber, M. H., & Ludlow, R. R. (1988). Student evaluation stability in mar-keting: The importance of early class meetings.The Journal of Midwest Marketing,3(1), 41–49.
Sherman, B. R., & Blackburn, R. T. (1975). Personal characteristics and teaching of college faculty. Journal of Educational Psychology, 67, 124–131.
Simpson, P. M., & Siguaw, J. A. (2000). Student evaluations of teaching: An exploratory study of the faculty response.Journal of Marketing Edu-cation,22, 199–213.
Soldz, S., & Vaillant, G. E. (1999). The Big Five Personality traits and the life course: A 45-year longitudinal study.Journal of Research in Personality,
33, 208–232.
Sproule, R. (2002). The underdetermination of instructor performance by data from the student evaluation of teaching.Economics of Education Review,21, 287–294.
Stumpf, S. A., & Freedman, R. D. (1979). Expected grade covariation with student ratings of instruction: Individual versus class effects.Journal of Educational Psychology,71, 293–302.
Tang, T. L., & Tang, T. L. (1987). A correlation study of students’ evaluations of faculty performance and their self-ratings in an instructional setting.
College Student Journal,21, 90–97.
Tatsuoka, M. M. (1971).Multivariate analysis: Techniques for educational and psychological research. New York, NY: Wiley.
Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction?New Directions for Institutional Research,109, 45–56.
Wallace, T. L., Grinnell, L., Carey, L. M., Dedrick, R. F., Ferron, J. M., Dailey, K. A., et al. (2001). A series of studies examining the Florida Board of Regent’s course evaluation instrument.Florida Journal of Edu-cational Research,41(1), 14–42.
Wason, P. C. (1960). On the failure to eliminate hypotheses in a con-ceptual task.Quarterly Journal of Experimental Psychology,12, 129– 140.
Widmeyer, W. N., & Loy, J. M. (1988). When you’re hot, you’re hot! Warm-cold effects in first impressions of persons and teaching effectiveness.
Journal of Educational Psychology,80, 118–121.
Williams, W. M., & Ceci, S. J. (1997). “How’m I doing?” Problems with student ratings of instructors and courses. Change, 29(5), 13– 23.
Willis, J., & Todorov, A. (2006). First impressions: Making up your mind after a 100-ms exposure to a face. Psychological Science, 17, 592– 598.
Yunker, P. J., & Yunker, J. (2003). Are student evaluations of teaching valid? Evidence from an analytical business core course.Journal of Education for Business,78, 313–317.