The Dreaded Student Evaluation of Teaching

Every academic in Australia has to face student evaluation of teaching (SET), typically on an annual or more frequent basis. Students are asked to fill out questionnaires on each lecturer assessing the value and clarity of their lectures.  So far so good. It is important, I hear you say, for lecturers to receive feedback on their performance, and I agree. No-one would take exception to that. But that is not the only way the results are used.  The answers to these questionnaires are distilled into a score, usually between 0 and 5 ,of the instructor’s performance that becomes part of their overall annual assessment. In turn, this SET assessment is taken into account for promotion and can  also be used in the ranking of the institution. Good (in terms of SET scores) teachers are rewarded, albeit minimally, and teaching awards are now almost de rigeur on CVs of applicants for tenured positions in top institutions in Australia.

These are universities we are talking about:  the great pursuers of truth and integrity. You imagine that the methodology has been studied, tested, and validated before being used in this career making/breaking way; that it would be subject to the kind of academic scrutiny and rigour we all claim to uphold as part of our research praxis.  Certainly there are many publications, some dating back to the 1950s in journals in Educational Psychology and Educational Research that discuss the relationship between SET scores and effective teaching.  The majority of those from the earlier era (say, pre-2000) claim that  students learn more effectively from more highly rated lecturers: the higher the SET score the better the lecturer.  But,  more recently, a number of publications have begun to question the connection between SET score and effectiveness as a teacher.

One of the generally agreed best techniques in studying how SET scores compare with real teaching effectiveness is to use courses taught in multiple sessions using different instructors. The performance of students across these sessions is compared to the SET scores of the different instructors. This is, or ought to be, a fairly straightforward statistical problem of a kind that people in the social sciences face daily in their research.

A famous and highly cited paper by Peter Cohen in 1981 undertook a meta-analysis, bringing together multiple earlier multi-session analyses to arrive at the conclusion that there is a small but significant correlation between SET scores and student performance. Cohen says, in particular,

The results of the meta-analysis provide strong support for the validity of student ratings as a measure of teaching effectiveness” and “ . . . we can safely say that student ratings of instruction are a valid index of instructional effectiveness. Students do a pretty good job of distinguishing among teachers on the basis of how much they have learned.

Many later publications cite the Cohen paper as a basis for using SET scores to assess teaching effectiveness.  Another paper,  by Clayson in 2009,  undertook a meta-analysis again using multi-session studies and came up with a somewhat different result; though again suggesting a correlation between future student performance and SET scores. A third paper of interest is one by McCallum in 1984 arriving at similar conclusions.   While I single out these, let me emphasize, there are quite a lot of  such papers.

One of the issues with studies of this kind is that students who are more interested in a course will rate the  instructor more highly than one who are less interested.  If assignment of students into sessions for, say, an ancillary subject is based on their major area of study,  biases will be inevitable.  Instructors who teach tougher courses will be penalized, though this should be less of a problem with with multi-session studies.  Often entertainment value is mistaken by students for quality teaching. I recall a high school teacher who was highly regarded by my friends and myself because he was very funny. It was only after becoming a math lecturer myself many years later that I realized he was a fairly ineffective teacher.

In the 2000s several publications have disputing the connection between future student performance and SET scores.  Now a recent hard-hitting paper by three researchers, Bob Uttl, Carmela White, and Daniela Wong Gonzalez,  in the Department of Psychology at Mount Royal University in Canada has provided a detailed critique of the earlier studies, including those of Cohen, McCallum, and Clayson, and pointed to several serious methodological flaws.  They list, in particular,  failure of these meta-studies to enunciate sample sizes in the primary studies making it impossible to repeat the original work, and failure to take account of variable sample sizes. The  paper by McCallum does provide numbers allowing the Mount Royal researchers to demonstrate that listed sample sizes and correlations in that paper were  actually incorrect. According to the Mount Royal group, all of the meta-analysis studies, they looked at

failed to seriously consider that the small-to-moderate correlations between SET and learning may be an artifact of small samples of most studies and small sample bias.

The paper goes on to provide a detailed analysis of several of the more significant previous studies, articulating their methodological and particularly statistical shortcomings. Specifically,  Cohen’s conclusions are not supported by his own data:

The inflated SET/learning correlations reported by Cohen appear to be an artifact of small study effects, most likely arising from publication bias.

As to Clayson,  according to the Mount Royal researchers, his

findings are largely uninterpretable and his weighted correlation estimate of SET/learning correlations is meaningless.

The Mount Royal researchers then undertake a very detailed and careful statistical meta-analysis themselves of the datasets available. Their conclusions are damning and unequivocal:

Contrary to a multitude of reviews, reports, as well as self-help books aimed at new professors … the simple scatterplots as well as more sophisticated meta-analyses methods indicate that students do not learn more from professors who receive higher SET ratings.

As they point out

The entire notion that we could measure professors’ teaching effectiveness by simple ways such as asking students to answer a few questions about their perceptions of their course experiences,   instructors’ knowledge, and the like seems unrealistic given well established findings from cognitive sciences such as strong associations between learning and individual differences including prior knowledge, intelligence, motivation, and interest.

And, in their view,  the techniques used appear not to have addressed some fairly obvious and well-understood sources of bias:

prior research indicates that prior interest in a course is one of the strongest predictor of SET ratings and that professors teaching quantitative courses receive lower SET ratings than professors teaching non quantitative courses.

Despite this societal bias against quantitative course,  as the researchers say,

some speculate that quantitative courses receive lower SET ratings because professors teaching them may be less competent and less effective.

One has the impression that the earlier researchers might have gained from paying more attention when they attended quantitative courses as students.

The   conclusions of the paper are similarly strongly stated:

Despite more than 75 years of sustained effort, there is presently no evidence supporting the widespread belief that students learn more from professors who receive higher SET ratings. If anything, the latest large sample studies [cited in the paper] show that students who were taught by highly rated professors in prerequisites perform more poorly in follow up courses.

Perhaps a little flippantly the authors suggest that individual institutions need to decide whether they regard student learning or student satisfaction as their primary educational goal. In the former case they should assign essentially zero weight to SET scores.  The “student satisfaction” institutions should use SET scores as their main measure of teaching performance, and perhaps they should  fire any academic whose SET score falls below the average!

Another recent article backing the Mount Royal group’s thesis and against the use of SET scores is given by John Lawrence in the May-June 2018 report of the American Association of University Professors (https://www.aaup.org/article/student-evaluations-teaching-are-not-valid#.W3oNfZ19iw4).  He states that giving out chocolate has been known to have a significant effect on SET scores. I personally know of a colleague who bought their students a pizza lunch in the assessment week. Their SET score increased by a whole point (from 3.6 to 4.6) over the same class in the previous year.  This took them from an assessment as a mediocre teacher to an outstanding one.  As Lawrence points out:

Psychologist Wolfgang Stroebe has argued that reliance on SET scores for evaluating teaching may contribute, paradoxically, to a culture of less rigorous education. He reviewed evidence that students tend to rate more lenient professors more favorably…. Thus, professors are rewarded for being less demanding and more lenient graders both by receiving favorable SET ratings and by enjoying higher student enrollment in their courses.

I don’t understand the use of the word “paradoxically” there. The economic model of universities in Australia is resulting in  pressure on academics to reduce the demands on students,  and it is very clear that SET is contributing to that.

 

 

 

search previous next tag category expand menu location phone mail time cart zoom edit close