Variation in the student ratings of school of business instructors: a regional university example.
Caines, W. Royce ; Shurden, Mike C. ; Cangelosi, Joe 等
INTRODUCTION
Over the past several years, student evaluations of instructors
have taken on increasing importance as administrators have sought to use
as many objective measures as possible to justify tenure/promotion
decisions, to differentiate pay raises, and to provide feedback to the
public. Student evaluations are convenient tools because they produce
numbers that appear to represent objective assessments of the
effectiveness of the instructor. These instruments are usually
administered during a class session late in the semester and are
relatively cheap to administer. Also, such evaluations give the students
the appearance of having input into decisions that relate to the quality
of educational services offered.
In the case of the regional university in this study, the State
Commission on Higher Education has mandated that student evaluations
must be administered in each course during the fall semester and has
even mandated one specific question that must be included for every
class. That mandate applies not only to this small regional university
but to every public college and university in the state. The mandated
question addresses the availability of the instructor to the student. So
far the other questions to be asked have not been mandated, but many
people feel that may very well be the eventual directive.
While most instructors may not object to a survey of student
opinion about instructional effectiveness, many are concerned about how
the data collected will be interpreted. Issues of variation in ratings
even in the same courses have been questioned for some time. For
example, Greenwald (1995) reported that his ratings for a course taught
one semester placed him in the highest 10 percent of faculty ratings at
his university. The next semester, he taught the same class by the same
plan but student ratings placed him in the second lowest decile of the
university faculty. How could such variation occur assuming similar
efforts on the part of faculty? Also, how do administrators react to
such variation? Do they consider just the numbers and reward effort one
period and punish effort the next period without considering whether a
large amount of variation may exist regardless of faculty effort?
Another criticism that instructors often level at student
evaluations is that they can be manipulated by grading standards and/or
style issues that have little to do with course content or knowledge of
the instructor. An example can be found in Williams and Ceci (1997)
where the authors showed that just by changing the enthusiasm shown by
the instructor significantly affected the student evaluations. In
another provocative study, Ambady and Rosenthal (1993) concluded that
students arrive at conclusions about instructors within seconds of being
exposed to them. Baird (1987) concluded that a large portion of rating
variance can be explained based on student's subjective assessment
of learning as opposed to actual course grades.
However, most administrators appear to subscribe to the view that
"student ratings tend to be statistically reliable, valid, and
relatively free from bias or the need for control; probably more so than
any other data used for evaluation." (Cashin, 1995). A study by
Marsh and Hocevar (1991) concluded that student evaluations of teaching
performance are stable and consistent over time based on data from 6,
024 classes and 195 instructors.
The focus of this paper is to evaluate the student evaluations of
instructors of business at a small regional university over a period of
twelve semesters. The variation across semesters and across instructors
will be examined and compared to actual ratings of the instructors by
administrators to determine if rational decisions are made in
differentiating instructional efforts. No effort is made to question the
reliability and validity of the use of student evaluations as a means of
evaluating teaching effectiveness.
DATA AND METHODOLOGY
Data for this study were collected from all courses taught in
business at a small (2800 students) regional public university during
each regular semester (summer school excluded) from Fall 1992 through
Spring 1998, a total of twelve semesters. Over that period, a total of
592 sections were taught in the School of Business. Several sections
were taught by part-time instructors, but most were taught by full-time
faculty. For purposes of this study, data from all sections are utilized
for comparison purposes, but the focal group of the study is twelve
instructors who taught one or more School of Business disciplines and
who were still on the faculty as of the end of Spring Semester 1998.
The student evaluation instrument is purchased from a large state
university Center for Teaching and Faculty Development. Therefore, the
instrument has been tested for reliability and validity. Also, the
instructor reports are compared to a norm group based on the other
institutions using the instrument. The surveys are completed on a very
formal schedule near the end of each semester. In each section, an
instructor who is not the instructor of the course is assigned to
administer the student evaluations at the beginning of the class period.
A student is selected to collect the completed forms and deliver them to
the School of Business office. They are then mailed to the test center
for analysis and results are returned to the School of Business office
shortly after the end of the semester. Instructors receive a copy of the
reports at the beginning of the next semester. The reports show
comparisons of the individual instructor to others in the national group
but do not show comparisons to instructors at their own institution.
Simple summary statistics are the first part of this analysis.
Mean, median, range, standard deviation, and coefficient of variation are all calculated for the entire group and for each of the twelve
instructors. Next, evaluation scores for each instructor are tested
against all other section evaluation scores by computing the Wilcoxon Rank Sum Test for differences of two medians. The purpose of using the
Wilcoxon Rank Sum Test is based on the fact that it is a nonparametric procedure which does not require the assumption of normality. The
Wilcoxon Rank Sum Test is a powerful test even when conditions of
normality are met and is more appropriate when the conditions are not
met (Levine, et. al. 1998) The Wilcoxon Rank Sum Test statistic is
approximately normally distributed for large sample sizes (Levine,
et.al.). For those instructors whose median evaluation scores indicate
that they are significantly different than the overall median, a second
Wilcoxon Rank Sum Test will be calculated to compare those scores.
The student evaluations administered at the university of this
study are based on a national instrument. A large number of questions
are asked to try to diagnose specific strengths/weaknesses of each
instructor. Also, the ratings are compared to peer instructors of all
courses and of similar courses where similar courses are specified as
those that are of similar class size and self-reported student
motivation level. Thus the reported results do not allow comparison to
courses specific to the discipline, but are compared to the national
test group.
The reports that are printed by the national test center include
the results from a large number of questions, many of which focus on
particular teaching methods such as "creating enthusiasm",
"explaining reasons for criticisms", etc. Many of these
results are not of particular interest to administrators seeking to
evaluate overall teaching performance. Rather the focus tends to be on a
few summary questions which form an overview of the instructor and
course. For purposes of this paper, five of those summary
items/questions are compared as shown in Table 1. Questions 1-3 are
reported as percentile scores where the individual instructor scores are
placed in a percentile of the test group that is compared to all similar
classes as reported in the preceding paragraph. Questions 4 and 5 are
reported as raw scores on a 1 (strongly disagree) to 5 (strongly agree)
scale with 1 being the lowest and 5 being the highest (preferred) score.
RESULTS
First, summary statistics were calculated for all sections taught
in the School of Business. Results are reported for all sections and for
each of the twelve instructors that are the focus of this study, Table
1.
Based on the results shown in Table 1, it is apparent that a large
amount of variation exists in the observed results of the student
evaluations. In general, higher mean and median scores with less
variation would represent results that represent consistently strong
teaching performances as evaluated by students in those classes.
Particularly, with questions 1-3, variation is rather large as exhibited
by the large ranges (compared to the means) and the coefficients of
variation.
For question 1, seven of the twelve instructors have ratings that
exceed the institutional median of the 58th percentile with Instructor 7
having the highest score at the 80th percentile. Remember these
percentile rankings are based on the norm group that represents all
institutions using this instrument. Five instructors have scores lower
than the institutional median with Instructor 3 receiving the lowest
rating at the 27th percentile. For question 2, six instructors received
ratings above the institutional median of the 52nd percentile with
Instructor 11 having the highest rating at the 91st percentile. One
instructor has the same rating as the institutional median and five are
below the institutional median with Instructor 8 receiving the lowest
rating at the 10th percentile.
For question 3, five instructors have median ratings above the
institutional overall median of the 52nd percentile with Instructor 11
again receiving the highest median rating at the 76th percentile. Seven
of the instructors received median ratings below the institutional
median with Instructor 3 having the lowest rating at the 28th
percentile.
Questions 4 and 5 are reported on a different scale than questions
1 through 3. These questions are reported as raw scores on a scale of
1-5 with 5 the more desirable score. Thus there is no comparison to a
norm group. For question 3, five of the instructors received median
ratings above the overall median with Instructor 7 tops at 4.8. Two
instructors had median ratings identical to the overall median and five
had ratings below the overall median with Instructor 3 lowest at 3.4.
On question 5, six instructors received median ratings above the
overall median with Instructor 7 on top with a 4.8. One instructor had a
median rating equal to the overall median and five had ratings below the
overall median with Instructor 8 rated lowest at 3.0 While these summary
statistics allow a visual basis for comparison, the large amount of
variation for each instructor gives concern about the statistical
validity of asserting that significant differences exist. For example,
Instructor 7 received median ratings well above the institutional median
for questions 1-3, yet the range of ratings exceeded 70 percentile
points. Similar observations can be made for several of the other
instructors.
Therefore the next step is to use the Wilcoxon Rank Sum Test of
differences of medians. In each case, the scores of the individual
instructor are compared to all other instructors of business courses at
this institution. The Wilcoxon Rank Sum Test is distributed
approximately normally, thus a Z-score is calculated and examined for
p-value in testing whether significant differences exist.
Table 2 gives the calculated Z-scores and indicates significant
differences. The chosen level of significance for this analysis is
?=.05. From the results several points can be gleaned. There are many
significant differences. Some instructors had ratings that were
significantly different and positive on some questions and significantly
different and negative on other questions . That makes it much more
difficult to rate the teaching of those instructors.
However, the group breaks into 3 general groups. Instructors 1, 6,
7, and 11 received ratings above the institutional median with at least
four of the questions rated significantly different. Instructors 3, 5,
and 8 received ratings on all five questions that were below the
institutional median by a statistically different margin. That leaves
Instructors 2, 4, 9, 10, and 12 who received mixed ratings. Thus the
first group might logically be considered to be rated higher on teaching
(as rated by student evaluations) while the second group might logically
be rated lower on teaching. The third group must be evaluated more
carefully and the weights to be placed on each question must be
carefully considered before a meaningful evaluation can be completed.
For example, Instructor 4 received significantly higher ratings for
questions 4 and 5 which indicate that students perceive an excellent
instructor and perceive they learned a lot in the course. Yet the
ratings on the other questions are not significantly different from the
institutional median and on question 3, the ratings are actually
significantly lower than the institutional median.
LOWER LEVEL COMPARED TO UPPER LEVEL
Another area of concern to faculty has been the perceived
differences in teaching lower level classes versus upper level classes.
Students in upper level classes must meet the requirements for admission
to the School of Business. Those requirements include acceptable
(minimum grade of "C" on several courses) completion of a
number of lower level courses, and an overall minimum grade point
average of 2.0. Therefore, faculty perceive a more mature, more
motivated group of students in upper level classes.
The data in general support the perception that upper level classes
receive higher student evaluations, Table 3. Instructors of lower level
classes received lower median ratings on four of the five questions.
However, on question 2, "Would like instructor again", the
ratings were higher for lower level classes compared to upper level
classes. Based on an [alpha] =.05, significant differences existed only
for questions 1 and 4. It should be noted that at an [alpha] = .10, all
questions would be rated as significantly different.
These results indicate that it may be appropriate to consider the
course assignments of individual instructors. Those who have mostly
taught lower level assignments may deserve some adjustment upward for
questions 1, 3, 4, and 5, but results of question 2 would then be
adjusted downward. It should be noted that three of the instructors
taught only upper division classes while all others taught a mix though
some mostly taught lower level while others mostly taught upper level.
ADMINISTRATIVE ACTIONS
One of the interesting comparisons is to compare actual
administrative actions to the actual results of student evaluations.
Each professor at this institution is evaluated on an annual basis with
the three areas of evaluation being identified as teaching, professional
development, and service. Based on the AACSB Accreditation plan, the
relative weight assigned to teaching is 70% because this is an
undergraduate only institution. Thus, the teaching component is a large
portion of the overall evaluation.
In years when professors are up for tenure, promotion, or
post-tenure review, administrators have additional information to judge
when evaluating individual instructors as peer reviewers review classes
and write letters of evaluation. However, in other years, the only data
available is the results of the student evaluations.
The actual annual evaluations are not available to the authors of
this study. However, much information can be inferred from annual pay
raises, reappointment/promotion/tenure decisions which is available.
Also, in many cases, individual instructors voluntarily share annual
evaluation information. Table 4 summarizes available information about
the evaluation results based on administrative decisions.
Based on observed data, it appears that administrators have indeed
made decisions that are reasonably consistent with this analysis in
rewarding higher student evaluations even though no effort has ever been
made to test the significance of differences. Instructors 3, 5, and 8
have received annual pay raises considerably lower than average. One of
the instructors has been required to take additional coursework in his
field and one instructor did not receive support for on-time promotion
though the instructor was granted tenure. Support for tenure was mixed
based on both teaching and professional development. All three of these
instructors received student evaluations significantly lower than the
School of Business median.
Instructors 1, 6, 7, and 11 have all received above average pay
raises and all have received tenure and on-time promotions during the
period of this study (or are currently in the process with full support
from administrators). Pay raises have not all been equal, but things
other than teaching are involved in that decision.
Instructors 2, 10, and 11 have received average pay raises which
befits the mixed results from the student evaluations. However, the
promotion of one instructor was delayed and received mixed support when
awarded. That was not all based on teaching as professional development
also played a role.
Instructors 4 and 9 are interesting cases as both received mixed
ratings on the questions. However, both have received above average pay
raises. In some cases, those raises have exceeded the raises awarded
some of the instructors in the higher rated group. Evidently, in those
cases, administrators weighted some questions more heavily or considered
issues other than teaching evaluations. A cursory review does not reveal
any obvious reason because professional development activities were not
obvious differences.
SUMMARY AND CONCLUSIONS
The results of this analysis indicate that significant differences
exist in the student evaluations of the instructors of business courses
at the regional university that provided the data. Although the measured
variations are large for each instructor, differences can be supported
as shown by the statistically significant differences of medians as
shown by the results of the Wilcoxon Rank Sum test. This analysis gives
administrators a methodology for establishing which instructors have
truly different student evaluations.
One item that has been a concern at the university is also shown to
have some validity. Instructors of lower level courses do tend to
receive lower student evaluations than do the instructors of upper level
courses except for the question "would like instructor again".
For that question, instructors of lower level classes tend to receive
higher ratings.
One final point that should be interesting to administrators at
this university that prides itself as being supportive of undergraduate
teaching is the comparison of the administrative decisions compared to
the analysis of student evaluations. For the most part, those
instructors with significantly higher evaluations have received more
positive administrative support while those who have significantly lower
evaluations have received less positive administrative support.
REFERENCES
Ambady, N. & R. Rosenthal. (1993). Half a minute: predicting
teacher evaluations from thin slices of nonverbal behavior and physical
attractiveness, Journal of Personality and Social Psychology, 64,
431-441.
Baird, J. S. (1987). Perceived learning in relation to student
evaluations of university instruction, Journal of Educational
Psychology, 79(1), 90-91.
Cashin, W. E. (1995). Student Ratings of Teaching: IDEA Paper No.
32, Manhattan, KS: Kansas State University Center for Faculty Evaluation
and Development.
Greenwald, A.G. (1995). Applying Social Psychology to Reveal a
Major (But Correctable) Flaw in Student Evaluations of Teaching, Paper
presented at the annual meeting of the American Psychological
association, New York, NY.
Levine, D. M., M.L. Berenson & D. Stephan. (1997). Statistics
for Managers Using Microsoft Excel. Upper Saddle River, NJ:
Prentice-Hall.
Marsh, H. W. & D. Hocevar. (1991). Student ratings of teaching
effectiveness: The stability of mean ratings of the same teachers over a
13-year period, Teaching and Teacher Education, 7(4) 303-314.
Tang, T. L. (1997). Teaching evaluation at a public institution of
higher education: Factors related to the overall teaching effectiveness,
Public Personnel Management, September 22, 379-390.
Williams, W. M. & S. Ceci. (1997). How'm I doing? Problems
with student ratings of instructors and courses, Change, 29, September
19, 12-24.
W. Royce Caines, Lander University
Mike C. Shurden, Lander University
Joe Cangelosi, University of Central Arkansas
Table 1. Summary Statistics: Student Evaluations of School
of Business Instructors
Question 1--Progress on Relevant Objectives
Standard Coefficient
Instructor Mean Median Range Deviation of Variation
All 54 58 98 28 52
1 75 81 82 18 24
2 41 39 88 24 60
3 27 23 77 22 81
4 59 64 97 28 48
5 50 54 91 24 48
6 69 69 77 17 24
7 80 90 81 20 25
8 28 22 67 21 73
9 62 64 85 21 35
10 70 74 91 23 33
11 68 73 78 21 31
12 54 53 88 29 53
Question 2--Would Like Instructor Again
Standard Coefficient
Instructor Mean Median Range Deviation of Variation
All 51 52 98 28 55
1 60 57 72 23 38
2 61 64 80 17 28
3 43 42 88 21 48
4 49 52 88 23 47
5 29 26 69 16 54
6 67 68 78 18 28
7 78 85 75 19 24
8 13 10 47 11 85
9 57 60 76 21 37
10 26 25 62 18 68
11 89 91 39 7 8
12 20 19 33 10 50
Question 3--Improved Attitude Toward Field
Standard Coefficient
Instructor Mean Median Range Deviation of Variation
All 49 52 97 25 52
1 60 58 74 22 36
2 56 60 87 17 30
3 34 28 85 20 58
4 39 42 82 22 55
5 43 43 87 22 51
6 55 58 79 18 33
7 74 75 82 21 29
8 27 24 77 20 77
9 40 34 83 24 59
10 50 49 91 26 51
11 74 76 76 18 24
12 51 51 78 27 53
Question 4--Overall, I learned a Great Deal In This Course
Standard Coefficient
Instructor Mean Median Range Deviation of Variation
All 3.9 4.0 3.2 0.5 13.9
1 4.5 4.5 1.1 0.3 6.2
2 3.9 4.0 1.5 0.3 8.7
3 3.3 3.4 2.2 0.5 15.1
4 4.1 4.1 2.1 0.5 11.8
5 3.7 3.7 1.4 0.4 9.8
6 4.0 4.0 1.4 0.3 8.2
7 4.6 4.8 2.0 0.5 10.0
8 3.5 3.5 2.3 0.6 16.6
9 4.2 4.2 1.3 0.3 7.2
10 4.0 3.9 2.2 0.5 12.7
11 4.1 4.1 1.3 0.3 6.9
12 3.7 3.8 1.4 0.4 11.2
Question 5--Overall, I Rate This Instructor An Excellent Instructor
Standard Coefficient
Instructor Mean Median Range Deviation of Variation
All 4 4.1 3.4 0.65 16.5
1 4.5 4.5 1.5 0.3 7.5
2 4.1 4.1 1.6 0.3 7.6
3 3.5 3.6 2.4 0.6 16.4
4 4.2 4.3 2.2 0.5 11.3
5 3.7 3.7 1.8 0.4 11.2
6 4.2 4.3 1.2 0.3 6.5
7 4.7 4.8 1.6 0.4 7.7
8 3.0 3.0 2.4 0.6 20.4
9 4.3 4.3 1.4 0.3 6.8
10 3.7 3.8 3.1 0.8 20.5
11 4.5 4.6 1.1 0.3 5.9
12 3.3 3.4 1.2 0.4 11.7
Table 2. Comparison of ratings of instructors
Question
Instructor 1 2 3
1 4.99 * 0.46 3.15 *
2 -3.37 * 1.86 * 3.04 *
3 -5.93 * -3.10 * -4.31 *
4 1.10 -0.71 -2.56 *
5 -4.55 * -5.05 * -1.87 *
6 4.41 * 4.27 * 2.39 *
7 4.91 * 4.37 * 4.79 *
8 -5.55 * -8.49 * -5.54 *
9 1.74 * 1.05 -2.36 *
10 4.05 * -6.63 * 0.44
11 3.22 * 9.66 * 6.88 *
12 0.74 -3.75 * 0.52
Question
Instructor 4 5 Summary
1 7.87 * 6.06 * 4 above, 1 ns
2 -0.21 0.83 2 above, 1 below, 2 ns
3 -7.38 * -5.36 * 5 below
4 3.04 * 2.49 * 2 above, 1 below, 2 ns
5 -3.85 * -7.54 * 5 below
6 1.37 2.60 * 4 above, 1 ns
7 6.64 * 6.40 * 5 above
8 -5.03 * -8.36 * 5 below
9 3.71 * 2.50 * 3 above, 1 below, 1 ns
10 0.04 -3.17 * 1 above, 2 below, 2 ns
11 1.78 * 6.51 * 5 above
12 -1.46 -3.94 * 2 below, 3 ns
* significant at a = .05
ns = not significant
Table 3. Differences Between lower level and upper level courses
Lower Level QUESTION
1 2 3 4 5
Mean 49 53 47 3.8 3.90
Median 51 56 46 3.9 4.10
Range 98 98 97 3.2 3.40
STDEV 26 30 27 0.52 0.68
CV 54 57 57 14 16.00
Upper Level
Mean 57 50 50 3.98 4.02
Median 62.5 49 52 4.00 4.10
Range 98 98 96 2.90 3.20
STDEV 28.6 27 24 0.55 0.64
CV 50.5 54 48 13.80 16.03
Z-Score -3.25 1.55 -1.29 -3.12 -1.29
p-value 0.0006 0.0606 0.0985 0.0009 0.0985
Table 4. Administrative Actions Affecting Instructors
Instructor Pay Raises Reappointment/Tenure/Promotion
1 Above average Received tenured/on-time promotion
2 Average Previously tenured full professor
3 Below average Previously tenured full professor
4 Above average Previously tenured full professor
5 Below average Tenured/Promotion delayed
6 Above average Received tenure/on-time promotion
7 Above average Received tenure/on-time promotion
8 Below average Previously tenured full professor
9 Above average Received tenure/on-time promotion
10 Average Received tenure/delayed promotion
11 Above average Received tenure/on-time promotion
12 Average Previously tenured full professor