文章基本信息

标题：Variation in the student ratings of school of business instructors: a regional university example.
作者：Caines, W. Royce ; Shurden, Mike C. ; Cangelosi, Joe 等
期刊名称：Academy of Educational Leadership Journal
印刷版ISSN：1095-6328
出版年度：1999
期号：May
语种：English
出版社：The DreamCatchers Group, LLC
摘要：Over the past several years, student evaluations of instructors have taken on increasing importance as administrators have sought to use as many objective measures as possible to justify tenure/promotion decisions, to differentiate pay raises, and to provide feedback to the public. Student evaluations are convenient tools because they produce numbers that appear to represent objective assessments of the effectiveness of the instructor. These instruments are usually administered during a class session late in the semester and are relatively cheap to administer. Also, such evaluations give the students the appearance of having input into decisions that relate to the quality of educational services offered.
关键词：Business teachers;Public universities and colleges;Student evaluation of teachers

Variation in the student ratings of school of business instructors: a regional university example.

Caines, W. Royce ; Shurden, Mike C. ; Cangelosi, Joe 等

INTRODUCTION

Over the past several years, student evaluations of instructors have taken on increasing importance as administrators have sought to use as many objective measures as possible to justify tenure/promotion decisions, to differentiate pay raises, and to provide feedback to the public. Student evaluations are convenient tools because they produce numbers that appear to represent objective assessments of the effectiveness of the instructor. These instruments are usually administered during a class session late in the semester and are relatively cheap to administer. Also, such evaluations give the students the appearance of having input into decisions that relate to the quality of educational services offered.

In the case of the regional university in this study, the State Commission on Higher Education has mandated that student evaluations must be administered in each course during the fall semester and has even mandated one specific question that must be included for every class. That mandate applies not only to this small regional university but to every public college and university in the state. The mandated question addresses the availability of the instructor to the student. So far the other questions to be asked have not been mandated, but many people feel that may very well be the eventual directive.

While most instructors may not object to a survey of student opinion about instructional effectiveness, many are concerned about how the data collected will be interpreted. Issues of variation in ratings even in the same courses have been questioned for some time. For example, Greenwald (1995) reported that his ratings for a course taught one semester placed him in the highest 10 percent of faculty ratings at his university. The next semester, he taught the same class by the same plan but student ratings placed him in the second lowest decile of the university faculty. How could such variation occur assuming similar efforts on the part of faculty? Also, how do administrators react to such variation? Do they consider just the numbers and reward effort one period and punish effort the next period without considering whether a large amount of variation may exist regardless of faculty effort?

Another criticism that instructors often level at student evaluations is that they can be manipulated by grading standards and/or style issues that have little to do with course content or knowledge of the instructor. An example can be found in Williams and Ceci (1997) where the authors showed that just by changing the enthusiasm shown by the instructor significantly affected the student evaluations. In another provocative study, Ambady and Rosenthal (1993) concluded that students arrive at conclusions about instructors within seconds of being exposed to them. Baird (1987) concluded that a large portion of rating variance can be explained based on student's subjective assessment of learning as opposed to actual course grades.

However, most administrators appear to subscribe to the view that "student ratings tend to be statistically reliable, valid, and relatively free from bias or the need for control; probably more so than any other data used for evaluation." (Cashin, 1995). A study by Marsh and Hocevar (1991) concluded that student evaluations of teaching performance are stable and consistent over time based on data from 6, 024 classes and 195 instructors.

The focus of this paper is to evaluate the student evaluations of instructors of business at a small regional university over a period of twelve semesters. The variation across semesters and across instructors will be examined and compared to actual ratings of the instructors by administrators to determine if rational decisions are made in differentiating instructional efforts. No effort is made to question the reliability and validity of the use of student evaluations as a means of evaluating teaching effectiveness.

DATA AND METHODOLOGY

Data for this study were collected from all courses taught in business at a small (2800 students) regional public university during each regular semester (summer school excluded) from Fall 1992 through Spring 1998, a total of twelve semesters. Over that period, a total of 592 sections were taught in the School of Business. Several sections were taught by part-time instructors, but most were taught by full-time faculty. For purposes of this study, data from all sections are utilized for comparison purposes, but the focal group of the study is twelve instructors who taught one or more School of Business disciplines and who were still on the faculty as of the end of Spring Semester 1998.

The student evaluation instrument is purchased from a large state university Center for Teaching and Faculty Development. Therefore, the instrument has been tested for reliability and validity. Also, the instructor reports are compared to a norm group based on the other institutions using the instrument. The surveys are completed on a very formal schedule near the end of each semester. In each section, an instructor who is not the instructor of the course is assigned to administer the student evaluations at the beginning of the class period. A student is selected to collect the completed forms and deliver them to the School of Business office. They are then mailed to the test center for analysis and results are returned to the School of Business office shortly after the end of the semester. Instructors receive a copy of the reports at the beginning of the next semester. The reports show comparisons of the individual instructor to others in the national group but do not show comparisons to instructors at their own institution.

Simple summary statistics are the first part of this analysis. Mean, median, range, standard deviation, and coefficient of variation are all calculated for the entire group and for each of the twelve instructors. Next, evaluation scores for each instructor are tested against all other section evaluation scores by computing the Wilcoxon Rank Sum Test for differences of two medians. The purpose of using the Wilcoxon Rank Sum Test is based on the fact that it is a nonparametric procedure which does not require the assumption of normality. The Wilcoxon Rank Sum Test is a powerful test even when conditions of normality are met and is more appropriate when the conditions are not met (Levine, et. al. 1998) The Wilcoxon Rank Sum Test statistic is approximately normally distributed for large sample sizes (Levine, et.al.). For those instructors whose median evaluation scores indicate that they are significantly different than the overall median, a second Wilcoxon Rank Sum Test will be calculated to compare those scores.

The student evaluations administered at the university of this study are based on a national instrument. A large number of questions are asked to try to diagnose specific strengths/weaknesses of each instructor. Also, the ratings are compared to peer instructors of all courses and of similar courses where similar courses are specified as those that are of similar class size and self-reported student motivation level. Thus the reported results do not allow comparison to courses specific to the discipline, but are compared to the national test group.

The reports that are printed by the national test center include the results from a large number of questions, many of which focus on particular teaching methods such as "creating enthusiasm", "explaining reasons for criticisms", etc. Many of these results are not of particular interest to administrators seeking to evaluate overall teaching performance. Rather the focus tends to be on a few summary questions which form an overview of the instructor and course. For purposes of this paper, five of those summary items/questions are compared as shown in Table 1. Questions 1-3 are reported as percentile scores where the individual instructor scores are placed in a percentile of the test group that is compared to all similar classes as reported in the preceding paragraph. Questions 4 and 5 are reported as raw scores on a 1 (strongly disagree) to 5 (strongly agree) scale with 1 being the lowest and 5 being the highest (preferred) score.

RESULTS

First, summary statistics were calculated for all sections taught in the School of Business. Results are reported for all sections and for each of the twelve instructors that are the focus of this study, Table 1.

Based on the results shown in Table 1, it is apparent that a large amount of variation exists in the observed results of the student evaluations. In general, higher mean and median scores with less variation would represent results that represent consistently strong teaching performances as evaluated by students in those classes. Particularly, with questions 1-3, variation is rather large as exhibited by the large ranges (compared to the means) and the coefficients of variation.

For question 1, seven of the twelve instructors have ratings that exceed the institutional median of the 58th percentile with Instructor 7 having the highest score at the 80th percentile. Remember these percentile rankings are based on the norm group that represents all institutions using this instrument. Five instructors have scores lower than the institutional median with Instructor 3 receiving the lowest rating at the 27th percentile. For question 2, six instructors received ratings above the institutional median of the 52nd percentile with Instructor 11 having the highest rating at the 91st percentile. One instructor has the same rating as the institutional median and five are below the institutional median with Instructor 8 receiving the lowest rating at the 10th percentile.

For question 3, five instructors have median ratings above the institutional overall median of the 52nd percentile with Instructor 11 again receiving the highest median rating at the 76th percentile. Seven of the instructors received median ratings below the institutional median with Instructor 3 having the lowest rating at the 28th percentile.

Questions 4 and 5 are reported on a different scale than questions 1 through 3. These questions are reported as raw scores on a scale of 1-5 with 5 the more desirable score. Thus there is no comparison to a norm group. For question 3, five of the instructors received median ratings above the overall median with Instructor 7 tops at 4.8. Two instructors had median ratings identical to the overall median and five had ratings below the overall median with Instructor 3 lowest at 3.4.

On question 5, six instructors received median ratings above the overall median with Instructor 7 on top with a 4.8. One instructor had a median rating equal to the overall median and five had ratings below the overall median with Instructor 8 rated lowest at 3.0 While these summary statistics allow a visual basis for comparison, the large amount of variation for each instructor gives concern about the statistical validity of asserting that significant differences exist. For example, Instructor 7 received median ratings well above the institutional median for questions 1-3, yet the range of ratings exceeded 70 percentile points. Similar observations can be made for several of the other instructors.

Therefore the next step is to use the Wilcoxon Rank Sum Test of differences of medians. In each case, the scores of the individual instructor are compared to all other instructors of business courses at this institution. The Wilcoxon Rank Sum Test is distributed approximately normally, thus a Z-score is calculated and examined for p-value in testing whether significant differences exist.

Table 2 gives the calculated Z-scores and indicates significant differences. The chosen level of significance for this analysis is ?=.05. From the results several points can be gleaned. There are many significant differences. Some instructors had ratings that were significantly different and positive on some questions and significantly different and negative on other questions . That makes it much more difficult to rate the teaching of those instructors.

However, the group breaks into 3 general groups. Instructors 1, 6, 7, and 11 received ratings above the institutional median with at least four of the questions rated significantly different. Instructors 3, 5, and 8 received ratings on all five questions that were below the institutional median by a statistically different margin. That leaves Instructors 2, 4, 9, 10, and 12 who received mixed ratings. Thus the first group might logically be considered to be rated higher on teaching (as rated by student evaluations) while the second group might logically be rated lower on teaching. The third group must be evaluated more carefully and the weights to be placed on each question must be carefully considered before a meaningful evaluation can be completed.

For example, Instructor 4 received significantly higher ratings for questions 4 and 5 which indicate that students perceive an excellent instructor and perceive they learned a lot in the course. Yet the ratings on the other questions are not significantly different from the institutional median and on question 3, the ratings are actually significantly lower than the institutional median.

LOWER LEVEL COMPARED TO UPPER LEVEL

Another area of concern to faculty has been the perceived differences in teaching lower level classes versus upper level classes. Students in upper level classes must meet the requirements for admission to the School of Business. Those requirements include acceptable (minimum grade of "C" on several courses) completion of a number of lower level courses, and an overall minimum grade point average of 2.0. Therefore, faculty perceive a more mature, more motivated group of students in upper level classes.

The data in general support the perception that upper level classes receive higher student evaluations, Table 3. Instructors of lower level classes received lower median ratings on four of the five questions. However, on question 2, "Would like instructor again", the ratings were higher for lower level classes compared to upper level classes. Based on an [alpha] =.05, significant differences existed only for questions 1 and 4. It should be noted that at an [alpha] = .10, all questions would be rated as significantly different.

These results indicate that it may be appropriate to consider the course assignments of individual instructors. Those who have mostly taught lower level assignments may deserve some adjustment upward for questions 1, 3, 4, and 5, but results of question 2 would then be adjusted downward. It should be noted that three of the instructors taught only upper division classes while all others taught a mix though some mostly taught lower level while others mostly taught upper level.

ADMINISTRATIVE ACTIONS

One of the interesting comparisons is to compare actual administrative actions to the actual results of student evaluations. Each professor at this institution is evaluated on an annual basis with the three areas of evaluation being identified as teaching, professional development, and service. Based on the AACSB Accreditation plan, the relative weight assigned to teaching is 70% because this is an undergraduate only institution. Thus, the teaching component is a large portion of the overall evaluation.

In years when professors are up for tenure, promotion, or post-tenure review, administrators have additional information to judge when evaluating individual instructors as peer reviewers review classes and write letters of evaluation. However, in other years, the only data available is the results of the student evaluations.

The actual annual evaluations are not available to the authors of this study. However, much information can be inferred from annual pay raises, reappointment/promotion/tenure decisions which is available. Also, in many cases, individual instructors voluntarily share annual evaluation information. Table 4 summarizes available information about the evaluation results based on administrative decisions.

Based on observed data, it appears that administrators have indeed made decisions that are reasonably consistent with this analysis in rewarding higher student evaluations even though no effort has ever been made to test the significance of differences. Instructors 3, 5, and 8 have received annual pay raises considerably lower than average. One of the instructors has been required to take additional coursework in his field and one instructor did not receive support for on-time promotion though the instructor was granted tenure. Support for tenure was mixed based on both teaching and professional development. All three of these instructors received student evaluations significantly lower than the School of Business median.

Instructors 1, 6, 7, and 11 have all received above average pay raises and all have received tenure and on-time promotions during the period of this study (or are currently in the process with full support from administrators). Pay raises have not all been equal, but things other than teaching are involved in that decision.

Instructors 2, 10, and 11 have received average pay raises which befits the mixed results from the student evaluations. However, the promotion of one instructor was delayed and received mixed support when awarded. That was not all based on teaching as professional development also played a role.

Instructors 4 and 9 are interesting cases as both received mixed ratings on the questions. However, both have received above average pay raises. In some cases, those raises have exceeded the raises awarded some of the instructors in the higher rated group. Evidently, in those cases, administrators weighted some questions more heavily or considered issues other than teaching evaluations. A cursory review does not reveal any obvious reason because professional development activities were not obvious differences.

SUMMARY AND CONCLUSIONS

The results of this analysis indicate that significant differences exist in the student evaluations of the instructors of business courses at the regional university that provided the data. Although the measured variations are large for each instructor, differences can be supported as shown by the statistically significant differences of medians as shown by the results of the Wilcoxon Rank Sum test. This analysis gives administrators a methodology for establishing which instructors have truly different student evaluations.

One item that has been a concern at the university is also shown to have some validity. Instructors of lower level courses do tend to receive lower student evaluations than do the instructors of upper level courses except for the question "would like instructor again". For that question, instructors of lower level classes tend to receive higher ratings.

One final point that should be interesting to administrators at this university that prides itself as being supportive of undergraduate teaching is the comparison of the administrative decisions compared to the analysis of student evaluations. For the most part, those instructors with significantly higher evaluations have received more positive administrative support while those who have significantly lower evaluations have received less positive administrative support.

REFERENCES

Ambady, N. & R. Rosenthal. (1993). Half a minute: predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness, Journal of Personality and Social Psychology, 64, 431-441.

Baird, J. S. (1987). Perceived learning in relation to student evaluations of university instruction, Journal of Educational Psychology, 79(1), 90-91.

Cashin, W. E. (1995). Student Ratings of Teaching: IDEA Paper No. 32, Manhattan, KS: Kansas State University Center for Faculty Evaluation and Development.

Greenwald, A.G. (1995). Applying Social Psychology to Reveal a Major (But Correctable) Flaw in Student Evaluations of Teaching, Paper presented at the annual meeting of the American Psychological association, New York, NY.

Levine, D. M., M.L. Berenson & D. Stephan. (1997). Statistics for Managers Using Microsoft Excel. Upper Saddle River, NJ: Prentice-Hall.

Marsh, H. W. & D. Hocevar. (1991). Student ratings of teaching effectiveness: The stability of mean ratings of the same teachers over a 13-year period, Teaching and Teacher Education, 7(4) 303-314.

Tang, T. L. (1997). Teaching evaluation at a public institution of higher education: Factors related to the overall teaching effectiveness, Public Personnel Management, September 22, 379-390.

Williams, W. M. & S. Ceci. (1997). How'm I doing? Problems with student ratings of instructors and courses, Change, 29, September 19, 12-24.

W. Royce Caines, Lander University

Mike C. Shurden, Lander University

Joe Cangelosi, University of Central Arkansas

Table 1. Summary Statistics: Student Evaluations of School
of Business Instructors

Question 1--Progress on Relevant Objectives

 Standard Coefficient
Instructor Mean Median Range Deviation of Variation

All 54 58 98 28 52
1 75 81 82 18 24
2 41 39 88 24 60
3 27 23 77 22 81
4 59 64 97 28 48
5 50 54 91 24 48
6 69 69 77 17 24
7 80 90 81 20 25
8 28 22 67 21 73
9 62 64 85 21 35
10 70 74 91 23 33
11 68 73 78 21 31
12 54 53 88 29 53

Question 2--Would Like Instructor Again

 Standard Coefficient
Instructor Mean Median Range Deviation of Variation

All 51 52 98 28 55
1 60 57 72 23 38
2 61 64 80 17 28
3 43 42 88 21 48
4 49 52 88 23 47
5 29 26 69 16 54
6 67 68 78 18 28
7 78 85 75 19 24
8 13 10 47 11 85
9 57 60 76 21 37
10 26 25 62 18 68
11 89 91 39 7 8
12 20 19 33 10 50

Question 3--Improved Attitude Toward Field

 Standard Coefficient
Instructor Mean Median Range Deviation of Variation

All 49 52 97 25 52
1 60 58 74 22 36
2 56 60 87 17 30
3 34 28 85 20 58
4 39 42 82 22 55
5 43 43 87 22 51
6 55 58 79 18 33
7 74 75 82 21 29
8 27 24 77 20 77
9 40 34 83 24 59
10 50 49 91 26 51
11 74 76 76 18 24
12 51 51 78 27 53

Question 4--Overall, I learned a Great Deal In This Course

 Standard Coefficient
Instructor Mean Median Range Deviation of Variation

All 3.9 4.0 3.2 0.5 13.9
1 4.5 4.5 1.1 0.3 6.2
2 3.9 4.0 1.5 0.3 8.7
3 3.3 3.4 2.2 0.5 15.1
4 4.1 4.1 2.1 0.5 11.8
5 3.7 3.7 1.4 0.4 9.8
6 4.0 4.0 1.4 0.3 8.2
7 4.6 4.8 2.0 0.5 10.0
8 3.5 3.5 2.3 0.6 16.6
9 4.2 4.2 1.3 0.3 7.2
10 4.0 3.9 2.2 0.5 12.7
11 4.1 4.1 1.3 0.3 6.9
12 3.7 3.8 1.4 0.4 11.2

Question 5--Overall, I Rate This Instructor An Excellent Instructor

 Standard Coefficient
Instructor Mean Median Range Deviation of Variation

All 4 4.1 3.4 0.65 16.5
1 4.5 4.5 1.5 0.3 7.5
2 4.1 4.1 1.6 0.3 7.6
3 3.5 3.6 2.4 0.6 16.4
4 4.2 4.3 2.2 0.5 11.3
5 3.7 3.7 1.8 0.4 11.2
6 4.2 4.3 1.2 0.3 6.5
7 4.7 4.8 1.6 0.4 7.7
8 3.0 3.0 2.4 0.6 20.4
9 4.3 4.3 1.4 0.3 6.8
10 3.7 3.8 3.1 0.8 20.5
11 4.5 4.6 1.1 0.3 5.9
12 3.3 3.4 1.2 0.4 11.7

Table 2. Comparison of ratings of instructors

 Question

Instructor 1 2 3

1 4.99 * 0.46 3.15 *
2 -3.37 * 1.86 * 3.04 *
3 -5.93 * -3.10 * -4.31 *
4 1.10 -0.71 -2.56 *
5 -4.55 * -5.05 * -1.87 *
6 4.41 * 4.27 * 2.39 *
7 4.91 * 4.37 * 4.79 *
8 -5.55 * -8.49 * -5.54 *
9 1.74 * 1.05 -2.36 *
10 4.05 * -6.63 * 0.44
11 3.22 * 9.66 * 6.88 *
12 0.74 -3.75 * 0.52

 Question

Instructor 4 5 Summary

1 7.87 * 6.06 * 4 above, 1 ns
2 -0.21 0.83 2 above, 1 below, 2 ns
3 -7.38 * -5.36 * 5 below
4 3.04 * 2.49 * 2 above, 1 below, 2 ns
5 -3.85 * -7.54 * 5 below
6 1.37 2.60 * 4 above, 1 ns
7 6.64 * 6.40 * 5 above
8 -5.03 * -8.36 * 5 below
9 3.71 * 2.50 * 3 above, 1 below, 1 ns
10 0.04 -3.17 * 1 above, 2 below, 2 ns
11 1.78 * 6.51 * 5 above
12 -1.46 -3.94 * 2 below, 3 ns

* significant at a = .05
ns = not significant

Table 3. Differences Between lower level and upper level courses

Lower Level QUESTION

 1 2 3 4 5

Mean 49 53 47 3.8 3.90
Median 51 56 46 3.9 4.10
Range 98 98 97 3.2 3.40
STDEV 26 30 27 0.52 0.68
CV 54 57 57 14 16.00

Upper Level

Mean 57 50 50 3.98 4.02
Median 62.5 49 52 4.00 4.10
Range 98 98 96 2.90 3.20
STDEV 28.6 27 24 0.55 0.64
CV 50.5 54 48 13.80 16.03

Z-Score -3.25 1.55 -1.29 -3.12 -1.29
p-value 0.0006 0.0606 0.0985 0.0009 0.0985

Table 4. Administrative Actions Affecting Instructors

Instructor Pay Raises Reappointment/Tenure/Promotion

1 Above average Received tenured/on-time promotion
2 Average Previously tenured full professor
3 Below average Previously tenured full professor
4 Above average Previously tenured full professor
5 Below average Tenured/Promotion delayed
6 Above average Received tenure/on-time promotion
7 Above average Received tenure/on-time promotion
8 Below average Previously tenured full professor
9 Above average Received tenure/on-time promotion
10 Average Received tenure/delayed promotion
11 Above average Received tenure/on-time promotion
12 Average Previously tenured full professor