ARE STUDENT RATINGS OF TEACHING EFFECTIVENESS INFLUENCED BY INSTRUCTORS' ENGLISH LANGUAGE PROFICIENCY?
Finegan, T. Aldrich ; Siegfried, John J.
John J. Siegfried [*]
Abstract
We use data from forming the third edition of the Test of Understanding College Economics to examine whether instructors for whom English is a second language (ESLs) receive lower student ratings of teaching effectiveness in principles of economics courses, holding constant what students learn in the course. The results suggest that student ratings of ESL instructors are, on average, about 0.4 points lower, on a scale of 1.0 to 5.0, than the ratings of native English speaking instructors. Most of this deficit can be attributed to differences in how the two groups of instructors teach their courses and evaluate the knowledge of their students.
Almost all economic departments collect student opinions about the teaching effectiveness of their faculty (Becker and Watts, 1999). Many use these evaluations in making appointment, tenure, and renewal decisions (Becker and Watts, 1999, 346; Walstad and Saunders, 1998, 339).
To the extent that teaching effectiveness is expected to reflect student learning, one might hope that differences in student ratings of instructors accurately measure differences in how much students actually learn. Critics, however, contend that student ratings do not measure well certain educational goals (e.g. critical thinking skills), are influenced by students' expected grades, reflect instructor popularity rather than effectiveness, and may be influenced by teachers' age, sex, or ethnic background (see Becker and Watts, 1999, 344, for a summary of the controversy and references to relevant literature). This paper asks whether student ratings of teacher effectiveness are systematically lower for teachers whose first language is not English, other things equal--and if so, why.
This question assumes greater importance as the proportion of new faculty whose first language is not English steadily increases. Siegfried and Stock (1999) report that the percentage of new recipients of Ph.D.s in economics and agricultural economics who were U.S. citizens fell from 67.3 percent in 1977 to 42.9 percent in 1996. This proportion is likely to be correlated with the proportion for whom English is a second language. Much of the increase in foreign-born instructors has come from east Asian countries, where English is seldom a first language. About half of each cohort of new foreign-born economists secure employment at U.S. or Canadian colleges and universities, where decisions on whether to renew their contracts or award them tenure rely at least in part on student ratings of their teaching.
Three earlier studies have explored this issue. First, Watts and Lynch (1989) found that students of introductory economics at Purdue University during 1984-85 who had graduate student instructors or discussion section leaders for whom English was a second language (ESLs) learned significantly less in each course, other things equal; the study did not examine student ratings of instructors. In a follow-up study using 1989 data from the same school, however, Watts and Bosshardt (1992) found that after screening ESL graduate instructors for proficiency in speaking English and providing remedial instruction for those needing it, the second language gap in student learning disappeared--even though the ESLs received significantly lower student ratings in overall teaching effectiveness.
More recently, Kent Saunders (1999) reported findings similar to Watts and Bosshardt's for graduate student associate instructors of micro and macro principles of economics sections at Indiana University from 1984 to 1990. For a subset of classes taught by regular instructors among those who normed the third edition of the Test of Understanding of College Economics (TUCE III) (Phillip Saunders, 1994), Kent Saunders found inconsistent results.
The present study examines a larger sample of TUCE III forming classes taught only by regular faculty. Our main findings are broadly consistent with those reported by Kent Saunders for associate instructors at Indiana University. We extend the analysis (a) to compare the association between proficiency in spoken English and overall instructor ratings for each group of instructors considered separately, (b) to examine differences in how the two types of instructors teach their courses and evaluate student learning, and, (c) to assess how such differences influence student evaluations of overall teaching effectiveness.
I. Data and Research Design
We use data from norming the third edition of the Test of Understanding College Economics (TUCE III) (Saunders, 1994) to explore whether instructors for whom English is a second language (ESLs) receive lower student ratings of teaching effectiveness in principles of economics courses.
The simple model on which our empirical tests are based can be written as follows:
R = R (E, L, C, P, U)
where R is the instructor's average rating in overall teaching effectiveness, as reported by students in the class;
E is the instructor's proficiency in spoken English, as measured by two variables (see below);
L is an objective measure of how much students have learned in the course (see below);
C is a set of variables controlling for the instructor's gender, the gender composition of the class, class size, the class's GPA in other courses, and the students' expected grade in the class.
P is a set of pedagogical variables that measure how the instructor teaches and assesses learning in introductory economics, his or her years of teaching experience, and how many hours students in the course report studying for it each week (on average);
U are unobserved characteristics of students and teachers.
We assign the control variables (C) and the pedagogical variables (P) to different categories, because the former involve characteristics of instructors and classes that are likely to be known to some outside observers (e.g., department chairs), while most of the latter characteristics (except for years of teaching experience) are likely to be known only to the instructor and the students in the course.
We have two measures of English language skills: (1) a dummy variable, ENGFIRST, equal to one for instructors reporting English as their first (native) language (EFLs) and zero otherwise; and (2) a continuous variable, ENGRATE, the students' ratings of their instructors' ability to speak English. The Pearson correlation coefficient between the two measures across the 154 classes in our sample is 0.86.
Our independent, objective measure of student learning (L) is the signed absolute difference between the class average score on the TUCE III given at the end of the course and the class average score on the same test administered at the beginning of the course. This indicator captures one important dimension of student learning, namely, an objective measure of value added (VALADD). It is limited, of course, to what a multiple-choice examination (albeit a good one) can measure. There are separate TUCE III examinations for introductory micro and macro courses. Each test contains 30 questions, divided into roughly equal proportions of recognition and understanding questions, explicit applications, and implicit applications. [1]
Unfortunately, only two thirds of the instructors in the sample counted the TUCE component of the final exam toward students' grades. Therefore students' motivation to score well on the post-TUCE may have varied across classes. To deal with this possibility, we partitioned value added between classes that counted the post-TUCE exam and classes that did not, expecting the coefficient on value added to have a larger effect on student ratings if the post-TUCE exam affected students' grades. Surprisingly, it does not.
The paper proceeds as follows. In Section II we present Versions 1 and 2 of our regression analysis. In Version 1 we regress the class average rating for "overall teaching effectiveness" of each instructor on his or her English language status, the class's mean value added on the TUCE III examination partitioned by whether the post-TUCE did or did not count towards students' grades, along with the set of basic control variables identified above. In Version 2, we add the students' average rating of their instructors' ability to speak English. The reference group for both ratings is "other college instructors."
The rating for proficiency in speaking English is entered separately for EFL and ESL instructors because we observe a different relationship between this proficiency and students' rating of teaching effectiveness for each group. Only one ESL instructor received a substantially higher English speaking proficiency rating (4.39) than did the lowest rated EFL instructor (3.82). There also is a much weaker positive association between ENGRATE and overall instructors' teaching effectiveness rating (RATING) for ESL instructors than for their EEL counterparts. Consequently, the second version of our estimates includes the dummy variable for whether English is the instructor's native language as well as the students' rating of speaking proficiency interacted with both the dummy variable and one minus the dummy variable. This specification allows both the slope and intercept of the teaching proficiency relationship to vary with the instructor's native language. [2] The results show that ESL instructors receive lower ov erall evaluations and reap smaller gains in student evaluations from greater English proficiency than EFLs. [3]
Section II presents the results of Version 3, where we add pedagogical variables to the analysis. These variables measure (a) student perceptions of how each instructor teaches his or her course; (b) information on instructors' years of teaching experience, how class time is used, and how student learning is assessed; and (c) student-reported data on how many hours per week (on average) were spent studying for the course. The purpose of the final part of the analysis is to appraise the extent to which differences across instructors in pedagogical approaches and student effort can explain the lower ratings received by ESL instructors observed in Versions 1 and 2. The main findings and conclusions are summarized in Section IV.
Throughout the paper we use observations based on classes rather than individual students. Characteristics of the course and the instructor would, of course, be the same for observations of both kinds. Only for student-specific variables (e.g. student rating of instructor's overall teaching effectiveness or the student's gender) would we observe differences within classes.
We use classes as the unit of observation for several reasons. First, because the main focus of our analysis is on an instructor characteristic--proficiency in speaking English--it seems inappropriate to weight larger classes more heavily than smaller classes. Second, suppose that department chairs tend to assign instructors with weaker language skills to smaller classes. Then, the true relationship between teachers' language skills and students' ratings of teaching effectiveness could be obscured by the large number of observations from large classes if individual students were used as the unit of observation. Third, unobserved characteristics of students may influence their ratings of instructors. Since these unobserved characteristics are likely to vary less across classes than within them, using class averages minimizes the influence of these characteristics on our empirical results. Fourth, using individual student observations risks finding statistically significant results with no substantive meaning b ecause of the "too-large-sample size" phenomenon (Kennedy, 1998, p. 64). Finally, averaging the data downplays the importance of extreme observations, reducing the risk of incorrectly estimating an unknown functional form with a linear specification. [4]
The TUCE III sample contains 93 macro and 96 micro classes taught by 139 different instructors. Because of missing data, our analysis is limited to 72 macro classes taught by 55 different instructors and 82 micro classes taught by 66 different instructors. Seven macro classes were taught by five different instructors for whom English is a second language (ESLs). Seven micro classes were taught by seven different ESL instructors. Because two ESL instructors taught both macro and micro courses, there are ten different ESL instructors in our sample. [5]
The TUCE III survey allowed individual students to rate the teaching effectiveness of their instructor along a continuous scale from 1.0 (lowest) to 5.0 (highest). Their responses were recorded to one decimal place (e.g., 3.5). Since we use class averages of these teaching effectiveness ratings, it is appropriate to use ordinary least squares to estimate the relationships at issue. The mean student rating of the instructors' overall teaching effectiveness was 4.05, with a standard deviation of 0.39, suggesting that censoring of the dependent variable is not a serious problem. Definitions and descriptive statistics for all variables using the combined set of 154 classes are reported in Table 1. Means of these variables partitioned by instructor's first language along with t-values of intergroup differences in means are reported in Table 2.
In this data set, the students of native English language instructors rated their instructors' overall teaching effectiveness and ability to speak English well significantly higher than did students of instructors for whom English was a second language. Learning (TUCE value added) was also somewhat greater for students of EFL instructors, but the latter difference was not statistically significant.
There are five basic control variables. The gender of the instructor (1 = female, 0 = male) allows for the possibility of systematic differences in student ratings of male and female faculty. The gender composition of the class is included to see whether male and female students have systematically different expectations or standards of teaching effectiveness. Students in smaller classes are more likely to be known personally by, and to receive more attention from, the instructor, which may lead to higher student ratings. [6] Class grade point average in all prior courses is included to allow for the possibility that academically stronger students have different expectations of their instructors. Finally, we include the students' average expected grade in the class, as reported by each student at the end of the course but before learning the results of the final examination. If students rate higher those instructors from whom they expect to earn a higher grade, holding constant what they actually learned, th e coefficient on expected grade should be positive. [7]
The only basic control variable with significantly different means between EFL and ESL instructors is class GPA (2.92 for EFLs, 2.77 for ESLs). As noted earlier, the TUCE data also include information about instructors' teaching methods, kinds of tests, and years of teaching experience, along with student assessments of their instructors' preparation for class, enthusiasm in teaching, and grading standards. In addition, the data contain students' estimates of how many hours (on average) they spend studying introductory economics outside of class. As Table 2 shows, students judged ESL instructors, on average, to display less enthusiasm about teaching (ENTHUS), to be less well prepared for class (PREP), and to have lower grading standards (RIGOR) than their EFL peers. While the absolute gap between group means for each of these three variables is small (between 0.25 and 0.40 on a five-point scale), each difference is significant at the 99 percent level.
We also find that ESL instructors devoted significantly more class time to lecturing (PCTLECT), relied significantly more on multiple choice tests (PCTMC), and had fewer years of college teaching experience (NYRTEACH). Although the last difference was not statistically significant, we retain NYRTEACH in the Version 3 regressions to control for any independent influence it may have on student ratings.
There is no appreciable difference in the average number of hours that students of ESLs and EFLs spent studying principles (HOURS). We expected that students of ESLs might study more outside of class to compensate for learning less in class, but evidently they do not. We retain this variable to see whether there is any relation across classes between HOURS and student ratings, holding other factors constant. Because this relationship might differ for ESLs and EFLs, we partition it by multiplying HOURS by ENGFIRST (for EFLs) and 1 - ENGFIRST (for ESLs).
It would be reasonable for students to give higher ratings to those instructors they perceive to be better prepared for class and more enthusiastic. How students might respond to more rigorous grading standards, especially after controlling for their expected grade in the course, is uncertain. [8] At least one earlier study (Aigner and Thum, 1986) found that students generally give higher ratings to instructors who rely less heavily on formal lectures and get students more involved in class discussion. More hours spent studying economics could reflect either greater interest in the subject matter, boosting the instructor's rating (if he or she is the stimulator), or a need to study more because little is learned in class, reducing the rating. Accordingly, the expected coefficient on this variable, partitioned by language status of instructor, also is ambiguous. Likewise for years of teaching experience, which can lead to either interest-motivating innovations or deeper ruts.
II. How Language Proficiency Affects Instructor Ratings
The estimated coefficients and corresponding t-ratios for the three OLS regressions are reported in Table 3. [9] The empirical results from the first version indicate that ESL instructors are rated about four tenths of a point (one standard deviation) lower in teaching effectiveness than native English speaking instructors, ceteris paribus. The second version confirms a strong positive relationship between students' perceptions of speaking proficiency and teaching effectiveness for EFL instructors. For ESL instructors, however, the positive association between these two ratings is much weaker (but still statistically significant at the 0.95 level), notwithstanding a larger variance in language proficiency ratings for ESL than for EFL instructors (as shown in Table 2).
For native English speaking instructors, the predicted teaching rating rises by 1.3 points for a one point increase in speaking proficiency rating (or 0.63 standard deviations of RATING for a one standard deviation increase in ENGRATE). By contrast, the predicted gain in overall ratings of ESL instructors from the same one-point increase in speaking proficiency is only about one-third as large (0.4 points).
One possible interpretation of the stronger association between student ratings of English proficiency and teaching effectiveness for EFLs than ESLs turns on the criteria that students may use to judge the language proficiency of each type of instructor. For native-English-speaking teachers, students' language proficiency ratings may reflect how clearly such teachers express concepts, ideas, and relationships, since technical facility with English is less likely to be a problem for them. For instructors with English as a second language, however, students' language proficiency ratings may reflect their difficulty understanding the words the teacher is speaking. The data suggest considerable variation in technical facility with English across these second language instructors, but there may be only a modest association between technical facility with English and what students really value-- i.e., the ability to explain ideas clearly.
How much confidence we should place in the foregoing results is influenced by the coefficients on the control variables. While a higher average VALADD score in Version 1 significantly increases the instructor's overall rating, the practical payoff in higher ratings from more learning in the course is small. [10] A student in a class in which the post-TUCE counted toward the final grade, and who learned the average amount in the sample, would have rated his or her instructor only 0.15 points higher than a student who had learned nothing in the course (as measured by VALADD). Curiously, the comparable ratings gain in a course where the post-TUCE did not count (0.30 points) was larger. But in Versions 2 and 3 of these regressions VALADD is not significant whether or not the post-TUCE counted.
In Version 1, classes with relatively more female students appear to rate their instructors higher than do classes with relatively fewer women, and female instructors appear to be rated higher than male instructors, ceteris paribus. Again, however, these associations disappear when students' perceptions of their instructors' speaking proficiency are controlled, and their signs even reverse when pedagogical characteristics are controlled in Version 3. [11]
Perhaps because of greater expectations of instructor performance, students with higher grade point averages (GPA) tend to give their instructors lower ratings; but measured in standard deviations, instructor ratings fall only about one quarter to one-half as fast as class GPA rises. At the same time, students who expect to receive higher grades, holding constant (TUCE-improvement) learning and their actual OPA in other courses, rate their instructors more generously. Instructors who create expectations of a mean class grade 0.33 higher than average can expect less than a 0.1 boost to their teacher effectiveness rating. Although the relationship is significant in two of three specifications, it is not large enough to provide much incentive for instructors to relax their grading standards. [12]
III. Explaining the Lower Ratings of ESLs
The specification of our Version 3 regression allows us to assess how much of the lower overall ratings of instructors with English as a second language (ESLs) might be attributable to differences in teaching methods. To the extent that this is so, the results of Versions 1 and 2 will have overstated the influence of student perceived language proficiency per se on the overall ratings of these teachers.
Three main conclusions emerge from the Version 3 regression adding the seven pedagogical variables, as reported in column 4 of Table 3. First, three of these variables--ENTHUS, PREP, and RIGOR--have positive regression coefficients that are significant at the 99 percent level, and a fourth, PCTLECT, has a negative sign that is significant at the 95 percent level, Evidently these characteristics of instructors play an important role in shaping students' overall ratings of instructors as the adjusted [R.sup.2] jumps from 0.50 to 0.81. [13]
Second, although adding the new variables produces little change in the Version 2 regression coefficients for the two VALADD variables and most of the basic control variables, controlling for differences in teaching style and grading standards reduces by roughly two-thirds the regression coefficients for each English language indicator (ENGFIRST, ENGRATE* [left arrow] ENGFIRST, and ENGRATE*[1 - ENGFIRST]). The improvement in instructor's overall student rating associated with a one-point gain (on a 5-point scale) in his or her perceived English speaking proficiency falls from 1.3 to 0.4 points for EFLs, and from 0.4 to about 0.1 points for ESLs. Our earlier regressions apparently impute more importance to differences in English language facility than is warranted. Nonetheless, we continue to observe a larger relative payoff to EFLs from speaking English better.
Third, the Version 3 results suggest that nonnative English speaking instructors could improve their overall student ratings by bringing more enthusiasm, better preparation, more rigorous grading standards, and more interaction with students to the classroom. The following experiments suggest that the returns from such pedagogical changes may exceed the payoff from efforts to improve their English language skills.
First, if we assign to ESLs the mean value of ENGRATE for their EFL counterparts (4.63, versus the actual mean for ESLs of 3.33), but assume no change in any other instructor characteristic, the third regression in Table 3 predicts only a 0.16 improvement in their average overall rating--less than half of the actual RATING gap of 0.39 points. Admittedly, this is a tenuous prediction because the highest observed value of ENGRATE for an ESL in our sample, i.e., 4.39, lies somewhat below the actual EFL mean of 4.63; so we are extrapolating beyond the actual range of observations for ESLs. On the other hand, it is arguably difficult for an ESL who has grown up in another culture to make that large an improvement in English proficiency; so the attainable benefits from this strategy may be overstated by this experiment.
A second experiment uses counterfactual assumptions that require almost no extrapolation. Suppose we use the Version 3 regression equation in Table 3 to predict the expected mean overall rating of ESLs if they had the average scores of EELs on ENTHUS, PREP, RIGOR, PCTLECT, and PCTMC, but their own (ESL) average ratings on all other variables, including English language proficiency. The result is a predicted RATING of 4.05, compared to actual means of 3.70 and 4.09 for the two groups. The means for EFLs in all but one of the five supplementary variables used in this experiment lie inside the actual range of observed values for ESL instructors. [14] These experiments suggest that ESL instructors could close nearly all of the ratings gap by adapting to the style of teaching and methods of student assessment used by the typical EFL.
The validity of these counterfactual estimates depends on how well the all-instructor regression equation in column (4) of Table 3 predicts the overall ratings of ESL instructors under varying premises about the values of key explanatory variables. Ideally, one would like to have enough observations of these instructors to run a subset regression limited to them--a test that our sample does not allow. A second-best test is to see how well this global regression equation predicts the mean RATING of ESLs based on their own observed means of all independent variables. The predicted mean RATING of 3.71
and the actual mean of 3.70 are virtually identical. [15]
IV. Conclusion
Instructors of classes in introductory economics for whom English is a second language (ESLs) receive significantly lower student ratings, on average, than do other instructors (EFLs). The unadjusted difference is 0.39 points (on a five point scale). This language gap in ratings remains just as large after controlling for student learning, class size, the gender of the instructor, the gender composition of the class, the students' cumulative GPA in previous courses, and students' expected grades in the course. Not only are ESLs judged to be less effective teachers, but they appear to reap a much smaller payoff in higher overall ratings from improving students' perceptions of their spoken English. Since students of ESLs, as a whole, perform about as well on the TUCE exam as do other students, the lower overall rating of ESLs is a source of concern, given the importance of such ratings in personnel decisions, including the awarding of tenure.
We find that the lower overall teaching effectiveness rating of ESL instructors is not attributable primarily to less proficiency in spoken English but, instead, can be accounted for mostly by student perceptions of less class preparation, less enthusiasm for teaching, a less interactive teaching style, looser grading standards, and heavier reliance on multiple choice tests. Consistent with this finding, we predict that ESLs could score a much larger gain in overall instructor ratings from adopting the teaching and testing norms of EFLs than from matching (if they could) the average proficiency of EFLs in spoken English.
Why ESL instructors receive lower overall student ratings might seem to be a question that is peripheral to how much students actually learn: within the TUCE data set we find no statistically significant association between our objective measure of student learning and either the instructor's proficiency in English or any of the pedagogical methods that are significantly related to overall instructor ratings. But TUCE scores are only one measure of how much economics students learn in an introductory course. Further research might reveal that the higher instructor ratings associated with greater fluency in English and student-preferred methods of instruction and evaluation are related to learning of a different kind--such as critical thinking skills--or longer retention of core principles.
(*.) The authors are professors of Economics at Vanderbilt University, Nashville, TN, 37203. We wish to thank Stephen G. Buckles and several anonymous referees for helpful comments and Howard Zhang for exceptional research assistance.
Notes
(1.) For further information on the nature of the TUCE, see Phillip Saunders (1994).
(2.) The more complicated specification of the second version implies that the coefficient on the dummy variable for English as a first language can no longer be interpreted appropriately as the advantage of native English speaking instructors, ceteris paribus. Rather, it shows the difference between the predicted RATING of an EEL with an ENGRATE value of zero and that of an ESL with the same value of ENGRATE, ceteris paribus.
(3.) When one constructs a scattergram of the values of RATING (on the vertical axis) and ENGRATE (on the horizontal axis), the observations for EFLs and ESLs fall in distinctly different clusters. All but three of the EFLs have values of ENGRATE between 4.2 and 4.9, but their RATING values are spread out over a much wider range, between 3.0 and 4.7. When RATING is regressed on ENGRATE using the 140 classes taught by EFLs, we get the following results (t-values are in parentheses):
RATING = - 2.15 + 1.35 ENGRATE [R.sup.2] = 0.42
(-3.49)(10.16) d.f.= 138
In contrast, ESLs have values of ENGRATE that stretch from 2.4 to 4.4, while their RATING values fall in a narrower band, 3.2 to 4.0. A similar regression for the 14 ESL-taught classes yields:
RATING = 2.42 + 0.39 ENGRATE [R.sup.2] = 0.48
(6.77) (3.63) d.f. = 12
A Chow test reveals that the two regressions are structurally different at the 99 percent level of confidence (F = 27.2).
(4.) Hanushek (1979) describes how aggregation of individual student observations can reduce errors in measurement problems in the estimation of educational production functions.
(5.) This sample includes only students who were present at both the beginning and end of the course. If students do not drop courses randomly, the resulting sample could be affected by selection bias (Becker and Walstad, 1990: Douglas and Sulock, 1995). Students are less likely to drop a course in which they are doing well. If academic achievement is related to the English language skills of the instructor, our findings relating course ratings to instructors' language skills could be biased. Fortunately, we have unpublished data reporting enrollments at the beginning of each course in our sample. Surprisingly, the mean drop rate was a little lower in classes taught by ESL instructors than in classes taught by EFLs, although the difference was not significant at the 95 percent level using a two-tail test.
(6.) It might seem desirable to create another dummy variable to control for a likely negative effect of very large classes on student evaluations. Nineteen of the classes in our sample had enrollments of 80 or more (but only three had over 115), and the mean value of RATING in this subset (3.96) was slightly (and not significantly) lower than that of mid-size classes with enrollments of 31 to 79 (4.07) or that of small classes of 30 or less (4.05). However, all 19 of the large classes were taught by EFL instructors; hence the addition of a large class dummy would create an identification problem. Further, these findings support the idea (mentioned earlier) that department chairs select instructors for large classes based in part on their English language proficiency. If so, the regression coefficient for a large class dummy variable would be biased toward zero.
(7.) One characteristic not controlled for in this analysis is the overall quality of the students in each instructor's school. (TUCE III institutions range from junior colleges to research universities.) In an earlier study using the same TUCE data set (Finegan and Siegfried, 1998), we constructed a measure of school selectivity based on students' SAT and ACT scores and other data. When that measure is added to the regressions reported here, it is never statistically significant, and the regression coefficients for the instructor's language status and proficiency and each significant control variable are changed very little.
(8.) Instructors judged to have more rigorous grading standards would presumably give lower grades, on average, for a given level of student aptitude and effort. That relationship, standing alone, would lead us to expect a negative sign. But, after controlling for the average expected grade in the class, it is less clear what the question on grading standards measures. Perhaps students with a given expected grade feel a greater sense of accomplishment when their instructor is viewed as being a "tough grader." Another possibility is that students believe rigorous grading standards improve horizontal equity in the assignment of grades. The latter two conjectures would lead one to expect a positive sign on the coefficient for RIGOR.
(9.) Because we use class means as observations, one might expect our residuals to be correlated with the number of students in each class. Estimated coefficients are inefficient when such heteroskedasticity is present. It turns out, however, that the residuals are not correlated with the number of students in each class. Nor did White's test indicate evidence of heteroskedasticity problems elsewhere (White, 1980). Thus we report uncorrected OLS estimates.
(10.) The absence of a stronger link between learning and overall instructor rating may be caused by misspecification. Students who believe the instructor is ineffective may substitute greater effort in other modes of learning to compensate. They may do well in the course but resent the extra work that the less effective instructor induced them to do, thus weakening the expected positive association between value added and instructor rating. Estimates that excluded value added from the set of explanatory variables produced otherwise similar results, however.
(11.) Using individual students as observations, Kent Saunders (1999) finds no significant association between student gender and instructor rating in either his Indiana University data set or his sample of TUCE classes. Since the influence of student-specific characteristics on instructor ratings is better estimated using individuals as observations, our significant coefficient for CLGENDER in Version 1 may arise from a correlation between the gender composition of classes and an unknown omitted variable.
Saunders' results for the gender of the instructor also differ from ours. Female associate instructors at Indiana received markedly and significantly lower overall ratings than their male counterparts in both micro and macro courses, while the opposite pattern appeared for regular instructors in his TUCE sample of macro classes. He found no significant association between instructors' gender and effectiveness rating in TUCE micro classes.
(12.) In earlier work with a slightly larger sample of 166 classes, we ran separate Version 1 and 2 regressions for micro classes and macro classes. Within each subject, the regression coefficient for ENGFIRST in Version I was positive and significant at the 0.99 level (0.538 in micro classes, 0.422 in macro). In Version 2, the regression coefficients for ENGRATE partitioned by English status were larger in macro classes than in micro, but within each subject we found the same pattern as in Table 3--namely, a regression coefficient for ESLs only about one-quarter as large as that for EFLs. More of the control variables were statistically significant in the micro class regressions, but a Chow test could not reject the null hypothesis of no significant difference at the .95 level in the structure of the two subject-specific regressions.
In order to draw more reliable conclusions about the small number of ESL instructors in our sample, we combined the classes for each subject into single regressions for each version using only those 154 classes (out of the original 166) for which data were available for all three specifications. We will provide interested readers with a table showing the subject-specific regressions from Versions 1 and Version 2.
(13.) The robust positive association between RIGOR and RATING is not the result of having controlled for CLEXGRADE. The simple association between RIGOR and RATING is +0.64 (significant at 99 percent), while the simple association between RIGOR and CLEXGRADE is a trivial -0.02. The correlations among RIGOR, ENTHUS, and PREP are much larger (between 0.54 and 0.71), suggesting that these three characteristics of instructors often go hand in hand. There are no correlations larger than 0.38 between any of the significant pedagogical variables and any of the original independent variables; the largest involve the English language variables, as discussed below.
(14.) The exception is RIGOR, where the mean for EFLs is 3.98 and the highest observed value for an ESL is 3.90.
(15.) This near identity does not imply that a regression equation using only ESL observations would produce the same regression coefficients as those in Table 3; it simply means that inter-group differences in these coefficients would be essentially offsetting in predicting the mean RATING of each group. Considering, however, the dominant role played by the supplementary variables in our last specification, it seems unlikely that it could predict the mean RATING of ESL instructors so accurately if the underlying relationships between RATING and ENTHUS, PREP, RIGOR, and PCTLECT differed greatly across kind of instructor. A rigorous test of this inference requires a larger sample of ESLs.
References
Aigner, Dennis, and Frederick Thum, "On Student Evaluation of Teaching Ability," Journal of Economic Education (Fall 1986), Vol. 17, 243-265.
Becker, William E., Jr. and William B. Walstad, "Data Loss From Pretest to Posttest as a Sample Selection Problem," Review of Economics and Statistics (February 1990), Vol. 72, 184-188.
Becker, William E., Jr. and Michael Watts, "How Departments Evaluate Teaching," American Economic Review (May 1999), Vol. 89, No. 2, 344-349.
Douglas, Stratford and Joseph Sulock, "Estimating Educational Production Functions with Correction for Drops," Journal of Economic Education (Spring 1995), Vol. 26, No. 2, 101-112.
Finegan, T. Aldrich, and John J. Siegfried, "Do Introductory Economics Students Learn More If Their Instructor Has a Ph.D.?" The American Economist (Fall 1998), Vol. 42, No. 2, 34-46.
Hanushek, Eric, "Conceptual and Empirical Issues in the Estimation of Educational Production Functions," Journal of Human Resources (Summer 1979), Vol. 14, No. 3, 351-388.
Kennedy, Peter, A Guide to Econometrics. 4th edn. (Cambridge, Mass.: MIT Press, 1998).
Saunders, Kent, "The Influence of Instructor Native Language on Learning and Instructor Ratings," November 29, 1999, unpublished paper presented at a joint session of the National Council on Economic Education and the National Association of Economic Educators at the annual meetings of the Allied Social Science Associations, Boston, MA, January 8, 2000.
Saunders, Phillip, The TUCE III Data Set: Background Information and File Codes [Documentation, Summary Tables, and Five 3 1/2" Double-Sided, High Density Disks in ASCII format] (New York: National Council on Economic Education, 1994).
Siegfried, John J. and Wendy A. Stock, "The Labor Market for New Ph.D. Economists," Journal of Economic Perspectives (Summer 1999), Vol. 13, No. 3, 115-134.
Walstad, William B. and Philip Saunders, "Using Student and Faculty Evaluations to Improve Economics Instruction" Chapter 22 in Walstad and Saunders, eds., Teaching Undergraduate Economics: A Handbook for Instructors (New York: Irwin/McGraw-Hill, 1998).
Watts, Michael, and William Bosshardt, "International TAs and Student Time Allocations: Impacts on Learning, Grades, and Student Course and Instructor Evaluations," unpublished paper presented at the January 1992 meetings of the Allied Social Science Associations.
Watts, Michael and Gerald J. Lynch, "The Principles Course Revisited" American Economic Review (May 1989), Vol. 79, No. 2, 236-241.
White, Halbert J., "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity," Econometrica (May 1980), Vol. 48, No. 4, 817-838.