When principals rate teachers: the best--and the worst--stand out.
Lefgren, Lars
Elementary- and secondary-school teachers in the United States traditionally have been compensated according to salary schedules based
solely on experience and education. Concerned that this system makes it
difficult to retain talented teachers and provides few incentives for
them to work to raise student achievement while in the classroom, many
policymakers have proposed merit-pay programs that link teachers'
salaries directly to their apparent impact on student achievement.
Until recently, only a handful of isolated districts had attempted
such programs. Now entire state systems are moving toward merit pay,
with new policies established recently in Florida and Texas requiring
districts to set teachers' salaries based in part on the gains
their students are making on the state's accountability exam.
Implementing a merit-pay system, however, comes with challenges.
Students often have more than one teacher but take only one high-stakes
test. How do we know which teacher to reward? If students are not tested
annually in each subject, how do we determine the merit of a teacher in
a year without testing? How do we fairly assess the impact of a teacher
during a testing year if we do not know how students performed during
the previous school year? Can a merit-pay system overcome these
obstacles?
One option is to turn to principals and ask them to help determine
the size of pay raises. Such subjective performance assessments are
already used to evaluate untenured teachers, and they play a large role
in promotion and compensation decisions in other occupations. While
principals can and do judge teachers' performance, however, there
is little good evidence on the accuracy of their judgments.
The research reported in this paper fills this gap. We found that
principals in a western school district did a good job of assessing
teachers' effectiveness. In fact, principals are quite good at
identifying those teachers who produce the largest and smallest
standardized achievement gains in their schools (the top and bottom
10-20 percent). They are less able to distinguish among teachers in the
middle of this distribution (the middle 60-80 percent), suggesting that
merit-pay programs that reward or sanction teachers should be based on
evaluations by principals and should be focused on the highest- and
lowest-performing teachers.
A Representative Sample
We surveyed all 13 elementary-school principals in a midsized
school district, that asked to remain anonymous, in the western United
States. We asked them to rate the teachers in their schools on a variety
of performance dimensions. The survey, conducted in February 2003,
provides evaluations by their principals of 202 elementary-school
teachers in grades 2 through 6.
The teachers included in the study are fairly representative of
elementary-school teachers nationwide. Sixteen percent of them are men,
the average age is 42, and average teaching experience is 12 years. Most
of these teachers attended a local university; 10 percent attended
another in-state college; and 6 percent attended a school out of state.
Seventeen percent of them have a master's degree or higher, and
most are licensed in either early childhood education or elementary
education. Finally, 8 percent of the teachers in our sample taught in a
mixed-grade classroom in 2002-03, and 5 percent were in a
"split" classroom, sharing a single contract and dividing the
school day with another teacher. The students in grades 2 through 6 in
the district are predominantly white (73 percent), with a sizable ethnic
minority (Latino students compose 21 percent of the elementary
population); 48 percent of them receive a free or reduced-price lunch.
Achievement levels in the district are almost exactly at the average of
the nation (49th percentile on the Stanford Achievement Test).
All elementary-school students in the district take a set of exams
each year, in reading and math. These multiple-choice,
criterion-referenced tests cover topics that are closely linked to the
district's learning objectives. While student achievement results
have not been linked to rewards or sanctions for schools until recently,
the results of the exams have been distributed to parents annually for
at least the past decade, years before implementation of the No Child
Left Behind law. This latter fact is important because our study relies
on a consistent data set covering the years 1998 through 2003. The
district has not had a merit-pay program for teachers at any time during
this period.
To ensure that we could link student achievement data to the
appropriate teacher, we limited our sample to classroom teachers,
omitting music and gym teachers as well as librarians. We excluded
kindergarten and first-grade teachers because earlier achievement exams
were not available for their students; this prevented us from developing
a "value-added" measure of student learning. We retain in our
analysis the small number of teachers who share a contract, each
teaching only half of the school day. For our analysis, the gains made
by students in these classes count toward the estimated value added of
each of the two teachers.
Can Principals Identify Effective Teachers?
Principals were asked not only to provide a rating of overall
teacher effectiveness, but also to assess, on a scale from one
(inadequate) to ten (exceptional), specific teacher characteristics (ten
altogether), including dedication and work ethic, classroom management,
parent satisfaction, positive relationship with administrators, and
ability to improve math and reading achievement. Principals were assured
that their responses would be completely confidential and would not be
revealed to the teachers or to any other employee of the school
district.
While there was some variation among principals, the overall
assessments they gave teachers were generally quite high, with an
average of 8.1. Only 10 percent of the assessments fell below a 6, and
the average rating for the least-generous principal was still a 6.7. At
the same time, principals did not simply assign similar scores to each
of their teachers. In fact, the principals generally used 5 to 6
different ratings for the teachers in their school.
Because principals differ in the generosity and degree of variation
in the ratings they give, we placed all the ratings on the same scale by
subtracting from each teacher's rating the average rating given by
that teacher's principal and then dividing by the principal's
standard deviation. We did this separately for each specific aspect of
teacher performance about which principals were asked.
We compared a principal's assessment of how effective a
teacher is at raising student reading or math achievement, one of the
specific items principals were asked about, with that teacher's
actual ability to do so as measured by their value added, the difference
in student achievement that we can attribute to the teacher. To estimate
the value added by a teacher, we examine the performance of her students
after accounting for a wide variety of student and classroom
characteristics that could affect achievement independent of the
teacher's ability. These characteristics include race, gender,
eligibility for the federal lunch program, limited English proficiency,
and, most important, previous student achievement. We also take
advantage of the availability of data on the same teachers from as far
back as the 1996-97 school year; this enables us to distinguish
long-term teacher quality from the possibly idiosyncratic performance of
a class in any one year.
We find a positive correlation between a principal's
assessment of how effective a teacher is at raising student achievement
and that teacher's success in doing so as measured by the
value-added approach: 0.32 for reading and 0.36 for math. These
correlations are based not on a principal's overall rating of the
teacher, but rather on the principal's personal assessment of how
effective the teacher is at "raising student math (or reading)
achievement." Previous studies of evaluations by principals have
used only the overall rating of the teacher, a less direct assessment of
a teacher's ability to raise student performance. Using the overall
rating in that way could compromise the accuracy of subjective
performance evaluations, especially if principals value characteristics
of teachers that are unrelated to their effect on student performance.
Our findings lead us to conclude that principals are able to identify
accurately this dimension of teacher effectiveness.
Why aren't these correlations even higher? One possible
explanation is that principals focus on the average test scores in a
teacher's classroom rather than on student improvement. There is
some evidence for this conjecture. The correlation between ratings by
principals and the average test scores of a teacher's students is
significantly higher than the correlation between ratings by principals
and the teacher's value-added rating in reading (0.56 versus 0.32),
though not in math.
Another reason could be that principals focus on their most recent
observations of teachers. We do find, for example, that the average
achievement gains in a teacher's classroom in 2002-03 is a modestly
stronger predictor of the principal's rating than the gains in any
previous year. In theory, it is possible that principals are correct in
assuming that a teacher's effectiveness changes over time so that
teachers' most recent experience is the best indicator of their
actual effectiveness. If that were the case, however, we would expect to
find that principals' ratings are more highly correlated with
value-added measures that have been adjusted to account for the fact
that teachers tend to be less effective in their first one or two years
in the classroom. In fact, the correlation between principals'
ratings and experience-adjusted value-added measures is no higher than
the correlation with our baseline value-added measures. The bigger
mistake principals make, it seems, is not adequately accounting for
students' incoming ability.
While informative about principals' overall abilities, a
simple correlation does not tell us whether principals are more or less
effective at identifying teachers at certain points on the ability
distribution. We therefore estimated the percentage of teachers that a
principal can correctly identify in the top group within his or her
school. We found that the teachers identified by principals as being in
the top category were, in fact, in the top category according to the
value-added measures about 52 percent of the time in reading and 69
percent of the time in mathematics. If principals randomly assigned ratings to teachers, we would expect the corresponding probabilities to
be 14 and 26 percent, respectively. This suggests that principals have
considerable ability to identify teachers in the top of the
distribution. The results are similar if one examines principals'
ability to identify teachers in the bottom of the ability distribution.
Despite their success with the top and bottom of the distribution,
principals are significantly less successful at distinguishing among
teachers in the middle of the ability distribution. Principals correctly
identify only 49 percent of teachers as being better than the median
teacher in their school in boosting students' reading scores,
relative to the 33 percent that one would expect if principals'
ratings were randomly assigned. Principals appear somewhat better at
distinguishing between teachers in the middle of the distribution in
math (they correctly placed 54 percent of teachers above the median,
compared with the 26 percent expected if ratings were random), but they
again appear to be better at identifying the best and worst teachers.
One reason that principals might have difficulty distinguishing
between teachers in the middle is that the distribution of
teachers' value-added ratings is highly compressed. However, our
analysis of the data suggests that this is not the case. Teachers who
receive ratings at or close to the median in the school have estimated
value-added measures that are quite widely dispersed.
What Characteristics of Teachers Do Principals Value?
Of course, the effects of moving to a system of compensation based
on assessment by principals depend on the relative importance they place
on a teacher's ability to raise standardized test scores when
making overall assessments of teachers' effectiveness. While such
preferences could theoretically be set by district administrators or
other policymakers, it is likely that principals would retain some
autonomy over personnel decisions, so their preferences are important to
investigate. We therefore compared principals' overall rating of
each teacher with their assessment of various teacher attributes to
examine how principals value different dimensions of quality in
teachers.
Perhaps not surprisingly, teachers' ratings on many (though
not all) of the individual survey items are highly correlated. Based on
the relationships between the questions, we created three groups of
teachers' quality characteristics and reanalyzed the results. The
first group captures what might be described as traditional teaching
ability and includes the ratings of classroom management, organization,
and ability to improve students' test scores. The second, including
the principal's assessments of a teacher's relationship with
colleagues and administrators, measures a teacher's collegiality.
The third measures student satisfaction and includes the
principal's ratings of student satisfaction and the teacher as a
role model.
Ability, collegiality, and student satisfaction all contribute
independently to a principal's overall evaluation of a teacher, but
principals weigh the set of questions measuring teachers' ability
to improve student achievement and to manage a classroom most heavily.
An increase of one standard deviation in a principal's evaluation
of a teacher's management and teaching ability, for example, is
associated with an increase of 0.56 standard deviations in the
principal's overall rating. In comparison, an increase of one
standard deviation in teacher collegiality is associated with an
increase in overall ratings of roughly one-third of a standard deviation
in overall rating. Meanwhile, teachers scoring one standard deviation
higher in student satisfaction score just 0.15 standard deviations in
their overall rating, all else being equal.
Predicting Performance
We should care about the quality of principals' assessments of
teacher quality not just for their reliability in a merit-pay system,
but also for their ability to identify teachers who will continue to
improve student achievement. In order to get a sense of how well
principals' assessments forecast teachers' performance, we
examined how well these assessments predict future student achievement
gains. For our February 2003 survey of principals, that meant evaluating
scores on the spring 2003 tests. We compared the predictive accuracy of
a principal's assessment of teacher effectiveness with the
predictive accuracy of a teacher's value-added rating. We also
measured the accuracy of the traditional determinants of teachers'
salaries, experience and education, in predicting those scores.
Throughout, we accounted for differences in previous student
achievement, student demographics, and classroom characteristics.
Our findings suggest that ratings by principals, both overall
ratings and ratings of a teacher's ability to improve achievement,
effectively predict a student's future achievement gains (see
Figure 1). Students whose teachers receive an overall rating one
standard deviation above the mean are predicted to score roughly 0.06
standard deviations higher in reading than students whose teacher
received an average rating. By way of comparison, students receiving
free or reduced-price lunch in the same district experience achievement
gains approximately 0.16 standard deviations lower than similar students
who are not eligible for such programs. Assignment to a teacher with a
favorable evaluation by her principal appears to be more important for
math performance. An increase of one standard deviation in the
principal's evaluation predicts an increase of 0.14 standard
deviations in math performance, roughly on par with the disadvantage
associated with coming from a low-income family.
Measures of teachers' value added in previous years are an
even better predictor of future gains in students' achievement than
are principal ratings. These results, which are similar for math and
reading, suggest that teachers' impact on student achievement, as
measured by simple value-added measures of teacher effectiveness, remain
fairly stable over time and that principals' ratings effectively
capture a substantial fraction of these stable differences in
teachers' effectiveness.
We do not find any statistically significant relationship between
the number of years a teacher has taught and students' achievement,
though this is probably due to the necessary omission of first-year
teachers (because we cannot measure their value added for a previous
school year). Other studies have found that first-year teachers tend to
perform worse on average than experienced teachers. Education does have
some predictive power. Teachers with advanced degrees have students who
score roughly 0.10 standard deviations higher. We hesitate to say that
education itself is producing these gains, because a teacher's
level of education is likely to be associated with personal
characteristics not accounted for in our analysis, and these may be the
very factors responsible for the improvements in student achievement.
Perhaps our most interesting finding is that the salaries teachers
in this district received in 2002-03 bore no relation at all to their
impact on student achievement. Students with highly paid teachers made
no more progress than those with teachers who had low salaries.
Conclusions
In sum, our results suggest that student achievement (as measured
by standardized test scores) would probably improve more under a system
based on principals' assessments than in systems where compensation
is based solely on education and experience. This is because principals
would be able to identify and reward the very best teachers while, at
the same time, identifying the least competent teachers for remediation
or dismissal.
To the extent that the most important staffing decisions involve
sanctioning incompetent teachers and rewarding the very best teachers, a
principal-based assessment system may affect achievement as positively
as a merit-pay system based solely on student test results. Moreover,
evaluation by the principal has the potential to offset some of the
potential negative consequences of test-based accountability systems. If
principals can observe inputs as well as outputs, they may be able to
ensure that teachers increase student achievement through improvements
in pedagogy, classroom management, or curriculum rather than teaching to
the test. Principals can also evaluate teachers on the basis of a
broader spectrum of educational outputs in addition to test scores that
parents may value. At the same time, the inability of principals to
distinguish between a broad middle range of teacher quality suggests
caution in relying on principals for fine-grained performance
determinations, as might be required under certain merit-pay policies.
Two important caveats to consider when interpreting our results.
First, we conducted our analysis in a context where principals were not
being evaluated on the basis of their ability to identify effective
teachers. It is possible that principals' ability to identify the
best-performing teachers would be enhanced by a school system where the
principals had more responsibility for monitoring teachers'
effectiveness. At the same time, social or political pressures might
make principals less willing to assess teachers honestly if their
judgments directly influenced teachers' compensation. Second, our
analysis focuses on the source of the teacher assessment; we do not
address the type of rewards or sanctions associated with teacher
performance. This is clearly an important dimension of any performance
management system, and one would not expect either a principal-based or
a test-based assessment system to have a substantial impact on student
outcomes unless it were accompanied by meaningful consequences.
Brian Jacob is assistant professor of public policy at the John F.
Kennedy School of Government, Harvard University and a faculty research
fellow with the National Bureau of Economic Research. Lars Lefgren is
assistant professor of economics, Brigham Young University.
Principal Distinctions (Figure 1)
Principals do a reasonably good job of identifying those teachers who
are better (and worse) at raising student test scores. Not surprisingly,
the best way to predict how effective a teacher will be is to find out
how effective the teacher has been in the past. Differences in teachers'
salaries within a school system are entirely unrelated to teachers'
effectiveness.
Predictors of Teacher Ability to Improve Student Performance
Test-score performance
explained by measure
(percent of a
standard deviation)
Math Reading
Teacher's previous performance 21 9
Teacher's overall rating by principal 14 6
Teacher's salary No explanatory value
Note: The figure shows the degree to which an increase of one standard
deviation in each variable is related to student achievement in 2003.
Previous performance is measured by the teacher's estimated success in
raising test scores between 1998 and 2002. The analysis controls for
student demographic characteristics, classroom characteristics, fixed
effects for grade and school, and lagged math and reading scores. All
reported effects are significant at the 0.05 level.
SOURCE: Authors' calculations from district's data
Note: Table made from bar graph.