文章基本信息

标题：Testing Bayesian updating with the Associated Press Top 25.
作者：Stone, Daniel F.
期刊名称：Economic Inquiry
印刷版ISSN：0095-2583
出版年度：2013
期号：April
语种：English
出版社：Western Economic Association International
关键词：Bayesian analysis;Bayesian statistical decision theory;Behavioral economics

Testing Bayesian updating with the Associated Press Top 25.

Stone, Daniel F.

I. INTRODUCTION

Most studies of Bayesian updating use experimental data. (1) Although this research has led to many insights into human behavior, it is inherently subject to a few criticisms. A common one is that in experimental settings agents lack expertise. Many experiments address this concern by giving subjects opportunities to practice. Still, the issue can never be completely mitigated. The intuition real-world agents gain from years of experience is not replicable in the lab. (2)

This article uses a non-experimental data source--the voter ballots of the Associated Press (AP) "Top 25" college (American) football poll--to contribute to the behavioral economics belief updating literature. (3) The AP poll is a weekly subjective ranking of the top teams by dozens of experienced journalists. As the rankings are revised primarily in response to game results, the main signals causing rank updating are publicly observable. As each team plays at most one game per week, there is at most one major new signal per team per week. These features of the institutional setting make it an ideal-field context for analyzing belief updating. Furthermore, the prior and signal distributions are much richer than those typically used in lab work, where subjects often use binary random variables to update beliefs on a single binary state variable. The richness of the rankings data yields evidence of a variety of behaviors: that the poll voters appear to Bayesian update in some situations, and respond both excessively (overreact) and insufficiently (underreact) to new information in other circumstances. The results both support the idea that even experts in real-world settings indeed sometimes act in a non-Bayesian way, and enhance understanding of the underlying causes of the different types of belief updating behavior.

The article's empirical method can be summarized very briefly as follows. I first estimate the voters' weekly Bayesian posterior rankings (henceforth the estimated posteriors). I then test for systematic differences between the estimated posteriors and the voters' actual posterior rankings (the observed posteriors), and whether these differences are associated with particular contextual factors.

Estimating the Bayesian posteriors is the main challenge. It is unclear how voters "should" update their ranks, as the instructions the AP provides voters regarding how to rank teams is vague and intended to allow for subjective interpretation. (4) The estimation is based on a model of Bayesian updating that assumes, for each voter season, there exists a "true" ranking--the ranking based on full information on team qualities, performances, and any other factors considered relevant--and that each voter's weekly goal is to rank teams as closely as possible to this true ranking. The model thus implies that to rank teams optimally each week voters should Bayesian update their beliefs about true ranks. I estimate these Bayesian updated beliefs by using each voter's final ranks for the season as an estimator, or proxy, for her true ranks for that season. This allows use of the empirical relations between final ranks and earlier ranks, and final ranks and game results, to construct posterior distributions. These distributions are then translated into the estimated posteriors ranks.

The final ranks are a natural proxy for true ranks for two main reasons. First, final ranks are based on more information than all earlier ranks from the same season, and so should be the most precise ranks for the season, from each voter's perspective at least. Second, using the voters' own final ranks to proxy "truth" allows each voter-season to have its own definition of truth. Given the subjective nature of the rankings this seems preferable, as opposed to imposing a single ideal ranking on all voters (such as the aggregate final ranks). Still, other plausible proxies are explored in robustness checks, discussed further below.

It is worth noting this method implies the estimated posteriors can be interpreted as estimates of how to update toward a voter's own final ranks as "quickly" as possible. If voters failed to do this it would indicate they fail to use available information efficiently, that is, they use this information in a non-Bayesian way. However, as a voter's own final ranks could be flawed, updating toward them maximally "quickly" is just a necessary, and not a sufficient, condition for Bayesian updating. (5) I elaborate on this issue, and all the details of the empirical method and model, further below, especially in Section III. In particular, I provide evidence that rank precision increases within the season.

In Section IV, after first showing evidence of the validity of the estimated posteriors, I use ordinary least squares regressions to test the null hypothesis of Bayesian updating. There are a number of results supporting rejection of the null. Voters do not respond sufficiently to home status and margin of victory over unranked opponents. (6) The voters' responses to margin of victory over ranked opponents and margin of loss against all opponents are closer to the estimated posterior responses. The latter types of signals, losses and wins over ranked teams, occur less often and are relatively informative regarding the teams' final ranks. The results are thus consistent with the voters being, in a sense, selectively Bayesian: determining their posteriors in a "more Bayesian" way only in response to relatively infrequent and informative signals. Voters seem to use a "win is a win" heuristic the rest of the time. Voters may be more responsive to characteristics of losses and wins over ranked teams because those games are more salient, that is, receive more attention from analysts, players, and coaches, and there are information processing costs. (7) It is also possible that voters prefer to adjust prior ranks minimally to preserve reputation or ego, and consequently ignore evidence that is relatively ambiguous (Sloman, Fernbach, and Hagmayer 2010).

There are also considerable deviations from Bayesian behavior that seem to result from subtle, or non-salient, differences in the precision of prior distributions for teams with different prior ranks. The estimated priors imply the most highly ranked teams are much better than moderately ranked teams, while the difference between moderately and low-ranked teams is small. (8) Thus, voters should have relatively precise prior beliefs for the ranks of the very best teams. This greater precision implies that voter responses to losses by top-ranked teams should be small: the mean estimated rank decline after losses by teams ranked 1-5 is only 5.7 spots, whereas the mean estimated decline following losses by teams ranked 6-10 is 8.9 spots. The mean observed declines are 7.7 and 8.4, respectively. In other words, voters downgrade top 1-5 teams by around 2 (7.7-5.7) spots more than they should after losses. (9)

In summary, a parsimonious explanation for the different types of behavior is that voters are more responsive to information that is more salient and less vague, conditional on actual information content. Although the salience bias and its effects on belief updating are recognized in psychology and even the popular press, (10) the relationship does not seem fully appreciated by the economics literature on belief updating. Barberis and Thaler (2003) note the relevance of salience to over/underreaction, (11) but do not cite any studies directly supporting this idea, although their article is in an in-depth survey. Recent experimental work (Holt and Smith 2009) and theory (Epstein, Noor, and Sandroni 2010) seems to neglect variation in the degree of salience among signals to focus on other issues. Recent work on salience and attention (Chetty, Looney, and Kroft 2009) does not focus on belief updating. The results from this article help fill this gap in the literature. The results also support the theory that the salience bias may sometimes be an optimal heuristic, as more salient games are relatively informative, although ignoring variation of prior precision seems strictly suboptimal. (12)

Before proceeding, it should be highlighted that using subjective beliefs (voters' own final ranks) as a proxy for true values is an unusual, and perhaps unique, empirical approach. The approach may even seem internally inconsistent, as it may appear to not make sense that individuals can make mistakes in updating toward their own beliefs, especially when the mistakes are identified using data on the behavior of those same individuals. However, it is certainly possible. A mathematical example is provided to illustrate this idea in the Supporting Information. The example also shows how rank precision can increase through the season despite rank updates being affected by mistakes. The intuition is essentially that mistakes, if sufficiently small, largely "wash out," while legitimate rank updates are more persistent.

The relevant question then is how this approach may bias the results. It is fairly intuitive that, as alluded to above, the bias likely would be against rejecting the null. This is most clearly seen for the extreme case of the final rank updating. Using the final ranks to proxy truth implies the observed posteriors for the next-to-final rankings are tautologically Bayesian. So the null cannot possibly be rejected for the final updating. More generally, the closer the updating is to the final updating, the more likely it will appear the voters are acting in a Bayesian way even when they are not. The natural way to address this problem is to focus on early season updating. The sample used for analysis is restricted to the first half (7 weeks) of each season. As rankings do vary considerably in the final half, the issue is substantially mitigated. That the problem is not too severe is supported by the validity analysis and the fact that numerous results supporting rejection of the null are indeed found.

Still, robustness checks are especially in order because of the unusual framework. In Section IV, I show that results are actually very similar when the posteriors are estimated using two alternative true rank proxies: (1) the aggregate AP poll final rankings and (2) "computer rankings." Both are essentially independent of each individual voter's ranks, and thus avoid the endogenity of true ranks problem. Finally, Section V presents two analyses that completely relax the main empirical framework, but also yield supportive results.

II. THE DATA

The AP college football poll is conducted once per week during the college football season and teams play exactly one or zero games per week. As games are the major signals regarding how the teams should be ranked, the poll voters observe at most one major signal about each team per week. (13) The signal probabilities--the empirical distributions of the scores--should be common knowledge, as the voters have all observed years of scores.

These two features of the data--the single major signal for each team between observations of the voters' rankings, and knowledge of the signal distributions--distinguish the rankings data from most economic data, and allow the rankings to be used to study Bayesian updating. In most economic situations there are many important signals, which arrive erratically, that may affect beliefs. It is difficult to tell which individuals observe which signals, and even more difficult to say anything about the subjective probabilities of the signals. (14)

The first AP poll is taken before the season starts in late August and the final poll occurs after the season ends in early January. The poll is voted on by 60-65 leading college football journalists from throughout the United States and different forms of media. Each voter submits a ranking of the top 25 teams, and the aggregate ranking is determined by assigning teams 25 points for each first-place vote, 24 for second, and so forth, and summing points by team (a Borda ranking). The poll began in 1934 but the number of teams ranked by each voter has changed over time, and has been 25 since 1989. Historically, the poll has played a part in determining the national championship, but this role ended in 2005. The individual ballots of the AP poll voters are not confidential, but historical ones from before 2007 are not published or even available on the Internet. The AP only makes the current week's ballots available on its website, which is where I obtained the 2007 and 2008 ballots. I obtained hard copies of the individual ballots for the 2006 season directly from Paul Montella and Ralph Russo of the AP. Historical aggregate polls and score data are widely available. (15)

The empirical setting does have several weaknesses. First, the voters do not have direct incentives relating to the quality of their rankings. This is not too serious an issue, as the voters' reputations, and thus career concerns, depend on their rankings. For example, a voter was removed from the 2006 poll after mistaking a win for a loss, and the voters' weekly ballots are scrutinized carefully by bloggers and websites. (16) In addition, discussions with voters indicate that they put substantial effort into producing their best possible rankings.

There are three more significant weaknesses. The first is that the voters only rank 25 out of more than 100 teams. The second is that the data only include rankings, rather than distributions of subjective probabilities regarding specific variables. The third is that there is ambiguity regarding the criteria on which the teams are being ranked. I discuss all of these weaknesses at length in the following section.

III. ESTIMATING THE BAYESIAN POSTERIORS A. The True Rankings

To quickly restate the main empirical method: I first estimate the voters' Bayesian posterior ranks (after observing game results; Week 2 is posterior to Week 1, Week 3 posterior to Week 2, etc.), then conduct tests on the differences between these estimates and the observed posteriors. Estimating the Bayesian posteriors requires an assumption regarding what the voters are updating toward. This is ambiguous, as the criteria on which the rankings are based are ambiguous. (17) As discussed in Section I, I assume each voter has a true ranking for the season--an ideal ranking--which she attempts to update toward throughout the season. I use each voter's final ranking as an estimate of her true ranking, as the final ranks incorporate both voter-specific rank criteria and should be the most precise ranks for the season.

The estimated posteriors are thus essentially estimates of how to update toward own final ranks efficiently. Consequently, the estimated posteriors can really only be used to assess whether voters satisfy the necessary condition for Bayesian updating that future changes in ranks are not predictable using current information. If this condition is violated, this would imply voter updating is not fully Bayesian. To illustrate with an extreme example, if voters always rank teams that win 10 games in the final top 10, and teams that win their first game by 50 points always win 10 games, then non-top 10 teams that win their first game by 50 should be ranked in the top 10 right away (i.e., in the Week 2 ranking). If voters failed to do this, it would indicate they do not understand the informativeness of the first signal, that is, do not use it efficiently.

However, efficient updating toward own final ranks is not a sufficient condition for Bayesian updating. To again illustrate by example, if voters simply held their ranks constant all season, then the estimated posteriors would also be constant, as it would appear that the prior ranks are equal to the true ranks, making the game results completely uninformative. This would cause the estimated posteriors to equal the observed posteriors, and the null of Bayesian updating would not be rejected. Clearly, however, the rationality of this behavior would be highly suspicious. While it is good to be aware of this issue, it would only be potentially problematic if the results largely failed to support rejection of the null. Since, as discussed above, this is not the case, the issue is not too concerning.

To examine the validity of the empirical method further, it is worth discussing how exactly the voters actually determine their ranks. One plausible criterion voters may use is subjective assessment of season-long performance. If voters rank teams this way and are not aware of possible mistakes they make in rank updating, then they will think of their final ranks as literally their true ranks. If voters did make mistakes and were aware of them, it would be natural to expect voters to correct for the mistakes to the extent possible, and still think of their final ranks as the most informed ranks for the season. The other most plausible criterion is current team quality. (18) While quality may change throughout the season, if quality is constant then final ranks would be (at least in the voters' eyes) the most precise estimators of true ranks throughout the season, since the final ranks incorporate maximal information and so should be most precise. Thus, if quality is constant, whether voters rank teams on "season-long performance" or "quality," it seems reasonable to use the final ranks to proxy the true ranks. (19)

Table 1 presents evidence that ranks do indeed become more precise throughout the season. The significant positive interaction term indicates the marginal effect of the rank advantage of the favorite (the team with ex ante better rank) on the probability of it winning the game, and on its score advantage, is greater later in the season than early. The results from the simplest model, presented in column 1, indicate when the favorite has a 20 rank advantage, this increases the favorite's chance of winning by only 10% in the early season (as compared to having no rank advantage), but yields a 25% increase in the late season. Moreover, the finding of Logan (2010) that rank responses to losses decline throughout the season is consistent with voters thinking their rank precision improves. Another advantage of using the voters' own final ranks to proxy truth is they incorporate regional or other voter-specific biases that are constant throughout a season. (20)

It seems the main reason the final ranks as true ranks proxy could be problematic is the possibility that team qualities change substantially within seasons and voters rank teams on current quality. Then it would be difficult to distinguish changes in ranks that occur during a season because of learning from those that are caused by changes in quality. I examine this possibility by testing the hypothesis that average score differences for teams of different final ranks are constant throughout the season. (21) If voters rank teams on current quality, and quality changes throughout the season, then teams highly ranked in the final poll would have better performances in the later part of the season, on average. This is because teams highly ranked in the final poll would improve on average throughout the year, and teams ranked poorly in the final poll worsen. See the Supporting Information for a theoretical illustration of this phenomenon.

Table 2 presents empirical evidence that this issue is not a problem. The table indicates that rankings are either based primarily on season-long performance, or that team qualities do not change significantly within seasons, perhaps due to the lack of a hot hand at the team level (Camerer 1989). While home teams of final rank 1-12 do beat teams of final rank 13-25 by a greater margin in the late season, home teams of final rank 13-25 also perform better in later months versus superior teams of rank 1-12. These results essentially nullify each other, as they point in opposite directions. Neither of the other results (for games between ranked teams and unranked teams that were ranked in the final poll in one of previous two seasons) indicate well-ranked teams' performances improved throughout the seasons. Thus there is little reason to lose confidence in the null, that score differences are uncorrelated over time.

A few further remarks are in order regarding the true ranks proxy. As discussed in Section I, the sample is restricted to the first half of each season to minimize the bias against rejecting the null caused by the proxy; this restriction should also reduce the chance of voters committing the hot-hand fallacy, that is, falsely inferring trend in team quality changes. (22) Another potential problem with using voter final ranks to proxy truth is it implies voters have no one to please but themselves; their objective functions do not depend in any way on others' perceptions of the accuracy of their rankings. (23) Also, the voters may think the aggregated final ranks may contain more information than their individual final ranks. Hence, an alternative proxy for true ranks that accounts for these issues is the aggregate final ranks. I conduct the analysis using this alternative proxy, and a version using the well-known Sagarin computer rankings as the true ranks proxy, to estimate the voter prior distributions. These are rankings calculated based only on game results and strength of schedule, so they completely eliminate the endogeneity issue.

Figure 1 shows an illustration of the estimated priors given the true ranks proxy: the final rank distributions conditional on prior rank group, split out by teams that win and lose their immediate subsequent game. The figure shows interesting variation in the precision of prior distributions across prior rank. The distribution for top 1-5 teams after losses dominates the distribution for top 6-10 teams after wins (and losses). That is, even after losses, top 1-5 teams still have higher probabilities of finishing in the top 10 and top 11-25 than top 6-10 teams do after they win. This implies that the priors for top 1-5 teams are quite strong, or precise--these teams are considerably better than the next best group of teams. On the other hand, the distribution for top 11-15 after losses is dominated by the distributions of both top 16-20 and 21-25 teams after wins. Moreover, the distributions for all three of these rank groups after wins are fairly similar. This indicates there is little distinguishing teams in the bottom 15 of the rankings, that is, the priors for these teams are relatively imprecise. This finding will be important for understanding the results from the main analysis presented in Section IV.

[FIGURE 1 OMITTED]

B. Formal Framework

This subsection specifies a model of Bayesian rank updating used as a framework for estimating the Bayesian posterior ranks. Let [r.sup.v.sub.i] denote the true rank of team i for voter v (in a particular season; index suppressed), i [member of] {1, ..., N}, [r.sup.v.sub.i] [member of] {1, ..., N}, and v [member of] {1, ..., V}, in which N is the total number of teams, and V is the number of voters. Let [[??].sup.v.sub.i,t] be the rank that voter v assigns to team i in week t, in which [[??].sup.v.sub.i,t] [member of] {1, ... ,25, unranked}, because the voters can only rank 25 teams, and t [member of] {1, 2, ..., T}, with T denoting the final week of the season. Thus, [[??].sup.v.sub.i,t] is the observed rank of team i by v in week t. The notation v is suppressed in the remainder of this subsection as it is unnecessary.

It is convenient to assume each voter's objective in each week t [member of] {1, 2, ..., T} is to minimize the expectation of a quadratic loss function of current and true ranks: [E.sub.t][[[summation].sub.i=1:N][([[??].sub.i,t] - [r.sub.i]).sup.2]]. For this purpose [[??].sub.i,t] can be equal to any number greater than 25 if in fact [[??].sub.i,t] = unranked. If [[??].sub.i,t] were a continuous variable, clearly it would be optimal for voters to set [[??].sub.i,t] = [E.sub.t]([r.sub.i]), for all i and t. They cannot do this though, because they have to assign the discrete ranks of 1 through 25 to 25 different teams. However, optimal behavior in the discrete case is similar to that of the continuous case, as shown in the following proposition:

PROPOSITION 1. For each t [member of] {1, 2, ..., T}, the loss function is minimized by ranking teams as follows:

(1) [E.sub.t]([r.sub.i]) > [E.sub.t]([r.sub.j]) [right arrow] [[??].sub.i,t] [greater than or equal to] [[??].sub.j,t].

Proof: Suppose not. Then there exists x, y, t such that [E.sub.t]([r.sub.x]) > [E.sub.t]([r.sub.y]), and [[??].sub.x,t] < [[??].sub.y,t] minimizes the loss function. The loss function can be written as [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. As [[??].sub.x,t] and [[??].sub.y,t] minimize the loss function, it must be true that

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

which is a contradiction. This proves the result.

Thus, the objective function implies teams should be ranked in order of expected final rank. To be clear, this objective function, and Proposition 1, are just tools for translating subjective probability distributions into ranks. The objective function may appear odd given the true ranks proxy. If one was to interpret the proxy literally--that the final ranks are literally true ranks--it would imply voters could strategically manipulate their final ranks so as to optimize the function. But I do not claim final ranks are literal truth; a voter's final ranks are just her best guess at truth, in the same way earlier ranks are the best guess at the time. Voters may rank teams this way because of intrinsic incentives or (as discussed above) career concerns incentives to rank teams accurately, which likely include the incentive to avoid being accused of ranking manipulation. In fact, the possibility of manipulation may explain why the AP does not instruct voters to rank teams based on their expected final ranks--if the AP made this explicit, it could increase suspicion of manipulation. I should be clear though that this framework is not claimed to perfectly capture voter behavior and mindsets. It is intended just to be a coherent model of their behavior that makes the subsequent empirical analysis tractable.

Voters thus determine their optimal weekly ranks by updating their beliefs about the true ranks using as much information as possible. Attention is restricted to game result information, because this is the information that is usually most important and is always publicly observed. Let [s.sub.ij] be the points scored by home team i minus points scored by away team j (i wins if and only if [s.sub.ij] > 0). This variable has no time subscript because teams almost never play each other more than once.

Let g([s.sub.ij]|[r.sub.i], [r.sub.j]) be the conditional probability that the game between teams with true ranks [r.sub.i] and [r.sub.j] results in score [s.sub.ij] (the conditional signal probability). Let [f.sub.i,t]([r.sub.i]) be the subjective probability that team i has true rank [r.sub.i] in week t. (f() is the prior; [r.sub.i] is only indexed by i for clarity in the Bayesian updating formula below.)

After team i plays j, [s.sub.ij] is observed and voters can update their beliefs to [f.sub.i,t+1]([r.sub.i]|[s.sub.ij]), [f.sub.j,t+1]([r.sub.j]|[s.sub.ij]). Technically, if beliefs about team i's rank change, beliefs about at least one other team's rank also must change. That is, [for all]k [not equal to] i, j, the voters update [f.sub.k,t+1](r|[s.sub.ij]). However, because these effects are minimal ! ignore them. Similarly, I make the simplifying assumption that [f.sub.i,t]([r.sub.i]|[r.sub.j]) = [f.sub.i,t]([r.sub.i]), [for all]j [not equal to] i.

Voters know g([s.sub.ij]|[r.sub.i], [r.sub.j]) from their observation of years of historical scores and true rankings. Voters can thus use a fairly straightforward application of Bayes' rule to update beliefs. For example, suppose the team indexed 10 hosts a game against team 11 and we are interested in the posterior probability that team 10 has true rank 1: [f.sub.10,t+1](1|[s.sub.10,11]). Using Bayes' rule, this is equal to the probability of [s.sub.10,11] given [r.sub.10] = 1, g([s.sub.10,11]|[r.sub.10] = 1), times the prior that team 10 has true rank 1, [f.sub.10,t](1), divided by the unconditional score probability, g([s.sub.10,11]). The first g() term depends on beliefs about the true rank of team 11, and the second depends on beliefs about the true ranks of both teams 10 and 11, specifically [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

In general, the formula for Bayesian belief updating is:

(3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

C. Estimation Methodology

The estimated Bayesian posteriors are constructed using Equation (3) and Proposition 1 to translate these distributions into rankings. The procedure requires estimates of both of the components of Equation (3), the f's and g's. The Supporting Information provides a detailed discussion of how these are obtained; the remainder of this subsection may be skipped without loss of continuity as well. To summarize, I use empirical frequencies to estimate both sets of distributions, but am forced to make several methodological compromises because of data limitations. First, I coarsen the support for both the f's and g's; second, I condition on aggregate final rank for the g's; third, I assume all voters have the same f for each prior rank; fourth, I use the 2006-2008 data to estimate the f's (the same data that the analysis is conducted on). The first and third are approximations and should not introduce any systematic bias. The second may cause the estimated g's to be too "tight"; if so, this would cause the signals to appear too informative, and the estimates to move too far from the priors. This would bias the results toward findings of underreaction. However, because it turns out the aggregate and individual rankings are very similar in the final poll (as a robustness check shows) this bias should be minimal. (24) The fourth issue implies future data are used to estimate current priors, so the estimated priors are based on information obviously unavailable to the voters. This method should not be problematic if the priors are stable from year to year. Then future final rank frequencies would be unbiased estimates of past frequencies, which voters do observe, and I do not. Still I check robustness by conducting a separate analysis on the 2008 data only with priors estimated from the 2006-2007 data. The robustness checks reported in Section V also are not subject to this issue.

The other issue that needs to be addressed before constructing the estimates is the limited number of teams that are actually ranked (the fact that the voters only rank 25 out of approximately 120 Division I-A teams). As most games are between ranked and unranked teams, some objective method of distinguishing among unranked teams is needed. I use three publicly observable variables to do this: (1) currently ranked by at least one other voter, (2) ranked by at least one voter in final AP poll in one of previous two seasons, and (3) ranked by at least one voter in final AP poll in one of previous three to five seasons. I also condition on YTD number of losses (0 vs. >0 in Weeks 1-3; 0-1 vs. > 1 in Weeks 4+) for teams not currently receiving votes from another voter. This expands the cardinality of the set of elements [r.sub.i,t] is in to 32, in which [r.sub.i,t] = 26 means team i receives at least one vote from others in week t, [r.sub.i,t] = 27 means team i does not receive any votes but was ranked in one of two previous seasons and has zero losses, and so forth.

This method of distinguishing among unranked teams is not sufficient for accurately estimating posterior beliefs for unranked teams. Consequently, I only estimate posterior beliefs and rankings for teams that are currently ranked. This forces a need to account for the fact that several teams do indeed drop from the rankings for most voters from 1 week to the next. I do this by restricting the maximum (worst) estimated posterior rank to one greater than the number of teams that are observed to stay in the poll, by voter week. I also re-rank observed posteriors among teams that were in the prior poll, and assign the same maximum rank (one greater than the number of teams that stay in the voter week's top 25) to teams that drop from the empirical rankings. This allows comparisons between Bayesian and observed posteriors to be apples-to-apples, unconfounded by teams entering the polls at various rank levels. (25) Finally, because teams are ranked in order of expected rank I use the value 35 for expected rank conditional on being unranked. (26)

IV. ANALYSIS

A. Validity of the Estimated Posteriors

The validity of the estimation procedure and results can be assessed before using them to test over/underreaction, as a natural measure of the posteriors' accuracy--their distance from the final rankings--is observable. As the estimated posteriors are conditioned on limited information, if they are at least as close as the observed posteriors, this would be strong evidence the estimated posteriors use the information they are conditioned on efficiently, as compared to how the observed posteriors use the same information. That is, it would be evidence the estimated posteriors are reasonably unbiased and precise (valid) estimates of the Bayesian responses to the information they incorporate. The distances for observed priors and flat priors are also reported for comparison. The observed priors are the voters' rankings prior to the game results, and the flat priors are equal prior rankings for all teams (the average rank).

Distance is measured using mean absolute deviation (MAD): 1/n [[summation].sub.i,t,v] [absolute value of [[??].sup.v.sub.i,t+1] - [r.sup.v.sub.i]]; n is the number of observations (results are similar using other metrics). (27) Table 3 presents summary statistics, split out by game result type. The estimated posteriors are actually closer to the final ranks than the observed posteriors for each type of game results. In addition, it appears voters at least approximate Bayesian methods as the observed posteriors are slightly more accurate than the observed priors, and observed priors are substantially more accurate than the flat priors.

Because observations are correlated across voters by game, I formally test the differences in MADs separately by voter, using paired t tests. The MADs for the estimates are lower for 94 of the 121 voters. This means the estimates out predict the voters' own final ranks for a large majority of voters. The null can be rejected in favor of the estimated posteriors being more accurate at the 5% level for 31.4% of the voters. The null is not rejected at 5% in favor of the observed posteriors being more accurate for any voters. Given the limited information the estimates are conditioned on, this is strong evidence the estimates use the information they do incorporate efficiently, supporting the estimates' validity.

B. Hypothesis Testing

Summary statistics for the estimated and observed responses to game results are presented in Table 4, categorized by the most basic game result types--win and loss (for simplicity I ignore byes)--and broken out by prior rank categories. The overall mean estimated and observed rank changes are similar, implying that on average the voters do not particularly under- or overreact. However, there are some stark contrasts within some of the specific rank groups. In particular, the estimated and observed responses to wins are very different for teams ranked 21-25; the estimated improvement is 4.9 spots while the observed improvement is only 2.66 spots. The responses to losses are most different for the top five teams; the mean estimated rank decline for them is 5.68 spots, while the mean observed decline is 7.66 spots. These statistics indicate the voters underreact to wins by low-ranked teams, and overreact to losses by top-ranked teams. However, this does not account for potentially confounding factors, or the correlation in observations in the aggregated sample (the repetition of games across voters).

To control for these things and estimate the determinants of over/underreaction, I construct a simple measure of overreaction, and regress it on a vector of covariates, with standard errors clustered by game. The measure of overreaction is intended to measure excess rank improvement following positive information and excess rank decline following negative information. Underreaction is the opposite. Thus, under/overreaction are not defined in the absence of a signal (weeks in which teams have byes), and need to be defined differently depending on the nature of the signal. I define the nature of the signal in the simplest possible way that is agnostic to my construction of the estimated posteriors: wins are positive signals, and losses negative. The variable overreaction (OVER) is defined thus as estimated posterior minus observed posterior after wins, and observed minus estimated posterior after losses. The intuition for this is when OVER for a team is positive after a win, then the observed posterior must be better than the estimated posterior, indicating the voter overreacted to the win; the intuition is similar for underreaction and losses. (28) OVER is then used in the following regression equation, estimated separately for games in which the ranked team wins and loses (as will be discussed shortly the hypothesized coefficient signs depend on whether the game was won/lost):

(4) [OVER.sub.ivts] = [X.sub.ivts] [beta] [[delta].sub.v] + [WEEK.sub.t] * [[delta].sub.v], + [[gamma].sub.s] + [[epsilon].sub.ivts].

i, v, t, and s denote team, voter, week, and season, respectively; [[delta].sub.v] is a voter fixed effect (FE) and [[gamma].sub.s] is a season FE. X is a vector of controls. These are defined in Table 5, and also include TOP1_5, TOP6_10, TOP11_15, and TOP16_20, which are dummies for team i being in the top 1-5 (for voter [upsilon] in t-s), and so forth (top 21-25 is omitted). The voter FEs are used as controls to account for voter-specific variation in priors or tendencies to over- or underreact. The interaction of the voter FE and week is used to account for possible heterogeneity of ranking definitions. If voters weigh season performance and quality differently, their responses to games might vary over time. (29) Summary statistics for variables used in the regressions are presented in Table 5.

If the voters were in fact Bayesian, the variables the estimates are conditioned on (variables i = 1-4 from Table 5) would not affect the estimated and observed posteriors in systematically different ways. That is, these variables would have no effect on OVER, thus, the null hypothesis of Bayesian updating implies [[beta].sub.i] = 0 for i = 1, ..., 4. The other variables may affect OVER despite Bayesian updating, because the estimates are not conditioned on them. The primary alternatives to Bayesian updating are overreaction and underreaction. I specify the alternative hypothesis of overreaction for each of these variables and the rank-group dummies below; the alternative hypothesis of underreaction would imply each coefficient takes the opposite sign.

Overreaction Hypotheses:

(1) For wins, [[beta].sub.1] < 0, [[beta].sub.2], [[beta].sub.3], [[beta].sub.4] > [[beta].sub.TOP1_5] < [[beta].sub.TOP6_10] < ... < [[beta].sub.TOP16_20] < 0.

(2) For losses, [[beta].sub.1] > 0, [[beta].sub.2], [[beta].sub.3], [[beta].sub.4] < 0, [[beta].sub.TOPI_5] > [[beta].sub.TOP6_10] > ... > [[beta].sub.TOP16_20] > 0. (30)

Part 1 can be explained as follows. Winning a game is positive information, which in general causes rank to improve. Additional positive information would cause the rank to improve excessively if the voters overreact. If a win occurs away (HOME = 0), this is additional positive information, because winning is less likely on the road for teams of worse true rank. Thus, if voters overreact to home status for wins, then OVER would be lower when HOME = 1, so [[beta].sub.1] would be negative. If voters underreact to home status for wins, OVER would be greater when HOME = 1 and [[beta].sub.1] > 0. Similarly, if voters overreact to additional positive information, their rank improvements will be greater when the score margin is high, the opponent is ranked, and margin of victory over ranked opponent is high. The coefficients for the rank-group dummies are predicted to be increasing as the team's prior rank worsens since, if voters overreact in general, they will improve ranks excessively after all wins. However, they will do so to a greater extent for worse ranked teams because they have further to potentially move up, due to the censored nature of the data. Part 2 of the overreaction hypotheses is analogous; negative information relating to losses would cause voters to worsen the losing team's rank excessively, if the voters overreact.

C. Results

Selected estimation results are presented in Table 6. (31) HOME is significant at the 1% level for almost all specifications; the estimates for both wins and losses samples imply voters do not appreciate the importance of home-field advantage. SMARGIN is negative and highly significant for wins, but close to 0 for losses. This implies voters are insensitive to margin of victory but not margin of loss. Voters are more responsive to margin of victory when the opponent is ranked. The rank of the opponent does not have a significant effect on loss responses.

In the wins models, the rank-group dummies are highly significant and of large magnitude for top 20 teams, indicating voters relatively underreact to wins by the lowest ranked teams. In the losses models, the top 1-5 dummy's coefficient stands out as it has a large, positive coefficient, significant at 5% for the preferred specification, and the top 11-15 dummy has a negative coefficient of smaller magnitude but significant at 1%. These results indicate voters overreact to losses by top 1-5 teams, especially relative to their reactions to losses by top 11-15 teams. The other control variables are mostly insignificant; AGGRKDIFF is the only one with consistently strong economic and statistical significance, indicating the voters are influenced by their peers. The results for robustness specifications (2) and (3) are very similar to those of the preferred model. The results for the regressions estimated only on the 2008 sample are similar to the other results but have substantially higher standard errors.

In summary, the results do not uniformly support the null or either alternative hypothesis (underreaction or overreaction). The [[??].sub.1], [[??].sub.2], and [[??].sub.3] wins-model estimates support underreaction, but the [[??].sub.4] estimate supports overreaction. The [[??].sub.1] losses-model estimate supports underreaction, but the insignificant and small [[??].sub.2], [[??].sub.3], and [[??].sub.4] estimates support the null. In addition, the rank-group coefficients are not fully consistent with any hypothesis.

D. Interpretation

The fact that the non-Bayesian mistakes go in opposing directions is not surprising, given that both overreaction and underreaction behavior have been found in previous literature. In this subsection, I discuss a possible common thread among these seemingly inconsistent results. I realize this exercise may have the flavor of post-hoc rationalization. The discussion is still presented because there does appear to be one particular factor, saliency, that has a great deal of explanatory power, and is supported by previous literature, so it could have reasonably been discussed as part of the a priori theory.

Voters underrespond to HOME and SMARGIN for wins. These variables are arguably salient, in that the sports media of course often discuss home-field advantage and the final score of games. But they are still second order, and thus relatively non-salient, as compared to the binary outcome of win/loss. Voters seem to update ranks in a similar way for all wins, whether they occur at home or on the road, or by wide or small margins. (32) In fact there is a saying in sports that "a win is a win," meaning that athletes, coaches, and commentators ignore negative features of wins (such as the score being close). The poll voters appear to use a heuristic like this, despite the fact that these factors do affect the estimated posteriors, indicating they are informative regarding final rank. The results for OPPRANK and OR_SMARGIN indicate voters generally do not take full account of the quality of opponent, which is also less salient than the win/loss outcome, but are more responsive to margin of victory when the win occurs against a ranked opponent. Margin of victory is plausibly more salient for games involving ranked opponents, because these games receive more attention. There is no analogous saying that "a loss is a loss," and the voters indeed are more responsive to home status and score margin for losses. That estimated overreaction to score margin of loss is close to zero for models (1)-(3) suggests that the voters are capable of revising beliefs in a sophisticated way.

A natural question that now arises is why are losses and wins against ranked teams relatively salient. One reason might be that they happen relatively infrequently, which, unto itself, would not justify the differences in voter responsiveness. It is also possible though that these more salient signals are actually more informative than other signals. I test for this possibility in a simple way: regressions with dependent variables of rank in final poll, and ranked/not ranked in final poll, on a single independent variable, estimated Bayesian posterior rank, separately for teams that lost, teams that beat ranked teams, and teams that beat unranked teams. The estimated coefficients and adjusted [R.sup.2] values are substantially higher for teams that lost and those that beat ranked teams. This implies those signals indeed are more predictive of final rank, meaning they contain more relevant information, than wins over unranked teams. (33) This means that if voters face the same effort costs in calculating their responses to all signals, it would be rational to exert more effort in responding to the more salient signals. Alternatively, voters may feel that minimizing rank revisions may enhance reputation or ego, as it indicates having strong priors, and so the benefit of unresponsiveness may be greater for more vague, less salient, game results.

Another question is what explains the variation in the rank-group coefficient estimates. One explanation is that the voters do not appreciate the differences in the precision of priors noted above. As these differences are subtle, this can also be interpreted as the voters failing to appreciate less-salient information. As discussed in Section III.A, the priors for top 1-5 teams are substantially stronger than those of teams ranked just below them, but the priors for teams ranked 11-25 are fairly similar. Consequently, the Bayesian response to a loss by a top 1-5 team should be small, and the response to a loss by a top 11-15 team should be large. The response to a win by a top 21-25 team should also be large, compared to responses to wins by other teams, because the teams ranked just better than 21-25 teams are not ex ante much better. If, on the other hand, the voters think of the priors as uniformly precise, they will treat losses and wins by all teams similarly. This would cause voter reactions to wins by top 21-25 teams to be too small, reactions to losses by top 1-5 teams to be too big, and reactions to losses by top 11-15 teams to be too small--which is exactly what occurs.

An alternative explanation for overreaction to losses by very top teams is that voters put too much weight on these signals because they are more unusual. However, this would imply voters should overreact to losses by top 11-15 teams relative to losses by worse ranked teams, which is clearly not the case. (34) An alternative explanation for underreaction to wins by top 21-25 teams is that voters underreact to wins in general, and this underreaction is most pronounced for the lowest ranked teams simply because they have the furthest to rise in the polls. But this would imply voters should underreact to wins by top 11-15 teams relative to wins by top 10 teams, which does not occur. A final alternative explanation I explore is that voters react to aspects of the signal, such as HOME and SMARGIN, differently for teams with different prior ranks, which could cause the rank-group dummy estimates to vary. To investigate this, I estimated the models separately for each rank group and examine the constants; they indeed vary in a way consistent with voters not appreciating prior precision variation. (35)

V. ROBUSTNESS CHECKS

A. Game Result Forecast Errors

In this subsection, I check whether the results hold using a different assumption for the voters' objective function--that they attempt to rank teams such that, ceteris paribus, higher ranked teams are likely to perform better in future games than lower ranked teams. Under this assumption, if historical information from earlier in the season on ranks or game performances was predictive of future game results conditional on current rank, current rank of opponent, and other appropriate controls, this would imply current ranks do not incorporate the historical information efficiently. That is, if voters make systematic game result forecast errors that are correlated with particular types of historical information, this would provide alternative evidence voters responded to that information in non-Bayesian ways. One reason this assumption is not used for the main analysis is it does not allow ranking criteria to be heterogeneous; another is that game forecast errors could be caused by incorrect prior ranks, in addition to incorrect rank updating. The main analysis is agnostic to the accuracy of prior rank, allowing for a focus on the updating process. (36)

The three strongest results regarding updating mistakes presented in Section IV are that voters are underresponsive to score margin for wins and to home status in general, and overresponsive to losses by top five teams. (37) To test whether these factors are associated with forecast errors, I separately regress two measures of game results, SMARGIN and a binary win variable, on variables representing the ranked team's history of home status (HIST_HOME), history of scores for wins (HIST_WSM), and history of being ranked in the top five (HIST_TOP5), along with controls for rank, opponent rank, current game home status, STATE, REGION, and week and voter FEs.

HIST_HOME is defined as the number of home games minus the number of away games the team has played prior to the current game. If voters underrespond to home status then teams that play more home (away) games will be overrated (underrated), so this variable should have a negative coefficient. HIST_WSM is defined similarly (the sum of points minus opponent's points, for games the team won only), but the hypothesis goes the other direction--if voters underrespond to score margin, then this variable should have a positive coefficient. HIST_TOP5 is the number of weeks the team was ranked in the top five; it should be positively associated with future game results for teams that were downgraded from the top five if voters overreacted to negative signals for those teams. To account for this, I estimate the models for both the full sample, and a subsample of teams ranked six or worse. HIST_TOP5 should have a much stronger effect for the latter. I also estimate the models separately for a subsample of games in which the opponent is ranked by at least one voter, because quality of opponent can be controlled for more precisely for these games. (38)

Results are presented in Table 7. Results for HIST_WSM are consistently positive and significant, indicating that current ranks do not sufficiently account for historical margin of victory. HIST_TOP5 is positive and significant for the models that exclude teams currently ranked in the top five, which supports the conclusion that voters overreact to negative signals for top five teams. However, home status is consistently insignificant. Thus, only two of the three hypotheses are supported; the evidence for one is inconclusive.

B. The 1991-2005 Aggregate Polls

While the individual voter ballots prior to 2006 are unavailable, the aggregate poll data are publicly available for all seasons. Despite their limitations, the aggregate data can be used to check that the main trends found above hold for other seasons, and using other analytical methods. In this subsection, I use very simple tests to verify the main results from above (underreaction to home status and score margin for wins, and overreaction to losses by top five teams). This robustness check is conducted by comparing sample mean posterior and final ranks for: teams that play at home versus on the road, top 21-25 teams that win by more than 10 points versus top 16-20 teams that win by less than 10 points, and top 1-5 teams that lose versus top 6-10 teams that win. If voters underrespond to home status, teams that play at home will tend to have better posterior ranks because their game performances will be inflated by the home advantage, but will have no better final ranks, as being the home team does not make a team better in the long run. If voters underrespond to score margin, teams ranked 16-20 who win by small margins will have better posteriors than teams ranked 21-25 who win by large margins, but the difference in final ranks should be smaller. If voters overrespond to losses by top 1-5 teams, their posteriors will be worse than those of winning top 6-10 teams, but the difference in final ranks will be smaller. These results are reported in Table 8.

All of the table's results support the conclusions of Section IV. Teams that play their next game at home have better next-week posterior ranks, but insignificantly different final ranks. (39) Top 21-25 teams that win games by large margins have worse posterior ranks than top 16-20 teams that win by small margins, but the teams have insignificantly different final ranks. Finally, top 1-5 teams that lose have worse posterior ranks than top 6-10 teams that win, but the teams have insignificantly different final ranks.

VI. CONCLUDING REMARKS

This article presents extensive evidence that real-world agents with extensive experience make belief updating mistakes similar to those committed by laboratory subjects. The agents also exhibit behavior consistent with estimated Bayesian behavior in some circumstances. A simple but powerful explanatory factor driving the different results seems to be salience. The voters' responses to the most salient aspects of the most salient signals (score margin of losses) are estimated to be Bayesian. Other, less-salient aspects of the signals, which are still informative, tend to be ignored. These results suggest that, given their experience, the voters would fare relatively well if faced with a simple belief updating task similar to the standard one faced by experimental subjects. It is somewhat, but not entirely, simply the overwhelming complexity of updating a ranking of 25 teams in response to dozens of multi-dimensional signals (game results) that causes the voters to rely on heuristics and ignore relevant, but less salient, information.

The results also show the voters are unaware of subtle differences in prior strength across the top 25 teams. Both under/overreaction sometimes result from this unawareness. These results are similar to and in line with experimental work on confidence and stability of systems. It is hard to justify these mistakes with information processing costs, or reputation concerns, however.

While the patterns found in the data studied in this article are strong, they are ultimately very broad, and require validation in other contexts. Using experiments and looking for other field data sources to confirm the relationship between salience and belief updating, analyzing individual-level heterogeneity (as in, e.g., El-Gamal and Grether 1995) and the structural relationship between salience and belief updating at a deeper level, and applying these results to the study of real-world economic phenomena are important directions for future research.

doi: 10.1111/j.1465-7295.2011.00431.x

ABBREVIATIONS

AP: Associated Press

BCS: Bowl Championship Series

FE: Fixed Effect

MAD: Mean Absolute Deviation

YTD: Year-To-Date

REFERENCES

Amir, E., and Y. Ganzach. "Overreaction and Underreaction in Analysts' Forecasts." Journal of Economic Behavior and Organization, 37(3), 1998, 333-47.

Barberis, N., and R. Thaler. "A Survey of Behavioral Finance." Handbook of the Economics of Finance, 10, 2003, 6-12.

Cai, H., Y. Chen, and H. Fang. "Observational Learning: Evidence from a Randomized Natural Field Experiment." The American Economic Review, 99(3), 2009. 864-82.

Camerer, C. "Does the Basketball Market Believe in the Hot Hand?" The American Economic Review, 79(5), 1989, 1257-61.

Chetty, R., A. Looney, and K. Kroft. "Salience and Taxation: Theory and Evidence." The American Economic Review, 99(4), 2009, 1145-77.

DellaVigna, S. "Psychology and Economics: Evidence from the Field." Journal of Economic Literature, 47(2), 2009, 315-72.

Dominitz, J. "Earnings Expectations, Revisions, and Realizations." Review of Economics and Statistics, 80(3), 1998, 374-88.

El-Gamal, M., and D. Grether. "Are People Bayesian? Uncovering Behavioral Strategies." Journal of the American Statistical Association, 90(432), 1995, 1137-45.

Epstein, L., J. Noor, and A. Sandroni. "Non-Bayesian Learning." The BE Journal of Theoretical Economies, 10(1), 2010, article 3.

Goff, B. "An Assessment of Path Dependence in Collective Decisions: Evidence from Football Polls." Applied Economics, 28(3), 1996, 291-97.

Gonzalez, R., and G. Wu. "On the Shape of the Probability Weighting Function." Cognitive Psychology, 38(1), 1999, 129-66.

Grether, D. "Bayes Rule as a Descriptive Model: The Representativeness Heuristic." The Quarterly Journal of Economics, 95(3), 1980, 537-57.

Griffin, D., and A. Tversky. "The Weighing of Evidence and the Determinants of Confidence." Cognitive Psychology, 24(3), 1992, 411-35.

Holt, C., and A. Smith. "An Update on Bayesian Updating." Journal of Economic Behavior and Organization, 69(2), 2009, 125-34.

Kraemer, C., and M. Weber. "How Do People Take into Account Weight, Strength and Quality of Segregated vs. Aggregated Data? Experimental Evidence." Journal of Risk and Uncertainty, 29(2), 2004, 113-42.

Lebovic, J., and L. Sigelman. "The Forecasting Accuracy and Determinants of Football Rankings." International Journal of Forecasting, 17(1), 2001, 105-20.

Levitt, S. "Why Are Gambling Markets Organised So Differently from Financial Markets?" The Economic Journal, 114(495), 2004, 223-46.

Levitt, S., and J. List. "What Do Laboratory Experiments Measuring Social Preferences Reveal about the Real World?" Journal of Economic Perspectives, 21(2), 2007, 153-74.

Logan, T. "Econometric Tests of American College Football's Conventional Wisdom." Applied Economics, 43, 2010, 2493-518.

Massey, C., and G. Wu. "Understanding Under- and Overreaction." The Psychology of Economic Decisions, 2, 2004, 15-29.

Nisbett, R., and L. Ross. Human Inference: Strategies and Shortcomings of Social Judgment. Englewood Cliffs, NJ: Prentice Hall, 1980.

Nutting, A. "And After That, Who Knows?: Detailing the Marginal Accuracy of Weekly College Football Polls." Journal of Quantitative Analysis in Sports, 7(3), 2011, 1274.

Sloman, S., P. Fernbach, and Y. Hagmayer. "Self-Deception Requires Vagueness." Cognition, 115, 2010, 268-81.

Surowiecki, J. "Running Numbers." The New Yorker, January 21, 2008.

Tversky, A., and D. Kahneman. "Judgment under Uncertainty: Heuristics and Biases." Science, 185(4157), 1974, 1124-31.

Zafar, B. "How Do College Students Form Expectations?" Journal of Labor Economics, 29(2), 2011, 301-48.

SUPPORTING INFORMATION

Additional Supporting Information may be found in the online version of this article:

Appendix S1. An illustrative model of rank updating.

Appendix S2. Testing constant team qualities.

Appendix S3. Estimation of score distributions.

Appendix S4. Estimation of prior distributions.

(1.) See, for example, Tversky and Kahneman (1974), Grether (1980), Massey and Wu (2004), and Holt and Smith (2009). Della Vigna (2009) provides a thorough review of field evidence of deviations from rational behavior, but does not refer to any studies that focus on belief updating. There is a strand of the literature on belief updating that analyzes survey data, such as Zafar (2011) and Dominitz (1998). These studies, while non-experimental, lack the data to directly test Bayesian updating.

(2.) There are a number of other fairly well-known criticisms of experimental work. Levitt and List (2007) provide an interesting discussion of some of these issues, including self-selection of subjects, small stakes, self-consciousness, inability to confer with others, and insufficient time to make optimal decisions.

(3.) Several other academic studies have used the AP Top 25 as a data source, including Goff (1996), Lebovic and Sigelman (2001), and Logan (2010). None focus on analyzing the rationality of belief updating.

(4.) Paul Montella, who manages the rankings for the AP, said in a phone conversation in the summer of 2009: "There is no real criteria for voting." While the voters are given a set of guidelines before each season, they are limited in scope to avoid imposing substantial structure on the voting. The guidelines are discussed further below.

(5.) Of course, this discussion refers to efficient updating from an ex ante, and not ex post, perspective. If the preseason number one ranked team loses its first game, wins all subsequent games, and finishes the season the consensus number one, lowering its rank after the initial loss is only (very likely) optimal ex ante, and not conditional on knowing the team is the "true" number one.

(6.) These results are consistent with those of Levitt (2004), who found gamblers also do not pay full heed to home status and score margin.

(7.) I follow the use of the term saliency given by Cai, Chen, and Fang (2009). They say, "The term 'saliency' is widely used in the perceptive and cognitive psychology literature to refer to any aspect of a stimulus that, for whatever reason, stands out from the rest."

(8.) Nutting (2011) documents this phenomenon in detail; his article includes a reference to a very apt quote made in 1989 by United Press International sports editor Fred McMane: "I don't think there are 25 good teams in the country. I think you generally see five good teams, 10 who are fairly good, and after that, who knows?"

(9.) These results are consistent with those of Massey and Wu (2004), who find overreaction is relatively likely in "stable systems" (situations with relatively precise priors).

(10.) See, for example, Nisbett and Ross (1980) and Surowiecki (2008).

(11.) They write, "If a data sample (signal) is representative of an underlying model, then people overweight the data. However, if the data is not representative of any salient model, people react too little to the data."

(12.) Amir and Ganzach (1998) analyze over/underreaction with field data, and analyze salience of the prior, but not the signal. Both overreaction and underreaction may be caused by a host of more specific biases identified in the psychology literature; for example, the availability (overreaction) and the anchoring (underreaction) biases. Although this article does not focus on any of these particular biases, it supports the idea that salience may be a key factor that helps to reconcile them. It is also worth noting that Griffin and Tversky (1992) hypothesized and showed evidence that when a signal was "strong" (extreme) but of low "weight" (low credibility), it would cause overreaction, and vice versa for low strength, high weight signals. Kraemer and Weber (2004) showed that this relationship was reversed if the weight of information was presented clearly. Again, the unifying factor may be salience--that normally strength is more salient than weight, but if weight of information or priors is presented relatively salient, it may indeed be overreacted to.

(13.) The voters do obtain other information about each team besides game scores, but this information has a small impact on the rankings, especially on a week-to-week basis.

(14.) For example, Dominitz (1998) analyzes revisions in beliefs about future earnings using survey data. Although he has data on subjective distributions of future earnings (priors) and the earnings realizations from an intermediate point in time (signals), he does not have sufficient information to estimate whether the observed belief revisions are actually Bayesian. This is mainly because of a lack of data on the subjective distributions of the signals. In fact, the author explicitly comments on this issue--the difficulty of analyzing belief updating using field data--saying: "This component of the analysis calls attention to the breadth of data required to assess the responsiveness of expectations to new information."

(15.) The historical aggregate AP polls and 'Others Receiving Votes' (teams receiving some votes whose point totals were not in the top 25) are from appollarchive.com and The [Baltimore] Sun. Historical score data are from http://homepages.cae.wisc.edu/dwilson/rsfc/history/howell/ and http://www.knology.net/jashburn/football/archive/ (URL as of May 5, 2009).

(16.) See, for example, http://sports.espn.go.com/ncf/ news/story?id=2663882 and http://pollspeak.com/.

(17.) Voters are given the following guidelines before each season, which are intentionally left ambiguous and open to interpretation according to Paul Montella of the AP: "Base your vote on performance, not reputation or preseason speculation. Avoid regional bias, for or against. Your local team does not deserve any special handling when it comes to your ballot. Pay attention to head-to-head results. Don't hesitate to make significant changes in your ballot from week to week. There's no rule against jumping the 16th-ranked team over the 8th-ranked team, if No. 16 is coming off a big victory and No. 8 just lost 52-6 to a so-so team." The first line (which may seem incongruous as there is a preseason poll) was added for the 2008 season, and Montella said it was not indicative of a policy change, but just meant to encourage the voters to be more responsive to game results. This is consistent with my finding that the voters are often underresponsive to game information. Results for the 2008 season are similar to those from earlier seasons regardless.

(18.) The terms "season-long performance" and "quality" are admittedly somewhat vague. It is not necessary for these terms to be defined precisely here, but performance can be thought of as referring to the realization of game results, and quality as referring to the unobserved team-specific distribution of game results (probability of winning).

(19.) Another criterion the rankings might appear to be plausibly based on is year-to-date (YTD) performance. This would be problematic, however, because of the existence of a preseason poll. As there is no YTD performance at that point, and a poll exists, the poll cannot be an assessment only of performance that has been observed. It follows that mid-season polls can also not be based purely on YTD performance. The data bear this out as voters clearly do not rank teams purely on YTD performance in early season polls, as, for instance, teams with two wins and one loss are often ranked ahead of teams with three wins and no losses. Similarly, if the weights placed on YTD performance varied throughout the season, the rankings criteria would be time inconsistent and this would be another form of deviation from rationality.

(20.) I thank Andrew Nutting for providing the data set used for this analysis. The data sets used for the paper's main analysis do not include ranks on teams for each week throughout the seasons, only the first half of each season and final ranks. I do not find evidence of precision increasing substantially in just the first half of seasons.

(21.) I use aggregate ranks for these tests due to lack of historical individual rank data.

(22.) If voters believed team qualities were changing, they would appear to overreact but would be making mistakes qualitatively distinct from basic misuse of Bayes' rule.

(23.) I do test for, and find significant, the effects of differences between individual and aggregate ranks on individual rank changes. This is likely large because of social learning. The data indicate that indeed the voters do not attempt to rank teams as closely to the aggregate ranks as possible. For example, in the first poll of 2006 Ohio State received the majority of first-place votes: 35 of 65. In the second poll, after a strong opening win, Ohio State received 39 first-place votes. If voters were simply trying to match the aggregate rankings, more than four of them would have switched their first-place vote.

(24.) This method is used for all of the specifications reported in Table 6. It is clearly not problematic (actually it is ideal) for the robustness check that uses the aggregate polls as true rankings; it is not ideal for the robustness check that uses computer rankings as truth. But it is a reasonable approximation for this purpose as well, as the final aggregate and computer polls are similar, and regardless should not cause the same issue with the signals appearing too informative as discussed above.

(25.) In other words, it allows the estimates to potentially exactly match the observed data. To illustrate by example, suppose only 22 of 25 teams in voter l's Week 1 ballot are ranked in Week 2. Suppose the teams ranked 19-21 in Week 1 dropped out of the voter's top 25 and were replaced by new teams (teams unranked by that voter in Week 1), so the ranks of teams ranked 1 -18 and 22-25 did not change. As I know little about voter 1's beliefs about the new teams in the poll (because they were unranked before they entered the poll) I ignore them and adjust the observed Week 2 posteriors. I assign ranks 19-22 to teams observed ranked 22-25, and 23 to the teams that dropped out. For the Bayesian estimates, I assign rank 23 to all teams with estimated rank 23 or higher. Hence, the estimated rankings can potentially be exactly the same as the observed posteriors.

(26.) Results are similar for other values.

(27.) I adjust the rankings to account for number of teams, per week and voter, not being in the final poll, in the same way that the estimated posteriors are adjusted to account for number of teams in observed posteriors and priors, as discussed in Section III.C.

(28.) This definition is straightforward and easily interpretable; because of its simplicity, however, it does not allow for "bad" wins or "good" losses, which certainly do occur. I experimented with numerous other definitions of overreaction that do account for these types of signals and found that they generally do not result in substantially different results.

(29.) This would not imply the true ranks proxy is problematic. Specifically, while the magnitudes of all voters' reactions to signals should decrease as the season progresses (and beliefs become more precise), the degree to which the reactions of voters who emphasize performance decrease may be larger. This is because these voters' belief revisions regarding future performance become less important as the number of future games decreases. However, I do not expect this difference to be substantial, as the sample is restricted to the first half of the season and there is a large number of remaining games even after the last week used in the analysis.

(30.) The other covariates may have significant effects on OVER for reasons other than non-Bayesian ranking changes, thus, they are mainly included to serve as controls, and are not the focus of discussion of results. For example, voters may legitimately be influenced by AGGRKDIFF via social learning, and while STATE and REGION seemingly should not have any effect on ranking updates, if they did it would be for reasons other than non-Bayesian updating.

(31.) Bootstrap standard errors are used because the dependent variable is estimated. Results are similar with conventional estimates. The coefficient estimates unreported are mostly insignificant.

(32.) It is actually somewhat amazing how insensitive the voters' responses are to home status. The mean observed rank improvement following home wins is 1.36 spots; the mean improvement following away wins is 1.43 spots. In contrast, the respective estimated Bayesian rank improvements are 1.21 and 2.55 spots.

(33.) Results not reported; available on request. I use estimated posterior rather than observed posterior ranks as the independent variable because if the observed posteriors do not respond appropriately to the signals the tests would be invalid. But results are similar either way.

(34.) Other alternative explanations, such as voters committing the base-rate fallacy (in which priors are in general ignored) or probability weighting (in which very high/low probabilities are interpreted as being closer to 0.5; see Gonzalez and Wu 1999), also should cause overreaction to losses by top 6-15 teams, relative to losses by top 16-25 teams, and thus do not work.

(35.) The constant is large for the model estimated on the subsample of losing top 1-5 teams, and small for the losing 11-15 and winning 21-25 teams subsamples. This indicates over/underreaction tendencies are generally different for these subsamples, and not just driven by differences in the types of signals (e.g., score margins) or responsiveness to signal characteristics across subsamples; results again are unreported. The claim that voters do not appreciate differences in prior precision is supported by the literature on confidence (Griffin and Tversky 1992), which has shown that while people are in general overconfident, they are more so when facing a difficult task, and people tend actually to be underconfident when facing easy tasks. In the AP poll context, ranking top 11-25 teams is difficult, which would make voters overconfident and use priors that are too precise, causing underreaction to signals. Ranking top 1-5 teams is easy, making voters underconfident, thus causing overreaction to signals.

(36.) The new objective function assumption is not inconsistent with the true rankings assumption used for the main analysis; as discussed in Section III.A, if voters only ranked teams on quality, with quality defined as likelihood of winning, the true rankings assumption is still valid given the evidence that quality does not vary substantially throughout the season. I also note forecast errors for the original assumption could be defined as the differences between final and posterior ranks. These are not analyzed because they are not clearly observed for many teams, because voters typically only rank in their final top 25 around 15 of the teams currently ranked. That is, for 10 of the ranked teams in each voter's Week 1-7 polls, the final rank is unobserved, meaning it could be anything from 26 to 120. Game results are observed for all ranked teams that play games in each week, which is the vast majority.

(37.) The specification (1) results imply voters react by over three spots more than they should to losses by top 1-5 teams relative to top 11-15 teams. The result that voters underreact to wins by top 21-25 teams is also strong, but of somewhat lower magnitude, and is more difficult to verify in this context.

(38.) Rank and opponent rank are controlled for with FE for each rank. For the models using games with unranked opponents, separate FE are used for each rank group used to construct the estimated posteriors.

(39.) The overall mean final rank is worse than the posterior because teams tend to become unranked as the season progresses.

DANIEL F. STONE, I thank Shan Zhou for excellent research assistance, Paul Montella of the Associated Press for providing me with the 2006 ballots and helpful discussion, Andrew Nutting for sharing data and discussion, and Edi Karni, Matt Shum, Joe Aldy, Tumenjargal Enkhbayar, Liz Schroeder, Carol Horton Tremblay, Stephen Shore, Peyton Young, Basit Zafar, and seminar participants at the Econometric Society 2009 North American Summer Meeting and 2009 IAREP/SABE joint meeting for helpful comments. Two referees and the coeditor (especially) also provided very helpful feedback. I thank Andrew Nutting for providing the data set used for this analysis. The data sets used for the article's main analysis do not include ranks on teams for each week throughout the seasons, only the first half of each season and final ranks. I do not find evidence of precision increasing substantially in just the first half of seasons. I thank an anonymous referee for suggesting this. The Sagarin rankings are a component of the Bowl Championship Series (BCS) rankings along with other computer rankings. I cannot use the BCS rankings because they are not computed after the bowls. I use the Sagarin ratings because they were easily obtainable and I expect other computer rankings would yield similar results.

Stone: Assistant Professor, Department of Economics, Oregon State University, Corvallis, OR 97331. Phone 541 737 1477, Fax 541 737 5917, E-mail dan.stone@oregonstate.edu

TABLE 1
Analysis of Within Season Changes in Rank Precision

 Dep. Var = Favorite Wins (0/1)

 (1) (2) (3) (4)

RANK_DIFF 0.005 0.0032 0.0213 -0.0054
 (0.005) (0.011) (0.014) (0.014)
POST_WK7 -0.1143 * -0.1177 * -0.1022 -0.0982
 (0.062) (0.063) (0.065) (0.066)
RANK_DIFF x 0.0134 ** 0.0135 ** 0.0124 ** 0.0125 **
 POST_WK7 (0.0058) (0.0058) (0.0062) (0.0062)
Full controls [check] [check]
Rank and [check] [check]
 season FE
[R.sup.2] 0.055 0.059 0.113 0.113
Observations 895 895 895 895

 Dep. Var = Favorite Points - Underdog
 Points

 (1) (2) (3) (4)

RANK_DIFF 0.290 * -0.185 0.797 * -0.571
 (0.168) (0.408) (0.409) (0.494)
POST_WK7 -4.722 ** -5.015 ** -4.317 * -4.269 *
 (2.111) (2.087) (2.206) (2.184)
RANK_DIFF x 0.556 *** 0.580 *** 0.507 ** 0.523 **
 POST_WK7 (0.2110) (0.2100) (0.2210) (0.2210)
Full controls [check] [check]
Rank and [check] [check]
 season FE
[R.sup.2] 0.102 0.108 0.155 0.156
Observations 895 895 895 895

Notes: Robust standard errors in parentheses. All models estimated
by ordinary least squares. Sample includes all games between teams
ranked in aggregate AP top 25 from 1991 to 2008. "Favorite" is (ex
ante) higher ranked team. RANK DIFF = favorite's rank--opponent's
rank; POST WK7 = 0/1 for game occurring in Week 8 of season or
later. Dummy variables for home/away and bowl game are included in
all models. Full controls include RANK DIFFZ and dummies for
favorite/opponent in top 5 and conference game. Rank and Season FE
are dummies for favorite rank, opponent rank, and season.

Significance levels: * 10%; ** 5%; *** 1%.

TABLE 2
Tests of [H.sub.0]: Mean Score Differences
Conditional on Final Rank Groups Are Equal
in the First and Second Halves of the Season

 p-Value for
Home Away [H.sub.0] :
Final Final [[bar.s].sub.Aug-Oct]
Rank Rank Period [bar.s] = [[bar.s].sub.Oct-Dec]

 1-12 13-25 Aug-Oct 15 13.6 0.14
 1-12 13-25 Oct 16-Dec 15 16.9
 1-12 Unranked Aug-Oct 15 22.6 0.35
 1-12 Unranked Oct 16-Dec 15 21.0
13-25 1-12 Aug-Oct 15 -7.0 0.35
13-25 1-12 Oct 16-Dec 15 -4.7
13-25 Unranked Aug-Oct 15 15.5 0.09
13-25 Unranked Oct 16-Dec 15 12.9

Notes: "Final Rank" = final AP aggregate rank; [bar.s] = mean home
score-away score. Sample includes games played 1989-2008 with
at least one Division 1-A team on non-neutral field. "Unranked"
restricted to teams receiving votes in final aggregate poll in at
least one of previous two seasons.

TABLE 3
MADs from Final Ranks (SDs in Parentheses)

 [absolute [absolute [absolute [absolute
 value of value of value of value of
 Estimated Observed Observed Flat
 Posterior- Posterior- Prior- Prior-
 Observed Observed Observed Observed
 Final] Final] Final] Final]

Wins 3.90 4.02 3.89 4.66
 (3.48) (3.61) (3.66) (2.66)
Losses 2.13 2.36 3.14 4.47
 (3.28) (3.38) (3.57) (2.22)
Byes 3.51 4.04 3.97 4.62
 (3.72) (3.88) (3.89) (2.43)
Total 3.49 3.67 3.74 4.61
 (3.53) (3.65) (3.68) (2.55)

Notes: Sample includes all games from Weeks 1 to 7 of 2006/2008
seasons with available data and games played on non/neutral sites
(for wins/losses); N = 21,758, 6,664, 2,881 for wins, losses, and
byes, respectively.

TABLE 4
Mean Rank Improvement (Prior
Rank-Posterior Rank; SDs in Parentheses) by
Prior Rank Group

Prior Wins Losses
Rank
Group Observed Estimated Observed Estimated

1-5 0.06 0.00 -7.66 -5.68
 (1.29) (1.41) (3.60) (5.67)
6-10 0.74 0.34 -8.39 -8.89
 (2.05) (2.46) (4.20) (4.63)
11-15 1.44 0.97 -6.49 -8.11
 (2.49) (3.47) (3.16) (2.71)
16-20 2.30 2.56 -3.93 -4.22
 (2.88) (3.53) (2.12) (2.02)
21-25 2.66 4.91 -0.42 -0.36
 (2.54) (3.62) (1.02) (1.06)
Total 1.39 1.67 -4.90 -5.05
 (2.49) (3.48) (4.12) (4.59)

Notes: Sample defined as in Table 3; N = 21,758, 6,664
for wins, losses, respectively. "Observed" = voter prior
rank-observed posterior rank; "Estimated" = voter prior
rank-estimated posterior rank.

TABLE 5
Summary Statistics of Variables Used for Overreaction Hypothesis
Testing

Variable Definition

OVER Estimated overreaction
1. HOME Home game dummy
2. SMARGIN Own score-opponent's score
3. OPPRANK Opponent ranked dummy
4. OR-SMARG SMARGIN x OPPRANK
5. EXPERIENCE Voter years of experience (since 1999)
6. STATE Team in same state as voter dummy
7. REGION Team in same region as voter dummy
8. AGGRKDIFF Aggregate rank-voter rank
9. PREY_YR_RK Previous year final aggregated rank

 Wins Losses

Variable M SD M SD

OVER -0.28 3.15 -0.15 3.69
1. HOME 0.66 0.47 0.43 0.50
2. SMARGIN 23.84 15.36 -11.45 10.27
3. OPPRANK 0.27 0.44 0.62 0.48
4. OR-SMARG 4.43 9.77 -8.71 11.02
5. EXPERIENCE 2.64 2.90 2.59 2.92
6. STATE 0.03 0.18 0.03 0.18
7. REGION 0.10 0.30 0.10 0.30
8. AGGRKDIFF 0.46 3.49 0.94 3.82
9. PREY_YR_RK 20.11 15.21 24.23 17.03

Notes: Sample defined as observations used in Tables 3 and 4 with no
missing values for numbered variables; N = 21,645, 6,613 for wins,
losses, respectively. Numbered variables are elements of X in (4).

TABLE 6
Overreaction Estimation Results

 Wins

 (1) (2) (3) (4)

HOME 1.556 *** 1.562 *** 1.588 *** 0.953 ***
 (0.173) (0.152) (0.218) (0.328)
SMARGIN -0.089 *** -0.088 *** -0.090 *** -0.097 ***
 (0.005) (0.006) (0.006) (0.011)
OPPRANK -1.925 *** -1.875 *** -1.932 *** -1.602 ***
 (0.300) (0.313) (0.296) (0.592)
OR_SMARG 0.052 *** 0.049 *** 0.047 *** 0.046
 (0.014) (0.017) (0.015) (0.035)
TOP1_5 2.545 *** 2.522 *** 2.5151' ** 3.545 ***
 (0.259) (0.216) (0.202) (0.529)
TOP6_10 2.586 *** 2.466 *** 2.447 *** 3.448 ***
 (0.213) (0.179) (0.183) (0.406)
TOP11_15 2.579 *** 2.233 *** 2.432 *** 3.445 ***
 (0.226) (0.201) (0.196) (0.393)
TOP16-20 1.744 *** 1.830 *** 1.523 *** 2.795 ***
 (0.159) (0.163) (0.161) (0.315)
AGGRKDIFF -0.159 *** -0.163 *** -0.153 *** -0.170 ***
 (0.013) (0.015) (0.016) (0.025)
[R.sup.2] 0.280 0.270 0.268 0.300
N 21,645 21,645 21,645 7,170

 Losses

 (1) (2) (3) (4)

HOME -1.220 *** -1.332 ** -1.194 *** -0.549
 (0.472) (0.531) (0.450) (0.935)
SMARGIN -0.002 0.007 0.008 -0.057
 (0.029) (0.029) (0.035) (0.076)
OPPRANK -0.077 -0.052 -0.204 -0.312
 (0.639) (0.789) (0.649) (1.330)
OR_SMARG -0.012 -0.021 -0.023 0.038
 (0.040) (0.044) (0.040) (0.080)
TOP1_5 2.127 ** 1.516 2.553 *** 3.061
 (1.011) (1.197) (0.816) (2.348)
TOP6_10 -0.526 -0.332 -0.690 0.689
 (0.480) (0.509) (0.497) (1.339)
TOP11_15 -1.341 *** -0.958 *** -1.122 *** -1.031
 (0.243) (0.312) (0.299) (0.727)
TOP16-20 -0.135 -0.127 -0.186 0.135
 (0.208) (0.245) (0.215) (0.407)
AGGRKDIFF 0.105 *** 0.107 *** 0.098 *** 0.045
 (0.018) (0.016) (0.017) (0.040)
[R.sup.2] 0.167 0.122 0.192 0.171
N 6,613 6,613 6,613 2,010

Notes: Bootstrap standard errors clustered by game in parentheses.
The dependent variable is OVER in all models; the estimated
posteriors are constructed using the voters' individual final
rankings as the true rankings in specifications (1) and (4), and
using the aggregate AP rankings and Sagarin computer rankings as
the true rankings in (2) and (3), respectively. In specification
(4) the estimated posteriors are constructed using priors estimated
only on data from 2006 to 2007 and the regressions are estimated on
a sample using only 2008 data. Year FE, EXPERIENCE, STATE, REGION,
PREY_YR_RK, WEEK, voter FE, and voter FE-WEEK interactions included
in all specifications.

Significance levels: * 10%; ** 5%; *** 1%.

TABLE 7
Game Forecast Errors Estimation Results

 Dep. Var = Win

 (1) (2) (3) (4)

HIST_HOME -0.0029 0.0025 -0.0273 -0.0307
 (0.0178) (0.0208) (0.0323) (0.0386)
HIST-WSM 0.0019 *** 0.0018 *** 0.0023 ** 0.0020 *
 (0.0007) (0.0006) (0.0010) (0.0011)
HIST-TOP5 0.0098 0.0638 *** -0.0148 0.0710 ***
 (0.0146) (0.0139) (0.0284) (0.0224)
[R.sup.2] 0.242 0.271 0.170 0.167
N 29,294 23,488 10,586 7,898

 Dep. Var = SMARGIN

 (1) (2) (3) (4)

HIST_HOME 0.806 0.818 1.020 0.822
 (0.710) (0.788) (0.997) (1.057)
HIST-WSM 0.079 *** 0.072 *** 0.087 ** 0.075 *
 (0.025) (0.025) (0.036) (0.038)
HIST-TOP5 0.172 2.075 ** -0.326 2.211 **
 (0.669) (0.897) (1.085) (0.996)
[R.sup.2] 0.377 0.387 0.259 0.233
N 29,294 23,488 10,586 7,898

Notes: Bootstrap standard errors clustered by game in parentheses.
Specifications (2) and (4) use samples restricted to teams ranked 6-25;
specifications (3) and (4) use samples restricted to games in which the
opponent is ranked by at least one voter. STATE, REGION, voter FE,
WEEK FE, rank FE, and opponent rank FE included in all specifications.

Significance levels: * 10%; ** 5%; *** 1%.

TABLE 8
Mean Aggregate Posterior, Final Ranks
Conditional on Selected Prior Rank Groups and
Game Results (Standard Errors in Parentheses)

 Observed Observed
 Posterior Rank Final Rank

Home teams 13.72 (0.29) 20.79 (0.49)
Away teams 16.03 (0.41) 20.85 (0.59)
p-Value ([H.sub.0]: Home = Away) <0.01 0.94
Top 16-20 teams that win by
 <10 15.82 (0.33) 27.04 (1.80)
Top 21-25 teams that win by
 [greater than or equal to] 10 19.92 (0.22) 26.26 (1.13)
p-Value ([H.sub.0]: Top 16-20 =
 Top 21-25) <0.01 0.72
Losing top 1-5 teams 10.52 (0.34) 13.35 (1.71)
Winning top 6-10 teams 7.02 (0.11) 13.69 (0.74)
p-Value ([H.sub.0]: Losing =
 Winning) <0.01 0.86

Notes: Sample includes Weeks 1-6 of 1991-2005 seasons.
Posterior rank is following week rank (rank from Weeks 2-7 of
same seasons). Final rank is postseason rank. Expected rank of
50 conditional on being unranked used to calculate unconditional
expected (mean) ranks.