In recent years economists have begun to investigate how people might learn equilibrium behavior. Microeconomists following Binmore (1987) and Fudenberg and Kreps (1988) consider learning models with roots in Cournot (1838) and Brown (1951). Numerous laboratory studies test and refine the microeconomists' learning models; see Camerer (1998) for a recent survey. There is also a separate theoretical macroeconomics literature on learning following Marcet and Sargent (1989a, 1989b, 1989c) and Sargent (1994); see Evans and Honkapohja (1997) for a recent survey. The focus is on how people might learn to forecast relevant prices and whether the learning process permits convergence to rational expectations equilibrium. We are not aware of any laboratory work intended to test and refine the learning models favored by macroeconomists. (1) The current study is intended to fill that gap.

We gather laboratory evidence on the most basic questions a macroeconomist might ask about learning: Can people learn to forecast prices rationally? If there are obstacles to learning, are they transient or innate characteristics of human behavior? What sorts of environments reduce or enlarge those obstacles? Additional questions might be asked about the effects of learning observable in the usual macroeconomic and financial field data and about forecasting in a self-referential macroeconomic setting. Our work does not address such questions directly, but it does lay a foundation for later investigations of these additional questions.

Available evidence on the basic questions is rather disquieting. An extensive cognitive psychology literature, following Kahneman, Slovic, and Tversky (1973), finds that human forecasts are bedeviled by many systematic biases, such as the anchoring and adjustment heuristic, the availability and representativeness heuristics, base rate neglect, and confirmatory and hindsight biases; see Rabin (1998) and Camerer (1998) for recent surveys. There is also a small experimental economics literature on forecasting prices and rational expectations that reaches generally negative conclusions. Garner (1982) presents 12 subjects over 44 periods with a continuous forecasting task that implicitly requires the estimation of seven coefficients in a third-order autoregressive linear stochastic model. He rejects stronger versions of rational expectations but finds some predictive power in weaker versions. Williams (1987) find autocorrelated and adaptive forecast errors by traders in simple asset markets. However, the true data -generating process is not stationary in this task and is unknown even to the experimenter, which makes it difficult to identify individually rational behavior. Dwyer et al. (1993) test subjects' forecasts of an exogenous random walk. They find excess forecast variance but no systematic positive or negative forecast bias for this nonstationary task.

A possible objection to both strands of the empirical literature is that neither provides good opportunities for learning. Most of the cognitive studies frame the tasks in ways that do not immediately engage subjects' forecasting experience, offer no salient reward, or provide little feedback that would allow subjects to improve performance. The three economics articles just cited have relatively few trials with complicated or nonstationary processes. Our study, by contrast, presents laboratory subjects with a moderately difficult forecasting task in several stationary learning environments.

We examine human learning in an individual choice task called Orange Juice Futures price forecasting (OJF). The OJF task has a form and complexity similar to the forecasting tasks in macroeconomists' models: Subjects must implicitly learn the coefficients of two independent variables in a linear stochastic process. The task is based on the observation of Roll (1984) that the price of Florida orange juice futures depends systematically on only two exogenous variables: the local weather hazard and the competing supply from Brazil. The laboratory experiment consists of many independent trials in which human subjects forecast the OJF price after observing values of the two variables. After each trial the subject receives feedback in the form of the "actual" price generated from the linear stochastic model using the observed values of the two variables. We report results for 99 subjects, each forecasting in 480 trials. Several treatments that may affect the learning environment are varied across subjects, such as the noise amplitude and the relative impact of the two variables.

We are interested in two aspects of learning: consistency and speed. Roughly speaking, learning is consistent to the extent that subjects eventually respond correctly to the exogenous variables, and learning is speedy to the extent that subjects settle quickly into a systematic pattern of response to the variables. To measure learning speed and consistency, we introduce a rolling regression (or sequential least squares) technique inspired by Marcet and Sargent (1989a, 1989b, 1989c). The technique gives us trial-by-trial estimates of subject's implicit coefficient values or responsiveness to the two exogenous variables. We deem learning to be consistent if these estimates converge by the last trial to the objective values, and say that there is under- (or over-) response if the absolute values of the coefficients are below (or above) the objective values. We measure learning speed by comparing an individual subject's path of coefficient estimates and cumulative squared forecast errors to a Bayesian (or Marcet- Sargent [M-S]) ideal forecast.

The OJF task is a continuous analogue of the discrete response Medical Diagnosis (MD) task studied intensively by psychologists, such as Gluck and Bower (1988) and more recently by Kitzis et al. (1998). (2) The older psychological literature from Thordike (1898) emphasizes reinforcement learning in binary tasks--actions that do well now are "reinforced" and chosen more frequently in the future. Naive reinforcement models do not extend naturally to our OJF task because it is not clear what reinforcement means in the context of continuous stimuli (weather and supply information) and continuous response (price forecast). The MD literature considers more sophisticated models of error-driven learning, including neural network or connectionist models and generalized discrete Bayesian models. The most striking finding of Kitzis et al. (1998) is that a generalized Bayesian model (a cousin to our rolling regressions) outperforms alternative psychological models in the version of the MD task closest to the present OJF task. The MD results encourage us to pursue rolling regression techniques in the OJF task.

Section II describes our experiment. Section III presents the results. The main conclusions include (1) learning is quite consistent in that most subject's coefficient estimates converge closely to the objective values, but there is a slight general tendency toward overresponse. (2) Typically learning is noticeably slower than the M-S ideal. Among the more striking treatment effects are a general tendency (3) toward overresponse in the High Noise treatment and (4) toward under-response in the Asymmetric impact treatment. Section IV discusses the results and proposes extensions of our work. Appendix A reproduces the instructions to subjects, and Appendix B documents the identification of unresponsive subjects. Other articles that rely on our data include Kelley and Friedman (forthcoming), which briefly summarizes the recent MD results together with preliminary OJF results, and Kelley (1998), which reports additional OJF results.


We induce the following linear stochastic relationship of price p to contemporaneous values of two exogenous variables, [x.sub.1] and [x.sub.2]:

(1) [p.sub.t] = [a.sub.1][x.sub.1,t] + [a.sub.2][x.sub.2,t] + [e.sub.t]

Subjects are told that p refers to the local orange juice futures price relative to its normal level. They are also told that [x.sub.1] refers to the local weather hazard, which could potentially destroy part of the domestic orange production, and that [x.sub.2] refers to the competing supply of oranges from Brazil. The realized price [p.sub.t] in trial t depends on the realized value of [x.sub.1,t] [subset] [0, 100] and its coefficient [a.sub.1] (approximately 0.4 in the baseline treatment), and on [x.sub.2,t] [subset] [0, 100] and its coefficient [a.sub.2] (approximately --0.4 in the baseline treatment). The coefficient signs reflect the economic reality that loss of domestic crops tends to increase price and that increased foreign supply tends to decrease price. The noise term e reflects the unpredictability of prices in field markets. Its value [e.sub.t] is drawn independently each trial from the uniform distribution on [-v, v], where the (maximum) noise amplitude [upsilon] is a treatment variable (approximately 8 in the baseline treatment).

Subjects are instructed on the general nature of the task but are not specifically told the functional form or the coefficient values. Subjects are told that the experiment is a learning experience in which the goal is to learn the relationship between information (weather and competing supply) and the price of OJF. The instructions (Appendix A) state in nontechnical language that the relationship is stable but subject to random events that are independent across trials. Treatments described in the next subsection are held constant for each subject and are varied across subjects.

Subject Pool

We tested 99 undergraduates from the University of California at Santa Cruz, most of them from the pool of psychology students who need to fulfill a class requirement. Salient cash payments were offered in one treatment described.


The experiment uses a graphics computer program written in C++, run on Power Mac 7500/100 computers with full color monitors. Subjects in four sound-dampened isolated testing rooms view controlled events on the monitor screen and respond via clicking the mouse on various icons on the display. See Figure 1 for examples of screen displays. This setup was chosen to minimize boredom and to eliminate the possibility of peer pressure.


The realized values for weather [x.sub.1,t] and supply [x.sub.2,t] are independently drawn each period from the uniform distribution on (0,100), so the variables are orthogonal. The noise term is independently drawn each period from a different uniform distribution, U(-v, v). The realized values then are combined using equation (1) and chosen parameter values ([a.sub.1], [a.sub.2], and v) to produce a 480 trial sequence of prices. The same sequence of realized values and prices is used for all subjects in any given treatment condition.


Each trial begins with the graphical presentation of the weather and supply values using two thermometer icons (labeled weather hazard and Brazilian supply) on the left side of the monitor display as in Figure 1 (top). Each thermometer is partially filled in red to indicate the realized value. Except in the No History treatment described, the subject could also access (by clicking on the Previous Cases icon labeled C in Figure 1, top) the history of prices in previous trials with similar weather and supply levels, as in Figure 1 (middle). (3)

Subjects enter their forecast each period by moving slide B in Figure 1 (top) up or down within the possible price range. After the price prediction is entered and confirmed, a blue line appears on the slide bar to indicate the actual price in that trial as in Figure 1 (bottom). Except in the No Score treatment, the score box then appears as in Figure 1 (bottom). (4) After viewing the score box (if present) the subject advances to the next trial via a mouse click.

Each subject completed 480 self-paced trials. The session is broken into three blocks of 160 trials and subjects are permitted five-minute breaks between blocks. Subjects generally finish in less than the allotted two hours.


We vary the learning environment using five alternatives to the baseline treatment. Actual participants in financial markets face a wide variety of conditions in terms of the availability of useful historical information, the immediacy and accuracy of information on current conditions, and quality of feedback on investment decisions. We would like to know something about how such conditions affect the quality of human forecasts. Also, as mentioned in the introduction, most of the existing laboratory data uses unpaid subjects who produce poor forecasts. By using both paid and unpaid subjects, we can see whether this treatment seems pivotal. The treatments described after the baseline are ordered in increasing anticipated difficulty for making accurate price forecasts.

Baseline. The baseline treatment provides parameter values of [a.sub.1] = 0.417, [a.sub.2] = -0.417 and v = 8.33. The history and score boxes appear as described. The history and score boxes appear as described. We regard the baseline treatment as resembling favorable information conditions in the field, when investors have good access to historical and contemporary information and immediate feedback. If subjects don't learn well in this environment, then either the task exceeds their cognitive abilities or they have insufficient motivation.

Paid. This treatment differs from baseline only in that subjects are paid according to their final scores. Each subject receives a $5 show-up fee covering the first 30,000 points of final cumulative score. (Actual final scores always exceeded 30,000, with the top scores over 37,000.) Subjects also receive an additional $1 for each 700 points scored above 30,000. The median payment was about $15 with top payments about $16.50. Subjects are told the payment procedures on arrival. This is a priori the most favorable environment for learning because the financial incentive seems sufficient to elicit subjects' serious effort. The a priori prediction of most experimental economists is faster and more consistent learning than in the baseline; most psychologists would predict no effect.

No Score. This treatment differs from baseline only in that subjects do not have access to the Results or Score box icon and box. The omission of this feedback may degrade the learning environment in that subjects no longer have a direct measure of their relative performance. (5) Of course, subjects still have access to all information that is directly useful in forecasting price, so the effect might be small.

No History. This treatment differs from baseline only in that subjects do not have access to the Previous Cases or History icon and box. A priori we expect that our handy summary of relevant historical information enables subjects to learn more rapidly. It is hard for most people to remember and organize hundreds of previous observations. Of course, the trial-by-trial outcomes still are all observable in this treatment, so an ideal observer would not be affected. Therefore we again have two competing hypotheses for this treatment: no effect, or slower and perhaps less consistent learning.

Asymmetric. The only difference from baseline is that the coefficient values are [a.sub.1] = 0.250 and [a.sub.2] = -0.583. Thus, the weather and the competing supply stimuli no longer have equal (or symmetric) impact on OJF price. A priori it is not clear whether this treatment creates a more difficult learning environment than No History. It would have no effect on a subject who learned each coefficient independently, but would reduce the learning speed and consistency of a subject with symmetric priors. Psychological studies for binary tasks with asymmetrically salient stimuli suggest additional hypotheses. If the overshadowing effect of Kahneman et al. (1982) were present subjects would tend to overrespond to the larger weighted stimuli and ignore the less important stimuli. The larger stimuli overshadows the smaller one. If this effect extends to our continuous task, we would see a bias toward overresponding to the more important news [x.sub.2] and underresponding to the less important news [x.sub.1].

High Noise. The final treatment almost doubles the noise amplitude to v = 14.3, and the coefficient values [a.sub.i] are scaled to [+ or -]0.357, as described. All other features are as in the Baseline treatment. We expect High Noise to slow down learning appreciably. Even an ideal M-S learner would take longer to reach a given degree of precision in estimating coefficients. One of the characteristics of the M-S least squares estimator that produces this effect is that the subjective estimates have a standard error that is decreasing in sample size (T) but increasing in the variance of the error term ([[sigma].sub.[epsilon]]); see Greene (1993), section 5.6.1. Humans may have additional difficulties because they tend to have difficulty separating random from systematic variability (e.g., Rabin 1988; Brehmer 1980). The effect is important in field applications, because some macroeconomic and financial variables are quite noisy (i.e., the nonsystematic component has large amplitude relative to the systematic co mponent) and others are not very noisy.

The general pattern one might anticipate is reduced forecast consistency and speed as we move down the list of treatments from Paid to High Noise. That is, we should see on average less accurate final estimates of the objective weights [a.sub.i] and slower convergence. The final scores should also decline, because these scores reflect forecast errors accumulated over all trials and therefore proxy for learning speed. Finally, the fraction of subjects displaying significant deviations from consistent forecasts should increase.

Before analyzing the data, we had no idea how good the forecasts would be. Possibly the representativeness heuristic and base rate neglect would lead to massive and persistent overestimates of both coefficients, or the anchoring and adjustment heuristic would produce initial estimates close to zero that converged toward the true values very slowly. Possibly overshadowing would be important in the Asymmetric treatment. Before looking at the results, we need to explain how the data are prepared for analysis.

Data Processing

The Baseline values of [a.sub.i] are scaled as follows. Begin with unscaled values [a.sup.*.sub.1] = 0.5 and [a.sup.*.sub.2] = -0.5. Given noise amplitude [v.sup.*], equation (1) implies that the unscaled price ranges from [p.sup.*] = 0.5(0) - 0.5(100) - [v.sup.*] = -(50 + [v.sup.*]) to [p.sup.*] = 0.5(100) - 0.5(0) + [v.sup.*] = (50 + [v.sup.*]). To fit in the screen's range (-50, 50), we display the scaled price p = 50[p.sup.*]/(50 + [v.sup.*]). The scaled coefficients therefore are [a.sub.i] = 50[a.sup.*.sub.i]/(50 + [v.sup.*] and the scaled noise amplitude is v = 50[v.sup.*]/(50+[v.sup.*]). For the Baseline noise value [v.sup.*] = 10 we have v = 8.33 and [a.sub.i] = 0.833[a.sup.*.sub.i] = 0.417. The scaled coefficients used in the High Noise ([v.sup.*] = 20) and Asymmetric treatments are derived in a similar fashion.

For a given subsequence of trials ([P.sub.t], [x.sub.1,t], [x.sub.2,t]), t = [t.sub.0],..., T, we define the ideal Bayesian (or Least Squares or M-S) learner by regressing [P.sub.t] on the independent variables [x.sub.1,t] and [x.sub.2,t] via ordinary least squares (OLS). The regression over this subsequence of trials yields coefficient estimates [a.sub.1,T] and [a.sub.2,T]. The subsequences we consider consist of trials 1 to 160, 2 to 161,..., 320 to 480. Thus we obtain learning curves [a.sub.1,T] and [a.sub.2,T] for T = 160, 161,...,480, which can be interpreted as ideal subjective estimates of the objective values [a.sub.1] and [a.sub.2]. We refer to these as the M-S learning curves.

We use similar rolling regressions for human subjects. An actual subject may think of the task in various idiosyncratic ways-- for example, he may believe that prices are serially correlated or that price is a nonlinear deterministic function of the exogenous variables, despite instructions to the contrary. Nevertheless, the analyst can summarize the subject's beliefs by seeing how he responds to the current stimuli [x.sub.i,t], and can summarize the learning process by seeing how the subject's response changes with experience. Our approach therefore is to reconstruct implicit beliefs using equation (1) and subjects' actual responses.

The reconstruction proceeds as follows. Take the subject's actual forecast [c.sub.t] in trial t as the dependent variable, and run rolling regressions as before on the realized values [x.sub.i,t], using a moving window of 160 consecutive trials with the last trial T ranging from 160 to 480. Consistent and speedy learning is indicated by rapid convergence of the coefficient estimates [a.sub.i,T] (as T increases) to the objective values [a.sub.i]. Obstacles to learning are suggested by slow convergence, convergence to some other value, which represents over-or underresponse, or divergence of the coefficient estimates.

Some details may be worth noting briefly. (1) In all the results reported below, the intercept coefficient [a.sub.0] is constrained to its objective value of zero. Excluding the intercept doesn't affect our main results, but it does reduce clutter and improve statistical efficiency. (2) In preliminary work we considered stretchable windows of data running from t = 1 to T, to capture fully the evidence available to the subject (or M-S ideal learner) in trial T. However, the entire learning curve then reflects the subject's initial response pattern as well as the recent response pattern. We concluded that learning curves would be more informative when estimated from a moving window that includes only the most recent responses. Of course, the recent responses already incorporate everything the subject has learned since the beginning of the session. (3) Lengthening a (nonstretchable) moving window reduces standard errors in the coefficient estimates but also reduces the weight on the most recent responses. After a cursory investigation of preliminary data, we settled on length 160 as a reasonable compromise. (4) We use OLS in the spirit of Marcet and Sargent. Because the error term [[epsilon].sub.t] is uniformly distributed rather than normal maximum likelihood (ML) estimation methods in theory could give better coefficient estimates. (6) We checked and found that all ML estimates were insignificantly different from the OLS estimates.


The data analysis provides various comparisons between human subjects' forecasts and the ideal Bayesian (or M-S) forecasts. The first step is to get a qualitative impression of the overall comparison and an impression of how the learning environment treatments affect the subjects' performance. Then we proceed to formal statistical tests of the various hypotheses presented earlier.

Figure 2 presents a sample of learning curves in each treatment. Each panel of the figure shows the objective coefficient values as a horizontal dotted line and shows the ideal M-S learning curves as thin, continuous lines. The rolling regressions that generate the M-S curves seem to capture the price data quite well; typical [r.sup.2]s ranged from 0.91 for the first 160-trial window of data to 0.93 for the last window. We were pleased to see that M-S learning is consistent and quite rapid--indeed, it is virtually complete within the first 160 trials, as indicated by closeness of the dotted and continuous lines in every panel. The gap between the lines typically is about one standard error of the M-S coefficient estimate.

The heavy continuous lines in each panel of Figure 2 represent the learning curves for the highest scoring subject or the subject with the median score in each treatment. The width of the line is roughly represents a one-standard-deviation band around the coefficient estimate. The rolling regressions again had typical [r.sup.2]s above 0.90. The first two panels show moderate but persistent overresponse to current weather and supply information, with implicit coefficient estimates lying closer to [+ or -]0.45 than to [+ or -]0.42 for both subjects in the Baseline treatment. The next two panels suggest that the top-scoring Paid subject is right on target, but the median scorer tends to underrespond slightly. Overresponse seems strongest, with the top-scoring subject in the No History treatment and the two subjects shown in the High Noise treatment. The two subjects shown in the Asymmetric treatment appear to underrespond in most trials.

To conserve space we do not show the learning curves for the other 87 subjects. Suffice it to say that subjects sometimes overrespond, sometimes underrespond, but typically are fairly close to the objective values. Subjects seem to update more slowly than the M-S ideal learner. The prediction that performance deteriorates as we go down the list of treatments is not contradicted by visual impressions of the learning curves. But neither is it strongly supported; individual variability within each treatment makes it difficult to see the treatment effects clearly. The rest of this section seeks answers more systematically using statistical tools.

Distribution of Scores

Figure 3 shows the distribution of the scores earned by subjects in each treatment. Recall that score is a proxy for learning speed, which is expected to decline as we move down the list of treatments from Paid to High Noise. Recall also that the M-S ideal learner would earn the same score in all treatments except High Noise, where the larger error variance would lower the score.

The figure shows that forecasts often are quite good. In most treatments the highest score is close to 38,000, only a bit below the M-S ideal. The modal score and the median score usually are not very far behind. Mean scores are usually lower because the lowest scores are much lower, sometimes below 34,000. The mean scores indeed have the expected ranking. Paid is highest, followed by Baseline, then No Score, No History, and Asymmetric treatments. The main surprise is High Noise, where the mean score is a bit higher than in Asymmetric. For comparison, we calculated scores in the Baseline treatment for two sorts of zero intelligence agents or nonlearners. An agent who always forecasted zero (the optimal uninformed forecast) would score 34,326 and an agent who always used last period's price as the forecast would earn 30,647.

Closer examination of the raw data raises questions about the motivation of the subjects with lowest scores. We found that these subjects generally stopped responding to the weather and Brazil supply information at some point during the session. Subjects who don't care about performance but seek only to finish quickly can do so by just clicking the OK icons in every trial, leaving the price forecast at the default value c = 0. We identified such behavior in 9 of the 99 subjects, whose scores are flagged by asterisks in Figure 3. (7) We now face a methodological issue. In general we do not recommend the ex post exclusion of unresponsive subjects from analyses. However, unthinking responses of c = 0 will bias coefficient estimates toward 0, so it is potentially important for subsequent data analysis to identify such behavior. Our solution is to report results both for the full sample and also for reduced sample that excludes the unresponsive subjects, and to carefully document our exclusion procedures in Append ix B. These complications are the price we pay for maintaining comparability to the psychological literature by using unpaid subjects in most treatments. Fortunately, none of our conclusions are reversed when we move from the full to the reduced sample; some results are sharpened.

We are now ready to report statistical tests of treatment effects. Given the problems with outliers, it is appropriate to use a robust (but possibly less powerful) nonparametric test. The standard Wilcoxon test fails to reject the null hypothesis of no difference in median scores between Paid and Baseline (p value = 0.15). Again relative to Baseline, the same test detects no significant impact of the No Score (p = 0.71) and No History (p = 0.56) treatments. The corresponding null hypothesis is rejected, and the research hypotheses are confirmed that scores are significantly lower in the Asymmetric (p value = 0.002) and High Noise (p = 0.002) treatments. We conclude that humans indeed learn more slowly in these environments. (8)

Distribution of Coefficient Estimates

We now consider the key question of consistency: Do humans eventually learn the objectively correct response to weather and supply news? Recall that the coefficient estimates [a.sub.i] from the regression equation explaining individual forecasts [c.sub.t],

(2) [c.sub.t] = [[alpha].sub.1][x.sub.1,t] + [[alpha].sub.2][x.sub.2,t] + [e.sub.t]

represent the individual's response, and the weights [[alpha].sub.i] used in the data-generating process in equation (1) represent the objectively correct response. Learning is consistent to the extent that the final estimates over trials T -159 to T coincide with the true values [[alpha].sub.i] by the last trial (T = 480). Figure 4 shows by treatment the distribution across subjects of both coefficient estimates in the last trial.

Overall, the subjects seem to have it about right: The estimates center near the objective value and most of the estimates are not far away. Moreover, most of the outlying estimates are spurious underresponses from the nine unresponsive subjects (denoted with asterisks, (*). The figure also suggests some treatment effects. There may be a slight bias toward overresponse in the High Noise treatment and toward underresponse to the more important stimulus (Supply) in the Asymmetric treatment. The distributions appear to be tighter for the Paid subjects, as predicted in the motivation hypothesis favored by experimental economists. As predicted in the main treatment hypotheses, the dispersion appears to increase slightly as we move to the more challenging learning evironments indicated in panels C-F, especially E (Asymmetric) and F (High Noise).

These impressions are not reliable for two reasons. First, some treatments have more subjects than others so it is difficult for the eye to properly compare the distributions behind the histograms. Second, estimated standard errors are e = 0.02 for the High Noise treatment and e = 0.01 for the other treatments. The histogram bins in the figure have width 0.10 or 5-10 standard errors, so the classification is a bit coarse.

Table 1 rectifies these shortcomings. It classifies a final (T = 480) coefficient estimate as objectively correct if its central 95% confidence interval contains the corresponding final value from the M-S simulation. (9) The estimate is classified as over- (or under-) response if the confidence interval lies entirely outside (or entirely within) the interval from zero to the (M-S) objective value. Overall, a plurality of estimates (71 of them) are classified as objectively correct, and there are roughly equal numbers of overresponses (59) and underresponses (48 plus 18 questionables).

The main imbalances arise in the last two treatments. Underresponse to the more important variable (as in Figure 4) and overresponse to the other variable are quite prevalent in the Asymmetric treatment. In the High Noise treatment, a majority of the nonquestionable estimates for both coefficients are classified as overresponse and none is classified as underresponse.

Formal statistical tests of the main treatment hypotheses are reported in the last column of the table. The entries are Wilcoxon p values for each of the two coefficients in each treatment for the full sample (and in parentheses, for the reduced sample that excludes the nine unresponsive subjects). In three cases the tests reject (at the conventional p = 0.05 level in the reduced sample) the null hypothesis that the estimates center at the objective value [a.sub.i], in favor of the following one-sided alternatives. There is significant underresponse to the Supply variable in the Asymmetric treatment (p = 0.00), and significant overresponse to both variables in the High Noise treatment (p = 0.02, 0.00). There is also marginally significant overresponse to the Supply variable in the Baseline treatment (p = 0.08). The other cases of apparent under- and overresponse do not produce significant results in this conservative test.

Table 1 also reports behavior observed halfway through the session, at T = 240. Recall from Figure 2 the impression that moderate but shrinking overresponse is quite typical at this point. The table shows that overresponse at the halfway point indeed is somewhat more prevalent than at the end of the session, especially in the Paid and High Noise treatments.

Summary and Interpretation

Several general conclusions emerge from the data analysis. First and foremost, our subjects learn rather quickly to produce surprisingly consistent forecasts. We see little evidence in our forecasting task of anchoring and adjustment (systematic underresponse) or of the representativeness heuristic or base rate neglect (systematic underresponse). Typical subjects in early trials often overrespond or underrespond somewhat to the two news sources ([x.sub.1] = weather and [x.sub.2] = information), but by the end of the experiment they have it about right.

Of course, human subjects do not learn as fast as an ideal Bayesian (or M-S econometrician). At the halfway point (T = 240) of the experiment, the coefficient estimates [a.sub.1] and [a.sub.2] indicate a slight tendency toward overresponse. Table 1 and sample learning curves in Figure 2 show that this tendency almost disappears by the end (T = 480) of the experiment. The magnitude of the lag is indicated by the subjects' scores, which average 3%-8% lower than the ideal.

Some clues as to how human forecast performance varies with the learning environment can be gleaned from the impact of the laboratory treatments. First, increasing subject motivation by offering higher cash payments for more accurate forecasts has a surprisingly modest impact in our experiment. Unlike most other treatments, there are no subjects with questionable motivation in the Paid treatment. Also, the distribution of coefficient estimates seems tightest in the Paid treatment, consistent with significant findings in other experiments (Smith and Walker, 1993). However, the difference from Baseline turns out not to be significant in our data according to standard parametric and non-parametric tests. Of course, our sample size is not large because the effect of salient payments was a secondary concern in our experimental design.

Surprisingly, neither the No Score treatment nor the No History treatment significantly impaired the subjects' scores or accuracy of the estimated coefficients. It seems that our subjects were able to keep track of some sort of summary statistic that made the score and the history summary almost redundant. The Asymmetric treatment, however, significantly lowered scores and pushed subjects significantly toward underresponse to the more important information and (insignificantly) toward overresponse to the less important information. Asymmetry is typical in field environments, so the finding is potentially important. It is the reverse of the overshadowing effect documented by psychologists in other tasks; see Busemeyer (1993) for another example of reverse overshadowing. The High Noise treatment had the strongest impact: Significantly lower scores and significant overresponse to both information variables. The implication seems to be that learning is slower and more biased when markets are volatile. One might c onjecture that such inefficient learning might contribute to market volatility and partially explain the clustered volatility documented in many financial markets (see Kelley, 1999).


As we noted in the introduction, existing literature from cognitive psychology indicates that humans typically make very irrational choices in simple laboratory tasks. In sharp contrast, our human subjects (with some modest exceptions) rather quickly learn highly rational behavior in a nontrivial forecasting task. What accounts for the divergent results?

In some ways our experiment makes it difficult for subjects to be rational. The task is challenging in that the target variable, price, is stochastic and contingent on two independent variables. Another challenging aspect of our experiment is that we used psychology pool subjects, unpaid in most treatments. Irrational behavior exhibited by such subjects in some tasks disappears when subjects drawn from other pools are offered salient payments (Friedman and Sunder, 1994). With the exception of 9 of 99 subjects whose motivation was questionable, our nonpaid subjects behaved quite rationally and appear to be just as motivated as the paid participants.

But in other ways our experiment gives rationality its best shot. The basic task allows subjects to learn over a relatively long sequence of 480 trials in a stationary environment. Our laboratory setup encourages subjects to draw on relevant intuitions about price determination and avoids features that might suggest inappropriate heuristics. The visual interface encourages rapid and unbiased processing of information and feedback. If anything, the interface biases subjects toward underresponse, because the default response is 0 and the subject must move the slide up or down from that point. The reduced sample used in some of the data analysis screened out the most egregious cases of default response, but perhaps some slight bias remains. (10) Arguably our setup is more representative of economically important field environments than the some of the setups used in laboratory studies that find irrational behavior.

The rational behavior is fairly robust. Performance was not significantly impaired in the No Score and No History treatments, which eliminated useful feedback. Even in the Asymmetric and High Noise treatments, performance was still quite good. Kelley (1998) reports several additional robustness checks that reinforce this conclusion. (11)

An important extension of the work presented here, especially from the macroeconomics point of view, is to introduce self-referential price determination. Marcet and Sargent (1989a, 1989b, 1989c) study several linear stochastic models where traders' forecasts affect the actual price observed each period. They derive conditions on traders' learning processes (rolling regressions in essence) that ensure convergence of actual price to rational expectations equilibrium. It seems feasible to implement such economies in the laboratory and (given some stronger assumptions than needed in the present article) to extract estimates of subjects learning processes. We conjecture that the empirical models introduced herein will continue to do well in a more complex self-referential setting. Equally important, the findings here should be examined in field settings. The observable implications of the modest departures from rationality we observed in the Asymmetric and High Noise treatments must be derived and tested on suita ble finance and macroeconomic data. Kelley (1998) begins this task.

We see two main lessons in the present results. First, people can learn to make quite good forecasts. Second, some slight but systematic biases remain. In particular, even after 480 trials, subjects still tended to over-respond to news in the High Noise environment. Slight individual biases might interact to produce economically important market biases (Akerlof and Yellen, 1985; DeLong et al., 1990; Kelley, 1998). More theoretical and empirical work is needed to understand fully the effects in major financial markets, and more experimental work is needed to understand learning in self-referential, non-stationary environments.



(Revised 5/98)


In this experiment you will be asked to use information to make predictions. You will look at information on competing supply levels and on weather hazard and will predict orange juice futures prices. Orange juice price determination in this experiment is fictitious but basically similar to real life. Your job is similar to that of an investor who must use imperfect information to predict futures prices.

In this experiment, new information arrives each period (or harvest season) on (1) the weather hazard for the local orange crop and (2) the supply of oranges in the main competing region, Brazil (see label A at Figure 1, top). Each piece of information can take on a value from 0 to 100. A value of 0 for weather hazard means that there will be no loss of local production due to inclement weather, and a value of 100 means likely massive damage to the local crop. Similarly, a value of 0 for supply means a very small Brazilian production and a value of 100 means the largest possible Brazilian crop.

Each period after viewing the information on weather and supply, you will enter your price prediction. Prices are measured within the range -100 (all the way DOWN, or 100 cents below the normal level) to +100 (all the way UP or 100 cents above the normal level). For example, sliding the box (see Figure 1, top) to the topmost UP position indicates that you believe that the current supply and weather conditions will result in a price 100 cents above the normal price. Likewise sliding the best guess box to the bottommost DOWN position indicates that you believe the current crop conditions imply a price 100 cents below the normal price. Moving the box halfway up (halfway down) between the middle and top (bottom) predicts a price 50 cents price above (below) the normal level. Leaving the best guess box at its original position predicts exactly the normal price level.


Each period (or harvest season), you should first look at the information chart. You may be able to get useful additional information by clicking on the Previous Cases box. If it is present (see Figure 1, top) it will be under the chart symbols. When you click that box, a window will appear in the lower left corner of the screen (see Figure 1, middle). The first column of the window lists the current information on competing supply and/or weather hazard. The second column lists the number of times so far in the experiment you have seen similar supply and weather conditions, i.e., within plus or minus 10. For example, in Figure 1, middle, in all previous periods a weather hazard between 0 and 17 has occurred 1 time, and a supply between 59 and 79 has occurred 3 times. The third column gives the average price in these similar conditions. For example (see Figure 1, middle), the current harvest's low weather hazard of (7) was associated with a price 35 cents below normal, and the somewhat high competing supply (6 9) was associated with a price 18 cents below normal. Click OK to leave the Previous Cases window.

After you have considered the relevant information, you enter your forecast by clicking the slide box and moving it to your chosen location on the ruler. After you have made your prediction the UP or DOWN box will be darkened if you predict a price different from the normal level, otherwise they will both remain light. Click on OK to submit your forecast. You will then be told the actual price that period. A blue bar will appear on the ruler to indicate the actual price (see Figure 1, bottom). You may then be given a numerical score for your prediction this harvest and a cumulative score for all harvests to date (see Figure 1, bottom). You will then get the information chart for the next period.

Your goal is to predict as accurately as possible each period. There will be many periods for you to predict. Work at your own pace. The experiment should take less than 2 hours. We ask that you do not take notes.


Your score is the profit an investor makes when acting on your price prediction. Each harvest you earn points based on your prediction (between -100 and 100) and the actual price that harvest. Profit is higher the more accurate your forecast (See Figure 1, bottom). For example, if the actual price turns out to be 70 cents above normal, then your score is highest if your prediction was +70, a bit lower if your prediction was +60 or +80, and much lower if you predicted 0 or below.


You should not expect your forecasts to be exactly correct each period. The same supply and weather conditions can sometimes lead to a price increase and sometimes to a price decrease relative to the normal level. But if you properly use the average effects of weather and competing supply, your forecasts will usually be fairly accurate.

Each harvest period researchers collect available information about market conditions affecting orange juice. The information is distilled into the charts you see. The charts always record the available information correctly. The two pieces of information are independent in the sense that, for example, a high local weather hazard does not indicate a high or low Brazilian supply.

Each piece of information tends to be associated with higher or lower prices, but there is never certainty. An expert who completely understands the effects of competing supply and weather hazards typically earns much higher profits than a novice, but even the expert can't predict perfectly each period.

Feel free to ask the experimenter about anything in these instructions or in the experiment that is unclear to you.

The reduced sample omits 9 of the 99 subjects. The omitted nine usually
earned the lowest scores in their particular treatment group. The
criterion for omission was whether the subject actually responded to the
stimuli or always entered the default continuous response of 0
(corresponding to a normal price forecast, or no expected price change)
in many consecutive trials. Here are the specifies.

Subject # Score Treatment Subject

10 32638.82 Baseline Virtually all responses are default
 ([c.sub.t] = 0) for 50 to 200
 consecutive trials. Second
 lowest score.
20 33753.02 High Noise Virtually all responses are default
 ([c.sub.t] = 0) for 50 to 200
 consecutive trials. Lowest
 score in group.
29 33071.24 Asymmetric Virtually all responses are default
 ([c.sub.t] = 0) for 50 to 200
 consecutive trials. Second
 lowest score.
30 36294.49 High Noise Completely stopped responding
 early in experiment.
 Sixth lowest score in group.
34 31426.81 Asymmetric Virtually all responses are default
 ([c.sub.t] = 0) for 50 to 200
 consecutive trials. Lowest score.
40 35851.27 High Noise Virtually all responses are default
 ([c.sub.t] = 0) for 50 to 200
 consecutive trials. Third
 lowest score.
61 35558.64 High Noise Completely stopped responding.
 Overresponse that moves to
 underresponse. Second lowest score.
74 32271.7 Baseline Virtually all responses are default
 ([c.sub.t] = 0) for 50 to 200
 consecutive trials. Lowest score.
89 35965.6 High Noise Virtually all responses are default
 ([c.sub.t] = 0) for 50 to 200
 consecutive trials. Fourth
 lowest score.





Frequency of Correct Response


Condition \[a.sub.1]\ \[a.sub.2]\ #Ss Over On

Baseline .40 .40 17 16 8
Paid .40 .40 22 27 6
No Score .40 .40 13 10 7
No History .40 .40 12 8 7
Asymmetric .24 .57 15 2 10
High Noise .33 .33 20 28 6

Overall 99 91 44

 T=240 T=480

Condition Under Over On Under

Baseline 10 11 14 9(-4)
Paid 11 15 13 16
No Score 9 7 13 6
No History 9 6 9 9
Asymmetric 18 3 9 18(-4)
High Noise 6 17 13 10(-10)

Overall 63 59 71 68(-20)

Condition [a.sub.1], [a.sub.2]

Baseline 0.49, 0.96 (0.98, 0.08)
Paid 0.67, 0.89
No Score 0.89, 0.74
No History 0.47, 0.47
Asymmetric 0.08, 0.00 (0.22, .00)
High Noise 0.81, 0.39 (0.02, .00)


(1.) We hasten to add that several important laboratory investigations have been inspired by other strands of macroeconomic theory. For example, Van Huyck et al. (1997) and related work study equilibrium convergence in coordination games; Marimon and Sunder (1994) and related work study sunspot equilibria in overlapping generations economies. We will discuss three laboratory studies of rational expectations equilibrium.

(2.) Our OJE study differs in two other respects from the MD studies by other investigators. Our rolling regressions are fit trial by trial utilizing only information the subject has actually seen. Typically the MD studies first train their models and then provide an in-sample fit. Also, the MD studies fit their models to aggregate behavior averaged across groups of subjects. The OJF model fits are to individual subjects, to investigate heterogeneity that might affect macroeconomic aggregates.

(3.) The history box numerically displays the current realization of both variables, the number of previous trials for each variable whose realization is within ten of the current realization, and the average realized price in those previous trials. The box remains on the screen until the subject clicks on the OK icon.

(4.) The score box displays the subject's score on the current trial and the cumulative score through the current trial. Each trial the score is calculated from the continuous price forecast c and the realized price p using the quadratic scoring rule S(p, c) = A - B[(p - c).sup.2], with A = 80 and B = 280. Thus the maximum score (for a perfect forecast) is A = 80 points and the minimum is -B = -200 points. See Friedman and Massaro (1998) for a recent discussion of this scoring rule. The box also displays the "expert" score of a forecaster with nothing left to learn, that is, the score earned by forecasting [a.sub.1][x.sub.1,t] + [a.sub.2][x.sub.2,t] in trial t, using objective values of [a.sub.i]. Subjects, of course, do not observe the expert forecast, just the expert score.

(5.) A referee asked an interesting question: What is the relation between score and market discipline? There is indeed a close relation. As can be seen from the quadratic scoring rule definition, the expected score declines linearly in the forecast error variance. But it is well known that the position size (short or long), and hence expected profit, also declines linearly in forecast error variance for an investor with constant absolute risk aversion.

(6.) Normal errors are unbounded, making them impractical in our laboratory task. Truncated normal errors are practical but are less convenient for our purposes and fail to eliminate the potential econometric problem.

(7.) None of these subjects appeared in the Paid treatment, but five of the nine appeared in High Noise. Perhaps subjects are more likely to become frustrated in this difficult learning environment.

(8.) Interpreting the High Noise test result is complicated by the fact noted earlier that ideal learners also learn more slowly in the High Noise environment. An eyeball examination of Figure 3 suggests that most High Noise scores lie a bit farther away from the M-S ideal as compared with subjects in the alternative treatments, but this difference is not significant in either sample.

(9.) Note that this redefinition of the objective value uses the available sample information rather than sunavailable population information to define the objective value. The redefinition is a bit conservative in that the original (population) definition differs by about 0.013 and would tend to shift the classifications very slightly toward overresponse.

(10.) The most questionable remaining subject is 009 in the No History treatment. He made very erratic choices until late in the session, spent no more time making choices than the screened subjects (about half as long as most remaining subjects), and earned almost as low a score as screened subjects. He was not screened out of the reduced sample because he entered mainly non-default responses, but his motivation is also questionable and his coefficient estimates indicate dramatic underresponse. Indeed, the relevant test would indicate marginally significant overresponse (to the second variable in the No History treatment, p = 0.08) if this subject were screened out of the sample.

(11.) Specifications designed to capture prior beliefs and non-linear responses detected some transient effects in many subjects, but for the most part these effects disappeared by the final trial. Tests allowing a nonzero intercept term a0 reached different conclusions only for the Asymmetric treatment, where the marginally significant overresponse to the less important news disappeared. Responses remained fairly rational even in a treatment featuring a structural break.


* This work is supported by NSF grants SBR 9310347 and SBR 9617917. It benefited from the comments of Jules Leichter, Dominic Massaro, Rachel Croson, Vai-Lam Mui, and especially Arlington Williams, as well as participants at the Economics Science Association and Public Choice Society meetings at Tucson and New Orleans. The exposition benefited considerably from the thoughtful suggestions of two anonymous referees and editor William Nielson.

Kelley: Assistant Professor, Department of Economics, Indiana University Bloomington, Bloomington, IN 47405. Phone 1-812-855-7928, Fax 1-812-855-3736, E-mail hukelley@indiana.edu

Friedman: Professor, Department of Economics, University of California, Santa Cruz, CA 95060. Phone 1-831-459-4981, Fax 1-831-459-5077, E-mail dan@cats.ucsc.edu