The harder the task, the higher the score: findings of a difficulty bias.
Morgan, Hillary N. ; Rotthoff, Kurt W.
The harder the task, the higher the score: findings of a difficulty bias.
Studies have found that going first or last in a sequential order
contest leads to a biased outcome, commonly called order bias (or
primacy and recency). Studies have also found that judges have a
tendency to reward contestants they recognize with additional points,
called reference bias. Controlling for known biases, we test for a new
type of bias we refer to as "difficulty bias, " which reveals
that athletes attempting more difficult routines receive higher
execution scores, even when difficulty and execution are judged
separately. Despite some identification challenges, we add to the
literature by finding strong evidence of a difficulty bias in
gymnastics. We also provide generalizations beyond athletics. (JEL L10,
L83, D81, J70, Z1)
I. INTRODUCTION
Judgments are made in many areas of life: job interviews, refereed
journal articles, marketing pitches, oral and written exam grades,
auditions, sporting events, debates, or even stock analyst estimates. In
areas where judges determine the outcome of an event, bias in the
judging process can create problems. Biased judging potentially leads to
questions about efficiency and fairness, particularly if it results in
selecting less than optimal candidates (Page and Page 2010).
Judging and perception biases have been observed in a variety of
situations. Psychologists show that sequential presentation of
information can influence the way the information is processed
(Mussweiler 2003). This idea has been carried over to other fields
including economics (Neilson 1998; Page and Page 2010; Sarafidis 2007)
and marketing (Novemsky and Dhar 2005). Judging bias has been found in
orchestra auditions (Goldin and Rouse 2000) and sequential voting
through the "Idol" series (Page and Page 2010). Bias has also
been found in basketball referees (Price and Wolfers 2010).
We test for bias in the judging of elite gymnastics. In particular,
the gymnastics meet we analyze provides a uniquely suitable dataset: the
order of competition is randomly assigned to a given country, and the
difficulty and execution of a routine are separately judged. (1)
Following previous biases found in the literature, we control for
performance order (primacy and recency) and reference bias. Despite some
unit analysis challenges in our control for reference bias and
identification issues concerning our lack of a perfect control for
athlete ability, we add to the literature by finding strong evidence of
difficulty bias; execution judges show a favorable bias for those
athletes attempting more difficult routines.
Measuring difficulty bias requires data where judgment is delivered
in two parts: difficulty and execution. This can be found in the world
of elite level gymnastics. Elite gymnasts receive scores based on the
difficulty of the task and
the execution of this task. One panel of judges is charged to
evaluate the execution, and only the execution, of the routine, with an
independent panel of judges evaluating the difficulty, and only the
difficulty, of each routine. In other words, execution judges should not
be concerned with the difficulty of the routine, and difficulty judges
should not be influenced by the execution. Because the judges sit on
separate panels, we can determine if the difficulty of the routine
influences the execution score.
Using normalized data, mean zero and standard deviation of one, we
regress execution score on difficulty score, with additional controls.
We find that a participant's overall score is artificially inflated
when that athlete attempts a more difficult routine. Figure 1 shows the
extent of this bias. Increasing one's difficulty by one standard
deviation artificially inflates the execution measure by 0.21 standard
deviations.
Likewise, attempting a less difficult routine, one that is one
standard deviation below the mean, decreases the execution score by 0.45
standard deviations.
This finding has major implications for the ability of judges to
accurately rank individuals. In situations where judgment is passed on a
given performance, participants may choose to execute a more difficult
gymnastics routine, play a more complex piece of music at an audition,
tackle a more challenging research topic when applying for a grant, or
even use impracticable statistical approaches to impress a referee; all
with the knowledge that the difficult act in question may influence the
evaluator, resulting in a biased execution score.
The next section provides background on types of judging bias
including order bias, reference bias, and others. We also outline the
ways in which our dataset allows us to distinguish between forms of
bias. Section III discusses an overview of the data and is followed in
Section IV by the methodology, with the limitations of our data. Section
V discusses our addition to the literature, difficulty bias, in detail.
The last section concludes with policy implications.
II. TYPES OF POTENTIAL BIAS IN SEQUENTIAL ORDER EVENTS
The psychology literature looks at judgment bias in sequential
order events, finding two key effects: a primacy effect and a recency
effect. If primacy exists, the first person or people to perform are
judged more accurately. If judges better remember late effects, a
recency effect results. Gershberg and Shimamura (1994) and Burgess and
Hitch (1999) conclude that in a sequential order contest it would be
best to go either first or last, but not in the middle. The economics
literature takes a different view on this idea. In situations where the
scores of each contestant are finalized before the next contestant
competes, as is the situation with our data, findings of an overall
order bias are more common.
The overall order bias impacts a contestant's relative ranking
depending on when in the event they compete. For example, Wilson (1977)
finds evidence that the order of appearance in synchronized swimming
influences the outcome. Analyzing artists that compete in "The
Queen Elizabeth musical competition," Flores and Ginsburgh (1996)
find the day an artist competes impacts that artist's final
standing. Bruine de Bruin (2005) studies both the "Eurovision"
song contest as well as figure skating, finding those that perform later
receive more favorable evaluations in both venues. Page and Page (2010)
also find an overall order bias in the "Idol" song contest.
Damisch, Mussweiler, and Plessner (2006) find a sequential order
bias, where one person's performance impacts the subsequent
performer, in the 2004 Olympic Games. They find that a gymnast's
score is influenced by the previous performance. However, there is no
evidence of this type of bias in the 2009 World meet, as found in
Rotthoff (2013).We therefore focus on overall order bias.
The psychology literature also presents a "reference
bias" in judgment. People, or judges, may have a tendency to rate a
participant relative to their expectations on that person's
performance (Thibaut and Kelley 1959). In the workplace, raters who are
more familiar with a worker tend to give more positive overall ratings
than those that are not familiar with that individual (Kingstrom and
Mainstone 1985). Tversky and Kahneman (1974) and Kahneman and Tversky
(1996) describe heuristics, or the use of a representative tool, as a
shortcut to process information. Findlay and Ste-Marie (2004) find that
figure skating judges use this representative tool, in the form of
athlete reputation, to judge a given athlete's performance, biasing
the known athletes' scores upward.
In addition to order and reference biases, evidence of racial,
gender, and nationalistic judgment biases have been discovered. For
example, Glejser and Fleyndels (2001) confirm the order bias results
from Flores and Ginsburgh (1996) concerning music competition and
further find that women obtain lower scores in piano while contestants
from the Soviet Union, prior to 1990, receive higher scores. Multiple
other studies find a nationalistic bias in figure skating: Seltzer and
Glass (1991) find a bias based on political loyalties; Sala, Scott, and
Spriggs (2007) find a systematic bias based on the countries status as a
"friend," "rival," or "enemy"; and both
Campbell and Galbraith (1996) and Zitzewitz (2006) find nationalistic
biases. Emerson, Seltzer, and Lin (2009) find strong evidence of a
nationalistic bias in Olympic diving and Segrest Purkiss et al. (2006)
find a negative ethnic bias in the hiring process. Racial bias is found
by Price and Wolfers (2010) in professional basketball refereeing, by
Parsons et al. (2011) in baseball as umpires call strikes, and by
Garicano, Palacios-Huerta, and Prendergast (2005) as referees favor the
home team in soccer (football).
We hypothesize that when reference points are limited, judgment is
made relative to a known element of the given task: difficulty. Given
the judges know what a difficult task is, they present biased scores
when more difficulty exists.
III. DATA
Gymnastics is uniquely able to distinguish the types of bias
described in the previous section. We use data from the 2009 World
Artistic Gymnastic Championships, held in London, England. Unlike the
majority of large international gymnastics meets, this one only offered
individual all-around and individual event competitions for male and
female elite level gymnasts. This meet provides insight into the
described forms of bias because there is no team competition. (2) More
importantly, the meet randomly assigns each country one to three
starting positions, based on the number of spots that country qualifies
for. Each country's governing body then distributes the spots to
their athletes.
Elite gymnastics also recently changed its scoring system, allowing
us to separate the athletes' difficulty of performance from their
execution. The difficulty and execution scores are awarded by separate
panels of judges. The two scores are then added together and, after
taking out any penalties, the final mark is awarded. Scores are given
after each contestant, so each score is finalized before the next
contestant makes their attempt. More detail on scoring is given later in
this section.
A. Gymnastics Basic Rules
In women's gymnastics there are four different events (vault,
uneven bars, beam, and floor) while the men have six events (vault,
floor, pommel horse, rings, high bar, and parallel bars). The structure
of the competition allows for enough recovery time between events, so
the athlete's performance on each event is independent. In the 2009
World competitions, each country could bring up to three athletes to
compete in each event, but many athletes competed in multiple events at
the meet. This is not unusual. Top talent is often good at multiple
events and they compete for the all-around title, where their additive
score for all individual events determines the winner. On the basis of
their performance in the preliminary round, athletes can make finals in
each individual event as well as for the all-around competition.
Most international competitions have a team competition built into
each meet. Teams often strategically place their athletes to maximize
the team score, which traditionally means ordering the athletes from the
lowest expected score to the highest. This meet does not have this team
aspect.
For each of the 10 events, we observe between 106 and 134
performances; the number varies based upon the number of athletes
attempting to make the finals in either the allaround or on a given
event. Each event has a preliminary and final session, usually spaced a
couple of days apart. The finals are structured in a traditional
gymnastics way, with the lowest scoring person going first. The goal in
the preliminary round is to get the best spot in the finals competition
and it is commonly known in the sport that the last spot is best. This
aligns the incentives of the athletes; each athlete wants to perform
their best in prelims in order to have the best position in the finals
competition. For this reason we use only preliminary scoring data and in
this round their goal is always score maximization, thus the use of
preliminary data does not bias the sample.
B. Gymnastics Scoring
In 2006, the gymnastics governing body, the FIG (Federation
Internationale de Gymnastique), completely overhauled the scoring system
for elite level gymnastics. This change came after an apparent judging
controversy in the 2004 Athens Olympics. Scores are now divided into two
parts: difficulty and execution, which sets this dataset apart from
Damisch, Mussweiler, and Plessner (2006). The system now separates out
the "D" score, which is designed to exclusively measure the
difficulty, and the "E" score, which is designed to
exclusively measure the execution score.
The difficulty score evaluates the content of the routine. Judges
award points on three basic parts: the difficulty value of the routine,
the demonstration of a fixed set of required skills, and added points
for connecting certain elements. (3) On vault, the same difficulty score
is awarded to every athlete who performs the same vault, as determined
by the gymnastics Code of Points. On all other events, a panel of judges
evaluates the difficulty score while the athlete performs. They then
compare the score among themselves and post it. The difficulty score is
theoretically infinite and is determined by the athlete because they
decide what level routine to do, meaning it is exogenous to the judges.
The execution score evaluates how perfectly the athlete performs on
that event. This score has a maximum value, and a starting value, of a
10.0 and salvages the part of the scoring system that made Nadia
Comaneci a household name. From the beginning of each routine, the judge
takes away points for errors in form, execution, technique, artistry,
and routine composition. The execution score is determined solely by the
judges on the execution panel and will capture any bias in the judging
process, if it exists.
The difficulty and execution scores are awarded by completely
separate panels of judges. With the exception of vault, where the
difficulty to be attempted is posted before the gymnast performs, the
difficulty and execution scores are evaluated simultaneously and
directly after the gymnast completes his or her routine. (4) The two
scores are then added together, and after taking out any penalties
(primarily given for athletes stepping out of bounds) the final mark is
awarded. Scores are posted after each contestant, meaning each score is
finalized before the next contestant makes their attempt. The average
and standard deviation of scores for the 2009 World Gymnastics
Championships are shown in Table 1 (women) and Table 2 (men). (5)
C. Normalization
Because there is only one athlete who goes first and one who goes
last on each event over the entire day of preliminary competition, we
aggregate each of the 10 events together and use the overall order of
each event. Aggregation allows more observations and increases the
validity of the estimates. However, because the mean and standard
deviations are different on each event, we first normalize all
men's and women's events to have a mean zero and a standard
deviation of one, then aggregate the data together. The summary
statistics for all aggregated events are in Table 3.
D. Performance Order
As previously mentioned, each country is randomly assigned a
competition spot, which is then given to a gymnast. For example, one of
the American spots was subdivision 5, starting on vault, in the fifth
position. The women had five potential subdivisions during the day and
the men had three. Within each subdivision the athletes started on
different events: four options for the women and six options for the
men. Finally, because only one gymnast performs on the event at a time,
the individual performance order was determined. Therefore, in our data
athletes are assigned to a competition order on three different levels:
(1) to which session, or subdivision, they will compete, (2) to which
event they will start on, or their rotation, and (3) in which order they
appear in their given event rotation (displayed in Figure 2). Judges
therefore have the opportunity to measure an athlete's performance
relative to the other athletes based on the overall performance order
during the entire competition, the order in which they appear in a given
session, and at the smallest level, the order in which they appear in a
given rotation. Throughout this study we use the overall performance
order as the main control for order bias.
Given the previous findings, we control for the order each athlete
appears in the competition and extend the literature by investigating
difficulty bias. Performance quality is determined by two factors: the
difficulty of the task at hand and the execution of that task. If judges
are charged to evaluate the execution of a performance separate from the
task's difficulty, we can determine whether task difficulty
influences the execution score.
A difficulty bias exists when a participant's overall score is
artificially high, or low, because of the level of difficulty attempted.
This is the primary focus of this study. Discovery of a difficulty bias
in a judged event can change the optimal strategy for the participant
and may lead an organizer to alter the judging process to account for,
or at least test for, this bias.
E. Reputation
Superstar athletes are generally known in the world of gymnastics,
which could create a scenario in which their reputation, or a reference
bias, influences the final scores. Given previous evidence of this type
of bias (Findlay and Ste-Marie 2004; Kahneman and Tversky 1996;
Kingstrom and Mainstone 1985; Thibaut and Kelley 1959; Tversky and
Kahneman 1974), we attempt to reduce it by controlling for athletes who
come from countries that have a reputation for producing superstars, as
a proxy for reference bias. The limitations of this control are
discussed in the methods section.
We define our reference proxy as those superstar countries that
have won at least three medals, in the particular event of interest, in
the top level competitions over the previous 9 years. This includes
three Olympics; 2000, 2004, and 2008, as well as six World competitions:
2001-2003 and 2005-2007. Superstar countries are shown in Tables 4 and
5.
F. Country Influence
Competitions with athletes from many countries also have judges
from many countries. Each event has a panel of judges designed to have a
diverse set of countries represented; those judges score the same event
for the whole competition. It is feared that these judges may show
favoritism to athletes from their home country, resulting in a biased
execution score (Zitzewitz 2006). Using data from GymnasticsResults.com,
we observe the country of each judge on each execution panel. (6) We
create a dummy variable controlling for whether the athlete and a judge
on the execution panel in the event in which they are competing come
from the same country, called Same Judge (judges' countries are
presented in Tables 6 and 7). Because the judges' countries are
known, we do not have to worry about an anonymity bias (Zitzewitz 2010).
IV. METHODOLOGY
In order to obtain an accurate measure of a judge's bias, it
is necessary to separately observe two different sections of the overall
score. These include the difficulty of the task at hand and the
execution of the said task.
(1) Score = f (Difficulty, Execution).
Therefore, the total score a gymnast receives, 7, is the sum of the
execution score (E) and the difficulty score (D), subtracting out any
penalties (P):
(2) T = E + D - P.
The difficulty score is a choice variable for the gymnast, and the
execution score can be thought of as
(3) E = f (0, R, J, A, D)
where the execution score is potentially a function of performance
order (O), reputation (R), country of judge (J), ability (A), and
difficulty (D). It is possible that skilled judges provide a
"bonus" in the execution score when people attempt more
difficult tasks. Because judges know that these tasks are more
difficult, they are potentially more lenient on the execution score,
even when these scores should remain independent. If this is the case,
that execution scores are positively correlated with difficulty, then
evidence of difficulty bias exists.
Using the two different judging panels we are able to measure any
impact of a difficulty bias. In order to accurately measure this bias,
we control for known biases in the data: order bias (as shown in Bruine
de Bruin 2005; Flores and Ginsburgh 1996; Page and Page 2010), reference
bias (as seen in Findlay and Ste-Marie 2004; Kingstrom and Mainstone
1985, Thibaut and Kelley 1959), and a same country bias (Zitzewitz 2006,
2010). As a proxy for order bias, we include the overall performance
order (O) as a measure of a given athlete's relative place in the
competition and also an overall order squared term to allow for a
nonlinear relationship. To determine if there are a few highly talented
individuals driving the results we control for a reputation (R) as a
reference bias. The last control captures whether a judge from a country
gives athletes from their own country better scores (J). The E vector
controls for event specific effects. (7) We also include country level
fixed effects, C, and estimate the following for each athlete, i,
aggregating all events, for both men and women, together:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)
To capture whether difficulty bias exists in the judge's
decision, we add a control for the difficulty score and, to control for
any non-linearities, we include a squared difficulty score. A
significant coefficient on D, difficulty score, reveals that there is a
difficulty bias in the judge's decision.
Recall that the difficulty and execution score, by rule, are
determined by two separate panels of judges. The difficulty section
scores the person for the quality of the routine, measured by how
intricate and difficult the attempted skills are. The execution score is
designed to measure only the execution of routine, capturing the perfect
ten aspects that so many fans are familiar with. If difficulty and
execution scores are positively related, while controlling for the
covariates described above, it will reveal biased judging, which we
define as difficulty bias. (8)
A. Limitations
While our data are well structured for the necessary analysis to
identify difficulty bias, there are still some limitations. First,
because each country's governing body places athletes into their
starting positions, there are potential implications on the measurement
of order bias. When countries are given starting positions, they tend to
place better athletes later in the competition. This causes an upward
bias in the measurement of an order bias. While we control for
performance order (and therefore primacy and recency effects), an ideal
dataset would have overall performance order randomly assigned to each
athlete instead of each country. To our knowledge this does not exist.
However, the semi-random assignment of performance order which we are
able to measure has no direct impact on the measurement of a difficulty
bias, which is our focus.
Second, the threat of omitted variable bias presents a potential
problem with accurately measuring the impact of difficulty on execution,
and therefore the effect of difficulty bias. The concern is that the
estimated effect of difficulty bias reflects further impacts on the
execution score in addition to those from the difficulty score. In order
to minimize these correlations, we control for order, same country, and
reputation as described above. However, reputation may also be
correlated with difficulty. For example, a gymnast with a high
reputation is likely to have had high scores at previous competitions.
High scores in the past are more likely if the gymnast also performed a
high degree of difficulty and received high difficulty scores.
Furthermore, gymnasts tend to choose similar levels of difficulty over
time, creating a positive correlation between both reputation and
difficulty score today. Without controlling for reputation, or reference
bias as described in the psychology literature, the reputation effect
may be picked up in the estimated effect of difficulty on the execution
score.
We control for reputation, a country level superstar effect, but it
is an imperfect measure because the rest of our data are observed at the
individual level. While the gymnastics governing body (FIG) has a world
ranking system based on the previous year's performance, these
rankings are inefficient the year following an Olympics because there is
generally high turnover in elite level gymnasts in the post-Olympic
year. The following year's World competition, like the one used for
this study, is the coming out of the next group of elite level gymnasts
and many who perform at the Olympics take the following year of
competition off or retire altogether. We objectively control for
reputation at the country level as described in the data section. We
also test our specification with a subjective reputation measure,
identifying by hand the "big names" in the sport, and get
similar results. (9) One reason for the similar outcomes may be that
reputation contributes in a lesser role in judging the year after the
Olympics because of the turnover. This eases any concern of a strong
relationship between reputation and difficulty in our estimations.
Furthermore, it solidifies that a post-Olympic non-team competition is
ideal for capturing difficulty bias because there is less concern with
multicolinearity between reputation and difficulty with regards to
reference bias as well as performance order and difficulty with regards
to order bias.
Finally, there is also an issue with ability, which is unobserved
but may be correlated with both the execution score as well and the
difficulty score. Athletes with varying ability choose their own levels
of difficulty, which introduces self-selection concerns within our data.
In an ideal situation, we would randomly assign the gymnasts different
difficulty levels to measure any bias. This would presumably introduce
additional variation into the execution score because some gymnasts may
be asked to perform routines at a difficulty level that does not
coincide with his or her optimal choice. Unfortunately this is not
possible in gymnastics but it should be taken into account in other
situations, such as job interview questions, where the difficulty level
is determined by an outside entity.
As a gymnast chooses a difficulty level to maximize their overall
score, their first-order condition would be
(5) ([partial derivative]Execution/[partial derivative]Difficulty)
+ 1 = 0
where the marginal cost of increasing one's level of
difficulty is ([partial derivative] Execution/[partial derivative]
Difficulty) < 0, and 1 is the marginal benefit of increasing
difficulty. Also, assume that (([[partial derivative].sup.2]
Execution)/([[partial derivative].sup.2] Difficulty)) > 0 (Figure 3).
Therefore, at the optimum the gymnast equates their marginal benefits
and marginal costs in their choice of difficulty (D).
An athlete's decision on difficulty level is dependent on the
sign of (([[partial derivative].sup.2] Execution)/([partial derivative]
Difficulty [partial derivative] Ability)). When this is equal to zero,
ability has no effect on the choice of the difficulty level. Thus, there
is no correlation between ability and difficulty. However, if
(([[partial derivative].sup.2] Execution)/([partial derivative]
Difficulty [partial derivative] Ability)) < 0, then those with higher
ability levels have smaller negative impacts of attempting more
difficult routines. In this case, the expected cost of a more difficult
routine is lower for high ability gymnasts and they will choose a higher
difficulty level. Graphically in Figure 4, a higher ability gymnast, H,
will have a lower marginal cost of attempting more difficult routines
relative to a lower ability gymnast, L.
Overall, the sign of (([[partial derivative].sup.2]
Execution)/([partial derivative] Difficulty [partial derivative]
Ability)) is critical to our ability to determine the existence of a
difficulty bias. If this sign is negative, a control for ability is
required to accurately estimate a difficulty bias. If the sign is zero,
adding a control for ability does not add to the estimation's
accuracy, but also does not decrease the estimation's accuracy.
Unfortunately a perfect measure of a gymnast's ability does
not exist and we face an identification challenge much like the
researchers attempting to capture student ability with standardized test
scores or stock trader's ability with records of previous returns.
We do our best to include a proxy to capture at least some of a
gymnast's abilities by including country level reputation effects
as described before. It is also likely that a gymnast's difficulty
score captures at least part of the athlete's innate ability as
well. In this study we estimate difficulty bias with and without the
reputation variable and find similar outcomes. (10) We also argue that
difficulty bias goes beyond acting as a proxy for ability. We encourage
future research involving fine tuning the measurement of gymnastics
ability.
V. RESULTS: DIFFICULTY BIAS
To investigate whether a difficulty bias exists in the data we
estimate Equation (4), first without the normalized difficulty score,
then including a difficulty score, and finally including the difficulty
score squared term. Results of these tests can be found in the first
three columns of Table 8. When predicting the execution score, we find
results for the existence of timing bias; competing early in the
competition results in statistically lower execution scores. This
supports literature finding an order bias (Bruine de Bruin 2005; Flores
and Ginsburgh 1996; Page and Page 2010). The reference effect is
positive and significant when the difficulty squared term is included;
athletes from top performing countries receive higher execution scores.
We do not, however, find a same judge effect. Finally, as an addition to
the literature, we find a statistically significant and positive
relationship between difficulty and execution scores, revealing a
difficulty bias; as an athlete's difficulty level increases it
artificially inflates their execution score at a decreasing rate. These
results continue to hold when country level fixed effects are added in
the last two columns.
The main result from Table 8, that a difficulty bias is found in
the data, is economically significant as well. To put it in perspective,
consider the vault score of American gymnast Rebecca Bross, who ranked
12th on this event after the preliminary round. Bross scored a 14.250
and her difficulty score, 5.8, was one standard deviation below Un Jong
Hung's score, the Chinese gymnast in first. If Bross attempted a
one standard deviation more difficult vault, she would have not only
received a 0.706 boost in her difficulty score, but also a 0.194 boost
in her execution score resulting from the difficulty bias. With this
bias, we estimate that a one standard deviation more difficult vault
would have increased her score by 0.9 points, resulting in a 15.15,
enough for second place. If a difficulty bias did not exist, her more
difficult vault would have scored her a 14.956, placing her third. For
Rebecca Bross, a one standard deviation increase in difficulty is equal
to trying the same level of difficulty as the winning athlete. On the
same event, vault, the Canadian gymnast Britney Rogers scored a 14.1,
with a 5.3 on her difficulty score, ranking her 15th. If she would have
tried a one standard deviation more difficult vault she would have
placed third, with a score of 15.0, with the difficulty bias. Without
the difficulty bias she would have scored a 14.806, placing her fourth
and off the podium.
It is also important to point out that attempting a one standard
deviation less difficult routine has twice the impact of increasing the
difficulty level. A one standard deviation increase in difficulty from
the mean artificially increases the execution score by 0.214 standard
deviations, while a one standard deviation decrease in difficulty from
the mean artificially decreases the execution score by 0.453 standard
deviations.
In addition to the difficulty coefficient being strongly
significant, the [R.sup.2] is 10 times higher when difficulty score is
controlled for, than when it is not. This is an interesting result
because when athletes attempt harder skills, it is reasonable to think
that they may not be able to execute as cleanly, ceteris paribus. Given
the magnitude of the coefficient on difficulty, it is reasonable to
think that the coefficient is capturing more than just a difficulty
bias. We likely capture both a difficulty bias and some proxy for
ability. Given this possibility, we further investigate difficulty bias
and how it is related to reference bias in the next section. We also
measure gender differences, judging effects, or event differences.
A. Interacting Difficulty and Reference Bias (Reputation)
It is possible that the known athletes are driving the results. We
test this in two ways in Table 9. First, we add an interaction term
between the normalized difficulty score and reputation, seen in the
first column. This captures the impact the reputation might have on the
difficulty score. Those athletes coming from a country with a reputation
of having great gymnasts (superstar countries) receive a positive and
significant difficulty bias, beyond the difficulty bias for
non-superstar athletes. This shows that the difficulty bias exists and
the reference bias magnifies the impact of this difficulty bias for
athletes from historically successful countries. The positive and
significant interaction also provides evidence that the marginal cost of
attempting a more difficult routine is lower for higher ability
gymnasts.
The last two columns in Table 9 examine whether a restricted sample
of the top or bottom 10% of execution scores are driving the results.
This could potentially occur with a reference bias (reputation), because
the athletes from historically successful countries are most likely
those already known by the judges. When excluding the top 10% of
execution scores the results on difficulty bias continue to hold.
Although this sample is smaller we find similar results, which
strengthen our overall findings. This solidifies that we are not
identifying a reference bias, but a separate bias toward those
completing more difficult tasks.
It is also possible that the bottom 10% of the execution scores
impact the results. This may occur because there is a limit of three
participants per country on each event, which means a strong gymnastics
nation like the United States may have to keep very talented gymnasts
home, while countries not known as gymnastics powerhouses get to send
athletes to compete. Because of the rule, there are contestants
competing who may not have qualified otherwise. It is possible that
judges award these gymnasts with higher execution scores in an attempt
to level the playing field with the better gymnasts. If this is the
case, restricting the sample by dropping out the lower 10% would change
the overall results. We find that the same judge and reputation effects
are insignificant. The results for difficulty bias continue to be
positive and significant, supporting the idea that judges who see a hard
routine give higher execution scores when they should be independent.
In addition to the objective measure of reputation used in these
regressions, we have also run all of them with a subjective measure of
individual level women's superstars. We subjectively identified
women superstars by going though the data and flagging the best known
names on each event: the results hold in this specification but are not
presented for brevity.
B. Gender Differences
To measure if the difficulty bias result is being driven by the
differences in men's and women's gymnastics we split the data.
Scores for male gymnasts show no evidence of a timing impact, whereas
the female gymnasts do show a timing bias; it is better for the women to
go later in the competition. However, they both show strong evidence of
a difficulty bias. Increasing the difficulty of a routine leads to a
positive difficulty bias on the execution score, at a decreasing rate.
It is also important to note that separating male and female
athletes is the only model specification that finds a same judge bias,
having a judge on the panel from the country you represent matters. The
same judge effect is negative and significant for the women, meaning
that a given athlete is worse off when there is a judge from her country
on the panel. For the men the same judge bias is positive and
significant.
C. Judging Effects
It is possible that judges know they will be scrutinized by
governing bodies of sports or researchers looking for bias. As such,
judges may change their judging strategy to benefit the gymnasts they
want, but in a way that is not easily detectable. For example, a judge
may give a slightly higher score to an athlete from their country in the
medal hunt (because it matters more for her/him) and give a slightly
lower score to an athlete not in the hunt (because she/he was not going
to receive a medal anyway). On average, the judge does not give a point
bonus to his or her country, but they have distributed those points
differently than if they had not favored their own country's
athlete. This effect is measured in the interaction of the normalized
difficulty score and same judge.
We do this for all athletes, as well as for male and female
athletes separately. In all specifications we find an insignificant
relationship for the interaction term of normalized difficulty score and
same judge. This shows that when a judge is judging an athlete from
their country they are not trying to hide their bias by favoring
athletes who try harder routines. Male athletes continue to see a
positive bias with a judge from the same country, while the female
athletes have a negative impact from having a judge from the same
country. We continue to find strong evidence of a difficulty bias.
D. Event Differences
It is possible that these results are driven by certain events,
rather than gymnastics on the whole. If this is true, interacting the
normalized difficulty score with each event will reveal this difference.
The vault, which is a quick movement over in seconds, could yield
different results than the floor routine, which lasts for a few minutes.
We find no discernible pattern across events, although all events do
have a positive and significant difficulty bias. (11)
VI. CONCLUSIONS
This study tests forms of judgment bias using data from elite level
gymnastics. In accordance with previous literature, we control for the
order of performance as well as judges from the same country and a proxy
for reference bias, reputation, finding an additional form of a judgment
bias: difficulty bias. In gymnastics, athletes choose the difficulty
level they will attempt, introducing an issue of self-selection. We also
face a common identification challenge when considering a gymnast's
innate ability. Despite these challenges, we find that the execution
score, which is supposed to be unrelated to the difficulty score, is
not; athletes who attempt more difficult routines also receive higher
execution scores. This bias is magnified for athletes from well-known
countries, supporting additional findings of a reference bias. The
reverse is also true; those who attempt less difficult routines are
penalized with lower execution scores. These results hold through
multiple robustness tests.
Our findings suggest an incentive misalignment for those who are
being evaluated; difficulty bias may induce people to attempt more
difficult tasks than they would have otherwise. The implications go
beyond the world of elite level gymnastics. For example, researchers may
submit more difficult projects when applying for grants, in hopes of
benefitting from this new form of judgment bias. Furthermore, authors
will rationally respond by including impracticably difficult statistical
methods to impress referees. Musicians may optimize by choosing
difficult pieces of music at auditions to impress evaluators. Employees
could use unnecessarily complex presentations at work to impress a boss
or gain a client.
Evaluators need to be aware of the potential issue as well,
especially in situations where the participant has no say in the
difficulty level. If difficulty is chosen by the judging body, a
difficulty bias stresses the importance of having similar complexity for
all contestants. For example, in a job interview, it is imperative that
the difficulty of questions asked is similar among candidates. Otherwise
a difficulty bias may exist, making it hard to accurately judge a job
candidate's abilities. Political debate mediators should be aware
of the potential effects as well. Candidates may benefit from being
asked a difficult question during an interview or debate, even if he or
she stumbles over the response. This supports Glejser and
Heyndels's (2001) idea: "it means that it is easier for an
expert to compare two artists if they perform the same piece of music
than if they perform different pieces," supporting the use of
musical pieces with the same level of difficulty at an audition. The
applications are truly endless.
This research has shown interesting insight into judging bias, with
our most significant contribution as the measurement of difficulty bias.
When complexity is controlled by those administering the competition, it
is important that difficulty is equal amongst all candidates. When it is
determined by the participant, judges should find a way to truly keep
difficulty and execution separate. If they cannot, participants may
efficiently respond by increasing their overall difficulty level.
Continuing to search for structures that eliminate these biases as well
as continued research on all forms of judgment bias are encouraged.
REFERENCES
Bruine de Bruin, W. "Save the Last Dance for Me: Unwanted
Serial Position Effects in Jury Evaluations." Acta Psycologica,
118. 2005, 245-60.
Burgess, N., and G. Hitch. "Memory for Serial Order: A Network
Model of the Phonological Loop and Its Timing." Psychological
Review, 106, 1999, 551-81.
Campbell, B., and J. Galbraith. "Nonparametric Tests of the
Unbiasedness of Olympic Figure-Skating Judgments." The
Statistician, 45(4), 1996, 521-26.
Damisch, L., T. Mussweiler, and H. Plessner. "Olympic Medals
as Fruits of Comparison? Assimilation and Contrast in Sequential
Performance Judgments." Journal of Experimental Psychology Applied,
12, 2006, 166.
Emerson, J., W. Seltzer, and D. Lin. "Assessing Judging Bias:
An Example from the 2000 Olympic Games." The American Statistician,
63, 2009. 124-31.
Findlay, L. C., and D. M. Ste-Marie. "A Reputation Bias in
Figure Skating Judging." Journal of Sport and Exercise Psychology,
26, 2004, 154-66.
Flores, Jr., R. G., and V. A. Ginsburgh, "The Queen Elisabeth
Musical Competition: How Fair Is the Final Ranking?" The
Statistician, 45(1), 1996, 97-104.
Garicano, L., I. Palacios-Huerta, and C. Prendergast.
"Favoritism under Social Pressure." Review of Economics and
Statistics, 87, 2005, 208-16.
Gershberg, F., and A. Shimamura. "Serial Position Effects in
Implicit and Explicit Tests of Memory." Journal of Experimental
Psychology: Learning, Memory and Cognition, 20. 1994, 1370-78.
Glejser, H., and B. Heyndels. "Efficiency and Inefficiency in
the Ranking in Competitions: The Case of the Queen Elisabeth Music
Contest." Journal of Cultural Economics, 25, 2001, 109-29.
Goldin, C., and C. Rouse, "Orchestrating Impartiality: The
Impact of 'Blind' Auditions on Female Musicians."
American Economic Review, 90(4), 2000, 715-741.
Kahneman, D., and A. Tversky. "On the Reality of Cognitive
Illusions." Psychological Review, 103, 1996, 582-91.
Kingstrom. P.O., and L. E. Mainstone. "An Investigation of the
Rater-Ratee Acquaintance and Rater Bias." Academy of Management
Journal, 28, 1985, 641-53.
Mussweiler. T. "Comparison Processes in Social Judgments:
Mechanisms and Consequences." Psychological Review, 110 (3), 2003,
472-89.
Neilson, W. "Reference Wealth Effects in Sequential
Choice." Journal of Risk and Uncertainty, 17, 1998, 27-48.
Novemsky, N., and R. Dhar. "Goal Fulfillment and Goal Targets
in Sequential Choice." Journal of Consumer Research, 32, 2005,
396-404.
Page, L., and K. Page. "Last Shall Be First: A Field Study of
Biases in Sequential Performance Evaluation on the Idol Series."
Journal of Economic Behavior & Organization, 73, 2010, 186-98.
Parsons, C. A., J. Sulaeman, M. C. Yates, and D. S. Hamermesh.
"Strike Three: Discrimination, Incentives, and Evaluation."
American Economic Review, 101(4), 2011, 1410-35.
Price, J., and J. Wolfers. "Racial Discrimination Among NBA
Referees." Quarterly Journal of Economics, 125(4), 2010, 1859-87.
Rotthoff, K. W. "(Not Finding a) Sequential Order Bias in
Elite Level Gymnastics." 2013. Accessed June 23, 2013. SSRN:
http://ssm.com/abstract=2230038 or doi. 10.2139/ssrn.2230038.
Sala, B., J. Scott, and J. Spriggs. "The Cold War on Ice:
Constructivism and the Politics of Olympic Skating Judging."
Perspectives on Politics, 5(1), 2007, 17-29.
Sarafidis, Y. "What Have You Done for Me Lately? Release of
Information and Strategic Manipulation of Memories." The Economic
Journal, 117, 2007, 307-26.
Segrest Purkiss, S., P. Perrewe, T. Gillespie, B. Mayes, and G.
Ferris. "Implicit Sources of Bias in Employment Interview Judgments
and Decisions." Organizational Behavior and Human Decision
Processes, 101, 2006, 152-67.
Seltzer, R., and W. Glass. "International Politics and Judging
in Olympic Skating Events: 1968-1988." Journal of Sports Behavior,
14, 1991, 189-200.
Thibaut, J. W., and H. H. Kelley. The Social Psychology of Groups.
New York: John Wiley & Sons, 1959.
Tversky, A., and D. Kahneman. "Judgment and Uncertainty:
Heuristics and Biases." Science, 185, 1974, 1124-31.
Wilson, V. "Objectivity and Effect of Order of Appearance in
Judging of Synchronized Swimming Meets." Perceptual and Motor
Skills, 44, 1977, 295-98.
Zitzewitz, E. "Nationalism in Winter Sports Judging and Its
Lessons for Organizational Decision Making." Journal of Economics
and Management Strategy, 2006, 67-99.
--. "Does Transparency Really Increase Corruption? Evidence
from the 'Reform' of Figure Skating Judging." Working
Paper. 2010. Accessed January 13, 2014. http://www.dartmouth.edu/~ericz/
transparency.pdf
(1.) Nearly random assignment of athletes in gymnastics is rare,
making this a unique dataset. Separate panels for judging began in 2006.
The event we use is the only elite level meet with numerous countries in
attendance that meets both of these requirements at this point in time.
(2.) In team competitions the coach chooses athlete orders to
optimize the team performance. This behavior removes the random
performance order aspect that is valuable when conducting statistical
analysis.
(3.) An athlete's difficulty score can be increased when two
elements are combined. The combination of elements is considered a more
difficult task than doing them individually.
(4.) Although the vault number from the gymnastics Code of Points
and implicitly the difficulty score for the vault is posted before the
event, the athlete's difficulty rating can change if they complete
a different vault than what has been posted.
(5.) The mean and median are close, showing that any outliers are
not driving the data.
(6.) We do not have this information for the difficulty panel.
(7.) These are set up as dummy variables for each event,
women's vault excluded, and are not reported for brevity. No
important results are found on the coefficients of these controls.
(8.) Because judges have been using the new scoring system since
2006, there has been adequate time to adjust to it. We are therefore not
concerned with biases due to scoring system mistakes.
(9.) These results are not reported in this paper. However, results
can be obtained by contacting the authors.
(10.) These results are available upon request.
(11.) Tables for gender, judging effects, and event differences
have been suppressed for brevity. They are available upon request to the
authors.
HILLARY N. MORGAN and KURT W. ROTTHOFF*
* We would like to thank Angela Dills, Robert Tollison, Sean
Mulholland, Rey Hernandez, Pete Groothuis, Ryan Rodenberg, Jay Emerson,
Sarah Marks, the participants at the American Statistical
Association's annual meetings and referees for helpful comments.
Also a special thanks to the editor, Jeff Borland, for helping us
clarify thoughts throughout the manuscript. Any mistakes are our own.
Morgan: Senior Data Analyst for Admissions and Financial Aid, Drew
University, Madison, NJ 07940. Phone 973-408-3005, Fax 973-408-3188,
E-mail HillaryNMor gan@gmail.com
Rotthojf: Associate Professor of Economics and Finance, Stillman
School of Business, Seton Hall University, South Orange, NJ 07079. Phone
973-761-9102, Fax 973-761-9217, E-mail rotthoff@gmail.com or Kurt.
Rotthoff@shu.edu
TABLE 1
Women's Events
Summary Statistics (Women)
Uneven Balance
Variable Vault Bars Beam Floor
Participants 107 113 118 113
Mean difficulty 4.94 4.89 4.99 4.92
score
Standard deviation 0.706 1.194 0.650 0.564
of difficulty score
Mean execution 8.24 6.91 7.21 7.37
score
Standard deviation 0.904 1.517 1.161 0.778
of execution score
TABLE 2
Men's Events
Summary Statistics (Men)
Variable Parallel Bars High Bar Rings
Participants 127 127 126
Mean difficulty score 5.31 5.31 5.43
Standard deviation of difficulty O.88 1.00 0.91
score
Mean execution score 8.07 7.80 7.94
Standard deviation of execution 0.78 0.85 0.66
score
Variable Floor Vault Pommel Horse
Participants 134 122 132
Mean difficulty score 5.51 5.31 5.14
Standard deviation of difficulty 0.79 0.88 0.90
score
Mean execution score 8.16 8.07 7.68
Standard deviation of execution 0.96 0.78 1.17
score
TABLE 3
Normalized Data for All Events
Standard
Variable Observations Mean Deviation
Order 1,219 63.40689 36.59816
Order-squared 1,219 5,358.76 4,849.31
Normalized difficulty score 1,219 0.000395 0.996615
Normalized execution score 1,219 8.14E-05 0.996706
Reputation 1,219 0.053322 0.224768
Same judge 1,219 0.101723 0.302407
Male 1,219 0.630845 0.482774
Variable Min Max
Order 1 135
Order-squared 1 18,225
Normalized difficulty score -7.009 2.208469
Normalized execution score -9.11125 1.75576
Reputation 0 1
Same judge 0 1
Male 0 1
TABLE 4
Superstar Countries for Women's Events
Superstar Countries (Women)
Vault Uneven Bars Balance Beam Floor
USA USA USA USA
Russia Russia Russia Romania
China China Romania
Germany China
TABLE 5
Superstar Countries for Men's Events
Superstar Countries (Men)
Parallel High Pommel
Bars Bar Rings Floor Vault Horse
China Germany China
S. Korea Slovakia Bulgaria Romania Romania Romania
Italy Poland Japan
TABLE 6
Country of the Execution Judges, by Event
Country of Execution Judges (Women)
Vault Uneven Bars Balance Beam Floor
Mexico North Korea India Slovenia
Bulgaria Egypt Ireland Germany
South Korea Norway Portugal Venezuela
Italy Canada Argentina Lithuania
Romania Brazil France China
Ukraine Germany Israel Russia
TABLE 7
Country of the Judges, by Event
Country of Execution Judges (Men)
Parallel Bars High Bar Rings Floor
The Netherlands Algeria Bulgaria Japan
South Korea Portugal France Venezuela
Lithuania Austria Germany Luxemburg
Argentina Ukraine Qatar Romania
Czech Republic Hungary Jordan Egypt
Poland Great Britain South Africa Italy
Parallel Bars Vault Pommel Hors
The Netherlands Mexico Slovenia
South Korea New Zealand Russia
Lithuania Belarus Portugal
Argentina Germany Brazil
Czech Republic Canada North Korea
Poland Israel Denmark
TABLE 8
Estimating Execution Score
Execution Score
(1) (2) (3)
O 0.008110 *** 0.008104 *** 0.008400 ***
(order) (0.003) (0.003) (0.002)
[O.sup.2] -0.000043 * -0.000060 *** -0.000057 ***
(order squared) (0.000) (0.000) (0.000)
R 0.730150 *** 0.046359 0.346858 ***
(reputation) (0.126) (0.108) (0.104)
J 0.021926 -0.036960 -0.012211
(same judge) (0.093) (0.077) (0.072)
D (normalized 0.576618 *** 0.375128 ***
difficulty score) (0.025) (0.028)
[D.sup.2] -0.121518 ***
[(normalized (0.009)
difficulty
score).sup.2]
Constant -0.348287 *** -0.206036 ** -0.144924
(0.126) (0.105) (0.098)
Event FE Yes Yes Yes
Country FE No No No
Observations 1,219 1,219 1,219
[R.sup.2] 0.039 0.340 0.424
(4) (5)
O 0.007821 *** 0.008965 ***
(order) (0.003) (0.002)
[O.sup.2] -0.000052 *** -0.000054 ***
(order squared) (0.000) (0.000)
R 0.028198 0.236493 **
(reputation) (0.119) (0.113)
J -0.008326 0.001308
(same judge) (0.076) (0.072)
D (normalized 0.584243 *** 0.333762 ***
difficulty score) (0.028) (0.033)
[D.sup.2] -0.119603 ***
[(normalized (0.010)
difficulty
score).sup.2]
Constant -0.458479 *** -0.279881 **
(0.115) (0.109)
Event FE Yes Yes
Country FE Yes Yes
Observations 1,219 1,219
[R.sup.2] 0.430 0.499
Note: Standard errors in parentheses.
*** p < 0.01; ** p < 0.05; * p < 0.1.
TABLE 9
Interaction Terms and Restricted Samples
Execution Score: Testing Reputation
Excluding
Excluding the
the Top Bottom
Interaction 10% 10%
0 0.008949 *** 0.007437 *** 0.003780 **
(order) (0.002) (0.002) (0.002)
[O.sup.2] -0.000053 *** -0.000042 ** -0.000028 **
(order squared) (0.000) (0.000) (0.000)
R -0.251925 0.077316 0.102997
(reputation) (0.250) (0.135) (0.082)
J -0.003502 -0.062450 0.083150
(same judge) (0.072) (0.076) (0.056)
D (normalized 0.316729 *** 0.237762 *** 0.213827 ***
difficulty score) (0.034) (0.035) (0.026)
[D.sup.2] [(normalized -0.123701 *** -0.138530 *** 0.031974 *
difficulty (0.010) (0.010) (0.017)
score).sup.2]
Normalized difficulty 0.453025 **
score x reputation (0.207)
Constant -0.277673 ** -0.257633 ** -0.054104
(0.109) (0.109) (0.083)
Event FE Yes Yes Yes
Country FE Yes Yes Yes
Observations 1,219 1.095 1,099
[R.sup.2] 0.501 0.505 0.267
Note: Standard errors in parentheses.
*** p < 0.01; ** p < 0.05; * p < 0.1.
COPYRIGHT 2014 Western Economic Association International
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2014 Gale, Cengage Learning. All rights reserved.