The accuracy, agreement and coherence of decision-making in rugby union officials.
Mascarenhas, Duncan R.D. ; Collins, Dave ; Mortimer, Patrick 等
Inaccurate decision-making by game officials can change the course
of a game, and may lead to significant financial implications for the
clubs and hence alter the course of a player's career (Craven,
1999). In many professional team sports, referees have to consider
numerous sources of information, make rapid decisions, and contend with
commentators who scrutinize their accuracy from slow-motion replays,
often culled from several different camera angles. Yet, comparatively
little is taking place to improve the performance of officials (Ford,
Gallagher, Lacy, Bridwell, & Goodwin, 1999; see Garcia, 2003; cf.,
Ste-Marie, 2003) despite initiatives such as the British World Class
Performance Plan, the United States Olympic Centre and the Australian
Institute of Sport seeking to enhance the performance of competitors
(Eady, 1999). Furthermore, England's rugby union premier league
coaches want a lot more consistency in the application of the laws
(Melrose, 1998). Consequently, more research is needed to develop
accurate, objective and reliable measurement systems for assessing
referee performance (Sloan, 2004) and subsequently for deploying
psychologically based training methods to enhance such performance (see
Oudejans, Verheijen, Bakker, Gerrits, Steinbriickner, & Beek, 2000).
However, an essential precursor to such enhancement is the
identification of those factors that determine performance in this
particular sphere.
Research that has focused on referee performance found that rugby
union and basketball officials believed "demonstrating a mastery of
the rules" to be the most important aspect of referee performance
(Anshel, 1995; Anshel & Webb, 1991).
Such mastery of the rules, or laws in the case of rugby union,
demands rapid decision-making, requiring referees to evaluate the
important characteristics of an event and present an appropriate
solution in about 1 second (Jones, Paull, & Erskine, 2002), without
the opportunity for reassessment or contemplation on the implications of
their decision. Referees have to respond quickly to dynamically
unfolding events, which may hold many uncertainties and ambiguities, and
often in response to input from touch-judges (the officials responsible
for controlling the sidelines in rugby union who have a microphone link
to the referee).
Thus, investigating referee performance through a Naturalistic
Decision Making (NDM) perspective, defined as the study of experts
making decisions in complex environments under time pressure where
incomplete and ambiguous information is inherent, seems appropriate as a
basis for psychological intervention (Cannon-Bowers, Salas, &
Pruitt, 1996; Orasanu & Connolly, 1993). Prior to the NDM paradigm,
classic DM strategies often prescribed a rational choice method where
decision-makers would be asked to deliberate amongst a range of
alternatives. More recently however, researchers recognized that time
pressures preclude this strategy in complex situations and have become
more intent on assessing and developing the expert's strategies to
improve situation awareness skills and declarative knowledge (in this
case of both the laws and their application) in a realistic environment
that more closely resembles the real world (Klein, 1997; for a review
see Yates, 2001).
NDM investigations have studied the efficacy of DM assessment and
training methods (e.g., Stout, Cannon-Bowers, & Salas, 1996)
concluding that they need to be of sufficient functional quality to test
the experienced decision-maker's ability (Alessi, 1988; Klein,
1997a). In this context, video and audio presentations provide a
suitable format for assessing perceptual skills (Abernethy, 1996;
Williams & Grant, 1999) and DM (Cannon-Bowers & Bell, 1997).
Similarly, Omodei, McLennan and Whitford (1998) suggest that
'own-point-of-view' video recordings can provide the best
representation of the complexity and dynamics of naturalistic environments and in addition allow the selection of pertinent events
from a wide variety of data. Unfortunately, despite the evidence
supporting video as a medium through which to assess and train DM, there
is no empirical research that has examined suitable criteria to measure
relative success in referee DM performance.
Measuring DM Performance--Accuracy, Coherence and Shared Mental
models
The accuracy of law-application, which is substantially based on
the expert's use of knowledge (see Williams & Davids, 1995),
would seem to be the most crucial criterion for success. Thus, although
law clarifications and interpretations may be guided by advice from the
governing body, it is the application of these laws by the senior
referees 'in the field' that set the standards (Bunting,
1999). Examinations of umpires calling force-out plays in baseball would
seem to support this, as they were found to collectively adopt a
normative rule in their adjudication of 'phantom tags' (Rainey
& Larsen, 1988; Rainey, Larsen, Stephenson, & Olson, 1993).
In addition to the accuracy of law-application, equability of
decisions may also play a part in the referee's evaluation.
Therefore, performance appraisals should also consider the
individual's DM against his or her peers. Examining the level of
agreement by measuring the range between responses may be an indicator
of the extent to which individuals share interpretations and, as such,
represents another important and face valid criterion for measuring
referee performance. Furthermore, measuring the 'coherence' of
decisions by examining the shared understanding of the event by
different officials is also important (Mascarenhas, Collins, &
Mortimer, 2002; Millgram & Thagard, 1996). Such collective
understanding has been measured by investigating the rationale that
individuals use to arrive at their decisions (Abraham & Collins,
1998; Langan-Fox, Code, & Langfield-Smith, 2000). When decisions are
built on a coherent appreciation of an event, teams then have the
ability to perform together more effectively and successfully (Rouse,
Cannon-Bowers & Salas, 1992).
Cannon-Bowers, Salas and Converse (1990) attributed such coherent
performances to shared mental models (SMMs); a concept that serves to
explain faultless performance through implicit interactions between
members of successful teams (see Brehmer, 1972). Therefore, as a
corollary, for rugby union officials these SMMs consist of not only
knowledge of the other team members and their roles, allowing effective
coordination strategies between referee and touch-judges (who control
the sidelines) but also a declarative knowledge base of the task,
it's concepts, and the relationship between them (Stout et al.,
1996). Furthermore, as these SMMs underpin coherent performance by
providing similarly organised expectations surrounding the task (Rouse
& Morris, 1986), the development of SMMs can be used as a basis for
understanding and enhancing both dependent and independent team DM in
real-life settings (see MacMahon & Ste-Marie, 1999; Stout et al.,
1996). As such the SMMs of all those involved in the officiating
process, the referees, touch-judges, their coaches and assessors, are of
interest when exploring the efficacy of rugby union refereeing.
Assessing such DM performance and SMMs in rugby union may be best
served by examining the tackle (law 15) as it regularly creates the most
controversy and is thought by many to be one of the most complex events
to referee in all team sports (Ackford, 2003; Bunting, 1999). Previous
referee researchers have examined the 'matter of fact' offside decision in soccer, asking whether a player was offside or not (e.g.,
Oudejans et al. 2000), and also 'matter of opinion' decisions
that often involve just two players, asking if anyone has committed a
foul and if so, whom (e.g., Plessner & Betsch, 2002; Jones et al.,
2002). However, refereeing the tackle in rugby union presents a unique
situation where multiple, complex and dynamic decisions are required, as
there are timing elements, overlapping elements, interactive elements
and often multiple players involved in the action (see Ackford, 2003).
Thus it is likely that a more extensive declarative knowledge base and
hence a more complex SMM is necessary for rugby union officials (see,
MacMahon & Ste-Marie, 1999). Consequently, giving NDM such a
rigorous challenge should test the robustness of our methods and hence
provide implications that will assist DM in other open team sports.
Therefore, the primary aim of this study was to measure the DM
accuracy, agreement and coherence of England's best RFU referees,
their assessors, coaches and touch-judges. Specifically, we were
interested in the relationship between the officials' accuracy (as
measured by their ability to reach an agreed standard), their conformity
to each other and the coherence of their reasons underlying their
decisions. Secondly, recognizing the roles played by different officials
in their coherent application of law we were interested in differences
between groups. Finally, we anticipated that the results would highlight
specific applied areas of concern in refereeing the tackle and provide a
preliminary application of NDM theories with a video-based system to
assess the time-pressured DM of expert officials viewing actual
scenarios.
Method
Participants
The participants consisted of 132 male RFU officials who were the
delegates at the RFU referees national conference. They included 45 of
the top 65 RFU referees, 27 referee assessors, 13 referee coaches, and
47 of the top 120 touch-judges. This sample represents 132 of the 239
individuals responsible for either officiating, or developing officials
in England's top five rugby union divisions. The referees, ranging
in age from 27 to 51 years (M = 38.6 yr.; SD = 5.6 yr.) had refereed on
the English National Panel from 1 to 16 years. Based on their national
rankings (1-65) made by a group of referee development officers in May
1998 from the periodical evaluations of 37 advisors, the referees were
already sub-divided into 1 of 3 groups; a top-20 group, who were
responsible for refereeing in the premier league (level 1; n = 14); a
mid-panel group ranked from 21-40, responsible for national league level
2 and 3 games (n = 8); and a lower-panel group ranked from 41-65 who
officiated at levels 4 and 5 (n = 23).
Instruments
In order to prepare a test instrument, incidents were selected from
actual premier league rugby union games, recorded with professional
video equipment (Betacam-SP). Each scenario was filmed in close-up from
a raised gantry, positioned at the halfway line. Only incidents
occurring in the middle of the pitch (<20o of arc) were examined for
inclusion in the study. This provided a view looking down over the
incident, similar to the angle that the match day referee might
experience (cf., McLennan & Omodei, 1996).
Further steps were taken to ensure the ecological validity of the
test items. From an original tape of 130 tackle incidents compiled from
60 hours of premier league play, an independent expert panel consisting
of elite referees (n = 4), coaches (n = 2) and players (n = 2) examined
the clips. This group independently graded each tackle on the difficulty
of the decision on a three point scale where 1 = easy, 2 = medium, and 3
= hard. In addition, they discarded all the tackles that did not display
sufficient information to make an accurate decision, or those where they
felt the match-day referee's decision would be discernible. The
experts then convened as a group and selected 10 difficult (i.e., grade
3) tackles from those remaining that they regarded as presenting
realistic game scenarios for the accurate application of law 15. It was
anticipated that the use of difficult yet realistic scenarios would
provide information to inform referee DM training in the future.
Finally, these 10 incidents were edited together to provide a test
instrument.
Each edited clip began with a voice-over that introduced the two
teams competing and indicated the team in possession. The tackle
incident was then played with approximately 5-seconds of
'lead-in', in order to orientate the participants to the
scene. After the tackle incident the recording cut to black and the
title "make your decision now" appeared on the screen.
Table 1 Notes:
The mean level of accuracy for all participants across all clips
was 49.6% (K= .25). The mean level of confidence in their decisions for
all participants across all clips was M = 4.0 (SD = 1.0). Significant
Kappa statistic indicates a better than chance agreement (significance
adjusted by Bonferroni method). Correct decision. Strength of agreement
(1) as per Landis and Koch (1977) (AS) Approaching Significance, * p
< .05.
A response sheet was developed to enable participants to quickly
and easily indicate their decision. This was essential since time
pressure, as opposed to slower, more reflective DM, is a crucial factor
for naturalistic environments (Klein, 1997b). Participants were given a
copy of the response sheet, consisting of a series of boxes in which to
indicate their decision, a space to explain the reasoning behind their
decision, a Likert scale to rate their confidence in the accuracy of
each decision, ranging from 1 (low) to 5 (high), and a section to
comment on the quality of each clip. Content of the response sheet is
included in Table 1.
Pilot Testing
Prior to the participants' assessments, pilot testing was
conducted using a group of individuals familiar with the rugby laws to
verify the qualities of the videotape, suitable viewing positions, the
efficacy of the response sheet, and the typical length of time it would
take to complete it. Based on this pilot work, the following procedure
was developed.
Procedure
For the purposes of viewing the 10 assessment clips, the
participants were randomly divided into four viewing groups of no more
than 35 for data-collection purposes only, each having approximately the
same number of referees, touch-judges, assessors and coaches. The pilot
study and subsequent analysis of the results confirmed this to be large
enough to minimise variability due to procedural differences but small
enough to allow each individual an acceptable view of the screen. They
were then informed that their own personal responses would remain
confidential and that their results would only be presented as grouped
data depending upon their officiating classification. After the
participants familiarised themselves with the response sheet they sat in
the darkened room where they could comfortably see the tackle incidents
projected onto a screen via a standard VHS video recorder and a
data-projector. This presented an image about 8 feet wide and 5 feet
high. The first clip from the videotape was then played and paused
immediately after its completion. Participants were asked to make an
immediate decision by ticking the appropriate box. They were then given
3-minutes to complete the remainder of the response sheet, and were
explicitly told not to change the decision once made. An inspection of
the response sheets and observation of participants suggested that all
conformed to these instructions.
After responding to all 10 clips in the same manner, participants
were asked to compare the quality of information upon which they made
decisions in this study to the quality of information they
'tended' to get as referees on the pitch and write their
explanation on the back of the response sheet. This procedure was
followed consistently for all four data-collection groups.
Data Analysis
Two of the full-time RFU referees, at the time nationally ranked 1
and 2, determined the correct response. Replicating the conditions under
which the participants were asked to respond, they both independently
made their immediate decision on the 10 clips. In cases where these two
referees had initially disagreed upon responses (clips 4 and 9) they
reviewed the videotape, and discussed the clip, before agreeing on the
most appropriate decision. In fact their initial disagreement was only
minor as in both clips they agreed on which team to advantage but
provided inconsistent decisions on the sanction for such infringements.
For example, in clip 4 one expert chose to play on, advantaging the
attacking team who retained possession, and the other chose to award a
penalty to the attacking team. Similarly in clip 9, one expert awarded a
scrum to the defending team while the other awarded a penalty. Finally,
these experts indicated "how many times per game" they
typically had to adjudicate a tackle situation like the one presented in
each clip. The expert's mean frequency ratings (number of
occurrences per game) for all 10 tackles was M = 10.9 (SD = 7.8).
Participants DM performance was assessed by three measures, (1)
accuracy--the percentage of participants achieving the correct decision,
(2) agreement--the degree of spread of their responses, and (3)
coherence--the similarity of their reasons underpinning decisions. The
kappa statistic of agreement (K) was used to measure the spread of
responses. This offers a ratio of the proportion of times that the
raters agree, against the maximum number of times that agreement was
possible and corrects for chance (see Altman, 1991). Thus, a score of K
= .90 would represent a 'very good' (high) level of agreement,
and K = .10 would represent a 'very poor' (low) level of
agreement, as classified by the system proposed by Landis and Koch
(1977). In addition to these measures, for each clip the
participants' reasons for their decisions were examined to
determine the extent of coherence in their mental models of each event.
Similarly, all three analyses were conducted on a group basis,
consisting of the three subgroups of referees, and the three other
groups, assessors, touch-judges, and referee coaches. Bonferroni
adjustments were applied to control for the experiment-wise chances of a
type-one error.
Results
Accuracy, Agreement, Coherence and Confidence levels for all
Participants
Table 1 provides details of the percentage incidence of responses
made, highlighting the accuracy scores and the kappa statistic of
agreement for each clip. The mean level of accuracy across the 10 clips
for all participants was 49.6% (SD = 28.6%). High levels of accuracy
were achieved for clip 1 (82%), clip 7 (89%) and clip 10 (70%), and
naturally these clips also exposed high levels of agreement (clip 1, K=
.60; clip 7, K = .74; and clip 10, K = .41). In addition, these clips
showed very high coherence in the participants' reasoning for each
decision. In clip 1, 95% of the participants who responded accurately
showed agreement by awarding the penalty for offside with only 5%
penalizing for support players arriving off their feet. In clip 7,94% of
the accurate participants awarded a penalty to the attacking team for
the defender failing to roll away, and similarly in clip 10, 95% of the
respondents making an accurate decision penalized the ball carrier for
not releasing the ball.
Accuracy, Agreement, Coherence and Confidence levels by Group
The mean accuracy scores shown in Table 2 revealed that the top-20
referees were the most accurate (M = 54.3%, SD = 32.9%), although
interestingly the lower-panel group showed greater accuracy (M = 52.4%,
SD = 26.3%) than the mid-panel group (M = 47.1%, SD = 28.4%).
Furthermore, despite poorer performance, this middle group of referees
showed greater confidence levels in their decisions than all other
groups (M = 4.4; SD = 0.7). The referee coaches were the least accurate
(M = 43.0%, SD = 37.3%). In fact, their decisions were less accurate
than the referees in 8 of the 10 clips.
Investigating the prevalence of a SMM by measuring the extent of
shared reasons underpinning decisions revealed perfect coherence when
groups displayed perfect accuracy. For example, in clip 1 the top-20
referees achieved 100% accuracy (see Table 2) and all chose to penalize the defending players for encroaching offside. Similarly, in clip 7 both
the top-20 and lower-panel referees revealed maximum accuracy with 100%
agreement since all the participants awarded a penalty to the attacking
team for the defender's failure to roll away. In addition, across
all 10 clips when officials were accurate, the top-20 showed a
considerably higher level of coherence in their reasons underlying
decisions (M = 93%) when compared to all other groups (mid-panel, M =
86%; lower-panel, M = 82%; touch-judges, M = 80%; assessors, M = 87%;
referee coaches, M = 80%).
The mean accuracy of all the support groups (touch-judges,
assessors and referee coaches) across all 10 clips was M = 47.9% (SD =
28.6%).
Applied Areas of Concern in Refereeing the Tackle
Surprisingly, given the level of officials examined, 2 of the 10
clips revealed extremely low accuracy scores (clip 2, M = 15%; clip 5, M
= 21%,). Furthermore, in an additional three clips, participants failed
to achieve 50% accuracy (clip 4, M = 31%; clip 6, M = 49%; and clip 9, M
= 30%). Moreover, when the levels of agreement are considered, clip 5
and clip 9 reveal a negative kappa statistic (clip 5, K = -.04; and clip
9, K = -.01), a result in fact lower than the level that would be
predicted by chance alone. Interestingly, for clip 5 there was no drop
in confidence levels (M = 4.0) across all the participants. In fact,
they were nearly as confident in this decision as they were for the
first clip (M = 4.1) where 82% of them agreed and made an accurate
response.
Further exploration into the coherence of the reasoning
underpinning decisions revealed the greatest discrepancy in clips 2, and
9. In clip 2, where only 15% achieved the correct decision, 68% awarded
this penalty for support players arriving off their feet, 19% for the
tackler not rolling away and 14% for offside (all legitimate rulings
within the RFU laws). Similarly in clip 9 the participants were divided,
with 51% awarding the penalty for not releasing the ball and 49% for the
ball carrier's support arriving off their feet.
From an applied perspective, clip 4, as well as showing relatively
low levels of accuracy (31% awarding a penalty to the attacking team)
also resulted in 48% of the participants awarding a penalty to the
defensive team and 45% awarding possession to the attacking team, either
through awarding a scrum, playing advantage or choosing to playing on.
Thus the participants were almost equally split on which team should
benefit from the decision, which would clearly have a profound effect on
the game. Although the two experts had initially disagreed on this clip,
they were in agreement that the attacking team should benefit from the
play.
Similarly in clip 7, while producing high levels of accuracy (M =
89%), 13% of participants believed that the clip contained an offence
worthy of a yellow-card, a procedure used to warn, sanction or send off
a player. Once again the levels of confidence in the accuracy of the
participant's decisions (M = 4.6) did not reflect this DM
discrepancy.
The Fidelity of the Video Recordings and the Naturalistic Paradigm
The participants' feedback suggests that the NDM procedure
used in this investigation was acceptable for all the groups examined.
Only 26 of the 1,320 participant responses (i.e., 132 participants
assessing 10 situations) were reported as holding insufficient
information to make a decision, while the mean confidence level for all
participants across all clips was M = 4.0 (SD = 1.0) out of a maximum of
5.
In terms of the ecological validity of the procedure, only 14% of
the participants believed that the quality of the video and camera angle
needed improving in at least one of the clips, although no consistent
pattern emerged as to which clips needed enhancement. Also, 10%
suggested that more information on the game such as scoreline and
knowledge of previous plays would have made the decision easier. Only 5
of the 132 participants made comments on the influence of the referee on
the screen. However, all the participants felt the test to be a fair
evaluation of referee DM prowess and, most pertinently for the present
investigation, there was no relationship between negative feedback on
the information presented on the screen with the levels of accuracy or
agreement shown.
Discussion
Analyses of all Participants
The primary aim of this investigation was to assess the accuracy,
agreement and coherence of England's best RFU referees,
touch-judges, assessors and referee coaches. The mean levels of accuracy
and agreement revealed poor DM performance. Despite selecting difficult
DM scenarios, all 10 clips were judged by experts to be representative
of actual decisions required on the field of play, which occur on
average 11 times per game. Since these RFU officials averaged only 50%
accuracy, this represents approximately 5 or 6 wrong decisions per game.
Clearly, the ramifications on the game may be significant. Moreover, it
is of even greater concern that the participants' level of
confidence in their decisions rarely decreased, even when their
decisions became more discordant. In other words, although these top
officials made both inaccurate and widespread decisions, they were all
as individuals equally confident in the accuracy of their DM.
Efficacy of SMMs to Test Officials 'DM
As suggested earlier in this paper, shared mental models do appear
to help accurate DM since when a high percentage of officials are
accurate their shared understanding as indicated by the same reasoning
is also high. Equally, when the number of accurate responses is low the
reasoning underpinning those decisions is even lower. In addition, since
the top-20 referees indicated greater coherence in the reasons
underpinning their accurate decisions, it seems fair to conclude that
their mental models have more similarities. This supports the ecological
and congruent validity of the methods used.
The critical emphasis in the development of a SMM relies on
understanding the reason for differences in decisions. The simplest
explanation may be that the different decisions are a reflection of the
participants' ability to identify the cues pertinent to making an
informed decision. In fact, as outlined by Mortimer and Collins (1997)
it may be that the individual participant has a particular scaling value
for the pertinent cues, using the terms criteria (the recognition of
relevant cues) and weighting (the relative value of each of the criteria
in reaching the decision). Thus, in the rugby union tackle situation,
one referee may rate the tackler's inability to roll away as the
most important criterion above the ball carrier's decision to hold
on to the ball until support arrives. This would result in awarding a
penalty to the attacking team. However, if another referee weighted the
ball carrier's obligation to immediately pass, place, or release
the ball as more important, then this referee would be more likely to
award a penalty to the defending team. This may explain the poor
coherence levels in clips 2 and 9. Accordingly, applying a hierarchical
weighting scale, where elements of the decision are prioritised, may be
one method of improving DM in such highly time pressured environments
(Annett, 1997; Rasmussen, 1985).
Analyses by Group
Some inter-group differences were apparent; for example, the
referees collectively were marginally better than the support groups.
However, the mid-panel referees' performance was worse than both
the touch-judges and the assessors, yet they were the most confident in
their decisions. This may suggest that the mid-panel referees achieved
this level of ranking because of their greater confidence levels, rather
than through more accurate DM. A study by Franks, Elliott and Johnson
(1985) would seem to support this idea. This investigation asked expert
and novice gymnasts to view paired handspring performances, to identify
if there were differences between the two and to state where these
differences occurred. Results showed that the experts were no more
accurate, but were simply more confident in their decisions. However, in
the present case the super-elite top-20 group, which included several
international referees, showed more 'realistic' confidence
scores since these levels more accurately represented their levels of
accuracy and coherence.
Most alarmingly the referee coaches revealed the lowest levels of
accuracy. In fact, they were worse than the referees to whom they are
required to offer guidance. Since most of these individuals are
ex-referees who had not performed in many years, this is perhaps not
surprising since the speed of the game is now much quicker (Campsall,
2002) and inevitably interpretations have similarly evolved to meet the
new demands of the professional game. Nevertheless, this has enormous
implications for the development of elite referees. If the referee
coaches, the individuals responsible for teaching referees, are offering
erroneous or disparate advice on this critical area of law-application,
the current levels of inaccurate and incoherent DM may remain.
Before concluding, it is important to consider any methodological
limitations that may have contributed to our findings. First, it is
possible that some of the officials may have seen some of the test
incidents before as they may have been broadcast on television, or
indeed the participants may have been involved in officiating the games.
However, a referee will typically officiate in at least 25 games per
season, each containing in the region of 120 tackles, which would total
about 3,000 of these types of situations. In this intervention it is
questionable whether or not referees would have been able to remember
each incident. Nevertheless, more control should be taken to prevent
this in future.
Finally, another possible limitation of this study is the small
number of test clips that were used to assess referee performance.
Furthermore these difficult clips are not necessarily representative of
the most common tackles that are likely to be encountered. Therefore,
future studies should investigate a wider variety of scenarios in order
to explore the levels of accuracy and coherence that are required to
referee at the top level. More importantly, to help ensure that game
outcomes are not adversely influenced by poor referee decisions,
interventions should provide an expert's detailed interpretations,
particularly focusing on the types of tackles that create problems, in
order to produce more coherent referee DM.
Applied Implications
From an applied perspective, this video-based NDM approach offers a
means of identifying areas of concern (cf., Abernethy, 1996). For
example, this investigation revealed inconsistent use of yellow-cards
(and the subsequent loss of a player for 10 minutes) and revealed
decisions with penalties awarded in opposite directions.
It is surprising that two clips (5 and 9) revealed levels of
agreement lower than that which would be expected by chance (as
reflected by the negative Kappa scores in Table 1). Thus, for these two
specific cases taken from premier league games, the decisions made by
England's best RFU referees (which included two international
referees ranked in the world top-20), touch-judges, assessors and
coaches appear to offer decisions that could fairly be described as
random. It appears that, with respect to the application of law 15,
England's top officials seem to be providing very unpredictable
decisions. Clearly, the influence of such poor coherence on the game can
be substantial, with players having to adjust their play week by week to
fit in with the individual foibles of each particular referee. Perhaps
this is acceptable, although there is currently no data indicating the
levels of consistency that are acceptable in any sports setting.
Nevertheless, the views expressed by premier league coaches, the main
consumers in this case, are clear. They want a lot more consistency, and
see the development of greater coherence in the management of law 15 as
the most critical factor for the improvement of RFU refereeing (Bunting,
1998; Melrose, 1998). Finally, although only very few clips generated
this disappointingly low level of agreement, the impact of such extremes
on player trust and the respect held for officials may have wider
implications. In short, one inaccurate decision especially at the wrong
time could change the tenor of the whole match.
In addition to highlighting particular types of tackles that create
problems, this test also identified the groups of officials who were
less accurate. The referee coaches' poor performance in particular
may necessitate some form of SMM training. Developing the declarative
knowledge of the task, the key concepts and their inter-relationship by
exposing the expert's reasons underpinning their decisions, might
be an appropriate way to improve their understanding of the tackle
(Stout et al., 1996). Future research should examine the efficacy of
such techniques for sports officials in light of the growing literature
in NDM (Cannon-Bowers et al., 1996).
A Naturalistic Approach to Referee Decision-making
The findings of this preliminary investigation support McLennan and
Omodei's (1996) conclusions that 'own-point-of-view'
video scenarios, in this case closely representing the match day
referee's perspective, can effectively be used for investigating
referee DM through a NDM perspective. All participants indicated that
this approach represented a fair test of their refereeing prowess, while
suggestions for refinement were relatively minor. Furthermore, all the
participants showed high confidence levels and only a very low
percentage were unable to offer decisions due to insufficient
information.
Despite this support for the NDM framework, feedback suggested that
several other factors might need to be refined in order to make the test
and subsequent training systems as real as possible. For example, some
participants felt that knowledge of the flow of the game may be
beneficial, with comments such as "it didn't allow me to get a
feel for the atmosphere", and "it would have been helpful to
have seen previous plays in the game". However, these are factors
that may be more representative of the art rather than the science of
refereeing (i.e., the judgment of context rather than pure
law-application). While it may be argued that context forms a critical
part of 'mastery of the laws' (see Anshel, 1995; Anshel &
Webb, 1991) it may present so many degrees of freedom that it is too
complex to assess and train reliably. Moreover, it seems sensible that
before developing such advanced skills like contextual judgment (see
Mascarenhas et al., 2002) officials develop coherence in pure law
application, which provides the critical foundation upon which to
develop more advanced skills. Without such, officials may become even
more discordant as contextual factors are added. So, in the absence of
contextual factors such as the emotion of the players and the tenor and
flow of the game, the present assessment provides a clear, unambiguous
test that requires a comparatively unequivocal application of the law.
In keeping with the literature, this study supports the contention
that researchers need to look at the reasons underlying decisions, as
well as the actual decisions made. Thus, training packages that use
these types of 'contentious' tackles, independently adjudged
to be realistic refereeing scenarios, may be appropriate to expose an
expert's mental model. This may speed up the process of amassing
experience (Stokes, Kemper, & Kite, 1997) and advance the
development of a SMM so that referees decisions are not esoteric, but
rather based on an accurate and coherent understanding of law.
Authors' Note
We gratefully acknowledge the financial support from the Rugby
Football Union and the contributions made by Nick Bunting and the
full-time referees at the Rugby Football Union Referees Centre of
Excellence.
References
Abernethy, B. (1996). Training the visual-perceptual skills of
athletes. Insights from the study of motor expertise. American Journal
of Sports Medicine, 24(suppl. 6), S89-S92.
Abraham, A., & Collins, D. (1998). Examining and extending
research in coach development. Quest, 50, 59-79.
Ackford, P. (2003, March 16). Ring of confidence from the whistle
blowers. The Sunday Tele graph, p. 11.
Alessi, S. M. (1988). Fidelity in the design of instructional
simulations. Journal of Computer-Based Instruction, 15(2), 40-47.
Altman, D. G. (1991). Practical statistics for medical research.
Boca Raton, FL: Chapman and Hall.
Annett, J. (1997). Analysing team skills. In R Flin, E Salas, M
Strub, & L Martin (Eds.), DM under stress." Emerging themes and
applications (pp. 315-325). Aldershot: Ashgate.
Anshel, M. H. (1995). Development of a rating scale for determining
competence in basketball referees: Implications for sport psychology.
The Sport Psychologist, 9, 4-28.
Anshel, M. H., & Webb, P. (1991). Defining competence for
effective refereeing. Sports Coach, 14(3), 32-37.
Brehmer, B. (1972). Policy conflict as a function of policy
similarity and policy complexity. Scandinavian Journal of Psychology,
13,208-221.
Bunting, N. J. (1998) Rugby Football Union Referee. Welcome to the
National Conference, Bromsgrove, July 1999.
Bunting, N. J. (1999) Allied Dunbar premiership guidance for
referees, players and coaches on the application of law (RFU Tech. Rep.
from the conference on the game). Castlecroft, Wolverhampton.
Campsall, B. (2002). Refereeing the Tackle. In High, C. J. (Chair),
Rugby Football Union Performance Department: Inaugural Elite Referee
Unit Conference, Huddersfield, August 2002.
Cannon-Bowers, J. A., & Bell, H. H. (1997). Training
decision-makers for complex environments: Implications of the
naturalistic decision-making perspective. In C. E. Zsambok, & G.
Klein (Eds.), Naturalistic decision making (pp. 99-110). Mahwah, N J:
Lawrence Erlbaum.
Cannon-Bowers, J. A. Salas, E., & Converse, S. A. (1990).
Cognitive psychology and team training: Shared mental models in complex
systems. Human Factors Society Bulletin, 33(12), 1-4.
Cannon-Bowers, J. A. Salas, E., & Pruitt (1996). Establishing
the boundaries of a paradigm for decision making research. Human
Factors, 38(2), 193-205.
Craven, B. J. (1999). A psychophysical study of leg-before-wicket
judgments in cricket. British Journal of Psychology, 89, 555-578.
Eady, J. (1999). World class performance plan - Guidelines. London.
Knight, Kavanagh, & Page.
Ford, G. G., Gallagher, S. H., Lacy, B. A., Bridwell, A. M., &
Goodwin, E (1999) Repositioning the home plate umpire to provide
enhanced perceptual cues and more accurate ball-strike judgments.
Journal of Sport Behavior, 22,(1), 28-44.
Franks, I. M., Elliott, M., & Johnson, R. (1985). The effects
of experience on the detection and location of performance differences
in a gymnastic technique. Paper presented at the meeting of the Canadian
society for psychomotor learning and sport psychology, Montreal.
Jones, M. V., Paull, G. C., & Erskine, J. (2002). The impact of
a team's aggressive reputation on the decisions of association
football referees. Journal of Sports Sciences, 20, 991-1000.
Klein, G. (1997a). An overview of naturalistic decision making
applications. In C. E. Zsambok, & G. Klein (Eds.), Naturalistic
decision making (pp. 49-59). Mahwah, N J: Lawrence Erlbaum.
Klein, G. (1997b). The current status of the naturalistic decision
making framework. In R Flin, E Salas, M Strub, & L Martin (Eds.),
Decision-making under stress." Emerging themes and applications
(pp. 137-146). Aldershot: Ashgate.
Landis, J. R., & Koch, G. G. (1977). The measurement of
observer agreement for categorical data. Biometrics, 33, 159-174.
Langan-Fox, J., Code, S., & Langfield-Smith, K. (2000).
"Team mental models: Techniques, methods and analytic
approaches", Human Factors 42, 242-271.
MacMahon, C. & Ste-Marie, D. M. (1999). Decision-making in
rugby officials. Paper presented at the Canadian Society for Psychomotor
Learning and Sport Psychology, Edmonton: Alberta.
Mascarenhas, D. R. D., Collins, D., & Mortimer, P. (2002). The
art of reason versus the exact ness of science in elite refereeing:
Comments on Plessner and Betsch (2001). Journal of Sport and Exercise
Psychology, 24, 328-333.
Melrose, A. (1998). More Jaw/Jaw, Less War/War: Coach and Referee
Communication. RFU Journal, 12-13. Autumn 1998.
McLennan, J., & Omodei, M., (1996). The role of prepriming in
recognition-primed decision-making. Perceptual and Motor Skills, 82,
1059-1069.
Millgram, E. & Thagard, E (1996). Deliberative coherence.
Synthese, 108, 63-88.
Mortimer, P. W., & Collins, D. J. (1997, September). Coherence
of decision-making in team sports. Paper presented at the BASES Annual
Conference, York.
Omodei, M., McLennan, J., & Whitford, P. (1998). Using a
head-mounted video camera and two-stage replay to enhance orienteering performance. International Journal of Sport Psychology, 29, 115-131.
Orasanu, J., & Connolly, T. (1993). The reinvention of
decision-making. In G. A. Klein, J. Orasanu, R. Calderwood, & C. E.
Zsambok (Eds.), Decision-making in action." Models and methods (pp.
3-20). Norwood, N J: Ablex.
Oudejans, R.R.D., Verheijen, R., Bakker, EC., Gerrits, J.C.,
Steinbruckner, M. & Beek, P.J. (2000) Errors in judging
'offside' in football. Nature, 404, 33.
Rainey, D., & Larsen, J. D. (1988). Balls, strikes, and norms:
rule violations and normative rules among baseball umpires. Journal of
Sport and Exercise Psychology, 10, 75-80.
Rainey, D., Larsen, J. D., Stephenson A., & Olson, T. (1993).
Normative rules among umpires: the "phantom tag" at second
base. Journal of Sport Behavior, 16, 3, 147-155.
Rasmussen, J. (1985). The role of hierarchical knowledge
representations in decision making and system management. IEEE Transactions on Systems, Man and Cybernetics, SMC-15, (2) 234-243.
Rouse, W.B., Cannon-Bowers, J.A. & Salas, E. (1992) The role of
mental models in team performance in complex systems. IEEE Transactions
on Systems, Man and Cybernetics, 22, 1296-1308.
Rouse, W. B., & Morris, N. M. (1986). On looking into the black
box: prospects and limits in the search for mental-models. Psychological
Bulletin, 100, 349-363.
Sloan, T. (2004) Refs can't rate refs. Referee, 329, 58-61.
Ste-Marie, D. (2003) Expertise in sport judges and referees. In J.
Starkes & K. A. Ericsson (Eds.), Expert performance in sports."
Advances in research on sport expertise (pp. 169-189). Illinois, Human
Kinetics.
Stokes, A. E, Kemper, K., & Kite, K. (1997). Aeronautical decision making, cue recognition, and expertise under time pressure. In
C. E. Zsambok & G. Klein (Eds.), Naturalistic decision making (pp.
183-196). Mahwah, N J: Lawrence Erlbaum.
Stout, R. J. Cannon-Bowers, J. A., & Salas, E. (1996). The role
of shared mental models in developing team situational awareness:
Implications for training. Training Research Journal, 2, 85-116.
Williams, A. M., & Davids, K (1995). Declarative knowledge in
sport: A by-product of experience or a characteristic of expertise.
Journal of Sport and Exercise Psychology, 17, 259-273.
Williams, A. M., & Grant, A. (1999). Perceptual skills in
soccer: Implications for talent identification to enhance
coach-performer interactions. Journal of Sports Sciences, 18, 737-750.
Yates, J. F. (2001). "Outsider:" Impressions of
naturalistic decision making. In E. Salas & G. Klein (Eds.), Linking
expertise and naturalistic decision making (pp. 9-33). Mahwah, NJ:
Lawrence Erlbaum.
Address Correspondence To: Duncan RD Mascarenhas, University of
Edinburgh Department of PE Sport & Leisure Studies St Leonard's
Land, Holyrood Road Edinburgh. EH8 8AQ Scotland Phone:+44(0)
131-651-6043 Fax:+44(0) 131-651-6521 Email:
duncan.mascarenhas@education.ed.ac.uk
Duncan R. D. Mascarenhas, Dave Collins and Patrick Mortimer
University of Edinburgh, UK
Table 1.
Responses of all Participants Expressed as a Percentage
Clip Number
Decision 1 2 3 4 5
No action--play on 2 13 9 2 21 (c)
Not enough info' 7 6 2 2
Manage situation 2 8 3 2 10
Advantage 5 2 2
Penalty to attack 82 (c) 15 (c) 7 31 (c) 15
Penalty to defence 5 7 58 (c) 48 15
Free Kick 11
Scrum 2 51 2 10 22
Scrum with turnover 2 6 14
Level of Confidence in Decision from 1 (low) to 5 (high)
M 4.1 3.6 3.8 4.3 4.0
SD 0.9 1.0 1.1 0.8 0.9
Kappa Statistic .60 * .14 .20 .14 -.04
Strength of agreement (1) Mod Poor Poor Poor V Poor
Clip Number
Decision 6 7 8 9 10
No action--play on 5 18 1
Not enough info' 2 1 2
Manage situation 6 2 16 1
Advantage 1 2 1 1
Penalty to attack 5 89 (c) 17 11 17
Penalty to defence 30 6 12 30 (c) 7 (c)
Free Kick
Scrum 49 2 55 18 5
Scrum with turnover 2 1 14 4 5
Level of Confidence in Decision from 1 (low) to 5 (high)
M 3.8 4.6 3.9 3.9 4.2
SD 1.1 0.7 1.0 1.0 .09
Kappa Statistic 0.17 .74 * .21 -.01 .4 (AS)
Strength of agreement (1) Poor Good Fair V Poor Mod
Table 2
Percentage of Correct Responses, Agreement and Confidence Scores
by Group
Referees
Clip Top-20 21-4 41-65 Touch- Assessors Referee
Judges Coaches
1 100.0 87.5 82.6 83.0 70.4 84.6
2 7.1 0.0 18.2 6.4 38.5 7.7
3 64.3 75.0 45.5 65.9 64.0 18.2
4 42.9 25.0 45.5 25.5 37.0 0.0
5 14.3 12.5 26.1 23.9 22.2 8.3
6 78.6 50.0 36.4 44.7 40.7 66.7
7 100.0 71.4 100.0 79.1 92.3 92.3
8 42.9 62.5 52.2 51.1 55.6 76.9
9 28.6 37.5 39.1 28.3 33.3 8.3
10 64.3 50.0 78.3 73.9 70.4 66.7
M 54.3 47.1 52.4 48.2 52.4 43.0
SD 32.9 28.4 26.3 26.7 21.7 37.3
Kappa statistic of Agreement
M .39 .30 .35 .29 .29 .34
SD .32 .20 .27 .19 .21 .26
Level of Confidence in Decision
M 4.3 4.4 4.1 3.9 3.9 3.9
SD 0.8 0.7 0.8 0.9 1.0 1.1
Notes: The mean accuracy of all the referees across all 10 clips
was M = 51.3% (SD = 28.5%).