Analogy in morphology: modeling the choice of linking morphemes in Dutch(*).
KROTT, ANDREA ; BAAYEN, R. HARALD ; SCHREUDER, ROBERT 等
Abstract
This study argues that a productive but not fully regular
morphological phenomenon, the choice of linking morphemes in Dutch
nominal compounds, is based on analogy. In Dutch, a linking -s- or -en-
can appear between the constituents of a nominal compound. We present
production experiments that reveal strong evidence that the choice of
linking morphemes in novel compounds is analogically determined by the
distribution of linking morphemes in what we call the "constituent
families." A "constituent family," is the set of existing
compounds that share the first (or second) constituent with the novel
compound. A further experiment shows that in the case of derived
pseudo-words as first constituents, it is the family of the suffix that
influences the choice of the following linking morpheme. In addition to
these experiments, we present computational simulation studies in which
the choices made by participants in our experiments are predicted with a
high degree of accuracy using a machine-learning algorithm for analogy.
These studies support the status of the constituent family as the
primary basis for analogical prediction. Finally, we outline a
psycholinguistic model for analogy in the mental lexicon that does not
give up symbolic representations and, at the same time, captures
nondeterministic variation.
Introduction
Morphological variation can often be captured by simple rules.
Consider, for example, the realization of the regular plural of English
nouns, which appears in three different forms, /Iz/, /z/, and /s/. These
three variants can be predicted on the basis of the phonological form of
the base word. The plural is pronounced /Iz/ after bases ending in
sibilants (e.g. horses), /z/ after bases ending in vowels and voiced
segments other than /z/, /3/, and /d3/ (e.g, beds), and it is
pronounced/s/after bases ending in voiceless segments other than /s/,
/[integral of]/, and /t[integral of]/ (e.g, months).
In addition to this kind of regular variation, there are
morphological domains where the choice between alternative realizations
is less predictable. One such domain is the analysis of linking elements
in compounds, which are also referred to as connectives, interfixes,
linkers, or linking morphemes. Linking elements occur in various
languages across different language families. In English, linking
elements are extremely rare. We know of only a few examples, all built
with the head word man: marksman, sportsman, craftsman, kinsman,
tradesman, and spokesman. The last example, in which the -s- appears
without any possible semantic function, best illustrates the phenomenon
of linking elements. In some languages, linking elements can be fully
predicted on the basis of the phonological characteristics of the
preceding (and/or the following) constituent. For instance, Zoque, a
Mixe-Zoquean language spoken in Mexico, has a nominal compound formation
in which the linking element is a vowel that is identical to the vowel
in the preceding syllable. However, in many other languages such clear
rules cannot be formulated. For example, Kabardian (North Caucasian) has
the linking elements -ah-, -m-, -n-, and -r-, which tend to be
obligatory in some morphological contexts and optional in others
(Kuipers 1960: 78-80). In Indo-European, the Germanic languages are
especially rich in nonpredictable or only partly predictable variation
in the use of linking elements (e.g. Danish, Norwegian, Swedish, and
German). The distribution of the two main linking elements in Dutch,
-en- and -s-, is likewise only partially predictable by rule.
The systematicities governing the selection of linking morphemes is
a longstanding unsolved problem in the morphology of Dutch and many
other Germanic and non-Germanic languages. It is an issue that has
hardly received attention in the generative tradition,(1) with the
exception of Botha (1968), even though it is a problem that receives
discussion in any good reference grammar (e.g. Haeseryn et al. 1997; De
Haas and Trommelen 1993).
A first goal of the present study is to show that the distribution
of linking morphemes in Dutch noun-noun compounds can be accounted for
by means of a formal computational model of analogy with a higher degree
of observational adequacy than can be achieved by means of the rules
proposed in the literature. Our conclusions are based both on surveys of
existing compounds in the Dutch lexicon and on the choices for linking
morphemes in novel compounds as produced by participants under strict
experimental conditions.
A second goal is to contribute to the discussion in the current
literature about the nature of morphological rules, whether such rules
are symbolic in nature (Clahsen 1999; Marcus et al. 1995; Pinker 1991,
1997) or whether rules are an epiphenomenon of distributed storage in
connectionist networks (Seidenberg 1987; Seidenberg and Hoeffner 1998;
Plunkett and Juola 1999; Rueckl et al. 1997). The phenomenon that we are
dealing with is interesting in the sense that it is fully productive and
yet not completely regular. As such, it poses a serious challenge to
proponents of symbolic rule systems. At the same time, we will show that
it is possible to predict nondeterministic aspects of human cognition without necessarily making use of distributed connectionist networks. In
this sense, our present analogy-based approach provides an alternative
to both symbolist and connectionist approaches to cognition.
The notion of analogy as we use it in this paper is different from
its two traditional interpretations in linguistics. First, analogy is
often contrasted with rules, with regular novel forms being formed by
rules, and exceptional novel forms being built by analogy to individual
examples (e.g. brunch by analogy to smog; see, e.g. Anshen and Aronoff
1988). Second, analogy can also be understood as the initial basis for
the acquisition of rules. In this view, analogical learning might be
involved in determining the conditions under which a rule applies. But
once a rule is established, the instances that led to the rule would
then be irrelevant and would not be kept in memory.
Our use of the term analogy differs from these two interpretations
in the following ways. First, the kind of analogy with which we are
concerned is not the kind of analogy that occasionally leads to
exceptional new creatively coined words such as brunch. Instead, we are
concerned with the regular phenomena that are traditionally described by
means of linguistic rules. Following Skousen (1989) and Daelemans et al.
(1999), we adopt a formal and computationally tractable definition of
analogy that offers a new way of understanding the way in which
linguistic rules actually work. Second, we hypothesize that, at least in
the domain of morphological processing, there are no rules that are
formed on the basis of initially stored examples of complex words, with
the initial exemplars fading from memory. Instead, we assume that many
fully regular complex words, both inflected and derived, remain
available in the mental lexicon (e.g. Bertram et al. 1999; Bertram et
al. 2000; Baayen et al. 1997; Sereno and Jongman 1997; Sandra et al.
1999; Taft 1979; Baayen et al. i.p.) and serve as exemplars for the
analogical formation of novel forms. In other words, we hypothesize that
rules are essentially analogical in nature (De Saussure 1966).
In what follows, we first describe the problem of the
systematicities underlying the distribution of linking morphemes in
Dutch, and we show that the notion of default rules that has figured
prominently in recent discussions (Marcus et al. 1995; Clahsen 1999) is
not applicable to this phenomenon. In the next section, we present the
results of three production experiments, which show that, the
substantial variation in the choice of linking morphemes
notwithstanding, Dutch native speakers tend to converge on the same
linking elements for novel compounds. These experiments show,
furthermore, that the choice of a linking element for a novel compound
is strongly influenced by the distribution of linking elements in the
set of existing compounds sharing the first or second constituent with
the novel compound (e.g. fiets `bike' in fiets-pad `cycle
path' and fiets + bel `bicycle bell'; and winkel `shop'
in schoen + winkel `shoe shop' and hoed + en + winkel, `hat + PLUR + shop', `hat shop'). We will refer to these sets of compounds
as constituent families.
In the subsequent section, we will show that the notion of analogy
based on constituent families can be formalized computationally, and
that this allows us to predict the distribution of linking morphemes in
the Dutch lexicon and also to predict the performance of our
experimental participants. In the general discussion, we outline how the
computational model can be mapped onto a psycholinguistically more
realistic spreading activation model along the lines of Schreuder and
Baayen (1995).
Linking morphemes in Dutch: no rules but tendencies
In this section, we describe the distributional properties of the
linking elements in Dutch and their linguistic status. The two main
linking elements in Dutch noun-noun compounds are -s- and -en-. The
latter is occasionally realized in the orthography as -e-. Both -en- and
-e- are pronounced as schwa in standard Dutch. As the present study
focuses on the production of linking elements, we do not distinguish
between the two orthographic realizations.
There is a longstanding discussion about the status of these
linking elements. Are they just meaningless letters or do they carry
semantic information? Both -s- and -en- are homographic with the two
productive plural suffixes of Dutch nouns.(2) The linking element -en-
may only appear after left constituents that themselves pluralize with
-en-. The linking element -s- is not constrained in the same way. It may
appear following constituents with which it does not form a plural.
There is evidence that -en- marks plurality in compounds, as shown by
Schreuder et al. (1998). Neijt et al. (n.d.) show that, depending on the
first constituent, the -s- may also convey plural semantics. In the
light of this evidence, we will henceforth refer to -en- and -s- as
linking morphemes rather than linking elements. Note, however, that the
question whether the -s- and -en- forms in Dutch compounds are indeed
completely identical to the Dutch plural suffixes is not what is at
issue in the present study. Our aim here is to come to grips with the
distribution of these forms irrespective of their morphological status.
The literature on linking morphemes in Dutch compounds has
attempted to capture the distribution of linking morphemes by means of
rules operating at the levels of phonology, morphology, and semantics
(see, e.g. Van den Toorn 1981a, 1981b, 1982a; 1982b; Mattens 1984). An
example of a phonological rule is the constraint that after first
constituents ending in a vowel, or ending in a schwa followed by a
sonorant, or ending in a liquid followed by /k/ or /m/ (thee + bus `tea
box'; meubel + zaak `furniture shop'), linking morphemes are
not allowed. This rule is not without exceptions, however, as shown by a
compound such as pygmee + en + yolk, `pygmy + PLUR + people',
`pygmy people'.
At the morphological level, particular affixes show preferences for
specific linking morphemes. For instance, the diminutive suffix -je is
always followed by the linking -s- in compounds (plaat + je + s + boek,
picture + DIMUNITIVE + PLUR + book, `small pictures book'). Other
morphemes show strong preferences, such as the suffix -heid
`-ness', which appears predominantly with -s-, but occasionally
without a linking morpheme and rarely with -en-.
At the level of semantics two different kinds of constraints have
been observed. First, the semantics of the first constituent may render
the use of a linking morpheme unlikely. For instance, mass nouns are not
followed by linking morphemes (e.g. papier + handel `paper trade';
exception: tabak + s + rook, `tobacco + GENITIVE + smoke', `tobacco
smoke'). Conversely, the linking morpheme -en- often occurs when
the first constituent of a compound has a plural interpretation
(Haeseryn et al. 1997: 685; Schreuder et al. 1998): boek + en + kast,
`book + PLUR + case', `book case', krent + en + brood,
`currant + PLUR + bread', `currant bread', exception boek +
handel, `book shop'. Semantic factors may interact with the
morphological structure of the first constituent. For instance, first
constituents ending in -er denote human agents or objects. For human
agents one tends to find the linking -s-, as in duik + er + s + ziekte,
`dive + er + PLUR + sickness', `decompression sickness', while
for inanimate objects one tends to find no linking morpheme, as in
straal + jager + piloot, `stream + hunt + er + pilot', `fighter jet
pilot'. These rules are also not without exceptions (e.g. leraar +
en + opleiding `teacher + PLUR + education', `education of
teachers') (see Mattens 1984). Second, the semantic relation between the two constituents has also been argued to codetermine the
choice of the linking morpheme. For instance, copulative compounds such
as man + wijf, `man + bitch', 'mannish woman' never take
a linking morpheme. Similarly, compounds in which the first constituent
is the object of a deverbal agent or action noun to its right also tend
to resist insertion of linking morphemes (boek + verkoper `book
seller'; exception: weer + s + verwachting, `weather + GENITIVE +
expectation', `weather forecast').
A final property of linking morphemes in Dutch is that they
evidence a certain amount of variability. For instance, the word
`spelling change' has two translation equivalents in Dutch,
spelling + verandering and spelling + s + verandering. Even for a single
speaker, forms such as these appear to be in free variation.
Summing up, first constituents seem to have the strongest influence
on the choice of linking morphemes, phonologically, morphologically, and
semantically. The second constituent plays a minor role, being a
codeterminant of the semantic relation between the two constituents. The
numbers of exceptions to the rules describing the distribution of
linking morphemes are so large that Van den Toorn (1982a, 1982b) has
argued that we are dealing with tendencies rather than with real rules.
It is important to note that the distribution of the linking
morphemes in Dutch does not lend itself to an analysis in terms of a set
of rules including a default rule. In such a system of rules, a series
of positively specified cases is supplemented by a general case, the
default, for which a simple and straightforward definition of its input
domain (in the sense of Van Marle, 1985) cannot be given.(3) Focusing on
the phonological rules for the distribution of Dutch linking morphemes,
we observe only negative specifications: linking morphemes do not appear
following left constituents that end in a vowel, in a schwa followed by
a sonorant, or in a liquid followed by /k/ or /m/. Crucially, the notion
of a default, covering those words that do not fall under the negatively
specified input domains, does not make sense for Dutch linking
morphemes, as it does not have any predictive power with respect to the
appropriate linking morpheme. Thus, words falling under the default,
that is, words that do not end in a vowel, in a schwa followed by a
sonorant, or in a liquid followed by /k/ or /m/, can still appear in a
compound with no linking morpheme, with -s-, or with -en-. Clearly, none
of these three possibilities can be the default choice. Turning to the
level of morphology, we again find that the notion of a default is not
applicable, as each suffix has its own stronger or weaker preferences.
Similarly, at the level of semantics, we only observe random
subgeneralizations without a well-specified overall default. In spite of
the absence of a rule system with a default, speakers of Dutch
nevertheless have strong intuitions about which linking morpheme is
appropriate for novel compounds.
Production experiments
In this section, we address two related questions. First, to what
extent do native speakers of Dutch agree about which linking morphemes
are most appropriate to use in novel compounds? How much variability can
be observed given the strong intuitions of native speakers as to what
might be the appropriate choice? Second, what factors underlie these
strong intuitions? We shall see that there is indeed strong agreement
about which linking morpheme is most appropriate. As to the factors
underlying the choice of linking morphemes, we shall see that the
existing compounds sharing the left (or right) constituent with the
target compound forms perhaps the most important factor of all. In what
follows, we will refer to these compounds as the left and right
constituent families of such a target compound. An individual compound
in such a family will be referred to as a constituent family member.
The next section presents experimental evidence for the important
role of the constituent families for the linking morphemes -en- and -s-.
The following section investigates the relevance of the morphological
structure of the first constituent. We have not explicitly included
semantic and phonological factors in our experimental design. However,
we will show that analogical modeling of the experimental data yields
slightly better results when semantic properties of the constituents are
also taken into account. Including phonological information results in
slightly worse performance.
The next two subsections present experiments studying the effect of
the constituent family on the choice of the linking morphemes -en- and
-s-.
The constituent family effect: experiment 1: the linking morpheme
-en-
If the choice of linking morphemes in novel compounds were based
simply on the distribution of the linking morphemes in the lexicon as a
whole, one would expect speakers to choose not to use a linking morpheme
in roughly seven out of ten cases: 69% of all compounds listed in the
CELEX lexical database (Baayen et al. 1995) appear without any linking
morpheme. Their second best guess would then be -s-, which occurs in 20%
of the compounds in this database, and their least probable bet would be
-e(n)- (11%). In the light of the linguistic description of the
distribution of -en- and -s- presented in the previous section, this
simple guessing behavior is unlikely. On the other hand, the linguistic
rules that have been formulated tend to have so many exceptions that
their explanatory value is called into question as well. In what
follows, we explore the hypothesis that native speakers of Dutch base
their choice on the relative frequencies of the linking morphemes as
realized not in the lexicon as a whole, but in the restricted sets
composing the constituent families of individual compounds.
Method
Materials. We constructed three sets of left constituents (L1, L2,
L3) and three sets of right constituents (R1, R2, R3). Each set
contained 21 nouns. The constituents of L1 and R1 had constituent
families with as strong a bias as possible toward the linking morpheme
-en-. Conversely, L3 and R3 showed a bias as strong as possible against
-en-, though we made sure that these constituents form their plural with
the suffix -en so that a linking -en- is possible. The sets L2 and R2,
the neutral sets, contained nouns with families without a clear
preference for or against -en-. We used the CELEX lexical database
(Baayen et al. 1995) to determine the constituent families of the
constituents in these six sets. Compounds with a token frequency of zero
in a corpus of 42 million words were not included.
The constituents in the L1 set had constituent family members of
which at least 70% contained the linking morpheme -en-. The mean number
of compounds in these families was 12.5 (range 5-43). Their mean token
frequency was 149.2 per 42 million word forms (range 58-439). The range
of choices for R1 constituents was more restricted. The constituents in
the R1 set therefore had constituent family members of which at least
60% contained the linking morpheme -en-. The mean number of compounds in
these families was 3.6 (range 2-7). Their mean token frequency was 49.1
per 42 million word forms (range 20-119). Neutral left constituents are
rare. The neutral set L2 included left constituents whose families
contained between 35% and 65% compounds with the linking morpheme -en-.
These families had a mean number of compounds of 8.3 (range 3-24) and a
mean token frequency of 136.3 per 42 million word forms (range 15-439).
The constituents in the R2 set had constituent family members of which
40% to 60% contained the linking morpheme -en-. These families had a
mean number of compounds of 5.3 (range 3-15) and a mean token frequency
of 66.7 per 42 million word forms (range 8-192). The remaining sets L3
and R3, the groups with a bias against -en-, contained constituents
whose family members never have a linking -en-. There were in the mean
25 (range 11-66; L3) and 17.9 (range 10-47; R3) family members
respectively. Their mean token frequency was 573.7 (range 98-2650; L3)
and 349.8 (range 47-2290; R3). These are the maximal contrasts that
allowed us to select 21 constituents for each experimental set.
Each of the three sets of left constituents (L1, L2, L3) was
combined with the three sets of right constituents (R1, R2, R3) to form
pairs of constituents for new compounds in a factorial design with two
factors: bias in the left position (positive, neutral, and negative) and
bias in the right position (positive, neutral, and negative). None of
these compounds is attested in the CELEX lexical database with a token
frequency higher than zero. All have a high degree of semantic
interpretability. Appendix A lists all experimental items. The 9 x 21 =
189 experimental items were divided over three lists. List 1 contained
the compounds of the factorial combinations L1-R1, L2-R3, and L3-R2.
List 2 contained the compounds of the combinations L1-R2, L2-R1, and
L3-R3, and list 3 contained the compounds of the combinations L1-R3,
L2-R2, and L3-R1. In this way, each participant saw a given constituent
only once. We constructed a separate randomized list of the 3 x 21 = 63
compound constituent pairs for each participant.
Procedure. The participants performed a cloze task. The
experimental list of items was presented to the participants in written
form. Each line presented two compound constituents separated by two
underscores. We asked the participants to combine these constituents
into new compounds and to specify the most appropriate linking morpheme,
if any, at the position of the underscores, using their first
intuitions. Occasionally, the first constituent may change its form when
it is combined with a linking morpheme (e.g. schip `ship' appears
as scheep in the compound scheepswerf `shipyard'). The instructions
made clear that these changes were not of interest and could be ignored.
We told the participants that they were free to use -eh- or -e- as
spelling variants of the linking morpheme -en-. The experiment lasted
approximately 15 minutes.
Participants. Sixty participants, mostly undergraduates at Nijmegen
University, were paid to participate in the experiment. All were native
speakers of Dutch. The participants were divided into three groups. Each
group was asked to complete one of the three experimental lists.
Results and discussion
Occasionally, participants filled in a question mark or a letter
sequence other than a linking morpheme. Such responses were counted as
errors. The overall error rate was extremely low (0.05%), which allowed
us to include all participants and all items in the data analysis. Table
1 summarizes the percentages of en responses versus other responses for
the nine experimental conditions. Appendix A lists the individual words
together with the absolute numbers of en and not en responses.
Table 1. Percentages of selected linking morphemes when varying bias
for -en- (positive, neutral, and negative) in the left and right
compound position
Left position Right position
positive neutral negative
Positive
en 94.8 (11.2)(a) 96.4 (6.7) 87.4 (15.3)
not en 5.2 (11.2) 3.6 (6.7) 12.6 (15.3)
other 0 0 0
Neutral
en 75.0 (23.7) 81.9 (15.5) 58.3 (26.9)
not en 25.0 (23.7) 18.1 (15.5) 41.2 (26.9)
other 0 0 0.5
Negative
en 18.1 (19.1) 18.8 (19.9) 6.0 (7.7)
not en 81.9 (19.1) 81.2 (19.9) 94.0 (7.7)
other 0 0 0
(a.) Standard deviations given in parentheses.
A by-item logit analysis (see, e.g., Rietveld and Van Hout 1993;
Fienberg 1980) of the en and not en responses revealed a main effect of
bias in the left position (F(2, 180) = 119.3, p [is less than] 0.0001),
a main effect of bias in the right position (F(2, 180) = 12.8, p [is
less than] 0.0001), and no interaction of the bias in both positions
(F(4, 180) [is less than] 1). Although the neutral bias condition for
the right constituents led to slightly higher numbers of en responses
than the positive bias condition, the difference between these two
conditions is not reliable (F(1, 120) = 1.1, p = 0.2974).
The upper panel of Figure 1 shows the effects of both biases on the
percentage of en responses. Bias has a larger effect on the left
position (a difference of roughly 80% between the positive and negative
conditions) than on the right position (a difference of roughly 15%).
This result reflects an asymmetry in the distribution of the linking
elements in Dutch that is also mirrored in our experimental design.
Figure 2 illustrates this asymmetry for the families of left and right
constituents of compounds with the linking morpheme -en-. The left panel
is a scattergram for the left constituents. It represents each of the
4320 constituents by a dot in the plane spanned by the number of
compounds with -en- in which it appears (horizontal axis) and the number
of compounds without -en- in which it appears (vertical axis). Note that
the points are scattered along the two axes, indicating that there are
many left constituents that occur predominantly either with -en- or
without -en-. Turning to the right panel, we find a more random pattern
for the 3935 right constituents: here, the presence of a larger number
of compounds with -en- does not imply a small number of compounds
without -en-, and vice versa. Thus, a strong bias for -en- exists only
for left constituents. Interestingly, this asymmetry is clearly
reflected in the responses of the participants in the present
experiment. If participants had chosen the linking morpheme at random on
the basis of all the existing compounds (CELEX: 43413) in the language,
one would have expected -en- (CELEX: 4744) to be selected in roughly 11%
of our experimental material. The left constituents provide larger
families with clearer preferences for or against -en-, leading to a much
higher percentage of en responses in the positive and neutral conditions
(58%-96% versus 6%-19% in the negative condition).
[Figures 1&2 ILLUSTRATIONS OMITTED]
In a post-hoc analysis we also tested the overall effect of family
homogeneity on the response homogeneity across the three conditions
(positive, neutral, negative) for both the left and right bias. We
calculated the family homogeneity in terms of the difference between the
number of family members with -en- and the number of family members
without -en-. We calculated the response homogeneity in terms of the
difference between the number of en responses and other responses. The
upper panels of Figure 3 reveal a nonlinear correlation between response
homogeneity and family homogeneity represented by a dotted line.(4) The
upper left panel shows a sigmoid curve for the left constituents. The
upper right panel shows a more diffuse pattern for the right
constituents. Despite this difference, a Spearman correlation test
revealed a significant correlation between the family homogeneity and
the response homogeneity for both the left ([r.sup.s] = 0.87, z = 6.88,
p [is less than] 0.0001) and the right position ([r.sup.s] = 0.34, z =
2.70, p = 0.007). The magnitude of these correlation coefficients
([r.sup.s] = 0.87 versus [r.sub.s] = 0.34) corresponds to the difference
in strength of the left and right bias: in terms of rank correlations,
the left bias explains 76% of the variance, while the right bias
explains only 12% of the variance.
[Figure 3 ILLUSTRATION OMITTED]
Having observed clear effects of analogy on the choice of the
linking morpheme -en-, we now turn to the linking morpheme -s-.
The constituent family effect: experiment 2: the linking morpheme
-s-
Method
Materials. As in experiment 1, we constructed three sets of left
constituents (L1, L2, L3) and three sets of right constituents (R1, R2,
R3). Each set contained 21 nouns. The constituents of the L1 and R1 sets
had constituent families with as strong a bias as possible toward the
linking morpheme -s-. Conversely, L3 and R3 showed a bias as strong as
possible against -s-. The sets L2 and R2, the neutral sets, contained
nouns with families without a clear preference for or against -s-. We
used the CELEX lexical database to determine the constituent families of
the constituents in these six sets. Compounds with a token frequency of
zero in a corpus of 42 million words were not included.
The constituents in the L1 set had constituent family members of
which at least 80% contained the linking morpheme -s-. The mean number
of compounds in these families was 45.7 (range 15-174). Their mean token
frequency was 1196.8 per 42 million word forms (range 102-6663). The
constituents in the R1 set had constituent family members of which at
least 70% contained the linking morpheme -s-. The mean number of
compounds in these families was 6.5 (range 4-19). Their mean token
frequency was 103.5 per 42 million word forms (range 12-409). Neutral
left constituents are rare. The neutral set L2 included left
constituents whose families contained between 35% and 65% compounds with
the linking morpheme -s-. These families had a mean number of compounds
of 6.4 (range 2-34) and a mean token frequency of 116.9 per 42 million
word forms (range 5-915). The constituents in the R2 set had constituent
family members of which 45% to 55% contained the linking morpheme -s-.
These families had a mean number of compounds of 16.4 (range 4-52) and a
mean token frequency of 216.4 per 42 million word forms (range 18-527).
The remaining sets L3 and R3, the groups with a bias against -s-,
contained constituents whose family members never have a linking -s-.
There were in the mean 31.2 (range 15-77; L3) and 2.45 (range 10-17; R3)
family members respectively. Their mean token frequency was 903.1 (range
98-2874; L3) and 532.9 (range 39-2677; R3). These are the maximal
contrasts that allowed us to select 21 constituents for each
experimental set.
As in experiment 1, each of the three sets of left constituents
(L1, L2, L3) was combined with the three sets of right constituents (R1,
R2, R3) to form pairs of constituents for new compounds in a factorial
design with two factors: bias in the left position (positive, neutral,
and negative) and bias in the right position (positive, neutral, and
negative). None of these compounds is attested in the CELEX lexical
database with a token frequency higher than zero. All have a high degree
of semantic interpretability. Appendix B lists all experimental items.
The 9 x 21 = 189 experimental items were divided over three lists. List
1 contained the compounds of the factorial combinations L1-R1, L2-R3,
and L3-R2. List 2 contained the compounds of the combinations L1-R2,
L2-R1, and L3-R3, and list 3 contained the compounds of the combinations
L1-R3, L2-R2, and L3-R1. In this way, each participant saw each
constituent only once. We constructed a separate randomized list of the
3 x 21 = 63 compound constituent pairs for each participant.
Procedure. The procedure was identical to that of experiment 1.
Participants. Sixty participants, mostly undergraduates at Nijmegen
University, were paid to participate in the experiment. All were native
speakers of Dutch; none had participated in the previous experiment. The
participants were divided into three groups. Each group was asked to
complete one of the three experimental lists.
Results and discussion
The participants followed the instructions very closely so that no
responses had to be counted as errors. That allowed us to include all
participants and all items in the data analysis. Table 2 summarizes the
percentages of s responses versus other responses for the nine
experimental conditions. Appendix B lists the individual words together
with the absolute numbers of s and not s responses.
Table 2. Percentages of selected linking morphemes when varying bias
for -s- (positive, neutral, and negative) in the left and right
compound position
Left position Right position
positive neutral negative
Positive
s 96.7 (20.3)(a) 97.4 (22.8) 91.7 (24.2)
not s 3.3 (20.3) 2.6 (22.8) 8.3 (24.2)
Neutral
s 70.5 (10.2) 67.6 (3.7) 53.6 (9.5)
not s 29.5 (10.2) 32.4 (3.7) 46.4 (9.5)
Negative
s 13.6 (5.2) 5.2 (11.5) 1.9 (5.1)
not s 86.4 (5.2) 94.8 (11.5) 98.1 (5.1)
(a.) Standard deviations given in parentheses.
A by-item logit analysis of the s and not s responses revealed a
main effect of bias in the left position (F(2,180) = 150.6, p [is less
than] 0.0001), a main effect of bias in the right position (F(2,180) =
10.5, p [is less than] 0.0001), and no interaction of the bias in both
positions (F(4,180)= 1.6, p = 0.1883). Again, the difference between the
neutral and positive bias conditions on the right position is not
reliable (F(1,120) = 1.9, p = 0.1687).
The lower panel of Figure 1 shows the effects of both biases on the
percentage of s responses. As in experiment 1, bias has a larger effect
on the left position (a difference of minimally 70% between the positive
and negative conditions) than on the right position (a difference of
maximally 17%). This result again reflects an asymmetry in the
distribution of the linking elements in Dutch that is also mirrored in
our experimental design. The left constituents provide larger families
with clearer preferences for or against -s-, leading to a much higher
percentage of s responses in the positive and neutral conditions (from
53% up to 97% versus 2% up to 14% for the negative condition).
In a post-hoc analysis we tested the overall effect of the family
homogeneity on the response homogeneity across the three conditions
(positive, neutral, negative) for both the left and right bias. As
before, we calculated the family homogeneity in terms of the difference
between the number of family members with -s- and the number of family
members without -s-. We calculated the response homogeneity in terms of
the difference between the number of s responses and other responses.
The lower panels of Figure 3 reveal a nonlinear correlation between
response homogeneity and family homogeneity represented by a dotted
line. The lower left panel shows the data of the left constituents, the
lower right panel shows the data of the right constituents. As for the
-en- homogeneity, the left constituents reveal a sigmoid curve, while
the right constituents show a more diffuse pattern. As in experiment 1,
a Spearman correlation test revealed a significant correlation between
the family homogeneity and the response homogeneity for both the left
([r.sub.s] = 0.89, z = 7.00, p [is less than] 0.0001) and the right
position (Spearman: [r.sub.s] = 0.42, z = 3.33, p [is less than]
0.0001). The magnitude of these correlation coefficients ([r.sub.s] =
0.89 versus [r.sub.s] = 0.42) corresponds to the difference in strength
of the left and right bias: in terms of rank correlations, the left bias
explains 79% of the variance, while the right bias explains only 18% of
the variance.
Experiment 2 addressed the question whether the families of the
right and left constituent affect the choice for or against the linking
morpheme -s- when building a new nominal compound. We were able to
replicate the results of experiment 1, which tested the family effect on
the linking morpheme -en-. The family of the left constituent has a
strong effect on the choice of the linking morpheme, while the family of
the right constituent has a smaller but also significant effect.
The suffix family effect: experiment 3: the effect of the preceding
suffix on the linking morpheme -s
We have seen that the families of the immediate constituents of a
new nominal compound have a great influence on the choice of the linking
morpheme. The linguistic literature tells us that, in the case of
derived words as left constituents, it is the suffix that has influence
on the following linking morpheme (Van den Toorn 1981 a, 1981b). For
instance, suffixes -ist (similar to English person-noun forming
"-ist") or -in (similar to English "-ess") appear
mainly with -en-, while suffixes -aard (similar to English
"-ee") or -held (similar to English "-ness") appear
mainly with -s-. However, like the constituents, the suffix does not
completely determine the linking morpheme. We therefore tested whether
the suffix family, that is, all compounds that contain a left
constituent built with a particular suffix, has an effect on the choice
of the linking morpheme. For this experiment we chose the linking
morpheme -s- because the -s- appears much more often with a preceding
suffix (586/1004*100 = 58.4% of all preceding derived words) than the
-en- (54/594*100 = 9.1% of all preceding derived words). To make sure
that we were testing the effect of the suffix and not the effect of the
left constituent, we used pseudo-derivations.
Method
Materials. We constructed two sets of left pseudo-constituents (L1,
L2) and three sets of right existing constituents (R1, R2, R3). Each set
contained 21 nouns. The pseudo-constituents of the sets L1 and L2
contained Dutch suffixes with pseudostems, none of which violated the
phonotactic rules of Dutch. The suffixes of L1 were -ing (similar to
English "-ing"), -heid (similar to English "-ness"),
and -iteit (similar to English "-ity"). They appear in CELEX
compounds mainly with the linking morpheme -s- (-ing: 379/406* 100 =
93.3%; -heid: 65/66*100 =: 98.5%; -iteit: 21/25*100 = 84.0%). The
suffixes of L2 were -in (similar to English "-ess"), -sel
(similar to English "-ee"), and -ster (similar to English
"-ess"). They appear in CELEX in at least 50% of all cases
without the linking morpheme -s- (-in: 0/1 = 0.0%; -sel 0/6 = 0.0%;
-ster: 1/2 = 50.0%). R1, R2, and R3 were the same as in experiment 2.
Thus, R1 had constituent families with as strong a bias as possible
toward the linking morpheme -s-. R3 showed a bias as strong as possible
against -s-. The set R2, the neutral set, contained nouns with families
without a clear preference for or against -s-.
Similar to the previous experiments, each of the two sets of left
pseudo-constituents (L1, L2) was combined with the three sets of right
constituents (R1, R2, R3) to form pairs of constituents for new
compounds in a factorial design with two factors: bias in the left
position (positive and negative) and bias in the right position
(positive, neutral, and negative). Appendix C lists all experimental
items. The 6 x 21 = 126 experimental items were divided over three
lists. List 1 contained the compounds of the factorial combinations
L1-R1 and L2-R2. List 2 contained the compounds of the combinations
L1-R2 and L2-R3, and list 3 contained the compounds of the combinations
L1-R3 and L2-R1. In this way, each participant saw each constituent only
once. We constructed a separate randomized list of the 2 x 21 = 42
compound constituent pairs for each participant.
Procedure. The procedure was identical to that of experiments 1 and
2.
Participants. Sixty participants, mostly undergraduates at Nijmegen
University, were paid to participate in the experiment. All were native
speakers of Dutch; none had participated in the previous experiments.
Each group was asked to complete one of the three experimental lists.
Results and discussion
Occasionally, participants filled in a question mark or a letter
sequence other than a linking morpheme. Such responses were counted as
errors. The overall error rate was extremely low (0.2%), which allowed
us to include all participants and all items in the data analysis. Table
3 summarizes the percentages of s responses versus other responses for
the six experimental conditions. Appendix C lists the individual words
together with the absolute numbers of s and not s responses.
Table 3. Percentages of selected linking morphemes when varying bias
for -s- in the left position (positive and negative) and right
position (positive, neutral, and negative)
Left position Right position
positive neutral negative
Positive
s 84.0 (14.9)(a) 86.4 (9.0) 79.5 (14.7)
not s 16.0 (14.9) 13.6 (9.0) 20.5 (14.7)
other 0 0 0
Negative
s 24.8 (17.4) 20.0 (15.4) 16.4 (16.9)
not s 75.2 (17.4) 80.0 (15.4) 82.6 (16.9)
other 0 0 1.0
(a.) Standard deviations given in parentheses.
A by-item logit analysis of the s and not s responses revealed a
main effect of bias in the left position (F(1,120) = 276.0, p [is less
than] 0.0001), no effect of bias in the right position (F(2,120) = 2.2,
p = 0.1201), and no interaction of bias in both positions (F(2,120) =
0.6, p = 0.5726).
Experiment 3 addressed the question whether the family of the
preceding suffix affects the choice for or against the linking morpheme
-s- when building a new nominal compound. We found a strong effect of
the suffix family on the choice of the linking morpheme. We were not
able to replicate the smaller but significant effect of the family of
the right constituent that we saw in experiments 1 and 2. The use of
pseudo-words in the left position led to compounds that are difficult to
interpret. Maybe the lack of a possible interpretation decreased the
effect of the bias in the right position, which was already small in the
previous two experiments.
Summary: experimental results
Experiments 1 and 2 revealed that linking morphemes in novel
compounds can be predicted on the basis of the families of both left and
right constituents, and that the effect of the left family is much
stronger. We have seen that the difference in strength mirrors a
distributional asymmetry in the lexicon, that is, left constituents tend
to have a stronger bias for or against a linking morpheme than right
constituents. Experiment 3 showed that suffixes attached to pseudo-words
to form left constituents also affect the choice of linking morphemes.
The experimental results are in line with the descriptions in the
literature in so far as the properties of the left constituent are
traditionally described as the main factors influencing the choice of
linking morphemes. The presence of a weaker but significant effect of
the right constituent is in line with the observation that right
constituents may be important because they codetermine the semantic
relation between the constituents in a compound. We have also shown that
the final suffix in derived left pseudo-words plays a role, which is in
line with the observations reported by Van den Toom (1981a, 1981b) for
real words. Most importantly, the results of our experiments have
revealed unambiguous evidence for a strong analogical effect of the
constituent family, a novel factor that is not discussed in the
linguistic literature.
In the next section, we proceed to test whether it is possible to
simulate the effect of the constituent families with the help of an
explicit computational algorithm for analogy. The aim of this section is
to ascertain whether analogy based on constituent families is
computationally tractable. In the general discussion, we will outline
how the computational technique that we have opted for can be mapped
onto a psycholinguistically plausible architecture of the mental
lexicon.
Analogical modeling
Several techniques are available for the modeling of data that
display statistical tendencies rather than discrete regularities.
Connectionist models are widely used to obtain predictions for graded
data where standard rule-based methods fail. Although connectionist
networks are powerful nonlinear classifiers, they have the disadvantage
that additional follow-up analyses of the network are required in order
to understand how the network arrives at its classifications. A second
disadvantage of connectionist models is that it is at present unclear
whether they can accommodate the family-size effect reported in
Schreuder and Baayen (1997) and De Jong et al. (2000). The family-size
effect concerns the finding that type counts of morphologically related
words for target words correlate with lexical decision times and
subjective frequency ratings to these target words, while the
corresponding token counts have emerged as irrelevant. Given the
sensitivity of connectionist networks to frequencies of occurrence, that
is, token frequencies, it is as yet unclear how this type-frequency
effect might emerge in combination with the absence of the
token-frequency effect. As the role of the constituent family that has
emerged from our experiments appears to be a similar type-count effect,
but now in production rather than in comprehension, we have opted for an
exemplar-based approach in which type-count effects are more easily
accommodated.
Exemplar-based approaches have been developed by, for example,
Skousen (1989) and Daelemans et al. (1999). Skousen proposes an
analogical model specifically for the domain of language. In his model,
stored exemplars are compared with a given target word using a
similarity metric defined over a series of user-specified features.
Exemplars that are most similar to the target are most likely to serve
as the analogical basis for its classification.
Various machine-learning techniques proceed along similar lines. We
have opted for a program implementing a series of machine-learning
techniques, TiMBL, developed by Daelemans et al. (1999).(5) This
implementation offers powerful heuristics for finding directly the
features with strong analogical weight. In what follows, we first
describe this machine-earning technique, which we have found very useful
from a computational linguistics point of view. We then discuss the
results that we obtained with this technique. In the general discussion,
we outline the way in which the technical computational model can be
mapped onto a psycholinguistically more plausible model of analogical
processing in the mental lexicon.
Exemplar-based learning
Exemplar-based learning techniques implement the idea that the
performance of cognitive processes is based on explicit storage of
representations of earlier experiences. Reasoning is conducted by
comparing a new instance with stored instances. Crucially, the
information carried by earlier experiences is not extracted from these
experiences and stored in the form of abstract rules. Instead, a general
strategy for similarity-based reasoning is combined with the extensive
storage of exemplars in an instance database. For example, the problem
of assigning the position of the main stress to a novel Dutch word is
solved by storing large numbers of multisyllabic words in the instance
database, and by using a distance measure defined over the phonological
makeup of the final two syllables of these words. A search in the
instance base leads to the exemplar that is most similar to that of the
target noun. The stress position stored with this exemplar is suggested
to be that of the target noun (see Daelemans et al. 1994 for a detailed
study). The main advantage of exemplar-based learning is that no
abstract rules need to be formulated. The price to be paid is that
computational load increases substantially with the size of the
database, because the distance between any new instance and all
exemplars in the instance database must be computed. We will return to
the issue of this computational load below.
In our experience, the k-NN algorithm with the Hamming distance measure known as IB1 in machine-learning literature (Aha et al. 1991)
yields the best results for the modeling of Dutch linking morphemes. Its
similarity metric is very simple. Given two patterns X and K each
represented by n features, the distance between X and Y is the number of
shared features. TiMBL makes three additions to the original k-NN
algorithm. First, the value of k refers to the k-nearest distances and
not the k-nearest cases. In our simulation studies we have set k to
unity, which means that all instances at Hamming distance I are included
in the set of nearest neighbors. Second, if the nearest-neighbor set
contains more than one instance, the linking morpheme is selected that
is most often instantiated in this nearest-neighbor set. Third, in case
of a tie, the linking morpheme is selected that has the highest
frequency in the instance base.
TiMBL has the useful possibility of adding to the Hamming-distance
measure a relevance weight for every feature (the IBI-IG algorithm).
TiMBL accomplishes this by means of the information gain (IG), which
looks at a feature and measures how much information it contributes to
our knowledge of the correct linking morpheme. The information gain of a
feature i is obtained by calculating the difference in uncertainty or
entropy between the situations without and with knowledge of the value
of that feature:
(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
In (1), C denotes the set of linking possibilities (-en-, -s-, ??),
and [V.sub.i] the set of values for feature i (e.g. "stressed"
and "unstressed" for the feature stress). The entropy of the
linking possibilities is
(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
with c ranging over {-en-, -s-, ??}. Using information gain
weights, we get the following distance metric:
(3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
(Daelemans et al. 1999: 9). By computing the information gain for
the many features that one might potentially use in a particular
simulation study, it becomes possible to make an informed preselection
of features.
In what follows, we will apply this methodology to the materials of
the first two experiments in order to ascertain to what extent
machine-learning techniques are able to predict the choice of linking
morphemes.
Predicting linking morphemes
In order to gauge the predictive power of exemplar-based learning
of Dutch linking morphemes, we first studied the preferred choices for
existing compounds using 10-fold cross-validation. In a 10-fold
cross-validation the dataset is divided into 10 "held-out"
subsets. For each held-out subset, linking morphemes are predicted on
the basis of the remaining 90% of the data, which serve as the training
set. The overall performance of the model is evaluated in terms of the
average percentage of correctly predicted linking morphemes calculated
over the ten cross-validation runs.
A crucial determinant of the model's performance is the set of
features defining its input space. In our simulation studies, we made
use of nine features. The first and second features code the left and
right immediate constituents, which represent the left and right
constituent families. The third feature represents the plural suffix
selected by the left constituent. This feature can be used to extract
the knowledge that the linking morpheme -en- is found only after left
constituents that select -en as their plural suffix. Features 4-7 code
the abstractness and animacy of the first and the second constituent.
They allow us to trace whether the semantics of the constituents
codetermine the choice of linking morphemes (Van den Toorn 1982b).
Feature 8 marks the presence of stress on the final syllable of the
first constituent, as it might be possible that the linking morpheme
-eh- is inserted to avoid a stress clash between the two constituents.
Finally, feature 9 codes the morphological complexity of the first
constituent in terms of its number of morphemes, as a greater complexity
of the left constituent has been argued to give rise to a preference for
-s-(see Mattens 1984). In various simulation runs not reported here, we
used the three final phonemes of the first constituent and the three
initial phonemes of the second constituent, as well as the last morpheme
of the first constituent instead of features 1 and 2. As the results
obtained with this alternative feature set invariably turned out to
yield inferior results, we do not discuss these alternative features.
We used the 22,994 Dutch nominal compounds in the CELEX lexical
database that occur with a frequency of at least 2 per 42 million word
forms as our instance base. Each of these compounds was assigned a
vector of values for our nine features. The second column of Table 4
lists the information gain for each individual feature on the basis of
the training sets in the cross-validation runs. When we use all
features, we predict the correct linking morpheme, for 93.2% of the
compounds in the held-out data-sets. When we use only the first feature,
the first constituent, which has the highest information gain, we obtain
an accuracy that is only slightly less, 92.5%. The linguistic literature
describes the choice of linking morpheme as governed by a conspiracy of
tendencies. Our cross-validation results suggest that, indeed, these
tendencies allow the linking morpheme to be predicted with a high degree
of accuracy. Surprisingly, most of the predictive power resides in a
single feature only: the first constituent, that is, the key for the
morphological family of the first constituent.
How well does the model predict the choice of the linking morpheme
for the neologisms used in experiments 1 and 2? First consider
experiment 1, summarized in columns 4-6. The column labeled Fam1 lists
information gain and accuracy when the model is trained on the pooled
constituent families of all experimental words. We trained the model on
this subset of the compounds listed in CELEX for the following reason.
The semantic specification for a constituent of a given compound, as we
have used it for the first study, is not restricted to the meaning of
the constituent in this particular compound but provides the full range
of possible meanings the constituent can have when used in isolation.
For a specific compound, this range of possible feature values is too
broad. For the subset of constituent families it was feasible to
manually narrow down the semantics to the correct meaning for each
specific compound separately. Consequently, there are two differences
between this analysis and the previous analysis based on the CELEX data.
First, the semantic features are more precise; second, the number of
types on which TiMBL is trained is much smaller (CELEX 22,994 vs. Fam1
1864).
When we train on the pooled families using all features, we obtain
an accuracy of 83.6%. As we are dealing with neologisms, accuracy is
evaluated in terms of the percentage of experimental words for which
TiMBL predicts a linking morpheme that is identical to the majority
choice of our participants. Again, we observe that the first constituent
has the highest information gain, and that using this feature
exclusively already leads to an accuracy of 78.8%. By adding features 5
and 6, we can increase the accuracy to 85.2%. Feature 5 concerns the
animacy of the left constituent: animate left constituents elicit higher
numbers of en responses. Feature 6 represents the abstractness of the
right constituent: abstract right nouns lead to fewer en responses. The
selection of these features is based on forward step-wise selection. At
the first step, the feature with the highest information gain is
selected. For each successive step, the feature with the next highest
information gain is considered. If addition of this feature improves
accuracy, it is added to the list of features. Otherwise, the feature
with the next highest information gain is tested. The information gains
of the features selected by this algorithm are marked with an asterisk in Table 4.
Table 4. Features used in the simulation studies, their information
gain (upper part of the table), and the corresponding prediction
accuracy (lower part of the table)
No. Feature CELEX EN:(b) S:(b)
(a) Faml(c) CELEX(c) Fam2(c) Faml(c)
1 1st C 1.11(*) 1.29(*) 1.11(*) (*) 1.14(*)
2 2nd C 0.41 0.96 0.41 0.70
3 1st C: plur 0.10 0.12 0.10 0.07
4 1st C: abst 0.07 0.13 0.07 0.13
5 1st C: anim 0.04 0.13(*) 0.04 (*) 0.07
6 2nd C: abst 0.02 0.06(*) 0.02 (*) 0.06(*)
7 2nd C: 0.00 0.01 0.00(*) 0.01
anim
8 1st C: stress 0.07 0.13 0.07 0.07
9 1st C: 0.11 0.05 0.11 0.08
compl
Accuracy 1-9 (%) 93.2 83.6 78.3 84.7 91.5
Accuracy 1 (%) 92.5 78.8 75.1 79.9 87.8
Accuracy(*)(%) 92.5 85.2 82.0 86.8 91.5
No. Feature
CELEX(c) Fam2(c)
1 1st C 1.11(*) (*)
2 2nd C 0.41
3 1st C: plur 0.10
4 1st C: abst 0.07
5 1st C: anim 0.04
6 2nd C: abst 0.02(*) (*)
7 2nd C: 0.00
anim
8 1st C: stress 0.07
9 1st C: 0.11
compl
Accuracy 1-9 (%) 82.5 82.5
Accuracy 1 (%) 82.5 87.3
Accuracy(*)(%) 83.1 88.4
(a.) CELEX: results using 10-fold cross-validation.
(b.) EN, S: results for experiments 1 and 2, with accuracy being
evaluated against the majority choice of the participants.
(c.) Predictions are made on the basis of various training sets: Fam1:
pooled family members of all experimental items; CELEX: all compounds
in CELEX; Fam2: predictions based on left and right constituent
families of each individual item.
(*) Features determined as relevant by forward step-wise selection.
When we compare these results with those obtained with
cross-validation for all compounds in CELEX (column 3), we observe a
decrease in accuracy of roughly 10%. This loss of accuracy has three
possible sources. First, the experiment made use of neologisms,
nonexisting compounds presented without a natural context, that may have
been somewhat more artificial than existing compounds. However, whatever
the nature of our materials may be, the performance of the model is
similar to that of human subjects. When we calculate the average
accuracy of the subjects in the same way as we evaluate the accuracy of
the model, that is, by treating the majority choice as the norm, we
obtain an average accuracy of 85.1%, which comes close to the maximum of
the range of model accuracies (78.8-85.2). Apparently, participants and
the model find the task equally difficult.
Second, the set of words with a neutral bias in the experiment is
atypical for the population as a whole. As we have already seen in
Figure 1, most of the left constituents in CELEX reveal a strong bias
for or against -en- (98% of all left constituents appear with the
linking morpheme -en- either in less than 35% or in more than 65% of all
members of the constituent family). The overrepresentation of left
constituents without a strong bias in the experiment (30% versus 2% of
all CELEX compounds) renders the experiment more difficult to model than
the CELEX population of compounds using cross-validation. In fact, the
accuracy scores for the subsets of words with a strong bias for or
against -en- are substantially higher than those for the words with a
neutral bias (left positive bias: 92.1%; left neutral bias: 71.4%; left
negative bias: 90.5%). Clearly, the atypical neutral set renders the
experiment more difficult.
Third, the reduced size of the training set may have led to reduced
accuracy. To investigate this possibility we ran additional simulation
experiments. When we train the model on all compounds in CELEX rather
than on the subsets of words for which we checked the coding of
concreteness and animacy of the constituents by hand, we observe a
slight reduction in accuracy of roughly 3%. Possibly, this reduction
arises because the semantic coding is less precise for the database as a
whole. Interestingly, we obtain slightly improved accuracies when we
train the model not on a larger but on an even smaller training set. By
training on the unique family members of each experimental compound
separately, we improve the average accuracy to 86.8% (column 6, Fam2),
using the same features that led to the highest accuracy when training
on the pooled family members.6 It is remarkable that training on the
basis of small by-item families (with a range of 8-84 family members)
results in slightly, although not significantly (p [is greater than]
0.2, proportions test), improved performance compared to training on the
1864 pooled family members or the 22,994 compounds in CELEX. This
suggests that the constituent families provide the analogical basis for
selecting the linking morphemes in novel compounds. From a
psycholinguistic perspective, this is an important result as it obviates
the need to scan the complete lexicon for analogical exemplars. In the
general discussion, we shall use this result to formulate a
psycholinguistic spreading activation model for the analogical selection
of linking morphemes.
The last three columns of Table 4 summarize the results obtained
using the same procedures for the data of experiment 2. The best results
are obtained when we train TiMBL on the pooled constituent family
members of all experimental compounds. On the basis of the first
constituent and the abstractness of the second constituent (abstract
right constituents lead to more s responses), TiMBL achieves an accuracy
of 91.5%. When we train the model on all compounds in CELEX, accuracy
decreases significantly to 83.1% (p = 0.02, proportions test). Training
on the individual families of the experimental compounds leads to a
slight reduction in accuracy that, however, does not differ
significantly from the accuracy when trained on the pooled constituent
family members. Compared to the participants in experiment 2, who on
average opt for the majority choice for 83.5% of the experimental
compounds, TiMBL performs surprisingly well.
The results summarized in Table 4 are the best results that we have
been able to obtain. Replacing the features for the first and second
constituents by features for the last three segments of the first
constituent and the first three segments of the second constituent
invariably leads to decreasing performance. The same holds for training
on the last morpheme of the first constituent.
Table 5 compares the success rate that can be achieved on the basis
of the phonological and morphological rules that have been formulated
for Dutch with the corresponding success rate as achieved by TiMBL
(trained on the constituent families of the individual items), for
experiments 1 and 2. Note that the rules are applicable only to small
subsets of the materials. The phonological rules state that no linking
morpheme is allowed following a rime ending with a vowel, with a liquid
preceding /k/ or /m/, or with a schwa followed by a sonorant. For words
with other rime characteristics, the rules provide no predictions at
all. Not surprisingly, the morphological rules apply only to the
compounds in our materials that have a derived left constituent.
Similarly, the semantic rules apply only to words with a mass noun and
human agents ending in -er as left constituent, as well as to synthetic
compounds in which the left constituent is the nonsubject argument of
the embedded verb to its right. From Table 5, it is clear that TiMBL
outperforms the rules for all applicable words. In addition, TiMBL
provides good predictions where the rules provide none. Interestingly,
TiMBL reveals the animacy and abstractness of the left and right
constituents to be relevant factors codetermining to some extent the
choice of the linking morpheme. Further rigorous quantitative research will have to clarify which semantic factors contribute to the choice of
the linking morpheme over and above the constituent families themselves.
Table 5. Comparison of rule-based and analogy-based predictions for
experbnents 1 and 2
Applicable Not applicable
rules TiMBL rules TiMBL
EN (experiment 1)
phonology(b) 9/15(a) 13/15 -/174 142/174
morphology(c) 15/36 36/36 -/153 119/153
semantics(a) 8/14 10/14 -/175 145/175
S (experiment 2)
phonology(b) 12/24 24/24 -/165 133/165
morphology(c) 27/51 41/51 -/138 116/138
semantics(d) 11/34 28/34 -/155 129/155
(a.) x/y: number of successful prediction/number of applicable cases.
(b.) Phonology: predictions based on the final rime.
(c.) Morphology: predictions based on lhe final sutfix.
(d.) Semantics: predictions based on semantic rules for mass nouns,
human agents ending in -er, and synthetic compounds in which the left
constituent is the nonsubject argument of the embedded verb to its
right.
Finally, Table 6 presents a comparison of the performance of the
participants with the performance of TiMBL when trained on the
constituent families of the individual items. The first two columns
specify the bias (positive, neutral, or negative) for the left and right
constituents. The third and fifth columns list the number of
participants (averaged over items) that selected -en- (column 3) and -s-
(column 5) in experiments 1 and 2 respectively. TiMBL provides for each
item the probabilities for the various linking options. Given that there
were 20 participants in each of the two experiments, the expected number
of participants selecting, for example, -eh- in experiment 1 for a given
item equals 20 times the probability of-eh- for that item. The average
number of participants selecting -en- for the nine experimental
conditions of experiments 1 and 2 are listed in columns 4 and 6
respectively. Note that the expected values as predicted by TiMBL are
similar to the experimental values, and this impression is confirmed by
goodness-of-fit tests.(7) Thus, the predictions of TiMBL as a
computational model of analogy remain accurate even when we consider the
individual conditions of our experimental design.
Table 6. Comparison of the participants and TiMBL across experimental
conditions
Left Right EN (experiment 1) S (experiment 2)
participantsa TiMBL participants(a) TiMBL
pos pos 19.0 17.8 19.3 19.7
pos neutr 19.3 18.3 19.5 19.7
pos neg 17.5 17.9 18.3 19.7
neutr pos 15.0 11.3 13.3 10.4
neutr neutr 16.4 12.4 12.7 10.4
neutr neg 11.7 11.8 10.5 10.4
neg pos 3.6 0.0 2.7 0.0
neg neutr 3.8 0.0 1.0 0.0
neg neg 1.2 0.0 0.4 0.0
(a.) Number of participants (averaged over items) selecting -en- in
experiment 1 and -s- in experiment 2 and the corresponding expectations
based on TiMBL (see text).
Note that this is not a trivial result. The model could have failed
in several ways. First, it could have predicted linking morphemes at
chance level. This would have indicated that constituent bias would not
be the true factor underlying the choice of linking morphemes. In that
case, our conclusion would have been that we had failed to include the
appropriate features in the input data. Second, the model could have
predicted the correct choice for the wrong reasons. Suppose that the
model had based its predictions not on the constituent family but on the
nature of the third phoneme of the right constituent. Suppose,
furthermore, that the left constituent family bias is uncorrelated with
the nature of this third phoneme. In these circumstances, the model
would be interesting from a technical point of view but seriously flawed
from a cognitive point of view, as our experiments show that constituent
bias is an important factor if not the most important factor. Third, we
ran our simulation studies not only on the bases of the constituent
families but on a great many other features as well. The simple fact
that the model assigns the greatest information gain to the constituent
families is not an artifact of the selection of our experimental
materials, as can be seen from the cross-validation data obtained for
all noun-noun compounds in the CELEX lexical database.
Summing up, the present simulation studies show that predictions
mirroring the actual choices of human participants can be made on the
basis of the families of the left constituent in combination with the
semantics of both constituents. These results suggest that analogy may
well underlie the strong intuitions that language users have concerning
the choice of the appropriate linking morpheme.
General discussion
This study has addressed the question of how analogy influences the
choice of linking morphemes in Dutch noun-noun compounds. Even though
the usage of linking morphemes in noun-noun compounds is not well
predictable by rule, it can be quite well predicted analogically on the
basis of the constituent families of both the left and the right
constituents. It is the family of the left constituent that constitutes
the primary domain of analogical prediction for existing words
(experiments 1 and 2). In the case of suffixed pseudo-words as left
constituents, the suffix provides the analogical domain for the choice
of the linking morpheme (experiment 3). A series of computational
simulation studies using an exemplar-based machine-learning algorithm
for the modeling of analogy, TiMBL, revealed that the actual linking
morphemes selected by the participants in our experiments can be
predicted with a high degree of accuracy on the basis of the
morphological family of the first constituent, with some additional
influence of the semantics of the second constituent. These results lead
us to conclude that the left constituent families provide the crucial
analogical basis for selecting the most appropriate linking morpheme in
Dutch. When comparing the choices made by the participants in our
experiments with those made by the machine-learning algorithm, we found
that the selection is equally difficult for human subjects and TiMBL.
Our results show that the choice of the linking morpheme hinges on
existing exemplars with the same left constituent. At the same time, our
experimental evidence suggests that the right constituent has a minor
role to play. We know of three other studies that mention a possible
role for the left constituent. For compounds in Afrikaans, Botha (1968)
argued that nouns are lexically marked for linking morpheme when they
appear as left constituents in compounds;. This works fine for those
left constituents that consistently occur with only one linking
morpheme. However, for the many left constituents with variable
realizations, Botha is forced to assume lexical listing of the full
compounds. Unfortunately, Botha's theory has no predictive power
with respect to neologisms that have a left constituent with variable
realizations.
The idea that analogy might be involved has been suggested for
German linking morphemes by Becker (1992), who, however, makes use of
such a general notion of analogy that it is difficult to see how any
falsifiable predictions might be obtained. Dressler et al. (n.d.)
present experimental data that hint at a role for left constituent bias
in German, but these authors mention this possibility only in passing
for a small subset of their data. Since our present results show that it
is possible to explicitly model analogy quantitatively and to predict
its influence experimentally, we believe that we now have a realistic
methodology for studying the influence of analogy on the realization of
linking morphemes across a wider range of languages.
Recall that there is considerable variation in the realization of
the linking morphemes. We have seen this variation in the responses of
the participants in our experiments, and it is also visible in
comprehensive dictionaries, which list variants such as spelling +
wijziging and spelling + s + wijziging `spelling change' side by
side. This variation is captured by our analogical model, which allows
for some uncertainty with respect to the appropriate linking morpheme
exactly as observed for the responses of our participants. This kind of
variation is not restricted to linking morphemes, it is also found in
the domain of derivational morphology. For instance, Malicka-Kleparska
(1985) discusses the formation of diminutives in Polish and calls
attention to the free variation between the rival forms -ik and -ek that
occurs for words with a particular phonological form. Such free
variation is at odds with strict rule-based systems, while it may arise
in systems based on analogy in the absence of a clear bias for a
particular form. We believe that such variational data provide evidence
in favor of the view that morphological rules are grounded in analogy.
Thus far, we have used the machine-learning algorithm implemented
in TiMBL to model the analogical selection of linking morphemes in novel
compounds. From a computational linguistics point of view, TiMBL
captures the analogy underlying the linking morphemes quite
satisfactorily. From a psycholinguistics point of view, the question
arises whether it is realistic to assume that in general analogy is
really based on an exhaustive calculation of a distance metric for all
forms in the lexicon. In fact, TiMBL itself does not carry out such an
exhaustive calculation for a novel form. While this might be feasible on
a massively parallel machine, present-day sequential machines require
alternative algorithms. TiMBL solves this algorithmic problem by
constructing a decision tree during training (Daelemans et al. 1997). By
dropping a novel form through the decision tree, the appropriate linking
morpheme is identified.
Such a decision tree can in fact be understood as a set of rules.
Given that the analogy underlying the choice of linking morphemes is
based on constituent families, a separate rule for each constituent is
embodied in the decision tree. Those researchers who view morphological
processing as fundamentally rule-based therefore have the option of
reformulating the decision tree of TiMBL as a set of morphological
rules. The cost of this option is a proliferation of rules, one for each
possible left constituent. As we find this cost too high, we have
explored an alternative approach based on the idea of parallel
coactivation of constituents in a spreading activation framework along
the lines of Schreuder and Baayen (1995). Parallel coactivation is a
realistic option precisely because our experimental results have
revealed that it is only the constituent families that have to be
inspected, and not each and every compound in the mental lexicon (or in
TiMBL's instance base). Consider Figure 4.
[Figure 4 ILLUSTRATION OMITTED]
The units in the bottom layer in Figure 4 represent sets of
semantic and syntactic features. For instance, the unit labelled PROBLEM
is a shorthand representation for a series of syntactic and semantic
representations such as NOUN, ABSTRACT, INANIMATE, etc. Even though not
represented graphically in Figure 4, representations such as those for
NOUN and ABSTRACT are shared by the units LIFE and FORM. The central
layer contains lemma nodes, nodes that link sets of semantic and
syntactic representations to form representations. For instance, the
left-hand lemma, which represents leven + s + probleem `life
problem', is activated during production by the semantic and
syntactic representations of PROBLEM and LIFE and in turn activates the
form representations <leven>, <probleem>, and <s>. The
numbers accompanying the outgoing arrows specify the order in which the
form representations have to be linearized for articulation.
In this architecture, the choice of the linking morpheme -s- for
the novel compound leven + ? + therapie made by 19 out of 20
participants in experiment 2 might proceed as follows. Once the
syntactic and semantic representations of LIFE and THERAPY have been
activated, activation spreads to their lemma nodes. In turn, activation
spreads from the lemma nodes to their form representations, activating
<leven> and <therapie>. Because leven + ? + therapie does
not have its own lemma representation, and because the linking morphemes
are not themselves addressed, the form representations of linking
morphemes have not yet been activated.
It has recently been shown that in subjective frequency ratings and
in visual lexical decision, morphological families of target words are
coactivated (Schreuder and Baayen 1997; De Jong et al. 2000). Our
hypothesis is that in production an analogous coactivation of the
constituent families takes place. Thus, we assume that the semantic and
syntactic representations for the left constituent LIFE in Figure 4
coactivate the lemmas of leven + s + vorm `life form', leven + s +
probleem `life problem', and other such compounds when the target
word is leven + ? + therapie.(8,9) The lemmas of these constituent
family members in turn coactivate their form representations, including
their linking morphemes.(10)
In addition to the strong influence of the first constituent, we
have also seen a somewhat weaker effect of the right constituent in our
experiments, both factorially and in the correlation analyses of bias
and response. We can model the prominence of the left constituent
families by having the semantic and syntactic representations of the
left constituent, LIFE in our example, send extra activation to the
lemma nodes with which it is connected. Possibly, the special burst of
activation flowing from the first constituent to the lemma layer is a
consequence of it being the first constituent that has to be articulated
(Roelofs 1996).(11) Recall that the TiMBL results revealed an effect of
the semantics of the right constituent. For instance, right abstract
constituents show a slight preference for the linking morpheme -s-. We
assume that right abstract constituents coactivate lemma nodes for
abstract nouns and, therefore, also abstract noun compounds in the
constituent families. The activation of these compound lemma nodes leads
to some extra support for the linking morpheme -s-.
Finally, the results of experiment 3, in which the left
constituents were suffixed pseudo-words, can be understood along similar
lines. Under the assumption that the suffix in the pseudo-word activates
its semantics, and that these semantics in turn coactivate the lemmas of
the compounds with this suffix, the bias in the suffix family will lead
to a preference for a given linking morpheme.
The present results challenge the idea that in order to model
nondeterministic linguistic phenomena symbolic representations have to
be given up and replaced by subsymbolic representations, as argued by,
for instance, Rumelhart and McClelland (1986) and Seidenberg (1987); see
also Zhou and Marslen-Wilson (n.d.). We have shown that it is possible
to model analogy without giving up symbolic representations such as
lemmas for complex words. At the same time, we do not think it is
necessary to be committed to the view that morphological rules are in
essence symbolic rewrite rules. This formal view of word-formation rules
is challenged by the experimental and simulation results for the
compounds with neutral bias that we have studied. Here, both our
participants and our model showed great uncertainty with respect to what
might be the most appropriate linking morpheme. This uncertainty is
difficult to reconcile with formal deterministic rules. For strongly
converging, consistent domains, formal analogical models will show
behavior similar to that of deterministic rules. For diverging,
inconsistent domains, deterministic rules impose regularity that is not
present in the data nor, if we may trust our experimental results, in
the minds of speakers of Dutch. Formal models of analogy, on the other
hand, reflect the inconsistency present in their input domains both in
the variation in their output and in the confidence they assign to their
output This shows that formal models of analogy are not unconstrained
all-powerful theories that can always predict any outcome and hence have
no explanatory value. Instead, the behavior of formal models of analogy
is tightly constrained by its input domain. For Dutch compounds, local
family-based analogical generalization instead of global lexicon-based
rule generalization has allowed us to approximate human behavior with
greater precision and insight.
Notes
(*) We would like to thank Wolfgang Dressier and Nivja de Jong for
their helpful comments on an earlier version of this paper. This study
was financially supported by the Dutch National Research Council NWO (PIONIER grant to the second author), the University of Nijmegen, The
Netherlands, and the Max Planck Institute for Psycholinguistics,
Nijmegen, The Netherlands. Address for correspondence and reprint requests: Andrea Krott, Interfaculty Research Unit for Language and
Speech and Max Planck Institute for Psycholinguistics, P.O. Box 310,
6500 AH Nijmegen, The Netherlands. E-mail: akrott@mpi.nl.
(1.) For instance, the 800-page handbook of morphology edited by
Spencer and Zwicky (1998) devotes five lines of text to the problem of
linking elements (1998: 81).
(2.) Marcus et al. (1995) and Clahsen et al. (1997) have argued
that it is impossible for a language to have more than one productive
rule for a particular inflectional function. This claim is based on the
distribution of noun plurals in German. The Dutch plural system provides
a counterexample to this claim, as shown by Baayen et al. (i.p.), a
study that presents detailed linguistic and psycholinguistic evidence
for the regularity and productivity of both Dutch plural suffixes.
(3.) For an analysis of German noun pluralization in such a
framework, see Marcus et al. (1995).
(4.) We used a nonparametric regression smoother (see Cleveland
1979), as parametric techniques based on linear models are clearly
inappropriate for our data.
(5.) For a detailed comparison between TiMBL and Skousen's AML model, see Krott et al. (n.d.).
(6.) One might expect to achieve the same accuracy for Fam1, Fam2,
and CELEX when training only on the first constituent (accuracy 1).
However, the different numbers of training items and the resulting
different structures of the three TiMBL-internal decision trees, as well
as the random choice of linking morphemes in the case of ties, lead to
somewhat different results.
(7.) For experiment 1, [[Chi square].sub.(8)] = 6.44, p = 0.60, and
for experiment 2, [[Chi square].sub.(8)] = 9.05, p = 0.34. In order to
avoid technical problems with zero counts for the negative left bias
conditions, the chi-squared tests were actually run on the complement
counts for all conditions, i.e., the number of participants not
selecting -en- (experiment 1) or -s- (experiment 2).
(8.) For evidence of storage of regular complex words in Dutch see
Baayen et al. (1997), Bertram et al. (2000); for compounds see Van
Jaarsveld and Rattink (1988).
(9.) It is in principle possible that compounds are activated that
contain leven as a right constituent, as in student + en + leven
`student life'. However, a post-hoc analysis showed that the family
homogeneity of these compounds in experiment 2 is not correlated with
the response homogeneity. This is true for compounds containing left
constituents at the right position ([r.sub.s] = 0.18; z = 1.44; p =
0.15) as well as for compounds containing right constituents at the left
position ([r.sub.s] = 0.01; z = 0.04; p = 0.97). These results suggest
that only those family members of the left constituent are activated
that share the left constituent with the novel compound, and only those
family members of the right constituent that share the right constituent
with the novel compound.
(10.) Figure 4 illustrates the composition route of our parallel
dual route model. We assume that there is also a full-form
representation <levens>, the plural of <leven>, for which
support can accumulate in the same way as for <s>.
(11.) The prominence of the first constituent is in line with the
observed greater priming effects of first constituents reported by
Kehayia et al. (1999). In addition, Stark and Stark (1991) report
impaired production of second constituents of compounds by a
Wernicke's aphasic.
References
Aha, D. W.; Kibler, D.; and Alber, M. (1991). Instance-based
learning algorithms. Machine Learning 6, 37-66.
Anshen, F.; and Aronoff, M. (1988). Producing morphologically
complex words. Linguistics 26, 641-655.
Baayen, R. H.; Dijkstra, T.; and Schreuder, R. (1997). Singulars
and plurals in Dutch: evidence for a parallel dual route model. Journal
of Memory and Language 36, 94-117.
--; Piepenbrock, R.; and Gulikers, L. (1995). The CELEX Lexical
Database (CD-ROM). Philadelphia: University of Pennsylvania Linguistic
Data Consortium.
--; Schreuder, R.; De Jong, N. H.; and Krott, A. (i.p.). Dutch
inflection: the rules that prove the exception. In Storage and
Computation in the Language Faculty, S. Nooteboom, F. Weerman, and F.
Wijnen (eds.). Dordrecht: Kluwer Academic.
Becker, T. (1992). Compounding in German. Rivista di Linguistica
4(1), 5-36.
Bertram, R.; Laine, M.; Baayen, R. H.; Schreuder, R.; and Hy6nfi,
J. (1999). Affixal homonymy triggers full-form storage even with
inflected words, even in a morphologically rich language. Cognition 74,
B13-B25.
--; Schreuder, R.; and Baayen, R. H. (2000). The balance of storage
and computation in morphological processing: the role of word formation
type, affixal homonymy, and productivity. Journal of Experimental
Psychology: Memory, Learning, and Cognition 26, 419-511.
Botha, R. P. (1968). The Function of the Lexicon in
Transformational Grammar. The Hague: Mouton.
Clahsen, H. (1999). Lexical entries and rules of language: a
multi-disciplinary study of German inflection. Behavioral and Brain
Sciences 22, 991-1060.
--; Eisenbeiss, S.; and Sonnenstuhl-Henning, I. (1997).
Morphological structure and the processing of inflected words.
Theoretical Linguistics 23, 201-249.
Cleveland, W. S. (1979). Robust locally weighted regression and
smoothing scatterplots. Journal of the American Statistical Association 74, 829-836.
Daelemans, W.; Van den Bosch, A.; and Weitjers, A. (1997). IGTree:
using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review 11, 407-423.
--; Gillis, S.; and Durieux, G. (1994). The acquisition of stress,
a data-oriented approach. Computational Linguistics 20 (3), 421-451.
--; Zavrel, J.; Van der Sloot, K.; and Van den Bosch, A. (1999).
TiMBL: Tilburg Memory Based Learner Reference Guide 2.0. Report 99-01.
Tilbury: Computational Linguistics, Tilburg University.
De Haas, W.; and Trommelen, M. (1993). Morfologisch handboek van
het Nederlands. The Hague: SDU.
De Jong, N. H.; Schreuder, R.; and Baayen, R. H. (2000). The
morphological family size effect and morphology. Language and Cognitive
Processes 15, 329-365.
De Saussure, F. (1966). Course in General Linguistics. New York:
McGraw.
Dressier, W. U.; Libben, G.; Stark, J.; Pons, C.; and Jarema, G.
(n.d.). The processing of interfixed compounds. Unpublished manuscript.
Fienberg, S. (1980). The Analysis of Cross-Classified Categorical
Data. Cambridge: MA: MIT Press.
Haeseryn, W.; Romijn, K.; Geerts, G.; de Rooij, J.; and Van den
Toorn, M. (1997). Algemene Nederlandse Spraakkunst. Groningen: Martinus
Nijhoff.
Halle, M.; and Marantz, A. (1993). Distributed morphology and the
pieces of inflection. In The View from Building 20: Essays in
Linguistics in Honor of Sylvain Bromberger, K. Hale and S. Keyser
(eds.), vol. 24, 111-176. Cambridge: MA: MIT Press.
Kehayia, E.; Jarema, G.; Tsapkini, K.; Perlak, D.; Ralli, A.; and
Kadzielawa, D. (1999). The role of morphological structure in the
processing of compounds: the interface between linguistics and
psycholinguistics. Brain and Language 68, 370-377.
Krott, A.; Schreuder, R.; and Baayen, R. H. (n.d.). Analogical
hierarchy: exemplar-based modeling of linkers in Dutch noun-noun
compounds. Unpublished manuscript.
Kuipers, A. H. (1960). Phoneme and Morpheme in Kabardian. The
Hague: Mouton.
Malicka-Kleparska, A. (1985). Parallel derivation and lexicalist
morphology: the case of Polish diminutivization. In Phono-morphology.
Studies in the Interaction of Phonology and Morphology, E. Gussmann
(ed.), 95-112. Lublin: Catholic University of Lublin.
Marcus, G.; Brinkman, U.; Clahsen, H.; Wiese, R.; and Pinker, S.
(1995). German inflection: the exception that proves the rule. Cognitive
Psychology 29, 189-256.
Mattens, W. H. M. (1984). De voorspelbaarheid van tussenklanken in
nominale samenstellingen. De nieuwe taalgids 7, 333-343.
Neijt, A.; Baayen, R. H.; and Schreuder, R. (n.d.). Reading relicts
of the past: the semantics of linking elements in present-day Dutch
orthography. Unpublished manuscript.
Pinker, S. (1991). Rules of language. Science 153, 530-535.
--(1997). Words and rules in the human brain. Nature 387, 547-548.
Plunkett, K.; and Juola, P. (2000). A connectionist model of
English past tense and plural morphology. Cognitive Science.
Rietveld, T.; and Van Hout, R. (1993). Statistical Techniques for
the Study of Language and Language Behavior. Berlin: Mouton de Gruyter.
Roelofs, A. (1996). Serial order in planning the production of
successive morphemes of a word. Journal of Memory and Language 35,
854-876.
Rueckl, J. G.; Mikolinski, M.; Raveh, M.; Miner, C. S.; and Mars,
F. (1997). Morphological priming, fragment completion, and connectionist
networks. Journal of Memory and Language 36(3), 382-405.
Rumelhart, D. E.; and McClelland, J. L. (eds.) (1986). Parallel
Distributed Processing. Explorations in the Microstructure of Cognition,
vol. 1: Foundations. Cambridge, MA: MIT Press.
Sandra, D.; Frisson, S.; and Daems, F. (1999). Why simple verb
forms can be so difficult to spell: the influence of homophone frequency
and distance in Dutch. Brain and Language 68(1/2), 277-283.
Schreuder, R.; and Baayen, R. H. (1995). Modeling morphological
processing. In Morphological Aspects of Language Processing, L. B.
Feldman (ed.), 131-154. Hillsdale, NJ: Erlbaum.
--; and Baayen, R. H. (1997). How complex simplex words can be.
Journal of Memory and Language 37, 118-139.
--; Neijt, A.; Van der Weide, F.; and Baayen, R. H. (1998). Regular
plurals in Dutch compounds: linking graphemes or morphemes? Language and
Cognitive Processes 13, 551-573.
Seidenberg, M. (1987). Sublexical structures in visual word
recognition: access units or orthographic redundancy. In Attention and
Performance XII, 245-263. Hove: Erlbaum.
--; and Hoeffner, J. (1998). Evaluating behavioral and neuroimaging
data on past tense processing. Language 74, 104-122.
Sereno, J.; and Jongman, A. (1997). Processing of English
inflectional morphology. Memory and Cognition 25, 425-437.
Skousen, R. (1989). Analogical Modeling of Language. Dordrecht:
Kluwer.
Spencer, A.; and Zwicky, A. (eds.) (1998). The Handbook of
Morphology. Oxford: Blackwell.
Stark, J.; and Stark, H.-K. (1991). On the processing of compound
nouns by a Wernicke's aphasic. In Neuro- und Patholinguistik, J.
Tesak (ed.), 95-112. Grazer Linguistische Studien 35. Graz: Universitat
Graz.
Taft, M. (1979). Recognition of affixed words and the word
frequency effect. Memory and Cognition 7, 263-272.
Van der Toorn, M. C. (1981a). De tussenklank in samenstellingen
waarvan het eerste lid een afleiding is. De nieuwe taalgids 74, 197-205.
--(1981b). De tussenklank in samenstellingen waarvan het eerste lid
systematisch uitheems is. De nieuwe taalgids 74, 547-552.
--(1982a). Tendenzen bij de beregeling van de verbindingsklank in
nominale samenstellingen I. De nieuwe taalgids 75(1), 24-33.
--(1982b). Tendenzen bij de beregeling van de verbindingsklank in
nominale samenstellingen II. De nieuwe taalgids 75(2), 153-160.
Van Jaarsveld, H.; and Rattink, G. (1988). Frequency effects in the
processing of lexicalized and novel nominal compounds. Journal of
Psychological Research 17, 447-473.
Van Marie, J. (1985). On the Paradigmatic Dimensions of
Morphological Creativity. Dordrecht: Foris.
Zhou, X.; and Marslen-Wilson, W. (n.d.). Lexical representation of
compound words: cross-linguistic evidence. Unpublished manuscript.
Appendix A. Materials for experiment 1: left constituent and right
constituent (number of en responses, number of other responses)
L1-R1: left position: positive -en- bias; right position: positive
-en- bias
student kolder (20, 0); pen prik (20, 0); advocaat geslacht (18,
2); soldaat deken (19, 1); vreemdeling buurt (20, 0); kleur
tegenstelling (10, 10); sigaret knipsel (18, 2); sigaar kiosk (17, 3);
pan rook (19, 1); toerist klooster (20, 0); roos gaas (20, 0); beer
lever (20, 0); noot laan (18, 2); aap klauw (20, 0); tomaat moes (20,
0); kat haat (19, 1); reus hol (20, 0); gans lijt (20, 0); stier beet (20, 0); vrucht massa (20, 0); wesp ras (20, 0).
L1-R2: left position: positive -en- bias; right position: neutral
-en- bias
noot dief (19, 1); sigaret bundel (20, 0); sigaar republiek (17,
3); stier kooi (20, 0); kat paar (20, 0); wesp jacht (20, 0); asp vel
(19, 1); vrucht rek (20, 0); tomaat stam (20, 0); roos zee (19, 1);
soldaat bond (20, 0); pen hout (20, 0); gans boter (20, 0); kleur rad
(19, 1); student kas (20, 0); reus rijk (20, 0); beer galerij (17, 3);
pan kaas (15, 5); vreemdeling steun (20, 0); toerist kuil (20, 0);
advocaat corps (20, 0).
L1-R3: left position: positive -en- bias; right constituent:
negative -en- bias
sigaar juffrouw (20, 0); sigaret tarief (20, 0); tomaat project
(18, 2); pan lengte (11, 9); toerist gedeelte (20, 0); soldaat
bevoegdheid (17, 3); beer maaltijd (19, 1); aap terrein (20, 0);
vreemdeling crisis (20, 0); student voorschrift (20, 0); gans schade
(18, 2); advocaat weg (17, 3); kleur techniek (13, 7); noot gewas (11,
9); pen patroon (12, 8); vrucht kanaal (18, 2); roes kunst (20, 0); kat
therapie (17, 3); wesp deskundige (19, 1), reus vrijheid (19, 1); stier
psycholoog (18, 2).
L2-R1: left position: neutral -en- bias; right position: positive
-en- bias
begrip tegenstelling (7, 13); bloem laan (20, 0); bom massa (14,
6); bron gaas (11, 9); buur geslacht (15, 5); god hol (13, 7); heer
buurt (20, 0); kaart kiosk (20, 0); koe ras (18, 2); klas kolder (19,
1); kool moes (8, 12); leerling klauw (13, 7); lid lijf (10, 10);
persoon beet (7, 13); pijp rook (11, 9); plaat knipsel (19, 1); pop
klooster (19, 1); prul deken (19, 1); wolf lever (12, 8); woord haat
(20, 0); ziel prik (20, 0).
L2-R2: left position: neutral -en- bias; right position: neutral
-en- bias
begrip stam (11, 9); bloem borer (14, 6); bom kuil (16, 4); bron
rijk (15, 5); buur steun (19, 1); god vel (11, 9); beer kaas (18, 2);
kaart bundel (20, 0); klas republiek (20, 0); koe kooi (20, 0); kool rek
(16, 4); leerling corps (19, 1); lid kas (15, 5); persoon bond (13, 7);
pijp galerij (18, 2); plaat hout (11, 9); pop rad (19, 1); prul zee (19,
1); wolf paar (14, 6); woord jacht (19, 1); ziel dief (17, 3).
L2-R3: left position: neutral -en- bias; right position: negative
-en- bias
begrip patroon (10, 10); bloem weg (20, 0); bom lengte (11, 8);
bron terrein (13, 7); buur project (12, 7); god maaltijd (16, 4); heer
tarief (18, 2); kaart juffrouw (18, 2); koe psycholoog (17, 3); kool
gewas (3, 17); leerling bevoegdheid (12, 8); lid voorschrift (11, 9);
persoon therapie (3, 17); pijp schade (4, 16); plaat techniek (11, 9);
pop kunst (14, 6); prul kannal (12, 8); ziel vrijheid (6, 14). woord
gedeelte (3, 17); klas crisis (19, 1); wolf deskundige (12, 8).
L3-R1: left position: negative -en- bias; right position: positive
-en- bias
stad haat (2, 18); gevangenis deken (0, 20); neus knipsel (6, 14);
angst prik (2, 18); industrie rook (4, 16); wijn kiosk (4, 16); kalf
beet (2, 18); bevolking ras (0, 20); bier lever (8, 12); overheid
geslacht (0, 20); christen klooster (6, 14); dokter klauw (0, 20);
fabriek buurt (4, 16); dak gaas (5, 15); aardappel moes (3, 17); rivier
massa (15, 5); citroen laan (10, 10); groep hol (0, 20); wetenschap
kolder (1, 19); kwaliteit tegenstelling (3, 17); koning lijf (1, 19).
L3-R2: left position: -en- bias; right position: neutral -en- bias
stad republiek (0, 20); industrie corps (7, 13); bevolking stam (0,
20); dokter bond (2, 18); rivier hout (8, 12); dak kuil (6, 14); groep
jacht (3, 17); kwaliteit kaas (0, 20); angst steun (6, 14); aardappel
bundel (5, 15); wijn dief (2, 18); kalf kooi (4, 16); koning vel (1,
19); bier zee (5, 15); neus paar (17, 3); wetenschap rijk (0, 20);
overheid kas (0, 20); gevangenis rek (3, 17); citroen boter (3, 17);
christen galerij (6, 14); fabriek rad (1, 19).
L3-R3: left position: negative -en- bias; right position: negative
-en- bias
aardappel juffrouw (2, 18); angst crisis (1, 19); bevolking
gedeelte (0, 20); bier deskundige (1, 19); christen vrijheid (2, 18);
citroen gewas (1, 19); dak lengte (4, 16); fabriek psycholoog (0, 20);
gevangenis terrein (0, 20); groep bevoegdheid (0, 20); industrie weg (1,
19); kalf maaltijd (4, 16); koning therapie (2, 18); kwaliteit kunst (0,
20); neus kanaal (2, 18); overheid project (0, 20); rivier techniek (5,
15); stad patroon (0, 20); wetenschap voorschrift (0, 20); wijn schade
(0, 20).
Appendix B. Materials for experiment 2: left constituent and right
constituent (number of s responses, number of other responses)
L1-R1: left position: positive -s- bias; right position: positive
-s- bias
arbeider standpunt (20, 0); bedrijf bevoegdheid (19, 1); beslissing
angst (19, 1); bestuur aangelegenheid (20, 0); fabriek norm (20, 0);
gezicht dimensie (16, 4); groep afstand (19, 1); handel fractie (20, 0);
investering orientatie (20, 0); leven tactiek (19, 1); macht woede (18,
2); onderzoek reden (20, 0); ontwikkeling duur (20, 0); persoonlijkheid
bevordering (20, 0); regering verhouding (20, 0); staat besluit (19, 1);
training toename (19, 1); veiligheid drang (20, 0); verkeer delegatie
(20, 0); verzorging bijdrage (20, 0); vrede uitoefening (18, 2).
L1-R2: left position: positive -s- bias; right position: neutral
-s- bias
arbeider functie (20, 0); bedrijf organisatie (20, 0); beslissing
conflict (18, 2); bestuur regel (20, 0); fabriek geschiedenis (19, 1);
gezicht verandering (19, 1); groep plicht (18, 2); handel project (20,
0); investering kunst (20, 0); leven therapie (19, 1); macht dienaar
(20, 0); onderzoek niveau (20, 0); ontwikkeling patroon (20, 0);
persoonlijkheid controle (20, 0); regering kwaliteit (20, 0); staat
conferentie (16, 4); training probleem (20, 0); veiligheid mechanisme
(20, 0); verkeer rust (20, 0); verzorging commissie (20, 0); vrede
karakter (20, 0).
L1-R3: left position: positive -s- bias; right position: negative
-s- bias
arbeider tent (20, 0); bedrijf bos (15, 5); beslissing schrift (13,
7); bestuur club (19, 1); fabriek kaas (20, 0); gezicht tekening (17,
3); groep kast (15, 5); handel voorraad (19, 1); investering meester
(20, 0); leven bel (20, 0); macht laag (19, 1); onderzoek schaal (19,
1); ontwikkeling sprang (19, 1); persoonlijkheid spiegel (19, 1);
regering les (20, 0); staat eiland (13, 7); training olie (20, 0);
veiligheid venster (20, 0); verkeer soort (19, 1); verzorging transport
(20, 0); vrede stok (19, 1).
L2-R1: left position: neutral -s- bias; right position: positive
-s- bias
begrip dimensie (14, 6); bisschop fractie (17, 3); directeur
besluit (19, 1); dood reden (18, 2); generaal delegatie (13, 7); geschut
afstand (19, 1); geweld bijdrage (20, 0); god woede (10, 10); heil
bevordering (15, 5); klimaat verhouding (11, 9); lucifer norm (9, 11);
minister bevoegdheid (12, 8); monnik aangelegenheid (0, 20); persoon
angst (14, 6); plicht uitoefening (17, 3); president standpunt (12, 8);
temperatuur toename (14, 6); tijd orientatie (16, 4); voordracht duur
(18, 2); voorkeur drang (20, 0); wolf tactiek (8, 12).
L2-R2: left position: neutral -s- bias; right position: neutral -s-
bias
begrip probleem (12, 8); bisschop karakter (18, 2); directeur
commissie (12, 8); dood rust (16, 4); generaal functie (19, 1); geschut
mechanisme (15, 5); geweld organisatie (14, 6); god dienaar (11, 9);
heil therapie (11, 9); klimaat geschiedenis (10, 10); lucifer kwaliteit
(6, 14); minister plicht (11, 9); monnik regel (6, 14); persoon kunst
(17, 3); plicht verandering (18, 2); president conferentie (12, 8);
temperatuur controle (15, 5); tijd conflict (19, 1); voordracht niveau
(17, 3); voorkeur patroon (17, 3); wolf project (8, 12).
L2-R3: left position: neutral -s- bias; right position: negative
-s- bias
begrip laag (10, 10); bisschop spiegel (18, 2); directeur stok (14,
6); dood eiland (9, 11); generaal kast (18, 2); geschut tent (12, 8);
geweld soort (10, 10); god bos (5, 15); heil olie (9, 11); klimaat
schaal (7, 13); lucifer voorraad (3, 17); minister club (14, 6); monnik
kaas (6, 14); persoon transport (6, 14); plicht schrift (4, 16);
president bel (9, 11); temperatuur venster (14, 6); tijd sprong (14, 6);
voordracht les (13, 7); voorkeur tekening (18, 2); wolf meester (12, 8).
L3-R1: left position: negative -s- bias; right position: positive
-s- bias
avond duur (5, 15); boek bijdrage (0, 20); christen aangelegenheid
(0, 20); dak afstand (5, 15); dwang reden (0, 20); kleur verhouding (5,
15); licht dimensie (5, 15); morgen delegatie (3, 17); nacht tactiek (2,
18); natuur bevordering (6, 14); nood besluit (3, 17); slag uitoefening
(1, 19); soldaat woede (3, 17); straat orientatie (1, 19); student
standpunt (0, 20); vuur angst (2, 18); wapen bevoegdheid (3, 17); wijn
norm (3, 17); woning fractie (4, 16); zand toename (2, 18); zang drang
(4, 16).
L3-R1: left position: negative -s- bias; right position: neutral-s-
bias
avond functie (1, 19); boek organisatie (0, 20); christen commissie
(0, 20); dak controle (1, 19); dwang regel (1, 19); kleur kwaliteit (0,
20); licht kunst (1, 19); morgen rust (1, 19); nacht project (0, 20);
natuur therapie (0, 20); nood mechanisme (1, 19); slag niveau (0, 20);
soldaat dienaar (3, 17); straat karakter (0, 20); student conflict (0,
20); vuur patroon (0, 20); wapen geschiedenis (3, 17); wijn conferentie
(0, 20); woning verandering (9, 11); zand probleem (1, 19); zang plicht
(0, 20).
L3-R1: left position: negative -s- bias; right positiou: negative
-s- bias
avond sprong (0, 20); bock transport (0, 20); christen schrift (1,
19); dak kast (0, 20); dwang soort (0, 20); kleur schaal (0, 20); licht
spiegel (0, 20); morgen bos (2, 18); nacht tent (0, 20); natuur eiland
(0, 20); nood olie (0, 20); slag les (0, 20); soldaat stok (2, 18);
straat bel (1, 19); student kaas (0, 20); vuur venster (0, 20); wapen
club (0, 20); wijn laag (0, 20); woning tekening (2, 18); zand voorraad
(0, 20); zang meester (0, 20).
Appendix C. Materials for experiment 3: left constituent and right
constituent (number of s responses, number of other responses)
L1-R1: left position: positive -s- bias; right position: positive
-s- bias
ontbolfing aangelegenheid (18, 2); verbrimming afstand (18, 2);
bebuiping angst (18, 2); wouking besluit (12, 8); hernabbeling
bevoegdheid (18, 2); struffing bevordering (18, 2); snoking bijdrage
(15, 5); bronkheid delegatie (20, 0); golheid dimensie (19, 1);
pritsheid drang (20, 0); dulligheid duur (20, 0); sloefheid fractie (19,
1); spreunheid norm (19, 1); vlitheid orientatie (18, 2); conviriteit
reden (15, 5); descaliteit standpunt (10, 10); dipromeniteit tactiek
(14, 6); illuniteit toename (15, 5); recarveniteit uitoefening (18, 2);
solutaniteit verhouding (18, 2); virubaniteit woede (11, 9).
L1-R2: left position: positive -s- bias; right position: neutral
-s- bias
ontbolfing commissie (18, 2); verbrimming conferentie (18, 2);
bebuiping conflict (18, 2); wouking controle (15, 5); hernabbeling
dienaar (18, 2); struffing functie (16, 4); snoking geschiedenis (12,
8); bronkheid karakter (19, 1); golheid kunst (18, 2); pritsheid
kwaliteit (16, 4); dulligheid mechanisme (19, 1); sloefheid niveau (20,
0); spreunheid organisatie (19, 1); vlitheid patroon (18, 2);
conviriteit plicht (16, 4); descaliteit probleem (16, 4); dipromeniteit
project (19, 1); illuniteit regel (17, 3); recarveniteit rust (17, 3);
solutaniteit therapie (18, 2); virubaniteit verandering (16, 4).
L1-R3: left position: positive -s- bias; right position: negative
-s- bias
ontbolfing bel (20, 0); verbrimming bos (18, 2); bebuiping club
(13, 7); wouking eiland (10, 10); hernabbeling kaas (12, 8); struffing
kast (11, 9); dipromeniteit laag (17, 3); vlitheid les (19, 1); golheid
meester (19, 1); pritsheid olie (15, 5); dulligheid schaal (19, 1),
sloefheid schrift (19, 1); spreunheid soort (17, 3); bronkheid spiegel
(18, 2); conviriteit sprong (16, 4); descaliteit stok (14, 6); snoking
tekening (13, 7); illuniteit tent (15, 5); recarveniteit transport (14,
6); solutaniteit venster (17, 3); virubaniteit voorraad (18, 2).
L2-R1: left position: negative -s- bias; right position: positive
-s- bias
moepsel aangelegenheid (4, 16); lirksel afstand (3, 17); steukster
angst (9, 11); raalster besluit (7, 13); vilkster bevoegdheid (7, 13);
girdin bevordering (4, 16); kloerdin bijdrage (3, 17); dreekster
delegatie (14, 6); preuksel dimensie (3, 17); pleefster drang (11, 9);
veepsel duur (3, 17); taapster fractie (8, 12); brumsel norm (4, 16);
zwaagster orientatie (7, 13); borberin reden (1, 19); doerin standpunt
(0, 20); darsin tactiek (5, 15); stimsel toename (2, 18); vlatsel
uitoefening (5, 15); ploebin verhouding (2, 18); zwaperin woede (2, 18).
L2-R2: left position: negative -s- bias; right position: neutral
-s- bias
taapster commissie (6, 14); girdin conferentie (1, 19); raalster
conflict (8, 12); preuksel controle (0, 20); ploebin dienaar (3, 17);
steukster functie (10, 10); stimsel geschiedenis (5, 15); dreekster
karakter (5, 15); pleefster kunst (7, 13); veepsel kwaliteit (2, 18);
vlatsel mechanisme (2, 18); moepsel niveau (2, 18); vilkster organisatie
(8, 12); lirksel patroon (1, 19); borberin plicht (3, 17); doerin
probleem (0, 20); darsin project (6, 14); zwaagster regel (9, 11);
kloerdin rust (3, 17); zwaperin therapie (2, 18); brumsel verandering
(1, 19).
L2-R3: left position: negative -s- bias; right position: negative
-s- bias
steukster bel (11, 9); lirksel bos (2, 18); pleefster club (12, 8);
zwaperin eiland (1, 19); kloerdin kaas (1, 18); zwaagster kast (6, 14);
vlatsel laag (4, 16); dreekster les (5, 15): veepsel meester (3, 17);
raalster olie (4, 15); moepsel schaal (1, 18); taapster schrift (7, 13);
vilkster soort (2, 18); brumsel spiegel (3, 17); borberin sprong (0,
20); doerin stok (0, 19); darsin tekening (3, 17), girdin tent (0, 20);
stimsel transport (3, 17); ploebin venster (1, 19); preuksel voorraad
(0, 20).
ANDREA KROTT, R. HARALD BAAYEN, and ROBERT SCHREUDER
Interfaculty Research Unit for Language and Speech Max Planck
Institute for Psycholinguistics University of Nijmegen
Received 30 May 2000 Revised version received 6 November 2000