期刊名称:Divergencias : Revista de Estudios Linguisticos y Literarios
印刷版ISSN:1555-7596
出版年度:2006
卷号:4
期号:02
出版社:University of Arizona
摘要:The fact that language structure is affected by usage is a cornerstone to functionallinguistics. One specific idea that is generally accepted is that the words with the greatesttoken frequency are also the shortest (e.g. Bybee, 2002). The purpose of this paper is tooutline a statistical method that may be used to perform tests on corpus data related tothe word length-token frequency function. The data used to develop this method comefrom the spoken portion of Davies' (2005) Corpus del espa.ol, a 100 million word corpusof the Spanish language including sources from eight centuries. A rank-order list thatincludes the number of occurrences of each form was extracted from the Corpus delespa.ol and the 1000 most frequent forms were then tagged for length in terms of numberof syllables. Using linear regression analysis, equations were created from the datapresenting word length to be a function of rank in the list in one case and frequency ofoccurrence in the other. These equations represent an approximate average word lengthat any point in the rank-order list. Details for selecting data are discussed and possiblefuture applications of this method are outlined