期刊名称:Drustvena istrazivanja. Journal for General Social Issues
印刷版ISSN:1330-0288
出版年度:2005
卷号:14
期号:1-2 (75-76)
页码:227-250
出版社:Institute of Social Sciences IVO PILAR
摘要:The existing formula / Vr(n)=Knß / of Heaps' Law regarding the size of a text's vocabulary is not universal, thus the law needs to be redefined, in order to be used for analysis of a different language corpus. The analysis of a corpus of texts in the Croatian language confirms the hypothesis that the number of functional items (F) in a text is constant and amounts to 21% of the size of the text n (there are 26% of functional items in English texts). The author proves that the percentage of functional items in a text can be used as the value for the parameter K, and that the parameter K presents a constant value for every language corpus. Empirical research has confirmed the author's thesis that the number of functional items in a text can be calculated according to the formula F=nK/100, and that for the value of the most frequent item (MF) the formula MF=n(K/100)2 can be applied. The value of the other parameter of Heaps' Law can also be accurately determined: ß=log K/100. The author therefore suggests a new form of the text vocabulary size law: Vr(n)=(Kn)ß. The number of words appearing only once (HL) in the text can be calculated according to the formula: HL= ((Kn)/2)ß . Research confirms that there is a very high correlation between the calculated and real values of the vocabulary size, i.e. between the real and calculated values of single words in the text. Interpreted and defined in such a way, the law of the text vocabulary size enables the calculation of the text's vocabulary size in every language, if the percentage of constant functional words for this language is known. However, this interpretation of the law enables, apart from determining the size of the text's vocabulary, also the calculation of the number of functional items in the text, the size of the most frequent word in the text, and the number of single items comprising the text's vocabulary