首页    期刊浏览 2024年11月25日 星期一
登录注册

文章基本信息

  • 标题:Stability of the syntagmatic probability distributions
  • 本地全文:下载
  • 作者:Dimitrijević Strahinja ; Kostić Aleksandar ; Milin Petar
  • 期刊名称:Psihologija
  • 印刷版ISSN:0048-5705
  • 电子版ISSN:1451-9283
  • 出版年度:2009
  • 卷号:42
  • 期号:1
  • 页码:107-120
  • DOI:10.2298/PSI0901107D
  • 出版社:Društvo psihologa Srbije
  • 摘要:

    The aim of the present study is to establish criteria for the optimal size of a corpus that can provide stable conditional probabilities of morphological and/or syntagmatic types. The optimality of corpus size is defined in terms of the smallest sample that generates probability distribution equal to distribution derived from the large sample that generates stable probabilities. The latter distribution we refer to as 'target distribution'. In order to establish the above criteria we varied the sample size, the word sequence size (bigrams and trigrams), sampling procedure (randomly chosen words and continuous text) and position of the target word in a sequence. The obtained distributions of conditional probabilities derived from smaller samples have been correlated with target distributions. Sample size at which probability distribution reaches maximal correlation (r=1) with the target distribution was taken as being optimal. The research was done on Corpus of Serbian language. In case of bigrams the optimal sample size for random word selection is 65.000 words, and 281.000 words for trigrams. In contrast, continuous text sampling requires much larger samples to reach stability: 810.000 words for bigrams and 868.000 words for trigrams. The factors that caused these differences remain unclear and need additional empirical investigation.

  • 关键词:corpus linguistics; quantitative linguistics; optimal sample size; conditional probabilities; Serbian language
国家哲学社会科学文献中心版权所有