首页    期刊浏览 2024年07月07日 星期日
登录注册

文章基本信息

  • 标题:An Algorithm for Predicting the Relationship between Lemmas and Corpus Size
  • 本地全文:下载
  • 作者:Yang, Dan-Hee ; Gomez, Pascual Cantos ; Song, Man-Suk
  • 期刊名称:ETRI Journal
  • 印刷版ISSN:1225-6463
  • 电子版ISSN:2233-7326
  • 出版年度:2000
  • 卷号:22
  • 期号:2
  • 页码:20-20
  • 语种:English
  • 出版社:Electronics and Telecommunications Research Institute
  • 摘要:Much research on natural language processing (NLP), computational linguistics and lexicography has relied and depended on linguistic corpora. In recent years, many organizations around the world have been constructing their own large corporal to achieve corpus representativeness and/or linguistic comprehensiveness. However, there is no reliable guideline as to how large machine readable corpus resources should be compiled to develop practical NLP software and/or complete dictionaries for humans and computational use. In order to shed some new light on this issue, we shall reveal the flaws of several previous researches aiming to predict corpus size, especially those using pure regression or curve-fitting methods. To overcome these flaws, we shall contrive a new mathematical tool: a piecewise curve-fitting algorithm, and next, suggest how to determine the tolerance error of the algorithm for good prediction, using a specific corpus. Finally, we shall illustrate experimentally that the algorithm presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, compiling methodology, corpus representativeness and linguistic comprehensiveness.
国家哲学社会科学文献中心版权所有