首页    期刊浏览 2024年11月24日 星期日
登录注册

文章基本信息

  • 标题:LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules
  • 作者:Matjaž Juršič ; Igor Mozetič ; Tomaž Erjavec
  • 期刊名称:Journal of Universal Computer Science
  • 印刷版ISSN:0948-6968
  • 出版年度:2010
  • 卷号:16
  • 期号:9
  • 页码:1190-1214
  • 出版社:Graz University of Technology and Know-Center
  • 摘要:Lemmatisation is the process of finding the normalised forms of words appearing in text. It is a useful preprocessing step for a number of language engineering and text mining tasks, and especially important for languages with rich inflectional morphology. This paper presents a new lemmatisation system, LemmaGen, which was trained to generate accurate and efficient lemmatisers for twelve different languages. Its evaluation on the corresponding lexicons shows that LemmaGen outperforms the lemmatisers generated by two alternative approaches, RDR and CST, both in terms of accuracy and efficiency. To our knowledge, LemmaGen is the most efficient publicly available lemmatiser trained on large lexicons of multiple languages, whose learning engine can be retrained to effectively generate lemmatisers of other languages.
Loading...
联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有