首页    期刊浏览 2024年10月01日 星期二
登录注册

文章基本信息

  • 标题:Improving Morphosyntactic Tagging of Slovene Language through Meta-tagging
  • 本地全文:下载
  • 作者:Jan Rupnik ; Miha Grčar ; Tomaž Erjavec
  • 期刊名称:Informatica
  • 印刷版ISSN:1514-8327
  • 电子版ISSN:1854-3871
  • 出版年度:2008
  • 卷号:32
  • 期号:4
  • 出版社:The Slovene Society Informatika, Ljubljana
  • 摘要:Part-of-speech (PoS) or, better, morphosyntactic tagging is the process of assigning morphosyntactic categories to words in a text, an important pre-processing step for most human language technology applications. PoS-tagging of Slovene texts is a challenging task since the size of the tagset is over one thousand tags (as opposed to English, where the size is typically around sixty) and the state-of-the-art tagging accuracy is still below levels desired. The paper describes an experiment aimed at improving tagging accuracy for Slovene, by combining the outputs of two taggers – a proprietary rule-based tagger developed by the Amebis HLT company, and TnT, a tri-gram HMM tagger, trained on a hand- annotated corpus of Slovene. The two taggers have comparable accuracy, but there are many cases where, if the predictions of the two taggers differ, one of the two does assign the correct tag. We investigate training a classifier on top of the outputs of both taggers that predicts which of the two taggers is correct. We experiment with selecting different classification algorithms and constructing different feature sets for training and show that some cases yield a meta-tagger with a significant increase in accuracy compared to that of either tagger in isolation.
  • 关键词:PoS tagging; meta-tagger; Slavic languages; FidaPLUS; JOS corpus; machine learning; Orange; decision trees; CN2 rules; Naive Bayes
国家哲学社会科学文献中心版权所有