首页    期刊浏览 2024年07月19日 星期五
登录注册

文章基本信息

  • 标题:An Efficient Indexer for Large N-Gram Corpora
  • 本地全文:下载
  • 作者:Hakan Ceylan ; Rada Mihalcea
  • 期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
  • 出版年度:2011
  • 卷号:2011
  • 出版社:ACL Anthology
  • 摘要:We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5- gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substitution task by a simple implementation that gives considerable improvement over a basic language model.
国家哲学社会科学文献中心版权所有