首页    期刊浏览 2024年07月07日 星期日
登录注册

文章基本信息

  • 标题:Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System
  • 本地全文:下载
  • 作者:Ethan Millar ; Dan Shen ; Junli Liu
  • 期刊名称:Journal of Digital Information
  • 印刷版ISSN:1368-7506
  • 电子版ISSN:1368-7506
  • 出版年度:2006
  • 卷号:1
  • 期号:5
  • 语种:English
  • 出版社:Texas A&M University Libraries
  • 摘要:Information retrieval has become more and more important due to the rapid growth of all kinds of information. However, there are few suitable systems available. This paper presents a few approaches that enable large-scale information retrieval for the TELLTALE system. TELLTALE is an information retrieval environment that provides full-text search for text corpora that may be garbled by OCR (optical character recognition) or transmission errors, and that may contain multiple languages. It can find similar documents against a 1 kB query from 1 GB of text data in 45 seconds. This remarkable performance is achieved by integrating new data structures and gamma compression into the TELLTALE framework. This paper also compares several different types of query methods such as tf.idf and incremental similarity to the original technique of centroid subtraction. The new similarity techniques give better performance but less accuracy.
国家哲学社会科学文献中心版权所有