首页    期刊浏览 2024年11月27日 星期三
登录注册

文章基本信息

  • 标题:n-Gram-Based Text Compression
  • 本地全文:下载
  • 作者:Vu H. Nguyen ; Hien T. Nguyen ; Hieu N. Duong
  • 期刊名称:Computational Intelligence and Neuroscience
  • 印刷版ISSN:1687-5265
  • 电子版ISSN:1687-5273
  • 出版年度:2016
  • 卷号:2016
  • DOI:10.1155/2016/9483646
  • 出版社:Hindawi Publishing Corporation
  • 摘要:We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
国家哲学社会科学文献中心版权所有