首页    期刊浏览 2024年12月01日 星期日
登录注册

文章基本信息

  • 标题:Disk Compression of k-mer Sets
  • 本地全文:下载
  • 作者:Amatur Rahman ; Rayan Chikhi ; Paul Medvedev
  • 期刊名称:LIPIcs : Leibniz International Proceedings in Informatics
  • 电子版ISSN:1868-8969
  • 出版年度:2020
  • 卷号:172
  • 页码:16:1-16:18
  • DOI:10.4230/LIPIcs.WABI.2020.16
  • 出版社:Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik
  • 摘要:K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
  • 关键词:de Bruijn graphs; compression; k-mer sets; spectrum-preserving string sets
国家哲学社会科学文献中心版权所有