文章基本信息

标题：Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
本地全文：下载
作者：Enrico Seiler ; Svenja Mehringer ; Mitra Darvish 等
期刊名称：iScience
印刷版ISSN：2589-0042
出版年度：2021
卷号：24
期号：7
页码：1-20
DOI：10.1016/j.isci.2021.102782
语种：English
出版社：Elsevier
摘要：SummaryWe present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representativek-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.Graphical abstractDisplay OmittedHighlights•Raptor is a tool to search through large collections of genomic texts•Raptor is 12-144 times faster and uses up to 30 times less RAM than COBS or Mantis•The Raptor index is 6-50 times faster to build•The use of minimizers and Bloom filters makes Raptor very space-efficientGenetics; bioinformatics; high-performance computing in bioinformatics