文章基本信息

标题：Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data
本地全文：下载
作者：Mikhail Pomaznoy ; Ashu Sethi ; Jason Greenbaum 等
期刊名称：Scientific Reports
电子版ISSN：2045-2322
出版年度：2019
卷号：9
期号：1
页码：1-10
DOI：10.1038/s41598-019-52584-w
出版社：Springer Nature
摘要：RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount .