摘要:To identify evolutionary events from the footprints left in the patterns of genetic variation in a population, people use many statistical frameworks, including neutrality tests. In datasets from current high throughput sequencing and genotyping platforms, it is common to have missing data and lowconfidence SNP calls at many segregating sites. However, the traditional statistical framework for neutrality tests does not allow for these possibilities; therefore the usual way of treating missing data is to ignore segregating sites with missing/ low confidence calls, regardless of the good SNP calls at these sites in other individuals. In this work, we propose a modified neutrality test, Extended Tajima’s D, which incorporates missing data and SNP-calling uncertainties. Because we do not specify any particular error-generating mechanism, this approach is robust and widely applicable. Simulations show that in most cases the power of the new test is better than the original Tajima’s D, given the same type I error. Applications to real data show that it detects fewer outliers associated with low quality data. The downloadable executable as well as the documentation can be found at google-code: https://code.google.com/p/robust-scan/.
关键词:neutrality test; Tajima’s D; missing genotype; next generation sequencing