期刊名称:Proceedings of the National Academy of Sciences
印刷版ISSN:0027-8424
电子版ISSN:1091-6490
出版年度:2015
卷号:112
期号:26
页码:E3441-E3450
DOI:10.1073/pnas.1412301112
语种:English
出版社:The National Academy of Sciences of the United States of America
摘要:SignificanceBayesian models, including admixture models, are a powerful framework for articulating complex assumptions about large-scale genetic data; such models are widely used to explore data or to study population-level statistics of interest. However, we assume that a Bayesian model does not oversimplify the complexities in the data, to the point of invalidating our analyses. Here, we develop and study procedures for quantitatively evaluating admixture models of genetic data. Using four large genetic studies, we demonstrate that model checking should be an important part of the modern genetic data analysis pipeline. Our methods help to support inferences drawn from recovered population structure, to protect scientists from being misled by a misspecified model class, and to point scientists toward useful model extensions. Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of fit of a statistical model to a specific dataset. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the admixture model fit to four qualitatively different population genetic datasets: the population reference sample (POPRES) European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.
关键词:posterior predictive checks ; admixture models ; population structure ; model checking ; genomic data