期刊名称:Proceedings of the National Academy of Sciences
印刷版ISSN:0027-8424
电子版ISSN:1091-6490
出版年度:2021
卷号:118
期号:40
DOI:10.1073/pnas.2105841118
语种:English
出版社:The National Academy of Sciences of the United States of America
摘要:Significance
Genome-wide association studies compare a phenotype to thousands of genetic variants, searching for associations of potential biological interest. Standard analyses rely on linear models of the phenotype given one variable at a time. However, their assumptions are difficult to verify and their univariate approaches make it hard to recognize interesting associations from spurious ones. Our work takes a different path: We analyze all variants simultaneously, modelling the randomness in the genotypes, which is better understood, instead of the phenotype. Our solution accounts for linkage disequilibrium and population structure, controls the false discovery rate, and leverages powerful machine-learning tools. Applications to the UK Biobank data indicate increased power compared to state-of-the-art alternatives and high replicability.
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
关键词:engenome-wide association studies;false discovery rate;knockoffs;population structure;hidden Markov models