期刊名称:Proceedings of the National Academy of Sciences
印刷版ISSN:0027-8424
电子版ISSN:1091-6490
出版年度:2022
卷号:119
期号:31
DOI:10.1073/pnas.2121279119
语种:English
出版社:The National Academy of Sciences of the United States of America
摘要:Significance
Biobanks linking genomic data to individual electronic health records are increasing in number and in size around the world and are important both for the discovery of associations between DNA variation and health outcomes and in the context of precision medicine. There is a need to analyze individual-level data from such biobanks using the most powerful and efficient software applications. Using two large biobank studies, we show that both for detecting associations and for making predictors of health outcomes from the DNA, we can greatly improve current approaches by using statistical models that allow the strength of association to differ according to the properties of the genetic markers.
Genetically informed, deep-phenotyped biobanks are an important research resource and it is imperative that the most powerful, versatile, and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. When compared to other approaches, GMRM accuracy was greater than annotation prediction models run in the LDAK or LDPred-funct software by 15% (SE 7%) and 14% (SE 2%), respectively, and was 18% (SE 3%) greater than a baseline BayesR model without single-nucleotide polymorphism (SNP) markers grouped into minor allele frequency–linkage disequilibrium (MAF-LD) annotation categories. For height, the prediction accuracy
R
2 was 47% in a UK Biobank holdout sample, which was 76% of the estimated
h
SNP
2
. We then extend our GMRM prediction model to provide mixed-linear model association (MLMA) SNP marker estimates for genome-wide association (GWAS) discovery, which increased the independent loci detected to 16,162 in unrelated UK Biobank individuals, compared to 10,550 from BoltLMM and 10,095 from Regenie, a 62 and 65% increase, respectively. The average
χ
2
value of the leading markers increased by 15.24 (SE 0.41) for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modeling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and discovery in large-scale individual-level studies.