摘要:We consider the problem of identifying diseaseassociated genomic regions in genome-wide association studies (GWAS). It is shown that conventional single SNP analysis can be greatly improved by (i) exploiting the spatial dependency and (ii) conducing set-wise analysis. The SNP set association problem can be conceptualized as the problem of simultaneously testing a large number of sets of hypotheses. We use hidden Markov models to exploit the linkage disequilibrium information in GWAS data, based on which a data-driven screening procedure (GLIS) is proposed. GLIS is shown to be optimal in the sense that it has the smallest missed set rate (MSR) among all valid false set rate (FSR) procedures. The numerical results demonstrate that the proposed procedure controls the FSR at the desired level, enjoys certain optimality properties and outperforms conventional combined p-value methods.We apply the GLIS procedure to analyze a Type 1 diabetes (T1D) GWAS dataset for detecting T1D associated genomic regions. The results show that our proposed SNP set analysis not only provides better biological insights, but also increases the statistical power by pooling information from different samples.