摘要:The problem of inferring the population structure, linkage disequi-
librium pattern, and chromosomal recombination hotspots from genetic polymor-
phism data is essential for understanding the origin and characteristics of genome
variations, with important applications to the genetic analysis of disease propen-
sities and other complex traits. Statistical genetic methodologies developed so far
mostly address these problems separately using specialized models ranging from
coalescence and admixture models for population structures, to hidden Markov
models and renewal processes for recombination; but most of these approaches
ignore the inherent uncertainty in the genetic complexity (e.g., the number of ge-
netic founders of a population) of the data and the close statistical and biological
relationships among objects studied in these problems. We present a new statis-
tical framework called hidden Markov Dirichlet process (HMDP) to jointly model
the genetic recombinations among a possibly innite number of founders and the
coalescence-with-mutation events in the resulting genealogies. The HMDP posits
that a haplotype of genetic markers is generated by a sequence of recombination
events that select an ancestor for each locus from an unbounded set of founders
according to a 1st-order Markov transition process. Conjoining this process with
a mutation model, our method accommodates both between-lineage recombina-
tion and within-lineage sequence variations, and leads to a compact and natural
interpretation of the population structure and inheritance process underlying hap-
lotype data. We have developed an ecient sampling algorithm for HMDP based
on a two-level nested Polya urn scheme, and we present experimental results on
joint inference of population structure, linkage disequilibrium, and recombination
hotspots based on HMDP. On both simulated and real SNP haplotype data, our
method performs competitively or signicantly better than extant methods in un-
covering the recombination hotspots along chromosomal loci; and in addition it
also infers the ancestral genetic patterns and o
ers a highly accurate map of an-
cestral compositions of modern populations.