摘要:Regional prevalence estimation requires epidemiologic data with substantial local detail. National health surveys may lack in sufficient local observations due to limited resources. Therefore, corresponding prevalence estimates may not capture regional morbidity patterns with the necessary accuracy. Health insurance records represent alternative data sources for this purpose. Fund-specific member populations have more local observations than surveys, which benefits regional prevalence estimation. However, due to national insurance market regulations, insurance membership can be informative for morbidity. Regional fund-specific prevalence proportions are selective in the sense that the morbidity structure of a fund’s member population cannot be extrapolated to the national population. This implies a selection bias that marks a major obstacle for statistical inference. We provide a methodology to adjust fund-specific selectivity and perform regional prevalence estimation from health insurance records. The methodology is applied to estimate regional cohort-referenced diabetes mellitus type 2 prevalence in Germany. Records of the German Public Health Insurance Company from 2014 and Diagnosis-Related Group Statistics data are combined within a benchmarked multi-level model. The fund-specific selectivity is adjusted in a two-step procedure. Firstly, the conditional expectation of the insurance company’s regional prevalence given related inpatient diagnosis frequencies of its members is quantified. Secondly, the regional prevalence is estimated by extrapolating the conditional expectation using corresponding inpatient diagnosis frequencies of the Diagnosis-Related Group Statistics as benchmarks. Model assumptions are validated via Monte Carlo simulation. Variable selection is performed via multivariate methods. The optimal model fit is determined by analysis of variance. 95% confidence intervals for the estimates are constructed via semiparametric bootstrapping. The national diabetes mellitus type 2 prevalence is estimated at 8.70% with a 95% confidence interval of [8.48%, 9.35%]. This indicates an adjustment of the original fund-specific prevalence from − 32.79 to − 25.93%. The estimated disease distribution shows significant morbidity differences between regions, especially between eastern and western Germany. However, the cohort-referenced estimates suggest that these differences can be partially explained by regional demography. The proposed methodology allows regional prevalence estimation in remarkable detail despite fund-specific selectivity. This enhances and encourages the use of health insurance records for future epidemiologic studies.