摘要:Automatic computer segmentation in three dimensions creates opportunity to reduce the cost of three-dimensional treatment planning of radiotherapy for cancer treatment. Comparisons between human and computer accuracy in segmenting kidneys in CT scans generate distance values far larger in number than the number of CT scans. Such high dimension, low sample size (HDLSS) data present a grand challenge to statisticians: how do we find good estimates and make credible inference? We recommend discovering and using scientifically and statistically sufficient statistics as an additional strategy for overcoming the curse of dimensionality. First, we reduced the three-dimensional array of distances for each image comparison to a histogram to be modeled individually. Second, we used non-parametric kernel density estimation to explore distributional patterns and assess multi-modality. Third, a systematic exploratory search for parametric distributions and truncated variations led to choosing a Gaussian form as approximating the distribution of a cube root transformation of distance. Fourth, representing each histogram by an individually estimated distribution eliminated the HDLSS problem by reducing on average 26,000 distances per histogram to just 2 parameter estimates. In the fifth and final step we used classical statistical methods to demonstrate that the two human observers disagreed significantly less with each other than with the computer segmentation. Nevertheless, the size of all disagreements was clinically unimportant relative to the size of a kidney. The hierarchal modeling approach to objectoriented data created response variables deemed sufficient by both the scientists and statisticians. We believe the same strategy provides a useful addition to the imaging toolkit and will succeed with many other high throughput technologies in genetics, metabolomics and chemical analysis.
关键词:curse of dimensionality; genomics; metabolomics; microarray