首页    期刊浏览 2024年10月05日 星期六
登录注册

文章基本信息

  • 标题:Cross-validation pitfalls when selecting and assessing regression and classification models
  • 本地全文:下载
  • 作者:Damjan Krstajic ; Ljubomir J Buturovic ; David E Leahy
  • 期刊名称:Journal of Cheminformatics
  • 印刷版ISSN:1758-2946
  • 电子版ISSN:1758-2946
  • 出版年度:2014
  • 卷号:6
  • 期号:1
  • 页码:10
  • DOI:10.1186/1758-2946-6-10
  • 语种:English
  • 出版社:BioMed Central
  • 摘要:We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.
国家哲学社会科学文献中心版权所有