摘要:Multi‐model climate experiments carried out as part of different phases of the Coupled Model Intercomparison Project (CMIP) are crucial to evaluate past and future climate change. The reliability of models' simulations is often gauged by their ability to reproduce the historical climate across many time scales. This study compares the global mean surface air temperature from 29 CMIP6 models with observations from three datasets. We examine (1) warming and cooling rates in five subperiods from 1880 to 2014, (2) autocorrelation and long‐term persistence, (3) models' performance based on probabilistic and entropy metrics, and (4) the distributional shape of temperature. All models simulate the observed long‐term warming trend from 1880 to 2014. The late twentieth century warming (1975–2014) and the hiatus (1942–1975) are replicated by most models. The post‐1998 warming is overestimated in 90% of the simulations. Only six out of 29 models reproduce the observed long‐term persistence. All models show differences in distributional shape when compared with observations. Varying performance across metrics reveals the challenge to determine the “best” model. Thus, we argue that models should be selected, based on case‐specific metrics, depending on the intended use. Metrics proposed here facilitate a comprehensive assessment for various applications.