摘要:Acquiring quantitative metrics-based knowledge about the performance of various space physics modeling approaches is central for the space weather community. Quantification of the performance helps the users of the modeling products to better understand the capabilities of the models and to choose the approach that best suits their specific needs. Further, metrics-based analyses are important for addressing the differences between various modeling approaches and for measuring and guiding the progress in the field. In this paper, the metrics-based results of the ground magnetic field perturbation part of the Geospace Environment Modeling 2008–2009 Challenge are reported. Predictions made by 14 different models, including an ensemble model, are compared to geomagnetic observatory recordings from 12 different northern hemispheric locations. Five different metrics are used to quantify the model performances for four storm events. It is shown that the ranking of the models is strongly dependent on the type of metric used to evaluate the model performance. None of the models rank near or at the top systematically for all used metrics. Consequently, one cannot pick the absolute “winner”: the choice for the best model depends on the characteristics of the signal one is interested in. Model performances vary also from event to event. This is particularly clear for root-mean-square difference and utility metric-based analyses. Further, analyses indicate that for some of the models, increasing the global magnetohydrodynamic model spatial resolution and the inclusion of the ring current dynamics improve the models' capability to generate more realistic ground magnetic field fluctuations.