文章基本信息

标题：Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
本地全文：下载
作者：Xiang Ren ; Zhongyuan Mi ; Panos G. Georgopoulos 等
期刊名称：Environment International
印刷版ISSN：0160-4120
电子版ISSN：1873-6750
出版年度：2020
卷号：142
页码：1-13
DOI：10.1016/j.envint.2020.105827
语种：English
出版社：Pergamon
摘要：Graphical abstractDisplay OmittedHighlights•Applied 13 Machine Learning (ML) algorithms for ozone within 2 modeling frameworks.•Tuned sample weights to improve peak accuracy and balance with global accuracy.•Assessed model’s interpolation and extrapolation ability via 6 targeted validations.•Visualized complex patterns using 4 advanced black-box model interpretation tools.•ML performed better than Land-Use Regression, especially for spatiotemporal modeling.•Spatiotemporal models were more flexible and accurate than spatial models with ML.AbstractBackgroundSpatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have been studied using different model structures. There is however a lack of comparisons of methods available within these two modeling frameworks, that can guide model/algorithm selection in air quality epidemiology.ObjectiveThe present study compares thirteen algorithms for spatial/spatiotemporal modeling applied for daily maxima of 8-hour running averages of ambient ozone concentrations at spatial resolutions corresponding to census tracts, to support estimation of annual ozone design values across the contiguous US. These algorithms were selected from nine representative categories and trained using predictors that included chemistry-transport model predictions, meteorological factors, land use and land cover, and stationary and mobile emissions.MethodsTo obtain the best predictive performance, model structures were optimized through a repeated coarse/fine grid search with expert knowledge. Six target-oriented validation strategies were used to prevent overfitting and avoid over-optimistic model evaluation results. In order to take full advantage of the power of different algorithms, we introduced tuning sample weights in spatiotemporal modeling to ensure predictive accuracy of peak concentrations, that is crucial for exposure assessments. In spatial modeling, four interpretation and visualization tools were introduced to explain predictions from different algorithms.ResultsNonlinear ML methods achieved higher prediction accuracy than linear LUR, and the improvements were more significant for spatiotemporal modeling (nearly 10%-40% decrease of predicted RMSE). By tuning the sample weights, spatiotemporal models can predict concentrations used to calculate ozone design values that are comparable or even better than spatial models (nearly 30% decrease of cross-validated RMSE). We visualized the underlying nonlinear relationships, heterogeneous associations and complex interactions from the two best performing ML algorithms, i.e., Random Forest and Extreme Gradient Boosting, and found that the complex patterns were relatively less significant with respect to model accuracy for spatial modeling.ConclusionMachine Learning can provide estimates that are actually more interpretable and practical than linear regression to improve accuracy in modeling human exposures. A careful design of hyperparameter tuning and flexible data splitting and validations is crucial to obtain reliable and stable results. Desirable/successful nonlinear models are expected to capture similar nonlinear patterns and interactions using different ML algorithms.
关键词：Machine learning;Land use regression;Ozone;Spatiotemporal modeling;Black-box model interpretation