Multi-Source-Data-Oriented Ensemble Learning Based PM 2.5 Concentration Prediction in Shenyang

AbstractShenyang where is surrounded by smokestack industries and depends on coal heating in winter, is a classical one of cities in China northeastern which has suffered from serious air pollution, especially PM2.5. The existing research on machine learning, based on historical air-monitoring data and meteorological data, does neither forecast accurately nor identify key pollutants for PM2.5. This paper presents a multi-source-data-oriented ensemble learning for predicting PM2.5 concentration. The proposed framework incorporates not only air quality data and weather data, but also industrial emission data, especially those of winter heating enterprises, in Shenyang and nearby cities; the model also takes into account location and emission frequency of pollution sources. All these data are entered into an ensemble learning model based on Extreme Gradient Boosting (XGBoost) in order to predict PM2.5 concentration, which not only improves prediction accuracy effectively, but also provides contribution analysis of different pollutants. Experimental results show that the top two factors affecting PM2.5 concentration are: (1) air pollutant emission quantities and (2) distance from pollution sources to air-monitoring stations. According to the importance of these two factors, we refine feature selection and re-train the ensemble learning model and find that the new model performs better on 72% of evaluation indexes.


Return to previous page