PreprintPDF Available

Comparative Study on Predicting Particulate Matter (PM2.5) Levels Using LSTM Models

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In recent times, air pollution has attracted the attention of policymakers and researchers as an important issue. The pollution that contaminates the air that people breathe is from pollutants such as oxides of carbon, nitrogen and sulphur as well minuscule dust particle which are smaller than 0.0025mm in diameter. The emissions contain many substances that are harmful to human health when exposed to them for a prolonged period or more than certain levels of concentration. The recent advent of technology in sensors and compact instruments to measure the concentration of pollutant levels with considerable ease. Further, this paper also predicts the air pollution for using multiple Deep Learning models that are variations of the Long Short-Term Memory (LSTM) model. In this research, only PM2.5 alone taken into consideration for prediction. Real-time air quality data were collected at selected places in the study area. It is found that the model prediction data is well matched with the other researchers' results and real-time data.
Content may be subject to copyright.
Page 1/14
Comparative Study on Predicting Particulate Matter
(PM2.5) Levels Using LSTM Models
R. Balamurali
Chennai Institute of Technology
Partheeban Pachaivannan ( parthi011@yahoo.co.in )
Chennai Institute of Technology
P. Navin Elamparithi
National Institute of Technology
R. Rani Hemamalini
St. Peter’s Institute of Higher Education and Research
Research Article
Keywords: Air Pollution, Deep Learning, LSTM, PM2.5, Real-time air quality data, Regression
DOI: https://doi.org/10.21203/rs.3.rs-436897/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Page 2/14
Abstract
In recent times, air pollution has attracted the attention of policymakers and researchers as an important
issue. The pollution that contaminates the air that people breathe is from pollutants such as oxides of
carbon, nitrogen and sulphur as well minuscule dust particle which are smaller than 0.0025mm in
diameter. The emissions contain many substances that are harmful to human health when exposed to
them for a prolonged period or more than certain levels of concentration. The recent advent of technology
in sensors and compact instruments to measure the concentration of pollutant levels with considerable
ease. Further, this paper also predicts the air pollution for using multiple Deep Learning models that are
variations of the Long Short-Term Memory (LSTM) model.In this research, only PM2.5 alone taken into
consideration for prediction. Real-time air quality data were collected at selected places in the study area.
It is found that the model prediction data is well matched with the other researchers' results and real-time
data.
1. Introduction
Pollution caused by pollutants such as CO, CO2, NOx, SO2 and dust particles with diameters less than
0.0025 mg is one of the leading causes of deaths in India. A study by 1 estimated that about 1.24 million
deaths in India could be attributed to air pollution. The key purpose of this analysis was to identify and
classify deep learning models that would be the best method to predict the PM2.5 concentrations using
the dataset. Data necessary for this experiment were collected from the Central Pollution Control Board of
India.
Data was collected over 15 min intervals and the type of articial Recurrent Neural Network called Long
Short-Term Memory was utilized to analyze the data and compare predictions. Data were collected from
3 monitoring stations in the city of Chennai. The locations of these stations are the neighbourhoods of
Alandur, Manali and Velachery. These pollution monitoring stations collect a variety of data and
exploratory variables were chosen from the gathered at these stations. Then data was cleaned to have a
more ordered and complete dataset to avoid any inaccuracies caused by missing data.
Air pollution has become a major concern in India in recent years, as large parts of the urban population
of India are exposed to some of the highest levels of pollution in the world 2. World Health Organization
estimates that the health effects of air pollution have increased the hazard risks in major cities of India 3.
Many cities in India have a population of over 1 million, and some of them rank among the top 10 in the
world's most polluted cities. Of the 3 million premature deaths in the world that occur annually due to
outdoor and indoor air pollution, the highest number is estimated to have occurred in India. India has
many pollution problems, the most severe of which is air pollution.
4developed three machine learning algorithms that predicted the levels of PM2.5 using a dataset that
included downscaled and uncertainty.5have applied Long Short-Term Memory (LSTM) and
Convolutional Neural Network (CNN) to predict the concentration of PM2.5 and compared results with
Page 3/14
other machine learning methods. 6have applied three machine learning models that forecasted PM2.5
concentrations and their results showed that the variability was 80% (R2= 0.8) in the concentrations of
PM2.5 and 75% of the pollution levels were predicted. 7 made an attention mechanism to capture the
degree of signicance of the effects on future concentrations of PM2.5 of the featured states at different
times in the past. 8 have studied the PM2.5 using Interagency Monitoring of Protected Visual
Environments (IMPROVE) and Chemical Speciation Network (CSN). They obtained data from these two
networks with different operating structures, sampling practices, analytical methods, analytical facilities,
and data handling and validation practices and they collected data for 33 months. Further they
concluded that the combined method of CSN and IMPROVE dataset will explain better understanding of
PM2.5 in urban and rural areas.
Magazzino et al established and experimented the relationship among COVID-19-related deaths,
economic growth, PM10, PM2.5, and NO2 concentrations for New York, USA 9. The arrived that the PM2.5
and NO2 are the important pollutants and economic growth rate increases pollution level for COVID -19
death rates. A study on Impact of Outdoor and Indoor Meteorological Conditions on the COVID-19
transmission in the Western Region of Saudi Arabia by 10. They have considered 10 impact outdoor and
indoor meteorological parameters for COVID-19 cases. They concluded that highest daily COVID-19 cases
when the temperature ranges between 40.71 °C to 41.20 °C. A research on Source analysis of heavy
metal elements of PM2.5 in canteen in a university in winter was carried by 11. They analysed the indoor
and outdoor PM2.5 in a canteen of university and found that the PM25. At inddor is 99.43 μg/m3 and
outdoor is 103.09 μg/m3. Further they found that more than half of the PM2.5 penetrates from the
adjacent outdoor area in the study location.
12 developed a novel long short-term memory neural network extended (LSTME) model with
spatiotemporal correlations. The authors used hourly PM2.5 data from Beijing City and the results
showed a mean absolute percentage error (MAPE) of 11.93%. In another study on AQI prediction for Delhi
done by 13 used a deep recurrent neural network (RNN) model. Their LSTM model achieved good results
for concentrations of pollutants. A study carried out for Beijing, China to predict PM2.5 using hourly data
collected for one year 14. Their mean R2 values varied from 0.l59 to 0.85 after 216 experiments. 15
developed a composite system that predicted both PM2.5 and PM10. They used Moderate Resolution
Imaging Spectroradiometer (MODIS) images, with a 1km spatial resolution and concluded that the LSTM
model is best for prediction of PM2.5 / PM10. 16 predicted PM by using two sets of 3-D chemistry-
transport model (CTM) simulations and the results index of agreements ranging from 0.62 to 0.79.
A research carried out on PM2.5 prediction for Wuhan and Chengdu by 17 used PM2.5 concentration data
from 2015-2017. Metrological data were also used in developing the model and better results were
achieved because of this. A machine learning method is adopted to predict PM2.5 using six-year
metrological data 18. The model has shown that the use of machine learning-based statistical models are
important for forecasting concentrations of PM2.5 from meteorological data. A study was carried out to
predict air quality with time to predict going up to 48 hrs by combining multiple neural networks 19. This
Page 4/14
experiment resulted in excellent performance and outperformed current state-of-the-art methods. 20
proposed a deep learning model to predict air quality in South Korea that used Stacked Autoencoders to
train and test data. Research work by 21 used meteorological data to forecast AQI. This is the only study
carried for Chennai city and used one of the monitoring stations from which the data used here was
collected from.
2. Methods
2.1. About the study area
Chennai is located along the coast of the Bay of Bengal. It is the state capital of Tamil Nadu and the
fourth largest metropolis in India. Chennai lies between the latitudes 12°50'49" and 13°17'24" and
longitudes 79°59'53" and 80°20'12". It can be counted as a part of the Coromandel Coast along the
eastern part of India 22. The terrain around Chennai is a at coastal plain and since it is close to the
equator, it is usually humid and hot. The highest temperatures are reached in May-June and are generally
around 40°C for a few days and the least temperatures are felt in early January with the recorded
temperature of about 20°C throughout the month. Chennai is a major transport hub for road, rail, air and
sea transport linking major inland and overseas cities. Chennai is one of India's most prominent
educational centres with a range of institutions and research centres. The metropolitan area of Chennai
stretches to some 1,189 sq.km.
2.2. About the dataset
The data was collected from the 3 Central Pollution Control Board (CPCB) monitoring stations in the city
of Chennai [19]. The stations are located at Alandur, Manali and Velachery and illustrated in Fig. 1. The
exploratory variables collected from these locations were the atmospheric pressure (BP), relative humidity
(RH), PM2.5 values, wind degree (WD) and wind speed (WS). The data collected was present in 15 min
intervals for the period of 00:00, 01 May 2019 to 23:59, 30 April 2020 and each station yielded a dataset
containing 35,039 data rows totalling a 105,117 data rows. The missing values of PM2.5 were
approximately 78.28%. The data was processed to remove any rows that had empty columns and the
data was restricted to rows that had PM2.5 levels of less than 2.5x10-4 mg/L. This left the data to be
reduced 22,827 data rows as certain elements were missing in all the other rows.
The statistical summary of the dataset has been shown in Table 1. The table provides an insight towards
how the dataset is structured. The exploratory variables shown in Table 1 were then used to plot a
heatmap. Figure 2 shows the correlation between the different exploratory variables collected in the
dataset. Figure S1 shows the rst 5 rows of information within the dataset. Figure S2 provides statistics
about the dataset such as central tendency, dispersion and shape of the dataset distribution. Figure S3
provides information about the type of data stored within the dataset.
2.3. About DL and RNN
Page 5/14
Deep learning is a subset of machine learning methods based on representation learning and articial
neural networks. Learning is of 3 types, namely, unsupervised, supervised or semi-supervised. Deep
learning architectures such as deep belief networks, deep neural networks, recurrent neural networks and
convolutional neural networks are being applied to speech recognition, natural language processing,
computer vision, audio recognition, machine translation, drug design, social network ltering, medical
image analysis, bioinformatics, material inspection and board game programs. RNNs are the basis for
the LSTMs used in the models. They are a class of Articial Neural Networks (ANN) that use their internal
state to process variable sequence length of inputs. RNNs can be dened as a generalized form of
feedforward neural networks. This means that RNNs can use previous outputs as inputs within the model
with hidden states as well. RNNs also have the added advantage of being able to compute inputs of
varying lengths and the size of the model doesn’t change with the size of the input. But the disadvantage
of using an RNN is that it is very dicult to train an RNN and it takes a lot of time to train RNNs.
3. Pm2.5 Prediction Using Deep Learning
The deep learning approaches used here were different variations of the articial recurrent neural network
(RNN) called long short-term memory (LSTM). LSTMs were introduced by Hochreiter & Schmidhuber in
1997 and can learn long term dependencies 23. LSTMs have varied uses and multiple ways of
implementation. The methods studied here are used to tackle time-series prediction problems and these
methods are namely
1. LSTM Network for Regression
2. LSTM for Regression with Time Steps
3. LSTM with Memory Between Batches
4. Stacked LSTMs with Memory Between Batches
A generic LSTM unit has three gates that regulate the ow of information within the unit. These gates are
called input, output and forget. All the models had the dataset split into training and testing datasets.
Two-thirds of the data was assigned to train the models and the remaining one-third was used to test the
models. All the models were trained for both 100 and 1000 epochs.
3.1. LSTM Network for Regression
The network has three layers with the visible layer having one input. The hidden block was made up of 4
LSTM units and the output layer produced a single value prediction. The data from the dataset is then t
into the model and from this the performance of the train and test datasets can be estimated. After this,
the model is used to make predictions on both the train and test datasets and from that, the visual skill of
the model can be identied.
Fig. 3(a) indicates the PM2.5 values against time for 100 epochs. Green indicates the training dataset
and red indicates the testing plot. The RMSE values obtained indicated that the model has an average
Page 6/14
error of 0.1552x10-4 mg/L for the training dataset and 0.1289x10-4 mg/L for the testing dataset. The R2
values obtained were 0.77 and 0.67 for the training and testing datasets, respectively. Fig. 3(b) shows the
LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training
and testing datasets.
Fig. 4(a) indicates the PM2.5 values against time for 1000 epochs. Green indicates the training dataset
and red indicates the testing plot. The RMSE values obtained indicated that the model has an average
error of 0.1553x10-4 mg/L for the training dataset and 01276x10-4 mg/L for the testing dataset. The R2
values obtained were 0.77 and 0.68 for the training and testing datasets, respectively. Fig. 4(b) shows the
LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training
and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn’t create any major
differences in results and the model has done a good job in tting the model for both the training and
testing datasets.
3.2.LSTM for Regression with Time Steps
Time steps can be used as inputs to predict the output at the next step. They provide another method in
tackling the time series problem. Any point of failure or surge and the conditions that lead up to them are
the features that dene a time step.
Fig. S4(a) indicates the PM2.5 values against time. Green indicates the training dataset and red indicates
the testing plot. The RMSE values obtained indicated that the model has an average error of 0.15x10-4
mg/L for the training dataset and 0.1329x10-4 mg/L for the testing dataset. The R2 values obtained were
0.79 and 0.65 for the training and testing datasets, respectively. Fig. S4(b) shows the LSTM trained on
regression for the dataset and the comparison of predicted values (blue) vs the training and testing
datasets.
Fig. S5(a) indicates the PM2.5 values against time for 1,000 epochs. Green indicates the training dataset
and red indicates the testing plot. The RMSE values obtained indicated that the model has an average
error of 0.1483x10-4 mg/L for the training dataset and 0.1394x10-4 mg/L for the testing dataset. The R2
values obtained were 0.79 and 0.62 for the training and testing datasets, respectively. Fig. S5(b) shows
the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the
training and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn’t create any
major differences in results and the model has done a good job in tting the model for both the training
and testing datasets.
3.3. LSTM with Memory Between Batches
LSTM in Python is executed through the Keras deep learning library and the library supports both
stateless and stateful LSTMs. The stateful LSTMs provide ner control over the internal state of the
LSTM and when the internal state of the LSTM is reset. This can be used to make predictions to by
building state over the entire training sequence.
Page 7/14
Fig. S6(a) indicates the PM2.5 values against time. Green indicates the training dataset and red indicates
the testing plot. The RMSE values obtained indicated that the model has an average error of 0.1602x10-4
mg/L for the training dataset and 0.1653x10-4 mg/L for the testing dataset. The R2 values obtained were
0.76 and 0.46 for the training and testing datasets, respectively. Fig. S6(b) shows the LSTM trained on
regression for the dataset and the comparison of predicted values (blue) vs the training and testing
datasets.
Fig. S7(a) indicates the PM2.5 values against time for 1,000 epochs. Green indicates the training dataset
and red indicates the testing plot. The RMSE values obtained indicated that the model has an average
error of 0.1582x10-4 mg/L for the training dataset and 0.1648x10-4 mg/L for the testing dataset. The R2
values obtained were 0.76 and 0.46 for the training and testing datasets, respectively. Fig. S7(b) shows
the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the
training and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn’t create any
major differences in results and the model has done a good job in tting the model for both the training
and testing datasets.
3.4Stacked LSTMs with Memory Between Batches
Stacked LSTMs are an extension of normal LSTMs which have a single hidden layer. Thereby, stacked
LSTMs have multiple hidden layers with multiple memory cells. Stacking LSTM layers make the model
deeper and thus justify the usage of the term deep learning.
Fig. S8(a) indicates the PM2.5 values against time. Green indicates the training dataset and red indicates
the testing plot. The RMSE values obtained indicated that the model has an average error of 0.1597x10-4
mg/L for the training dataset and 0.1724x10-4 mg/L for the testing dataset. The R2 values obtained were
0.76 and 0.40 for the training and testing datasets, respectively. Fig. S8(b) shows the LSTM trained on
regression for the dataset and the comparison of predicted values (blue) vs the training and testing
datasets.
Fig. S9(a) indicates the PM2.5 values against time for 1000 epochs. Green indicates the training dataset
and red indicates the testing plot. The RMSE values obtained indicated that the model has an average
error of 0.1595x10-4 mg/L for the training dataset and 0.1717x10-4 mg/L for the testing dataset. The R2
values obtained were 0.76 and 0.42 for the training and testing datasets, respectively. Fig. S9(b) shows
the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the
training and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn’t create any
major differences in results and the model has done a good job in tting the model for both the training
and testing datasets.
4. Results And Discussions
Page 8/14
The collected dataset is divided into two parts: two-third of the data was used to train the model, and the
remaining one-third of the data was used to test the performance of the developed models when
benchmarking with others. The root-mean-square error (RMSE), and coecient of determination (R2)
were used to evaluate the performance of the dierent models presented in this paper.
Four different variations of the LSTM models were compared and tested based on performance. All four
models were trained and tested using the same datasets, the PM2.5 mass concentrations from the three
air quality monitoring stations in the city of Chennai were present in the dataset and were predicted to
evaluate performance. The dataset was cleaned to remove any incomplete data or data that exceeded
2.5x10-4 mg/L in providing more uniform predictions and results.
The comparison of the prediction results from the four models are made in terms of RMSE and R2 in
Tables 2 and 3. The time to predict is taken to be 15 min and most prior researchers have not used such a
short time to predict for their research. All four DL models used have similar RMSE values for the given
dataset. The R2 seems the same as well for the training dataset, but when it comes to the testing dataset,
there appears to be a decrease in R2. This difference in R2 values for the training and testing dataset can
be indicative of overtting by the models.
All the models were trained for both 100 and 1,000 epochs and show very similar results in both cases.
The results both in terms of RMSE and R2 values are very similar and this can be attributed to the fact
that any model should be trained till it reaches the minimum error rate and not after that as it may cause
overtting in the model. Also, from the results, it can be said that these are very suitable for predicting
urban PM2.5 concentration in the future.
Nonetheless, the research has some drawbacks as emissions have a huge effect on air quality. As
emission data are hard to acquire, the data obtained in this paper does not contain emissions from
factories and vehicles in the region. This affects the accuracy of themodel 's prediction. Also, when a
sudden increase inpollution because certainaccidentsoccur, the concentration of PM2.5 changes
suddenly. Whether the proposed model will forecast this well still needs to be shown.
5. Conclusions
All the models were developed to predict PM2.5 concentrations with the LSTM model that used
regression with time steps showing the best results for 100 epochs of training. All four models produce
very similar RMSE values for both training and testing datasets. The least difference in RMSE values was
in the LSTM with Memory Between Batches variation. While the least values training and testing RMSE
values were observed in the LSTM for Regression with Time Steps and LSTM Network for Regression
respectively. The R2 values for these were consistent for the training dataset but varied wildly for the
testing dataset. Per this, it can be concluded that the LSTM Network for Regression produced the best
results as there was little to no overtting seen in the model. While these provide some insight onto which
models might be appropriate for prediction of PM2.5 values, all of these were trained for 100 and 1,000
Page 9/14
epochs with little to no variation in results in terms of accuracy of predictions. Also, there is a necessity to
introduce more statistical analysis techniques as well as the introduction of more exploratory variables
will improve the model’s performance and open new avenues to study new exploratory variables and
methods to analyze them. The development of such models is very useful to the city of Chennai, which
plays a role as a vital industrial centre and region of economic importance. The models can be used to
identify factors that affect air pollution within the city of Chennai and thereby reduce the levels of
pollution as well as the impact of air pollution on the inhabitants of the city. The models can also be
expanded to different cities around India and the world and thereby improve the quality of life of people
around the world.
Declarations
Acknowledgments
The authors would like to thank the government of India and the Central Pollution Control Board for
providing access to the data collected and stored in their website.
Author Contributions
R. B (Associate Professor) contributed to problem identication, literature collection and data analysis. P.
P (Professor) involved in problem identication, literature collection, data collection, manuscript
preparation. P. N. E (B. Tech Student) contributed to data collection, data cleaning, data analysis, model
development, interpretation of results. R. R. H (Professor) contributed to data analysis, manuscript
preparation.
References
1. Balakrishnan, K.
et al.
The impact of air pollution on deaths, disease burden, and life expectancy
across the states of India: the Global Burden of Disease Study 2017.
Lancet Planet. Heal.
3, e26–e39
(2019).
2. Smith, K. R. Managing the Risk Transition.
Toxicol. Ind. Health
7, 319–327 (1991).
3. Lippmann, M. Environmental Toxicants: Human Exposures and Their Health Effects, 3rd Edition.
Chromatographia
71, 555–555 (2010).
4. Di, Q.
et al.
An ensemble-based model of PM2.5 concentration across the contiguous United States
with high spatiotemporal resolution.
Environ. Int.
130, 104909 (2019).
5. Huang, C.-J. & Kuo, P.-H. A Deep CNN-LSTM Model for Particulate Matter (PM2.5) Forecasting in
Smart Cities.
Sensors
18, 2220 (2018).
. Karimian, H.
et al.
Evaluation of Different Machine Learning Approaches to Forecasting PM2.5 Mass
Concentrations.
Aerosol Air Qual. Res.
19, 1400–1410 (2019).
Page 10/14
7. Li, S.
et al.
Urban PM2.5 Concentration Prediction via Attention-Based CNN–LSTM.
Appl. Sci.
10,
1953 (2020).
. Gorham, K. A., Raffuse, S. M., Hyslop, N. P. & White, W. H. Comparison of recent speciated PM2.5 data
from collocated CSN and IMPROVE measurements.
Atmos. Environ.
244, 117977 (2021).
9. Magazzino, C., Mele, M. & Sarkodie, S. A. The nexus between COVID-19 deaths, air pollution and
economic growth in New York state: Evidence from Deep Machine Learning.
J. Environ. Manage.
286,
112241 (2021).
10. Habeebullah, T. M., Abd El-Rahim, I. H. A. & Morsy, E. A. Impact of Outdoor and Indoor Meteorological
Conditions on the COVID-19 transmission in the Western Region of Saudi Arabia.
J. Environ.
Manage.
184, 112392 (2021).
11. Niu, Y., Wang, F., Liu, S. & Zhang, W. Source analysis of heavy metal elements of PM2.5 in canteen in
a university in winter.
Atmos. Environ.
244, 117879 (2021).
12. Li, X.
et al.
Long short-term memory neural network for air pollutant concentration predictions:
Method development and evaluation.
Environ. Pollut.
231, 997–1004 (2017).
13. Bansal, M., Aggarwal, A. & Verma, T. Air Quality Index Prediction of Delhi using LSTM.
Int. J. Emerg.
Trends Technol. Comput. Sci.
8, 59–68 (2019).
14. Yang, M., Fan, H. & Zhao, K. PM2.5 Prediction with a Novel Multi-Step-Ahead Forecasting Model
Based on Dynamic Wind Field Distance.
Int. J. Environ. Res. Public Health
16, 4482 (2019).
15. Wu, X., Wang, Y., He, S. & Wu, Z. PM2.5/PM10 ratio prediction based on a long short-term memory
neural network in Wuhan, China.
Geosci. Model Dev.
13, 1499–1511 (2020).
1. Kim, H. S.
et al.
Development of a daily PM 10 and PM 2.5 prediction system using a deep long short-
term memory neural network model.
Atmos. Chem. Phys.
19, 12935–12951 (2019).
17. Zhang, S.
et al.
LSTM-based air quality predicted model for large cities in China.
Nat. Environ. Pollut.
Technol.
19, 229–236 (2020).
1. Kleine Deters, J., Zalakeviciute, R., Gonzalez, M. & Rybarczyk, Y. Modeling PM 2.5 Urban Pollution
Using Machine Learning and Selected Meteorological Parameters.
J. Electr. Comput. Eng.
2017, 1–14
(2017).
19. Soh, P.-W., Chang, J.-W. & Huang, J.-W. Adaptive Deep Learning-Based Air Quality Prediction Model
Using the Most Relevant Spatial-Temporal Relations.
IEEE Access
6, 38186–38199 (2018).
20. Xayasouk, T. & Lee, H. Air Pollution Prediction System Using Deep Learning.
WIT Trans. Ecol.
Environ.
230, 71–79 (2018).
21. Anurag, N. V., Burra, Y., Sharanya, S. & MG, G. Air Quality Index Prediction using Meteorological Data
using Featured Based Weighted Xgboost.
Int. J. Innov. Technol. Explor. Eng.
8, 1026–1029 (2019).
22. Sivacoumar, R. & Jayabalou, R. Assessment of source contribution to ambient air quality through
comprehensive emission inventory, long-term monitoring and deterministic modeling.
Int. J. Environ.
Sci. Technol.
16, 2765–2782 (2019).
23. Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory.
Neural Comput.
9, 1735–1780 (1997).
Page 11/14
Tables
Table 1 Statistical summary of the exploratory variables in the dataset
Statistical Measures BP (Pa) RH (%) PM2.5 (μg/m³) WD (deg) WS (m/s)
Mean 988.028 63.801 40.461 170.322 1.959
Standard Deviation 51.422 18.249 32.170 75.237 1.254
Minimum 751.120 4.450 0.010 0.070 0.020
Maximum 1043.600 100.000 249.680 359.960 10.330

Table 2 Comparison of RMSE and R2 values across the four prediction models for 100 epochs
Type of LSTM Models RMSE R2
Training Testing Training Testing
LSTM Network for Regression 15.52 12.89 0.77 0.67
LSTM for Regression with Time Steps 15.00 13.29 0.79 0.65
LSTM with Memory Between Batches 16.02 16.53 0.76 0.46
Stacked LSTMs with Memory Between Batches 15.97 17.24 0.76 0.40
Table 3 Comparison of RMSE and R2 values across the four prediction models for 1000 epochs
Type of LSTM Models RMSE R2
Training Testing Training Testing
LSTM Network for Regression 15.53 12.76 0.77 0.68
LSTM for Regression with Time Steps 14.83 13.94 0.79 0.62
LSTM with Memory Between Batches 15.82 16.48 0.76 0.46
Stacked LSTMs with Memory Between Batches 15.95 17.17 0.76 0.42
Page 12/14
Figures
Figure 1
Location of air quality monitoring stations Note: The designations employed and the presentation of the
material on this map do not imply the expression of any opinion whatsoever on the part of Research
Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning
the delimitation of its frontiers or boundaries. This map has been provided by the authors.
Page 13/14
Figure 2
Heatmap of exploratory variables
Figure 3
Page 14/14
(a) PM2.5 values vs Time (b) Comparison of observed vs predicted PM 2.5 values trained for 100 epochs
Figure 4
(a) PM2.5 values vs Time (b) Comparison of observed vs predicted PM 2.5 values trained for 1000
epochs
Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.
SupplementaryMaterial.docx
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The aim of this paper is to assess the relationship between COVID-19-related deaths, economic growth, PM10, PM2.5, and NO2 concentrations in New York state using city-level daily data through two Machine Learning experiments. PM2.5 and NO2 are the most significant pollutant agents responsible for facilitating COVID-19 attributed death rates. Besides, we found only six out of many tested causal inferences to be significant and true within the AUPRC analysis. In line with the causal findings, a unidirectional causal effect is found from PM2.5 to Deaths, NO2 to Deaths, and economic growth to both PM2.5 and NO2. Corroborating the first experiment, the causal results confirmed the capability of polluting variables (PM2.5 to Deaths, NO2 to Deaths) to accelerate COVID-19 deaths. In contrast, we found evidence that unsustainable economic growth predicts the dynamics of air pollutants. This shows how unsustainable economic growth could increase environmental pollution by escalating emissions of pollutant agents (PM2.5 and NO2) in New York state.
Article
Full-text available
Urban particulate matter forecasting is regarded as an essential issue for early warning and control management of air pollution, especially fine particulate matter (PM2.5). However, existing methods for PM2.5 concentration prediction neglect the effects of featured states at different times in the past on future PM2.5 concentration, and most fail to effectively simulate the temporal and spatial dependencies of PM2.5 concentration at the same time. With this consideration, we propose a deep learning-based method, AC-LSTM, which comprises a one-dimensional convolutional neural network (CNN), long short-term memory (LSTM) network, and attention-based network, for urban PM2.5 concentration prediction. Instead of only using air pollutant concentrations, we also add meteorological data and the PM2.5 concentrations of adjacent air quality monitoring stations as the input to our AC-LSTM. Hence, the spatiotemporal correlation and interdependence of multivariate air quality-related time-series data are learned by the CNN–LSTM network in AC-LSTM. The attention mechanism is applied to capture the importance degrees of the effects of featured states at different times in the past on future PM2.5 concentration. The attention-based layer can automatically weigh the past feature states to improve prediction accuracy. In addition, we predict the PM2.5 concentrations over the next 24 h by using air quality data in Taiyuan city, China, and compare it with six baseline methods. To compare the overall performance of each method, the mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R2) are applied to the experiments in this paper. The experimental results indicate that our method is capable of dealing with PM2.5 concentration prediction with the highest performance.
Article
Full-text available
A deep recurrent neural network system based on a long short-term memory (LSTM) model was developed for daily PM10 and PM2.5 predictions in South Korea. The structural and learnable parameters of the newly developed system were optimized from iterative model training. Independent variables were obtained from ground-based observations over 2.3 years. The performance of the particulate matter (PM) prediction LSTM was then evaluated by comparisons with ground PM observations and with the PM concentrations predicted from two sets of 3-D chemistry-transport model (CTM) simulations (with and without data assimilation for initial conditions). The comparisons showed, in general, better performance with the LSTM than with the 3-D CTM simulations. For example, in terms of IOAs (index of agreements), the PM prediction IOAs were enhanced from 0.36–0.78 with the 3-D CTM simulations to 0.62–0.79 with the LSTM-based model. The deep LSTM-based PM prediction system developed at observation sites is expected to be further integrated with 3-D CTM-based prediction systems in the future. In addition to this, further possible applications of the deep LSTM-based system are discussed, together with some limitations of the current system.
Article
Full-text available
Various approaches have been proposed to model PM 2.5 in the recent decade, with satellite-derived aerosol optical depth, land-use variables, chemical transport model predictions, and several meteorological variables as major predictor variables. Our study used an ensemble model that integrated multiple machine learning algorithms and predictor variables to estimate daily PM 2.5 at a resolution of 1 km × 1 km across the contiguous United States. We used a generalized additive model that accounted for geographic difference to combine PM 2.5 estimates from neural network, random forest, and gradient boosting. The three machine learning algorithms were based on multiple predictor variables, including satellite data, meteorological variables, land-use variables, elevation, chemical transport model predictions, several reanalysis datasets, and others. The model training results from 2000 to 2015 indicated good model performance with a 10-fold cross-validated R 2 of 0.86 for daily PM 2.5 predictions. For annual PM 2.5 estimates, the cross-validated R 2 was 0.89. Our model demonstrated good performance up to 60 μg/m 3. Using trained PM 2.5 model and predictor variables, we predicted daily PM 2.5 from 2000 to 2015 at every 1 km × 1 km grid cell in the contiguous United States. We also used localized land-use variables within 1 km × 1 km grids to downscale PM 2.5 predictions to 100 m × 100 m grid cells. To characterize uncertainty, we used meteorological variables, land-use variables, and elevation to model the monthly standard deviation of the difference between daily monitored and predicted PM 2.5 for every 1 km × 1 km grid cell. This PM 2.5 prediction dataset, including the downscaled and uncertainty predictions, allows epidemiologists to accurately estimate the adverse health effect of PM 2.5. Compared with model performance of individual base learners, an ensemble model would achieve a better overall estimation. It is worth exploring other ensemble model formats to synthesize estimations from different models or from different groups to improve overall performance.
Article
Full-text available
Background: Air pollution is a major planetary health risk, with India estimated to have some of the worst levels globally. To inform action at subnational levels in India, we estimated the exposure to air pollution and its impact on deaths, disease burden, and life expectancy in every state of India in 2017. Methods: We estimated exposure to air pollution, including ambient particulate matter pollution, defined as the annual average gridded concentration of PM2.5, and household air pollution, defined as percentage of households using solid cooking fuels and the corresponding exposure to PM2.5, across the states of India using accessible data from multiple sources as part of the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2017. The states were categorised into three Socio-demographic Index (SDI) levels as calculated by GBD 2017 on the basis of lag-distributed per-capita income, mean education in people aged 15 years or older, and total fertility rate in people younger than 25 years. We estimated deaths and disability-adjusted life-years (DALYs) attributable to air pollution exposure, on the basis of exposure-response relationships from the published literature, as assessed in GBD 2017; the proportion of total global air pollution DALYs in India; and what the life expectancy would have been in each state of India if air pollution levels had been less than the minimum level causing health loss. Findings: The annual population-weighted mean exposure to ambient particulate matter PM2·5 in India was 89·9 μg/m3 (95% uncertainty interval [UI] 67·0-112·0) in 2017. Most states, and 76·8% of the population of India, were exposed to annual population-weighted mean PM2·5 greater than 40 μg/m3, which is the limit recommended by the National Ambient Air Quality Standards in India. Delhi had the highest annual population-weighted mean PM2·5 in 2017, followed by Uttar Pradesh, Bihar, and Haryana in north India, all with mean values greater than 125 μg/m3. The proportion of population using solid fuels in India was 55·5% (54·8-56·2) in 2017, which exceeded 75% in the low SDI states of Bihar, Jharkhand, and Odisha. 1·24 million (1·09-1·39) deaths in India in 2017, which were 12·5% of the total deaths, were attributable to air pollution, including 0·67 million (0·55-0·79) from ambient particulate matter pollution and 0·48 million (0·39-0·58) from household air pollution. Of these deaths attributable to air pollution, 51·4% were in people younger than 70 years. India contributed 18·1% of the global population but had 26·2% of the global air pollution DALYs in 2017. The ambient particulate matter pollution DALY rate was highest in the north Indian states of Uttar Pradesh, Haryana, Delhi, Punjab, and Rajasthan, spread across the three SDI state groups, and the household air pollution DALY rate was highest in the low SDI states of Chhattisgarh, Rajasthan, Madhya Pradesh, and Assam in north and northeast India. We estimated that if the air pollution level in India were less than the minimum causing health loss, the average life expectancy in 2017 would have been higher by 1·7 years (1·6-1·9), with this increase exceeding 2 years in the north Indian states of Rajasthan, Uttar Pradesh, and Haryana. Interpretation: India has disproportionately high mortality and disease burden due to air pollution. This burden is generally highest in the low SDI states of north India. Reducing the substantial avoidable deaths and disease burden from this major environmental risk is dependent on rapid deployment of effective multisectoral policies throughout India that are commensurate with the magnitude of air pollution in each state. Funding: Bill & Melinda Gates Foundation; and Indian Council of Medical Research, Department of Health Research, Ministry of Health and Family Welfare, Government of India.
Article
Meteorological conditions may influence the incidence of many infectious diseases. Coronavirus disease-2019 (COVID-19) is a highly contagious, air-borne, emerging, viral disease caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). In 2020, the COVID-19 global pandemic affected more than 210 countries and territories worldwide including Saudi Arabia. There are contradictory research papers about the correlation between meteorological parameters and incidence of COVID-19 in some countries worldwide. The current study investigates the impact of outdoor and indoor meteorological conditions on the daily recorded COVID-19 cases in western region (Makkah and Madinah cities) of Saudi Arabia over a period of 8 months from March to October 2020. Reports of the daily confirmed COVID-19 cases from the webpage of Saudi Ministry of Health (MOH) were used. Considering, the incubation period of COVID-19 which ranged from 2-14 days, the relationships between daily COVID-19 cases and outdoor meteorological factors (temperature, relative humidity, and wind speed) using a lag time of 10 days are investigated. The results showed that the highest daily COVID-19 cases in Makkah and Madinah were reported during the hottest months of the year (April - July 2020) when outdoor temperature ranged from 26.51 - 40.71 °C in Makkah and of 23.89 - 41.20 °C in Madinah, respectively. Partial negative correlation was detected between outdoor relative humidity and daily recorded COVID-19 cases. No obvious correlation could be demonstrated between wind speed and daily COVID-19 cases. This indicated that most of SARS-CoV-2 infection occurred in the cool, air-conditioned, dry, and bad-ventilated indoor environment in the investigated cities. These results will help the epidemiologists to understand the correlation between both outdoor and indoor meteorological conditions and SARS-CoV-2 transmissibility. These findings would be also a useful supplement to assist the local healthcare policymakers to implement and apply a specific preventive measures and education programs for controlling of COVID-19 transmission.
Article
As long-term speciated PM2.5 monitoring programs, the Interagency Monitoring of Protected Visual Environments (IMPROVE) and Chemical Speciation Network (CSN) were designed with different objectives but apply similar analytical methods to 24hr filter samples and report many of the same species. The two networks have different operating structures, sampling practices, analytical methods, analytical facilities, and data handling and validation practices, which require attention when data from the two networks are combined in an analysis. Data from collocated CSN and IMPROVE sites from January 1, 2016 through September 30, 2018 are presented to document the comparability between the networks. While species measured well above the method detection limit (MDL) generally agree well during this period, there is evidence of some inter-network bias for fine-soil-related elements at specific locations, as well as subtle biases for some well-measured species. Many species – particularly for CSN – are measured at or near the MDL and have poor inter- and intra-network collocated agreement; caution should be used when advancing findings on such measurements. However, comparison of reconstructed mass shows good inter-network agreement suggesting that the networks are effective at quantifying predominant mass species.
Article
Indoor air particulate samples were collected in the first floor of the Xingyuan canteen of Nanjing University of Science and Technology (NJUST) in Nanjing during the winter season. Meanwhile, outdoor air particulate samples were collected on the roof of a building that is 28 m away from the canteen. The mean PM2.5 (fine particulate matter) concentrations of the indoor and outdoor samples were found to be 99.43 and 103.09 μg/m³, respectively. Through correlation analysis, it was found that more than half of the PM2.5 penetrates from the adjacent outdoor area into the canteen. Inductively coupled plasma optical emission spectrometry (ICP-OES) was used to determine the concentration of heavy metals (As, Cd, Cr, Cu, Fe, Mn, Ni, Pb, Zn) in the PM2.5, revealing that the concentration of As, Mn and Cd in the canteen exceeded health standards. Positive Matrix Factorization (PMF) was used to identify the pollution sources of the PM2.5-related heavy metals in the canteen, revealing the following sources in descending order: cooking (34.7%), fuel combustion (28.9%), canteen kitchenware (14.4%), transportation (9.6%), indoor building materials (8%) and the Earth's crust (4.4%). Enrichment factor analysis revealed the source of the excessive As in the canteen to be the outdoor air and the cooking of a large amount of meat in the canteen. The outdoor air contained excessive As and infiltrated the canteen. In addition, the Earth's crust was found to be the source of excessive Mn in the canteen, while transportation was the cause of excessive Cd.
Article
In the present study, air pollution monitoring was carried out in Chennai city continuously for more than 3 decades from 1978 to 2016 and air quality trends are established for planning mitigation measures. An extensive air pollution monitoring network consisting of 19 sampling locations covering traffic corridors and intersections, residential, commercial and industrial areas was operated to monitor dust and gaseous pollutants, toxic trace metals, polycyclic aromatic hydrocarbons (PAHs) and other criteria pollutants. Comprehensive emission inventory indicated contribution of pollution load is mainly from transport (80%) followed by domestic (13%), industry (4%), commercial activities (2%) and power back generators (1%). The air pollutant concentrations were high during day time in winter season at traffic corridors, intersections and industrial areas. The monitoring data indicated PM10, PM2.5 and PAHs concentrations were exceeding the limits due to vehicular emissions, road condition (paved and unpaved), construction, industrial and commercial activities. Carbon monoxide and hydrocarbon concentrations were high during traffic peak hours and near road corridors where traffic congestion is high. GM, ATDL and ISCST3 models were employed to assess the contribution of air pollutants from transport, domestic and industry sector, respectively. Performance evaluation of models was also carried out by comparing monitored and model-predicted concentration to assess model prediction accuracy.