ResearchPDF Available

Application of Regularized Logistic Regression and Artificial Neural Network model for Ozone Classification across El Paso County, Texas United States

Authors:

Abstract

This paper deals with ozone prediction in the atmosphere using a machine learning procedure. The persistence of highly concentrated ozone levels in the troposphere does harm biotic and abiotic. It is therefore vital to detect high levels of ozone early to ensure a healthy environment. El Paso, Texas is considered a high ozone affected city in the USA, with a history of very high ozone levels every year. In this paper we will use the data sets of air pollutants and meteorological variable from the El Paso area to classify the high/low ozone levels. The dataset was collected from the Texas Commission on Environment Quality (TCEQ) ground stations and cleaned for research purposes. In this study, we trained the data sets using Logistic Regression and Artificial Neural Network algorithms and further made comparison to determine the best model that accurately predicts early ozone level at El Paso area. We found that our model has a very high classification accuracy (89.3%) for predicting ozone level at a given day. From our evaluation metrics, the accuracy of both ANN and LR models were 89.3% and 88.4% respectively. In addition, the AUC of both models were almost the same with ANN having 95.4% while LR has 95.2%. Based on the outcome of our odds ratio, features like; 'Solar radiation', 'Std. Dev. Wind Direction', 'outdoor temperature', 'dew Point temperature' and 'PM10' contributes to the high level of ozone within El Paso Texas.
Journal of Data Analysis and Information Processing, 2023, 11, 217-239
https://www.scirp.org/journal/jdaip
ISSN Online: 2327-7203
ISSN Print: 2327-7211
DOI:
10.4236/jdaip.2023.113012 Jul. 11, 2023 217 Journal of Data Analysis and Information Processing
Application of Regularized Logistic Regression
and Artificial Neural Network model for Ozone
Classification across El Paso County, Texas,
United States
Callistus Obunadike, Adekunle Adefabi, Somtobe Olisah, David Abimbola, Kunle Oloyede
Department of Computer Science, Austin Peay State University, Clarksville, USA
Abstract
This paper focuses on ozone prediction in the
atmosphere using a machine
learning approach. We utilize air pollutant and meteorological variable data-
sets from the El Paso area to classify ozone levels as high or low. The LR and
ANN algorithms are employed to train the datasets. The models demonstrate
a remarkably high classification accuracy of 89.3% in predicting ozone levels
on a given day. Evaluation metrics reveal that both the ANN and LR models
exhibit accuracies of 89.3% and 88.4%, respectively. Additionally, the AUC
values for both models are co
mparable, with the ANN achieving 95.4% and
the LR obtaining 95.2%. The lower the cross-entropy loss (log loss), the high-
er the models accuracy or performance. Our ANN model yields a log loss of
3.74, while the LR model shows a log loss of 6.03. The predic
tion time for the
ANN model is approximately 0.00 seconds, whereas the LR model takes 0.02
seconds. Our odds ratio analysis indicates that features such as “Solar radia-
tion”, “Std. Dev. Wind Direction”, “outdoor temperature”, “dew point tem-
perature”, and PM10” contribute to high ozone
levels in El Paso, Texas.
Based on metrics such as accuracy, error rate, log loss, and prediction time,
the ANN model proves to be faster and more suitable for ozone classification
in the El Paso, Texas area.
Keywords
Machine Learning, Ozone Prediction, Pollutants Forecasting, Atmospheric
Monitoring, Air Quality, Logistic Regression, Artificial Neural Network
1. Introduction
Ozone is created in the atmosphere from gases that are released through smoke-
How to cite this paper:
Obunadike, C.
,
Adefabi
, A., Olisah, S., Abimbola, D. and
O
loyede, K. (2023) Application of Regula
rized
Logistic Regression and Artificial Neural
Network model for Ozone Classification
across El Paso County, Texas, United States.
Journal of Data Analysis and Information
Processing
,
11
, 217-239.
https://doi.org/10.4236/jdaip.2023.113012
Received:
April 25, 2023
Accepted:
July 8, 2023
Published:
July 11, 2023
Copyright ©
2023 by author(s) and
Scientific
Research Publishing Inc.
This work is
licensed under the Creative
Commons Attribution International
License (CC BY
4.0).
http://creativecommons.org/licenses/by/4.0/
Open Access
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 218 Journal of Data Analysis and Information Processing
stacks, tailpipes, and a variety of other sources. These gases react when exposed
to sunlight, thereby creating ozone pollution. Ozone is a key component of the
Earths atmosphere; it plays a vital role in protecting life on our planet + by ab-
sorbing harmful ultraviolet radiation. However, excessive levels of ozone can
have negative impacts on human health and the environment. Ozone prediction
is an important task that helps us to better understand and manage the effects of
ozone on our planet. Application of Machine learning serves as a powerful me-
chanism that helps to predict ozone levels in the atmosphere. This could be
achieved by training a machine learning model on historical data, to make pre-
dictions about future ozone levels based on various factors such as temperature,
wind speed, and emissions. These predictions can then be used to inform deci-
sion making and mitigate the negative effects of excessive ozone levels. Ozone
starts off as an invisible pollution when not properly monitored it combines with
other contaminants to cause lots of health challenges [1]. Ozone happens to be
one of the most dangerous elements on earth. For the past several decades, re-
searchers have been examining how ozone affects human health. In El Paso,
Texas, United States, ozone level has been recorded as the highest affected city
across the United States. Three oxygen atoms make up the gas molecule known
as ozone (O3). Ozone, also known as “smog”, is dangerous to breathe. By chem-
ically interacting with lung tissue, ozone actively damages it.
2. Literature Review
Three oxygen atoms make up the gas molecule known as O3 (see Figure 1).
Another name for ozone (O3) is “smog”, which is very dangerous when inhaled.
Ozone becomes very harmful when it chemically interacts with the lung tissue,
thus causing severe damage. Figure 1 illustrates ozone molecules.
2.1. Formation of Ozone (O3)
The same processes that produce ozone also produce other dangerous pollutants
when O3 is present. Although, we are protected from the majority of the suns
UV radiation by the ozone layer, which is located high in the stratosphere (
i.e.
,
upper atmosphere). However, O3 air pollution poses major health risks when it
is present at ground level where we may breathe it (
i.e.
, within the troposphere).
Nitrogen oxides (NOx) and volatile organic compounds (VOCs) are the two
main raw materials that produce ozone. In addition, burning of fossil fuels like
gasoline, oil, or coal or the evaporation of certain chemicals like solvents also
contributes to ozone production. Power plants, automobiles, and other high-heat
combustion sources all emit NOx whereas vehicles, chemical plants, refineries,
factories, petrol stations, paint, and other sources all release VOCs [1]. Figure 2
shows the reaction that leads to ozone formation pattern.
2.2. Risk of Ozone Exposure
Anyone who spends time outside in an area with high levels of ozone pollution
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 219 Journal of Data Analysis and Information Processing
Figure 1. Showing ozone (O3) molecule.
Figure 2. Showing ozone formation.
could be in danger. The effects of inhaling ozone are particularly harmful to four
types of people:
o Children and teenagers [2].
o Everyone above the age of 65 [2].
o Those who already have lung conditions including asthma and chronic ob-
structive pulmonary disease (COPD), which also encompasses emphysema
and chronic bronchitis [2].
o Those who work or exercise outside [2].
o People living with obesity [2].
2.3. Implications of Ozone Exposure
People with allergies may respond more strongly to allergens after inhaling
ozone. Children were more likely to experience hay fever and respiratory aller-
gies when ozone and PM2.5 levels were high, based on research study that was
published in 2009 [3].
2.3.1. Premature Death
When exposed to the ozone layer, ones life may be shortened. From several re-
search carried out in cities across the U.S., Europe, and Asia, it is obvious that
ozone has a devastating effect on peoples health and life span. Over time, re-
searchers have discovered that exposure to increasing ozone levels raised the
chance of premature death [4]. Even when other pollutants are also present,
ozone raises the chance of premature mortality, according to more recent re-
search [1].
2.3.2. Inhalation Problems
In major counties across the United States (like: El Paso, Texas), ozone level in-
creases over the summer thus leading to increase in health challenges [5]. In ad-
dition to a higher risk of premature mortality, inhalation challenges like wheez-
ing, coughing, and shortness of breath; asthma episodes; increases the need for
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 220 Journal of Data Analysis and Information Processing
hospitalization and medical care for persons with lung disorders including
asthma or chronic obstructive pulmonary disease (COPD), as well as higher risk
of respiratory infections, susceptibility to pulmonary inflammation, and risk of
respiratory infections [2].
2.3.3. Risk from Long-Term Exposure
Recent research alerts us to the negative consequences of prolonged exposure to
ozone. Scientists are discovering that prolonged exposure (
i.e.
, radiation expo-
sure > 8 hours as well as days, months, or years) increases the chance of prema-
ture mortality. Researchers have discovered that high levels of ozone are linked
to an increased risk of respiratory disease which leads to a high mortality rate
[4]. Also, New York researchers examined hospital data for pediatric asthma pa-
tients and discovered that exposure to ozone over an extended period increased
the probability of hospital admission for asthma patients. Recent studies show
that kids from low-income households were more likely to be hospitalized due to
high levels of ozone exposure as against kids from high-income households [6].
2.3.4. US Environmental Protection Agency (EPA) Findings
In February 2013, EPA published a comprehensive review of their most recent
findings on ozone pollution [7]. EPA had asked the “
Clean Air Scientific Advi-
sory Committee
”, a group of distinguished scientists, to assist them in evaluating
the evidence that was gathered by EPA; in particular, they looked at research
published between 2006 and 2012. The EPA and the committees experts con-
cluded that ozone pollution posed numerous, substantial health risks. Based on
that evaluation in 2015, the EPA firmly supports the “
National Ambient Air
Quality Standard
” (
i.e.
, the official ozone acceptable limit). However, recent stu-
dies show that ozone can be dangerous even at much lower concentrations. In a
scientific paper published in 2017, researchers presented additional proof that
confirms that older adults face a higher risk of premature death even with low
ozone levels beyond the national acceptable level [8].
2.4. Features or Variable Types
According to [9], the predictor variables could otherwise be known as “PIE
(predictor, independent or explanatory) variables” while the response variables
could otherwise be termed “DORT (dependent, observatory, response or target)
variables”. Features (variables) importance enables the ML algorithm to train
faster as well as reduces cost and time required for training the dataset, therefore
making it simpler to interpret. It also reduces the variance of the model and im-
proves the accuracy, provided the right subset is chosen [9].
Odds Ratio
Generally, the intensity of the odds ratio is called the “
strength of the associa-
tion
. The further away an odds ratio is from 1, the more likely it is that the rela-
tionship between the exposure and the disease is causal. For instance, an odds
ratio of 1.25 is above 1, but is not a strong association while that of > 9.5 suggests
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 221 Journal of Data Analysis and Information Processing
a stronger association [9].
2.5. Selection of Logistic Regression and Artificial Neural Network
Model
Its important to note that the choice between LR and ANN models depends on
the
specific problem
,
dataset
, and
desired outcome
. LR is suitable for simpler
tasks and when interpretability is crucial, while ANN models excel in more
complex problems where high accuracy is the priority.
2.5.1. Advantages of LR and ANN
The Logistic Regression model is straightforward and interpretable. Its easy to
understand and implement, making it a good choice for simple classification
problems [10]. Training an LR model is computationally efficient compared to
complex ANN models. It can handle large datasets with relative ease [11]. LR
provides meaningful insights into the impact of each feature on the predicted
outcome. It assigns weights to features, indicating their importance in the deci-
sion-making process [12].
Artificial Neural Networks (ANNs) can model complex and nonlinear rela-
tionships between features and the target (DORT) variable. They can learn in-
tricate patterns that may be difficult for LR models to capture. ANN models can
automatically extract relevant features from raw data, reducing the need for
manual feature engineering [13]. ANN models, especially deep learning models,
have achieved state-of-the-art performance on various tasks, including image
and speech recognition, natural language processing, and recommendation sys-
tems [14].
2.5.2. Disadvantages of LR and ANN
The Logistic Regression model assumes a linear relationship between features
and the target variable. It may struggle to capture complex patterns and nonli-
near relationships in the data [15]. Logistic Regression relies heavily on manual
feature engineering. Thus, choosing relevant features and transforming them
appropriately is crucial for its performance. LR performs well in certain scena-
rios, it may underperform when faced with highly complex datasets or problems
that require high predictive accuracy [16].
Artificial Neural Network models, especially deep neural networks, require
significant computational resources and can be time-consuming to train and
they often require specialized hardware like GPUs [17]. ANN models can be
challenging to interpret and understanding how the model arrives at its predic-
tions can be difficult, making it less transparent compared to LR models [18]. In
addition, ANN models are prone to overfitting, especially when working with
limited training data [19]. Regularization techniques and careful hyperparameter
tuning are necessary to mitigate this risk [5].
2.6. Factors that Influence Accuracy of LR and ANN
Its important to consider these factors and carefully optimize them to achieve
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 222 Journal of Data Analysis and Information Processing
the best classification accuracy for LR and ANN models.
o
Dataset quality and size
: The quality and size of the dataset used for training
and evaluation play a crucial role. A larger dataset with a diverse range of
samples can help both LR and ANN models generalize better and achieve
higher accuracy [18].
o
Feature selection and engineering
: The choice and preparation of input
features can significantly affect model performance, proper feature selection
and engineering can improve the discriminative power of the features and
lead to better accuracy for both LR and ANN models [20].
o
Model complexity
: The complexity of the model can impact classification
accuracy. LR assumes a linear relationship, while ANN models, especially
deep neural networks, can capture complex nonlinear relationships [17]. In
general, more complex models like ANNs have the potential to achieve high-
er accuracy, but they are also more prone to overfitting [19].
o
Regularization techniques
: Regularization methods, such as L1 or L2 regu-
larization, can help prevent overfitting in both LR and ANN models. Regula-
rization adds a penalty term to the models objective function, discouraging
overly complex models and improving generalization [5].
o
Hyperparameter tuning
: Both LR and ANN models have various hyperpa-
rameters that need to be tuned for optimal performance. Examples include
learning rate
,
regularization strength
,
number of hidden layers
, and
number of neurons
. Proper hyperparameter tuning can significantly affect
classification accuracy [18].
o
Training duration and convergence
: The duration and convergence of the
training process can impact final accuracy. Training for too few iterations
may result in
underfitting
, while training for too many iterations may lead
to
overfitting
[19]. Finding the right balance and ensuring convergence is
essential for achieving high accuracy.
o
Class imbalance
: Class imbalance occurs when one class has significantly
more or fewer samples than others. This can affect the models ability to ac-
curately predict the minority class. Techniques like oversampling, under
sampling, or class weighting can help address class imbalance and improve
accuracy [21].
o
Preprocessing and normalization
: Proper preprocessing steps, such as han-
dling missing values, scaling features, and handling outliers, can impact the
accuracy of both LR and ANN models. Different preprocessing techniques
may be more suitable for different models, and their proper application can
enhance accuracy [22].
o
Model evaluation and validation
: The choice of appropriate evaluation me-
trics and validation techniques can affect the reported accuracy. Metrics such
as
accuracy
,
precision
,
recall
, and
F
1
score
provide different perspectives
on model performance, and using appropriate validation methods like
cross-validation
can give a more reliable estimate of the models accuracy
[23].
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 223 Journal of Data Analysis and Information Processing
o
Computational resources
: ANN models, especially deep learning models,
can be computationally intensive and may require specialized hardware, such
as GPUs, for efficient training. The availability of computational resources
can impact the size and complexity of the ANN models used, which can, in
turn, affect their accuracy [17].
3. Methodology
The aim and objective of this research is to examine the prediction ability of Lo-
gistic Regression and Artificial Neural Network Models in correctly classifying
ozone levels into high and low categories, considering other predictor variables.
The dataset consists of 973 rows and 14 variables (features). Among these va-
riables, ozone was selected as the response (dependent) variable, with “low
ozone” assigned as 0 and “high ozone” assigned as 1. Therefore, we are dealing
with a binary classification problem. The dataset was analyzed using the R pro-
gramming language. The first step in the analysis involved checking the variable
types, identifying missing values, outliers, and potentially incorrect records, and
conducting exploratory data analysis (EDA), including frequency distribution of
the target variables and the association between them.
3.1. Descriptive Analysis of the Dataset
The dataset contains 14 variables (features), ozoneis assigned to be the target
variable otherwise known as
DORT or Y
(dependent, observatory, response, or
target variables) while the remaining 13 variables represents
PIE or X
(Predictor,
Independent or explanatory variables) (see Table 1).
3.2. Data Pre-Processing
The following steps were adopted during the data prr-processing to ensure the
accuracy of our dataset. We performed exploratory data analysis on the dataset
by cleaning the data and checking for missing values (refer to section 3.2.1). Ad-
ditionally, we applied a for-loop to iterate over the dataset and cross-check for
other types of missing values, such as “na”, “NA”, or an empty string (refer to
Figure 3). Based on the output or results, our dataset showed no missing values.
3.2.1. Checking for Missing Values
The anyNA () function was used to check for missing variables in our dataset.
The outcome was “False”. Thus, it implies that we did not have any missing data.
In addition, we went further to visualize if there was any sort of missing data
using the naniar package. Figure 4 shows a bar plot depicting there are no
missing variables in our dataset.
3.2.2. Intensive Cross Checking of Other Missing Values
To ensure that our analysis and model would be free from errors. It is very im-
portant to thoroughly loop through the whole dataset to check for other missing
values that may occur in other forms apart from “NA”.
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 224 Journal of Data Analysis and Information Processing
Table 1. Description of the variable’s data types.
S/No
Variables
Data Type
1
Nitric Oxide Numeric
2
Nitrogen Dioxide Numeric
3
Oxides of Nitrogen Numeric
4
Wind Speed Numeric
5
Resultant Wind Speed Numeric
6
Resultant Wind Direction Integer
7
Maximum Wind Gust Numeric
8
Std. Dev. Wind Direction Integer
9
Outdoor Temperature Numeric
10
Dew Point Temperature Numeric
11
Relative Humidity Numeric
12
Solar Radiation Numeric
13
PM10 Numeric
14
Ozone Integer
Figure 3. Showing the (for-loop) code iteration for missing values.
Table 2 shows V.name or the variable names, Mode (data types), N. level
(number of occurrences out of the total observations), Ncom (number of total
observations), Nmiss (number of missing observations), and Miss. Prop (per-
centage of missing observations).
3.2.3. Frequency Distribution of Target Variable (Ozone)
Analyzing the ozone level, transformed the ozone level from binary-numerical to
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 225 Journal of Data Analysis and Information Processing
Figure 4. Showing barplot of the dataset using vis_mis () function in naniar
package.
Table 2. Iteration through the dataset using for loop to check for other missing values.
Col.num
Mode
N.level
ncom
nmiss
Miss.prop
1
Nitric Oxide numeric 82 973 0 0
2
Nitrogen Dioxide numeric 228 973 0 0
3
Oxides of Nitrogen numeric 238 973 0 0
4
Wind Speed numeric 146 973 0 0
5
Resultant Wind Speed numeric 153 973 0 0
6
Resultant Wind Direction numeric 270 973 0 0
7
Maximum Wind Gust numeric 242 973 0 0
8
Std. Dev. Wind Direction numeric 68 973 0 0
9
Outdoor Temperature numeric 367 973 0 0
10
Dew Point Temperature numeric 442 973 0 0
11
Relative Humidity numeric 459 973 0 0
12
Solar Radiation numeric 549 973 0 0
13
PM10 numeric 451 973 0 0
14
Ozone numeric 2 973 0 0
binary-categorical such that values of 1 are given “high level” while values of 0
are given “low level”. From the frequency distribution plot below, the days with
low ozone levels occur more frequently than those of high ozone level. Compar-
ing the difference between both rates however, we can say the distribution is a
bit balanced since the difference is not significantly large (see Figure 5).
4. Results
The first approach to building a model with high accuracy is to properly investigate
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 226 Journal of Data Analysis and Information Processing
Figure 5. Frequency distribution of the target variable (ozone).
data quality, coherence, association, and correlations between the
DORT
and
PIE
variables. This will thus enable us to correctly predict areas with high ozone
and low ozone effectively. Analyzing the output, we observe that all the variable
types are continuous (quantitative) except for target (ozone) variable which is
binary (categorical).
4.1. Data Exploration and Correlation
It is important to understand the degree of correlation and association between
the predictor variables with the target variable (ozone). The correlations were
computed with the data, and it shows different degrees of correlations ranging
from strong negative to strong positive correlation (see Figure 3). Out of the 13
variables only solar radiation, outdoor temperatureand std. dev. wind di-
rectionshow positive correlation with the target variable (ozone). Figure 6 illu-
strates predictor variables that are either positively or negatively correlated with
the target variable (ozone).
4.2. Box Plots Predictor Variables
Figure 7 and Figure 8 show the boxplots of the predictor variables that are posi-
tively and negatively correlated with the target variable (ozone). In general, it
could be seen that the negative correlated predictor variables have outliers.
4.2.1. Box Plots of Ozone and Selected Predictor Variables
The effect of selected predictor features (
i.e.
, Solar Radiation, Nitric Oxide,
Nitrogen Dioxideand PM10) differences used in determining the ozone lev-
el of a particular day (see Figures 9-12). The high ozonerate is nearly normal
in most of the distributions as against the low ozone level with a negative
skewness in distribution (see Figures 13-16). The histograms of Solar Radia-
tion, Nitric Oxide, Nitrogen Dioxideand PM10 show right-skewed dis-
tribution (see Figures 13-16). This already suggests that the distribution of the
Solar Radiation, Nitric Oxide, Nitrogen Dioxideand PM10is not normal.
Aside from Solar radiation, the other selected variables show presence of outliers.
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 227 Journal of Data Analysis and Information Processing
Figure 6. Correlation Coefficient between predictor variables and target
(ozone) variable.
Figure 7. Boxplots of +Corr. predictor variables.
Figure 8. Boxplots of -Corr. predictor variables.
Figure 9. Boxplots of solar radiation vs ozone.
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 228 Journal of Data Analysis and Information Processing
Figure 10. Boxplots of nitric oxide vs ozone.
Figure 11. Boxplots of nitrogen dioxide vs ozone.
Figure 12. Boxplots of PM10 vs ozone level.
Figure 13. Solar radiation histogram.
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 229 Journal of Data Analysis and Information Processing
Figure 14. Nitric oxide histogram.
Figure 15. Nitrogen dioxide histogram.
Figure 16. Particulate matter 10 histogram.
Since some of the box plots for the selected predictor variables have outliers and
skewness, we would proceed to test for normality using the Anderson Darling
and Shapiro-Wilk tests (see Table 3).
4.2.2. Normality Test and Wilcoxon Test
The normality test shows that the p-value of Anderson-Darling and Shapiro test
are less than the significance level (0.05), which signifies that the distribution is
not normal. Since the assumption of t-test is violated, we apply Wilcoxon rank
sum test (non-parametric alternative test) to examine the association between
Solar Radiation, Nitric Oxide, Nitrogen Dioxide and PM10 on ozone level (see
Table 3).
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 230 Journal of Data Analysis and Information Processing
Table 3. Association between target variable (ozone) and response variable (solar radiation).
Solar Radiation
Vs
Ozone Level
Nitric Oxide
Vs
Ozone Level
Nitrogen Dioxide
Vs
Ozone Level
PM10
Vs
Ozone Level
Anderson-Darling
normality test
A = 54.466
p-value < 2.2e−16
A = 210.08
p-value < 2.2e−16
A = 76.434
p-value < 2.2e−16
A = 63.252
p-value < 2.2e−16
Shapiro-Wilk
normality test
W = 0.84817
p-value < 2.2e−16
W = 0.31033
p-value < 2.2e−16
W = 0.7364
p-value < 2.2e−16
W = 0.64781
p-value < 2.2e−16
Wilcoxon rank test W = 177795
p-value < 2.2e−16
W = 61823
p-value < 2.2e−16
W = 84314
p-value = 9.622e−11
W = 100056
p-value = 0.005453
Comments
The p-value of the Wilcoxon rank sum test above is lower than the alpha value (0.05) indicating that
there is significance relationship between ozone level and Solar Radiation/Nitric Oxide/Nitrogen
Dioxide/PM10.
data: Alternative hypothesis (HA): true location shift is not equal to 0
4.3. Data Splitting
The dataset was partitioned into two parts with a ratio of 2:1, where the training
data (D1) has 67%, and the test data (D2) takes 33%. Logistic regression tech-
nique was applied to the train data to build a predictive model. Firstly, we
adopted the lasso regularization (
L1
) with penalty to obtain the tuning parameter
(λ) with cross validation. The logistic regression model was fitted with lasso re-
gularization method using our trained data, D1. The lasso method was applied
because our aim is to build a parsimonious model which will properly explain
our target (ozone) feature.
4.3.1. MSE and Tuning Parameter
The best lambda to regularize our model is evaluated using the MSE and
miss-classification rate. The result of the first six rows of the Lambdas, miss
classification rate and mean square error is shown in Table 4. Using MSE (Mean
Square Error) as the evaluation metric. Our Best Lambda (tuning parameter) is
0.0026.
4.3.2. Mean Square Error vs Lambda
The plot below indicates that as the tuning parameter increases, the Mean
Square Error increases as well. Therefore, it is important to keep Lambda very
minimal to obtain low MSE. However, at 0.3 Lambda, the MSE becomes con-
stant (see Figure 17).
4.4. Model Fitting and Odds Ratio of LR Model
Having gotten the best lambda. We fit the final Lasso Logistic regression model
with the Training and Validation data pooled together. The important features
can be seen from Table 5. After fitting the model with the best lambda, both
Nitrogen Dioxide”
and “
Resultant Wind Speed
” happen to be the unimportant
variables in our LR model (see Table 5).
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 231 Journal of Data Analysis and Information Processing
Table 4. Lambda’s, misclassification rate and MSE matrix.
[1] [2] [3]
[1] 0.00010
0.355
0.2500
[2] 0.00261
0.126
0.0891
[3] 0.00512
0.135
0.0930
[4] 0.00764
0.138
0.0982
[5] 0.01015
0.145
0.1017
[6] 0.01266
0.145
0.1041
Figure 17. Showing plot of lambda vs MSE.
Table 5. Coefficients of important predictors using LGR (l1) model.
Variables
Coefficients
Nitric Oxide 1.98759
Nitrogen Dioxide .
Oxides of Nitrogen 0.00638
Wind Speed 0.25833
Resultant Wind Speed .
Resultant Wind Direction 0.00391
Maximum Wind Gust 0.01733
Std. Dev. Wind Direction 0.06886
Outdoor Temperature 0.00652
Dew Point Temperature 0.04620
Relative Humidity 0.13025
Solar Radiation 1.23665
PM10 0.01135
The negative coefficients of Nitric Oxide indicates that a slight increase in
Nitric Oxide multiplies the odd ratio by a number < 1 which effectively in-
creases the probability of the output being labeled as low ozone level (0). In ad-
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 232 Journal of Data Analysis and Information Processing
dition, the positive coefficients of Solar Radiationsuggests that a unit increase
in the variable Solar Radiation multiplies the odd ratio by a number greater
than one which effectively increases the probability of the output being labeled
as high ozone level (1). We will then use the best-fitted model on our test data.
Table 6 below presents the odds ratio of important predictor variables based on
the best fit model.
4.4.1. Logistic Regression Model Evaluation
From Table 7, the AUC of 0.952 indicates that our fitted model has 95% ability
to correctly classify ozone level into high or low. The confidence interval also in-
dicates the true AUC falls within the interval (0.929, 0.975). Therefore, we are
95% confident that our AUC is accurate. From Table 7, we obtained an MSE
value of 0.0833 which generally indicates a good performance for our model. We
further computed the miss-classification rate since this is a logistic regression
model and MSE is not the best evaluating method. The miss-classification rate
value is 0.116 which means that our model correctly predicts the ozone levels
into high and low ozone at a rate of 88.4% which suggests that our model per-
forms well. The AUC of 0.952 implies that our best fitted model has 95.2% ac-
curacy to predict if the ozone level is either high or low for a particular day. The
C.I also indicates the true AUC falls within the interval (0.929, 0.975). Therefore,
we are 95% confident that our AUC falls within this interval.
4.4.2. Receiver Operating Characteristic (ROC) Curve for LR Model
The ROC curve stands for Receiver Operating Characteristic curve. It is a graphi-
cal representation used in evaluating the performance of binary classification
models. It illustrates the relationship between the hit rate (also known as sensi-
tivity or true positive rate) and the false alarm rate (also known as the false posi-
tive rate). The hit rate refers to the proportion of correctly identified positive in-
stances (true positives) out of all actual positive instances. It represents the mod-
el’s ability to correctly classify positive cases. On the other hand, the false alarm
rate represents the proportion of incorrectly identified negative instances (false
positives) out of all actual negative instances.
The ROC curve plots the hit rate on the y-axis and the false alarm rate on the
x-axis. It shows how the trade-off between these two rates changes as the classi-
fication threshold of the model varies. The threshold determines the point at
which the model classifies instances as positive or negative based on the pre-
dicted probabilities or scores. Ideally, a good classification model would achieve
a high hit rate and a low false alarm rate, resulting in a curve that hugs the up-
per-left corner of the ROC space. The closer the curve is to this corner, the better
the models performance. The diagonal line from (0, 0) to (1, 1) represents the
performance of a random classifier.
In addition to the ROC curve itself, the area under the curve (AUC) is often
calculated to provide a single metric summarizing the models performance. The
AUC represents the probability that the classifier will rank a randomly chosen
positive instance higher than a randomly chosen negative instance. A higher
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 233 Journal of Data Analysis and Information Processing
Table 6. Odds ratio of important predictor variables based on the best fit model.
Odds
Ratio
Implications
Target Variable
(Ozone)
Low (0) High (1)
Nitric Oxide 0.16 Nitric Oxide might not be a protective factor for high ozone level
Oxides of Nitrogen 0.981 Oxides of Nitrogen might lead to high ozone level subsequently
Wind Speed 0.733 Wind speed might lead to high ozone level subsequently
Resultant Wind
Direction 0.997 Resultant wind direction might lead to high ozone level
subsequently
Maximum Wind Gust 0.976 Maximum wind gust might lead to high ozone level
subsequently
Std. Dev. Wind
Direction
1.06 Std. Dev. wind direction is a risk factor for high ozone level
Outdoor Temperature 1 Outdoor temperature is a risk factor for high ozone level
Dew Point Temperature 1.04 Dew point temperature is a risk factor for high ozone level
Relative Humidity 0.885 Relative humidity might lead to high ozone level subsequently
Solar Radiation 4.19 Solar radiation is certainly a major risk factor for high ozone level
PM10 1.01 Particulate matter 10 is a risk factor for high ozone level
Table 7. Results of the LR model evaluation.
Miss-classification rate MSE cvAUC SE CI Confidence
0.116 0.0833 0.952 0.0115 0.929, 0.975 0.95
AUC value indicates better discrimination power of the model. Our model has a
very high discriminatory power for correct prediction of high ozone and low
ozone levels at any given day. Figure 18 shows the ROC curve (
i.e.
, trade-off
between sensitivity (or TPR) and False Positive Rate [1 Specificity]). It further
indicates that the model performs better against the benchmark (50%) with total
area of 0.952 (95.2%).
4.4.3. Performance of LR Model using Confusion Matrix
Metrics such as
accuracy
,
precision
(
positive prediction value
),
recall
(
sensi-
tivity
)
and f
1
score
provide different perspectives on model performance. The
confusion matrix also helps in the interpretation of model performance. The
Sensitivity or Recall (TP rate) of 0.8761 (87.6%) indicates that the model has a
higher % of detecting high ozone level of a particular day. The Specificity (TN
rate) of 0.8878 (88.8%) which is relatively high indicates that the model has a
higher % of detecting low ozone level of a particular day. Therefore, our fitted
Model has an accuracy of 88.4% with respect to performance and a precision of
81.2% which implies that our Model has a low FP rate. Confusion matrix and
other statistical prediction parameters for logistic regression model are shown in
Table 8.
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 234 Journal of Data Analysis and Information Processing
Figure 18. Showing the receiver operators curve of the hit rate vs false alarm.
Table 8. Confusion matrix and other statistical prediction parameters for LR.
Confusion Matrix and Statistics For LR
Accuracy 0.884
95% CI (0.843, 0.917)
Sensitivity/Recall 0.876
Specificity (True Negative Rate/TNR) 0.888
Pos Pred Value/Precision 0.811
F1 Score 0.843
Prediction Time 0.02 secs
Binary Cross Entropy 6.03
4.5. Artificial Neural Network (ANN) Model
To fit an Artificial Neural Network (ANN) model with our trained dataset D1, to
find the desired model. It is necessary to scale our training data, thereby creating
a data frame with the target variable. After scaling, we then build our ANN
structure which has 4 hidden layers containing 9, 7, 5, and 3 neurons respective-
ly together with input and output layers.
4.5.1. Scaling and MSE of ANN Model
After scaling the test data D2, we proceeded to predict the target “ozone” variable
using our ANN model. Computing the Mean Squared Error (MSE) of our mod-
el, we obtained a value of 0.0833 which indicates that the model performed well.
However, MSE alone is not an optimal evaluation technique for our model,
hence we need to further calculate the misclassification rate and the confusion
matrix.
4.5.2. Misclassification Rate of ANN Model
The miss-classification rate value is 0.107 which means that our model correctly
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 235 Journal of Data Analysis and Information Processing
predicts both high ozone and low ozone level at a rate of 89.3%. This suggests
that our model performs well. Nevertheless, for better evaluation, we would fur-
ther calculate the AUC and the confusion matrix of our ANN model (see Table
9).
4.5.3. ANN Model Evaluation
The AUC of 0.954 implies that our best fitted model has 95.4% accuracy to pre-
dict if the ozone level is either high or low for a particular day. The C.I also in-
dicates the true AUC falls within the interval (0.929, 0.979). Therefore, we are
95% confident that our AUC falls within this interval.
4.5.4. ROC Curve
The AUC of 0.954 implies that our best fitted model has 95.4% accuracy to pre-
dict if the ozone level is either high or low for a particular day. The C.I also in-
dicates the true AUC falls within the interval (0.929, 0.979). Therefore, we are
95% confident that our AUC falls within this interval (see Figure 19).
4.5.5. Performance of ANN Model using Confusion Matrix
The accuracy of our model is 0.893 (89.3%) which is relatively high indicating
that our model performs well in predicting the ozone level for a day. The Sensi-
tivity or Recall (TP rate) of 0.802 (80.2%) indicates that the model has a high-
er % of detecting high ozone level of a particular day. The Specificity (TN rate)
of 0.957 (95.7%) which is relatively high indicates that the model has a higher %
of detecting low ozone level of a particular day. With the high accuracy and a
precision of 92.9%, these results imply that our Model has a low False Positive
(FP) rate. Confusion matrix and statistical prediction parameters for artificial
neural network model are shown in Table 10.
Table 9. Results of the ANN model.
Miss-classification rate MSE cvAUC SE CI Confidence
0.107 0.0833 0.954 0.0126 0.929, 0.979 0.95
Table 10. Confusion matrix and other statistical prediction parameters for ANN.
Confusion Matrix and Statistics
Accuracy 0.893
95%CI (0.854, 0.925)
Sensitivity/Recall 0.802
Specificity (True Negative Rate/TNR) 0.957
Pos Pred Value/Precision 0.929
F1 Score 0.861
Prediction Time 0.00 secs
Binary Cross Entropy/Log Loss 3.74
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 236 Journal of Data Analysis and Information Processing
Figure 19. Showing the ROC of the hit rate vs false alarm.
4.6. F1 Score
The
F
1 score is a metric commonly used in classification tasks to evaluate the
overall performance of a model. It combines both precision and recall into a sin-
gle value, providing a balanced measure of the models accuracy.
()
( )
2 precision recall
1 Score precision recall
F∗∗
=+
F
1 score considers both FP (false positive) and FN (false negatives), making it
a
useful metric
when dealing with imbalanced datasets or when both precision
and recall are equally important. When comparing different models or algo-
rithms, a higher
F
1 score indicates better performance in terms of both precision
and recall. Based on our results, ANN performs better than LR with
F
1 score of
0.861.
5. Conclusion and Recommendations
The accuracy of our model is 89.3% which is relatively high, thus it indicates that
our model performs well in predicting the ozone level for a given day. Also, the
Sensitivity or Recall (TP rate) of 80.2% indicates as well that our model has a
higher chance of detecting the high ozone rate of a particular day. The Specifici-
ty (TN rate) of 95.7% indicates that the model has a higher chance of detecting
the low ozone rate on a given day as well. With the high accuracy stated above
and a precision of 92.9%, these results imply that our model has a low False Pos-
itive (FP) rate.
In addition, from our evaluation metrics for both models, Our ANN model
performs slightly better than the LR model with the ANN model having higher
accuracy 89.3% compared to LRs 88.4% and AUC 95.4% compared to LRs
95.2% while also having a lower miss-classification rate (10.7% compared to LRs
11.6%).
Furthermore, when we consider the precision and recall of our models per-
formance, both models perform very well with very high precision and high re-
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 237 Journal of Data Analysis and Information Processing
call, meaning that our model has a high true positive (TP) rate and a low false
positive (FP) rate. When the sensitivity is high, we also tend to have a lower false
negative rate meaning that our model would most likely avoid a wrong predic-
tion of a negative (low ozone level) outcome any day.
With regards to the prediction time, while both models show very small-time
complexity for prediction execution, the ANN model has a lower prediction
time. Also looking at the binary cross entropy, the ANN model has the lower
binary cross entropy indicating that it performed better than the LR model in
terms of classification.
We recommend that subsequent research should consider the following
points:
Application of other types of supervised machine learning models, such as the
Random Forest Model, Support Vector Machine, K-Nearest Neighbors, Decision
Trees, and Naïve Bayes, for the classification of ozone.
Other researchers could try to expand the scope of the paper by using differ-
ent datasets from regions affected by ozone pollution in various areas.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this pa-
per.
References
[1] Di, Q., Wang, Y., Zanobetti, A., Wang, Y., Koutrakis, P., Choirat, C., Dominici, F.
and Schwartz, J.D. (2017) Air Pollution and Mortality in the Medicare Population.
The New England Journal of Medicine
, 376, 2513-2522.
https://doi.org/10.1056/NEJMoa1702747
[2] Lin, S., Liu, X., Le, L.H. and Hwang, S.-A. (2008) Chronic Exposure to Ambient
Ozone and Asthma Hospital Admissions among Children.
Environmental Health
Perspectives
, 116, 1725-1730. https://doi.org/10.1289/ehp.11184
[3] Jerrett, M., Burnett, R.T., Pope, C.A., Ito, K., Thurston, G., Krewski, D., Shi, Y.,
Calle, E. and Thun, M. (2009) Long-Term Ozone Exposure and Mortality.
The New
England Journal of Medicine
, 360, 1085-1095.
https://doi.org/10.1056/NEJMoa0803894
[4] Parker, J.D., Akinbami, L.J. and Woodruff, T.J. (2009) Air Pollution and Childhood
Respiratory Allergies in the United States.
Environmental Health Perspectives
, 117,
140-147. https://doi.org/10.1289/ehp.11497
[5] Bhuiyan, M.A.M., Sahi, R.K., Islam, M.R. and Mahmud, S. (2021) Machine Learn-
ing Techniques Applied to Predict Tropospheric Ozone in a Semi-Arid Climate Re-
gion.
Mathematics
, 9, Article No. 2901. https://doi.org/10.3390/math9222901
[6] U.S. EPA. Nonattainment Areas for Criteria Pollutants (Green Book).
https://www.epa.gov/green-book
[7] U.S. Environmental Protection Agency. Integrated Science Assessment (ISA) for
Ozone and Related Photochemical Oxidants.
https://www.epa.gov/isa/integrated-science-assessment-isa-ozone-and-related-phot
ochemical-oxidants
[8] Medina-Ramón, M. and Schwartz, J. (2008) Who Is More Vulnerable to Die from
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 238 Journal of Data Analysis and Information Processing
Ozone Air Pollution?
Epidemiology
, 19, 672-679.
https://doi.org/10.1097/EDE.0b013e3181773476
[9] Olufemi, I., Obunadike, C., Adefabi, A. and Abimbola, D. (2023) Application of Lo-
gistic Regression Model in Prediction of Early Diabetes across United States.
Inter-
national Journal of Scientific and Management Research
, 6, 34-48.
https://doi.org/10.37502/IJSMR.2023.6502
[10] Tran, B., Sudusinghe, C., Nguyen, S. and Alahakoon, D. (2023) Building Interpreta-
ble Predictive Models with Context-Aware Evolutionary Learning.
Applied Soft
Computing
, 132, Article ID: 109854. https://doi.org/10.1016/j.asoc.2022.109854
[11] Issitt, R.W., Cortina-Borja, M., Bryant, W., Bowyer, S., Taylor, A.M. and Sebire, N.
(2022) Classification Performance of Neural Networks versus Logistic Regression
Models: Evidence from Healthcare Practice.
Cureus
, 14, e22443.
https://doi.org/10.7759/cureus.22443
[12] Valluri, C., Raju, S. and Patil, V.H. (2022) Customer Determinants of Used Auto
Loan Churn: Comparing Predictive Performance Using Machine Learning Tech-
niques.
Journal of Marketing Analytics
, 10, 279-296.
https://doi.org/10.1057/s41270-021-00135-6
[13] Xie, X., Wang, L. and Wang, A. (2010) Artificial Neural Network Modeling for De-
ciding If Extractions Are Necessary Prior to Orthodontic Treatment.
The Angle
Orthodontist
, 80, 262-266. https://doi.org/10.2319/111608-588.1
[14] Abiodun, O.I., Jantan, A., Omolara, A.E., Dada, K.V., Mohamed, N.A. and Arshad,
H. (2018) State-of-the-Art in Artificial Neural Network Applications: A Survey.
He-
liyon
, 4, e00938. https://doi.org/10.1016/j.heliyon.2018.e00938
[15] Sarker, I.H. (2021) Machine Learning: Algorithms, Real-World Applications and
Research Directions.
SN Computer Science
, 2, Article No. 160.
https://doi.org/10.1007/s42979-021-00592-x
[16] Couronné, R., Probst, P. and Boulesteix, A.-L. (2018) Random Forest versus Logistic
Regression: A Large-Scale Benchmark Experiment.
BMC Bioinformatics
, 19, Article
No. 270. https://doi.org/10.1186/s12859-018-2264-5
[17] Sarker, I.H. (2021) Deep Learning: A Comprehensive Overview on Techniques,
Taxonomy, Applications and Research Directions.
SN Computer Science
, 2, Article
No. 420. https://doi.org/10.1007/s42979-021-00815-1
[18] Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O.,
Santamaría, J., Fadhel, M.A., Al-Amidie, M. and Farhan, L. (2021) Review of Deep
Learning: Concepts, CNN Architectures, Challenges, Applications, Future Direc-
tions.
Journal of Big Data
, 8, Article No. 53.
https://doi.org/10.1186/s40537-021-00444-8
[19] Montesinos López, O.A., Montesinos López, A. and Crossa, J. (2022) Multivariate
Statistical Machine Learning Methods for Genomic Prediction. Springer, Cham.
https://doi.org/10.1007/978-3-030-89010-0
[20] Albaradei, S., Thafar, M., Alsaedi, A., Van Neste, C., Gojobori, T., Essack, M. and
Gao, X. (2021) Machine Learning and Deep Learning Methods That Use Omics
Data for Metastasis Prediction.
Computational and Structural Biotechnology Jour-
nal
, 19, 5008-5018. https://doi.org/10.1016/j.csbj.2021.09.001
[21] Duan, F., Zhang, S., Yan, Y. and Cai, Z. (2022) An Oversampling Method of Unba-
lanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE.
Sen-
sors
, 22, Article No. 5166. https://doi.org/10.3390/s22145166
[22] Karrar, A.E. (2022) The Effect of Using Data Pre-Processing by Imputations in
C. Obunadike et al.
DOI:
10.4236/jdaip.2023.113012 239 Journal of Data Analysis and Information Processing
Handling Missing Values.
Indonesian Journal of Electrical Engineering and Infor-
matics
, 10, 375-384. https://doi.org/10.52549/ijeei.v10i2.3730
[23] Bin Rafiq, R., Modave, F., Guha, S. and Albert, M.V. (2020) Validation Methods to
Promote Real-World Applicability of Machine Learning in Medicine. 2020 3
rd In-
ternational Conference on Digital Medicine and Image Processing
, Kyoto, 6-9 No-
vember 2020, 13-19. https://doi.org/10.1145/3441369.3441372
... Another study found that a hybrid model combining RF and CNN achieved higher accuracy and F1-score than individual models for detecting network intrusions [9]. Additionally, application of machine learning e.g., ANN models and logistic regression models, is seen as powerful mechanisms [5]. ...
Research Proposal
Full-text available
The use of machine learning in cyber security has become increasingly popular in recent years due to its potential to identify and mitigate cyber threats. In this paper, we explore the application of machine learning algorithms to detect cyber-attacks in network traffic data. We first preprocessed the data by applying feature engineering and scaling techniques. We then trained and tuned two models: A Random Forest and a Neural network. Our results show that both models performed exceptionally, achieving 100% accuracy and an F1 score on the test data. The Random Forest model achieved these results without any parameter tuning. At the same time, the neural network required careful tuning of its architecture and hyperparameters to evaluate the model's performance using precision, recall, F1 score, and confusion matrix. In conclusion, our findings demonstrate the potential of machine learning in detecting cyber-attacks in network traffic data. The high accuracy achieved by our models indicates that machine learning algorithms can effectively detect cyber threats in real-time effect. This has important implications for developing more robust and reliable cybersecurity systems in the future.
... This model combines the strengths of Support Vector Machines and Random Forests, resulting in higher accuracy and better generalization performance. Additionally, application of machine learning e.g., ANN models and logistic regression models, is seen as powerful mechanisms to mitigate environmental hazards [11]. Other studies have also explored the use of ensemble methods such as stacking and boosting [5]. ...
Preprint
Full-text available
Road accidents have significant economic and societal costs, with a small number of severe accidents accounting for a large portion of these costs. Predicting accident severity can help in the proactive approach to road safety by identifying potential unsafe road conditions and taking well-informed actions to reduce the number of severe accidents. This study investigates the effectiveness of the Random Forest machine learning algorithm for predicting the severity of an accident. The model is trained on a dataset of accident records from a large metropolitan area and evaluated using various metrics. Hyperparameters and feature selection are optimized to improve the model's performance. The results show that the Random Forest model is an effective tool for predicting accident severity with an accuracy of over 80%. The study also identifies the top six most important variables in the model, which include wind speed, pressure, humidity, visibility, clear conditions, and cloud cover. The fitted model has an AUC of 80%, a recall of 79.2%, a precision of 97.1%, and an F1 score of 87.3%. These results suggest that the proposed model has higher performance in explaining the target variable, which is the accident severity class. Overall, the study provides evidence that the Random Forest model is a viable and reliable tool for predicting accident severity and can be used to help reduce the number of fatalities and injuries due to road accidents in the United States.
Article
Full-text available
The study is focused on identity theft and cybersecurity in United States. Hence, the study is aimed at examining the impact of cybersecurity on identity theft in United States using a time series data which covers the period between 2001 and 2021. Trend analysis of complaints of identity theft and cybersecurity over the years was conducted; also, the nature of relationship between the two variables was established. Chi-Square analysis was used to examine the impact of cybersecurity on identity theft in United States. Line graphs were used to analyze the trend in the variable. Time series data was used in the study and the data was obtained from secondary sources; Statista.com, US Federal Trade Commission, Insurance Information Institute and Identitytheft.org. Result from the study revealed that consumers’ complaints on identity theft were on the increase every year. Total spending of the economy (both private and public sector) on cybersecurity was on continuous increase over the years. More than 100% of spending in 2010 was incurred in 2018. The Chi-Square analysis revealed that cybersecurity does not have significant impact on identity theft. The study recommended that the government increase the level of public awareness to ensure that members of the public protect their personal and other information to ensure that they are not compromised for fraud or identity theft. Organizations also need to invest more in the security system and develop policies that will support the security system. At the country level, international treaties and collaboration should be encouraged to prosecute the fraudsters hiding behind national borders.
Research
Full-text available
This study examines a case study and the impact of predicting early diabetes in the United States through the application of Logistic Regression Model. After comparing the predictive ability of machine learning algorithm (Binomial Logistic Model) to diabetes, the important features that cause diabetes were also studied. We predict the test data based on the important variables and compute the prediction accuracy using the Receiver Operating Characteristic (ROC) curve and Area Under Curve (AUC). From the correlation coefficient analysis, we can deduce that, out of the 16 PIE variables, only "Itching and Delayed healing" were statistically insignificant with the target variable (class) with a value of 83% and 33% respectively while "Alopecia and Gender/Sex" has a negative correlation with the target variable (class). In addition, the Lasso Regularization method was used to penalize our logistic regression model, and it was observed that the predictor variable "sudden_weight_loss" does not appear to be statistically significant in the model, and the predictor variables "Polyuria and Polydipsa" contributed most to the prediction of Class "Positive" based on their parameter values and odd ratios. Since the confidence interval of our model falls between 93% and 99%, we are 95% confident that our AUC is accurate and thus, it indicates that our fitted model can predict diabetes status correctly.
Article
Full-text available
In the last decade, ground-level ozone exposure has led to a significant increase in environmental and health risks. Thus, it is essential to measure and monitor atmospheric ozone concentration levels. Specifically, recent improvements in machine learning (ML) processes, based on statistical modeling, have provided a better approach to solving these risks. In this study, we compare Naive Bayes, K-Nearest Neighbors, Decision Tree, Stochastic Gradient Descent, and Extreme Gradient Boosting (XGBoost) algorithms and their ensemble technique to classify ground-level ozone concentration in the El Paso-Juarez area. As El Paso-Juarez is a non-attainment city, the concentrations of several air pollutants and meteorological parameters were analyzed. We found that the ensemble (soft voting classifier) of algorithms used in this paper provide high classification accuracy (94.55%) for the ozone dataset. Furthermore, variables that are highly responsible for the high ozone concentration such as Nitrogen Oxide (NOx), Wind Speed and Gust, and Solar radiation have been discovered.
Article
Full-text available
Importance The US Environmental Protection Agency is required to reexamine its National Ambient Air Quality Standards (NAAQS) every 5 years, but evidence of mortality risk is lacking at air pollution levels below the current daily NAAQS in unmonitored areas and for sensitive subgroups. Objective To estimate the association between short-term exposures to ambient fine particulate matter (PM2.5) and ozone, and at levels below the current daily NAAQS, and mortality in the continental United States. Design, Setting, and Participants Case-crossover design and conditional logistic regression to estimate the association between short-term exposures to PM2.5 and ozone (mean of daily exposure on the same day of death and 1 day prior) and mortality in 2-pollutant models. The study included the entire Medicare population from January 1, 2000, to December 31, 2012, residing in 39 182 zip codes. Exposures Daily PM2.5 and ozone levels in a 1-km × 1-km grid were estimated using published and validated air pollution prediction models based on land use, chemical transport modeling, and satellite remote sensing data. From these gridded exposures, daily exposures were calculated for every zip code in the United States. Warm-season ozone was defined as ozone levels for the months April to September of each year. Main Outcomes and Measures All-cause mortality in the entire Medicare population from 2000 to 2012. Results During the study period, there were 22 433 862 million case days and 76 143 209 control days. Of all case and control days, 93.6% had PM2.5 levels below 25 μg/m³, during which 95.2% of deaths occurred (21 353 817 of 22 433 862), and 91.1% of days had ozone levels below 60 parts per billion, during which 93.4% of deaths occurred (20 955 387 of 22 433 862). The baseline daily mortality rates were 137.33 and 129.44 (per 1 million persons at risk per day) for the entire year and for the warm season, respectively. Each short-term increase of 10 μg/m³ in PM2.5 (adjusted by ozone) and 10 parts per billion (10⁻⁹) in warm-season ozone (adjusted by PM2.5) were statistically significantly associated with a relative increase of 1.05% (95% CI, 0.95%-1.15%) and 0.51% (95% CI, 0.41%-0.61%) in daily mortality rate, respectively. Absolute risk differences in daily mortality rate were 1.42 (95% CI, 1.29-1.56) and 0.66 (95% CI, 0.53-0.78) per 1 million persons at risk per day. There was no evidence of a threshold in the exposure-response relationship. Conclusions and Relevance In the US Medicare population from 2000 to 2012, short-term exposures to PM2.5 and warm-season ozone were significantly associated with increased risk of mortality. This risk occurred at levels below current national air quality standards, suggesting that these standards may need to be reevaluated.
Article
Full-text available
Background Studies have shown that long-term exposure to air pollution increases mortality. However, evidence is limited for air-pollution levels below the most recent National Ambient Air Quality Standards. Previous studies involved predominantly urban populations and did not have the statistical power to estimate the health effects in underrepresented groups. Methods We constructed an open cohort of all Medicare beneficiaries (60,925,443 persons) in the continental United States from the years 2000 through 2012, with 460,310,521 person-years of follow-up. Annual averages of fine particulate matter (particles with a mass median aerodynamic diameter of less than 2.5 μm [PM2.5]) and ozone were estimated according to the ZIP Code of residence for each enrollee with the use of previously validated prediction models. We estimated the risk of death associated with exposure to increases of 10 μg per cubic meter for PM2.5 and 10 parts per billion (ppb) for ozone using a two-pollutant Cox proportional-hazards model that controlled for demographic characteristics, Medicaid eligibility, and area-level covariates. Results Increases of 10 μg per cubic meter in PM2.5 and of 10 ppb in ozone were associated with increases in all-cause mortality of 7.3% (95% confidence interval [CI], 7.1 to 7.5) and 1.1% (95% CI, 1.0 to 1.2), respectively. When the analysis was restricted to person-years with exposure to PM2.5 of less than 12 μg per cubic meter and ozone of less than 50 ppb, the same increases in PM2.5 and ozone were associated with increases in the risk of death of 13.6% (95% CI, 13.1 to 14.1) and 1.0% (95% CI, 0.9 to 1.1), respectively. For PM2.5, the risk of death among men, blacks, and people with Medicaid eligibility was higher than that in the rest of the population. Conclusions In the entire Medicare population, there was significant evidence of adverse effects related to exposure to PM2.5 and ozone at concentrations below current national standards. This effect was most pronounced among self-identified racial minorities and people with low income. (Supported by the Health Effects Institute and others.)
Article
Full-text available
Although many studies have linked elevations in tropospheric ozone to adverse health outcomes, the effect of long-term exposure to ozone on air pollution-related mortality remains uncertain. We examined the potential contribution of exposure to ozone to the risk of death from cardiopulmonary causes and specifically to death from respiratory causes. Data from the study cohort of the American Cancer Society Cancer Prevention Study II were correlated with air-pollution data from 96 metropolitan statistical areas in the United States. Data were analyzed from 448,850 subjects, with 118,777 deaths in an 18-year follow-up period. Data on daily maximum ozone concentrations were obtained from April 1 to September 30 for the years 1977 through 2000. Data on concentrations of fine particulate matter (particles that are < or = 2.5 microm in aerodynamic diameter [PM(2.5)]) were obtained for the years 1999 and 2000. Associations between ozone concentrations and the risk of death were evaluated with the use of standard and multilevel Cox regression models. In single-pollutant models, increased concentrations of either PM(2.5) or ozone were significantly associated with an increased risk of death from cardiopulmonary causes. In two-pollutant models, PM(2.5) was associated with the risk of death from cardiovascular causes, whereas ozone was associated with the risk of death from respiratory causes. The estimated relative risk of death from respiratory causes that was associated with an increment in ozone concentration of 10 ppb was 1.040 (95% confidence interval, 1.010 to 1.067). The association of ozone with the risk of death from respiratory causes was insensitive to adjustment for confounders and to the type of statistical model used. In this large study, we were not able to detect an effect of ozone on the risk of death from cardiovascular causes when the concentration of PM(2.5) was taken into account. We did, however, demonstrate a significant increase in the risk of death from respiratory causes in association with an increase in ozone concentration.
Article
Full-text available
Childhood respiratory allergies, which contribute to missed school days and other activity limitations, have increased in recent years, possibly due to environmental factors. In this study we examined whether air pollutants are associated with childhood respiratory allergies in the United States. For the approximately 70,000 children from the 1999-2005 National Health Interview Survey eligible for this study, we assigned between 40,000 and 60,000 ambient pollution monitoring data from the U.S. Environmental Protection Agency, depending on the pollutant. We used monitors within 20 miles of the child's residential block group. We used logistic regression models, fit with methods for complex surveys, to examine the associations between the reporting of respiratory allergy or hay fever and annual average exposure to particulate matter < or = 2.5 microm in diameter (PM2.5), PM < or = 10 microm in diameter, sulfur dioxide, and nitrogen dioxide and summer exposure to ozone, controlling for demographic and geographic factors. Increased respiratory allergy/hay fever was associated with increased summer O3 levels [adjusted odds ratio (AOR) per 10 ppb = 1.20; 95% confidence interval (CI), 1.15-1.26] and increased PM2.5 (AOR per 10 microg/m3 = 1.23; 95% CI, 1.10-1.38). These associations persisted after stratification by urban-rural status, inclusion of multiple pollutants, and definition of exposures by differing exposure radii. No associations between the other pollutants and the reporting respiratory allergy/hay fever were apparent. These results provide evidence of adverse health for children living in areas with chronic exposure to higher levels of O3 and PM2.5 compared with children with lower exposures.
Article
Full-text available
The association between chronic exposure to air pollution and adverse health outcomes has not been well studied. This project investigated the impact of chronic exposure to high ozone levels on childhood asthma admissions in New York State. We followed a birth cohort born in New York State during 1995-1999 to first asthma admission or until 31 December 2000. We identified births and asthma admissions through the New York State Integrated Child Health Information System and linked these data with ambient ozone data (8-hr maximum) from the New York State Department of Environmental Conservation. We defined chronic ozone exposure using three indicators: mean concentration during the follow-up period, mean concentration during the ozone season, and proportion of follow-up days with ozone levels > 70 ppb. We performed logistic regression analysis to adjust for child's age, sex, birth weight, and gestational age; maternal race/ethnicity, age, education, insurance status, smoking during pregnancy, and poverty level; and geographic region, temperature, and co-pollutants. Asthma admissions were significantly associated with increased ozone levels for all chronic exposure indicators (odds ratios, 1.16-1.68), with a positive dose-response relationship. We found stronger associations among younger children, low sociodemographic groups, and New York City residents as effect modifiers. Chronic exposure to ambient ozone may increase the risk of asthma admissions among children. Younger children and those in low socioeconomic groups have a greater risk of asthma than do other children at the same ozone level.
Article
Daily increases in ambient ozone have been associated with increased mortality. However, little is known about which subpopulations are more susceptible to death related to ozone. We conducted a case-only study in 48 US cities to identify subpopulations particularly vulnerable to ozone air pollution. Mortality and ozone data were obtained for the period 1989-2000 (May through September of each year) for 2,729,640 decedents. For each potential effect modifier, we fitted city-specific logistic regression models and pooled the results across all cities. Additionally, we examined differences in susceptibility factors according to several city characteristics using a meta-regression. For each 10 ppb increase in ozone (average of lags 0 to 2), people aged > or =65 years had a 1.10% (95% confidence interval = 0.44% to 1.77%) additional increase in mortality (compared with younger ages). Other groups that were particularly susceptible were black people (additional 0.53% [0.19% to 0.87%]), women (additional 0.58% [0.18% to 0.98%]), and those with atrial fibrillation (additional 1.66% [0.03% to 3.32%]). Susceptibility factors had a larger effect in cities with lower ozone levels. For instance, the additional increase in ozone-related mortality for the elderly was 1.48% (0.81% to 2.15%) in a city with a mean ozone level of 42 ppb versus 0.45% (-0.27% to 1.19%) in a city with a level of 51 ppb. We confirmed the susceptibility of the elderly to die of ambient ozone and identified other vulnerable subpopulations including women, blacks, and those with atrial fibrillation. Differences in vulnerability were particularly marked in cities with lower ozone concentrations.
Nonattainment Areas for Criteria Pollutants
  • U S Epa
U.S. EPA. 2017. Nonattainment Areas for Criteria Pollutants (Green Book). Accessed at https://www.epa.gov/green-book. Data updated as of January 31, 2018.