ResearchPDF Available

Abstract

This study examines a case study and the impact of predicting early diabetes in the United States through the application of Logistic Regression Model. After comparing the predictive ability of machine learning algorithm (Binomial Logistic Model) to diabetes, the important features that cause diabetes were also studied. We predict the test data based on the important variables and compute the prediction accuracy using the Receiver Operating Characteristic (ROC) curve and Area Under Curve (AUC). From the correlation coefficient analysis, we can deduce that, out of the 16 PIE variables, only "Itching and Delayed healing" were statistically insignificant with the target variable (class) with a value of 83% and 33% respectively while "Alopecia and Gender/Sex" has a negative correlation with the target variable (class). In addition, the Lasso Regularization method was used to penalize our logistic regression model, and it was observed that the predictor variable "sudden_weight_loss" does not appear to be statistically significant in the model, and the predictor variables "Polyuria and Polydipsa" contributed most to the prediction of Class "Positive" based on their parameter values and odd ratios. Since the confidence interval of our model falls between 93% and 99%, we are 95% confident that our AUC is accurate and thus, it indicates that our fitted model can predict diabetes status correctly.
Internaonal Journal of Scienc and Management Research
Volume 6 Issue 05 (May) 2023
ISSN: 2581-6888
Page: 34-48
Application of Logistic Regression Model in Prediction of Early Diabetes
Across United States
I.Olufemi, C.Obunadike, A. Adefabi & D. Abimbola
1,2,3,4 Department of Computer Science, Austin Peay State University, Clarksville, USA
DOI - http://doi.org/10.37502/IJSMR.2023.6502
Abstract
This study examines a case study and impact of predicting early diabetes in United States
through the application of Logistic Regression Model. After comparing the predictive ability
of machine learning algorithm (Binomial Logistic Model) to diabetes, the important features
that causes diabetes were also studied. We predict the test data based on the important variables
and compute the prediction accuracy using the Receiver Operating Characteristic (ROC) curve
and Area Under Curve (AUC). From the correlation coefficient analysis, we can deduce that,
out of the 16 PIE variables, only “Itching and Delayed healing” were statistically insignificant
with the target variable (class) with a value of 83% and 33% respectively while “Alopecia and
Gender/Sex” has a negative correlation with the target variable (class). In addition, the Lasso
Regularization method was used to penalize our logistic regression model, and it was observed
that the predictor variable “sudden_weight_loss” does not appear to be statistically significant
in the model and the predictor variables “Polyuria and Polydipsa” contributed most to the
prediction of Class "Positive" based on their parameter values and odd ratios. Since the
confidence interval of our model falls between 93% and 99%, we are 95% confident that our
AUC is accurate and thus, it indicates that our fitted model can predict diabetes status correctly.
Keywords: Machine Learning, Supervised Learning, Binomial Logistic Model, Early Diabetes
Prediction.
1. Introduction
Diabetes is undoubtedly one of the most common diseases worldwide. It is one of the biggest
health problems that affect millions of people across the world today. Many Machines Learning
(ML) techniques have been utilized in predicting diabetes in the last couple of years and the
ever-increasing complexity of this problem has inspired research scientist to explore other
robust set of algorithms [2].
Logistic regression is a classification technique adopted by machine learning. It is also a
statistical method applied when analyzing dataset with one or more PIE variables in other to
determine the outcome. In other to find the best fit model that would describe the relationship
between the DORT and PIE variables, logistic regression model is the best model that answers
this puzzle [Bhuiyan].
35 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
This paper will help to solve the predominant challenges encountered in the health sector by
applying machine learning and logistic regression model to accurately predict and identify
diabetes at early stage. According to the Centers for Disease Control and Prevention [2],
National Diabetes Statistics Report released in 2022. This report estimates that more than 130
million adults are living with diabetes or prediabetes in the United States [2]. The application
of machine learning and artificial intelligence in the health sector will help to tackle this
imminent problem aggressively and subsequently lead to drastic improvement in the health
sector. The aims and objective of this journal is to help the United State health and medical
sector in identifying diseases at a very early stage by using diabetes as a case study. In addition,
based on the report from U.S Department of Health and Human Services, diabetes happens to
be among the 15 leading causes of death in United States.
Fig. 1: 15 Leading Causes of Death in United States [2]
In the world today, diabetes is one of the frequent diseases that targets the elderly population.
According to the International Diabetes Federation, 451 million people across the world were
diabetic in 2019 [4]. The expectations are that this number will increase greatly to affect 693
million people in the coming 26 years. Diabetes is considered as a chronic disease associated
with an abnormal state of the human body where the level of blood glucose is inconsistent due
to some pancreas dysfunction that leads to the production of little or no insulin at all, causing
diabetes of type 1 or cells to become resistant to insulin, causing diabetes of type 2.
The dataset used for this analysis was obtained from University of California, Irvine Machine
Learning Repository. The dataset consists of multivariate data of 520 instances (rows or events)
and 17 attributes (i.e., variables or columns). Out of the 17 variables, ‘class’ is assigned to be
the target variable otherwise known as DORT or Y variable (i.e., dependent, observatory,
response, or target variables) while the remaining 16 variables represents PIE or X variable
(Predictor, Independent or explanatory variables).
The main cause of diabetes remains unknown, yet scientists believe that both genetic factors
and environmental lifestyle play a major role in diabetes. Even though it’s incurable, it can be
36 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
managed by treatment and medication. Individuals with diabetes face a risk of developing some
secondary health issues such as heart diseases and nerve damage. Thus, early detection and
treatment of diabetes can prevent complications and assist in reducing the risk of severe health
problems.
2. Methodology
A systematic literature review was conducted to identify published studies on the early
detection and prevention of diabetes [5]. We present in more detail about the use of data that is
relevant with how to do early detection against Diabetes on an individual based on our survey
activities and knowledge acquisition. The dataset was analyzed using R programming language
and the first step adopted during the analysis was Exploratory Data Analysis (EDA).
Exploratory data analysis is very vital because it allows us to have first insight about the
association and relationship that exists between different variables. This could be often done
by using some packages in R like plotly, naniar ggplot etc.
2.1 Descriptive Analysis of the Dataset
Out of the 17 variables, ‘class’ is assigned to be the target variable otherwise known as DORT
or Y (dependent, observatory, response, or target variables) while the remaining 16 variables
represents PIE or X (Predictor, Independent or explanatory variables). The target variable
(class) was transformed from negative to 1 and positive to 0 (see Table 1).
Tables 1: Description and transformation of the data types (variables or features)
S/No
Variables
Initial Data Type (Idt)
Transformed data Type
(Tdt)
1
Age
20-65
Numeric/ Continuous
Numeric/ Continuous
2
Sex/Gender
M/F
Categorical/Binary
Numeric/Binary
0/1
3
Polyuria
Yes/No
Categorical/Binary
Numeric/Binary
0/1
4
Polydipsia
Yes/No
Categorical/Binary
Numeric/Binary
0/1
5
Sudden
Weight
Yes/No
Categorical/Binary
Numeric/Binary
0/1
6
weakness
Yes/No
Categorical/Binary
Numeric/Binary
0/1
7
Polyphagia
Yes/No
Categorical/Binary
Numeric/Binary
0/1
8
Genital
thrush
Yes/No
Categorical/Binary
Numeric/Binary
0/1
9
Visual
blurring
Yes/No
Categorical/Binary
Numeric/Binary
0/1
10
Itching
Yes/No
Categorical/Binary
Numeric/Binary
0/1
11
Irritability
Yes/No
Categorical/Binary
Numeric/Binary
0/1
12
Delayed
healing
Yes/No
Categorical/Binary
Numeric/Binary
0/1
13
Partial
paresis
Yes/No
Categorical/Binary
Numeric/Binary
0/1
14
Muscle
stiffness
Yes/No
Categorical/Binary
Numeric/Binary
0/1
37 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
15
Alopecia
Yes/No
Categorical/Binary
Numeric/Binary
0/1
16
Obesity
Yes/No
Categorical/Binary
Numeric/Binary
0/1
17
Class
+/-
Categorical/Binary
Numeric/Binary
0/1
2.2 Checking for Missing Data
The anyNA() function was used to check for missing variables in our dataset. The outcome
was “False”. Thus, it implies that we did not have any missing data. In addition, we went further
to visualize if there was any sort of missing data using the naniar package (see Fig.1). Based
on figure 1 and 2, it clearly shows that we do not have any missing data.
Figure 1: Barplot showing missing data using vis_mis() function in naniar package.
Figure 2: Graphical line plots showing missing using data using gg_miss_var() function
in naniar package.
2.2.1 Intensive cross checking of other missing values
To ensure that our analysis and model would be free from errors. It is very important to
thoroughly loop through the whole dataset to check for other missing values that may occur in
other forms asides from “NA”. From table 2 below, we could see that the ncom, nmiss and
miss.prop shows that we do not have any missing values. This implies that our dataset is ready
for use.
38 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
Tables 2: Iteration through the dataset using for loop to check for other missing values
Col.num
V.name
Mode
N.level
ncom
nmiss
Miss.prop
1
Age
numeric
51
520
0
0
2
Sex/Gender
numeric
1
520
0
0
3
Polyuria
numeric
1
520
0
0
4
Polydipsia
numeric
2
520
0
0
5
Sudden Weight
numeric
2
520
0
0
6
weakness
numeric
2
520
0
0
7
Polyphagia
numeric
2
520
0
0
8
Genital thrush
numeric
2
520
0
0
9
Visual blurring
numeric
2
520
0
0
10
Itching
numeric
2
520
0
0
11
Irritability
numeric
2
520
0
0
12
Delayed healing
numeric
2
520
0
0
13
Partial paresis
numeric
2
520
0
0
14
Muscle stiffness
numeric
2
520
0
0
15
Alopecia
numeric
2
520
0
0
16
Obesity
numeric
2
520
0
0
17
Class
numeric
2
520
0
0
2.2.2 Frequency distribution of target variable and box plots between target and predictor
variable
Fig. 3 shows the frequency distribution of the target variable (class). It shows that the frequency
distribution of the target variable “class”. Out of the 520 events or rows, 0.62 or 62% of the
class are “Positive” while 0.38 or 38% are negative (see fig 3). Based on the frequency
distribution of our target variables, it indicates that we have unbalanced classification. Thus,
we would further investigate the association of the class variable (DORT) with other 16
variables (PIE). Since the target variable (class) is categorical and “Age” variable is
continuous(discreet), Wilcoxon rank sum test was applied to check for their association. Based
on the result, the p-value of 0.0124 indicates that there is an association between the two
variables because the p-value is below the benchmark of 0.05. Furthermore, a correlation
coefficient of -0.11 was seen between the target variable (class) and predictor (gender/sex)
variable (see fig. 5). Table 3 shows the p-values and test-statistics of all other 16 predictor
variables with the target (class) variable. From the result, it is obvious that both Itching and
Delayed healing are unimportant variables because their p-values were higher than the
benchmark. It is important to note that, asides from Age variable, chi-square was used to check
for association between the remaining variables because they are categorical variables.
2.2.3 Correlation coefficients and association between variables
Correlation is a very strong statistical measurement. It helps to identify the association between
two variables, and it ranges from -1 to +1. Based on our analysis, it could be seen that all the
39 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
16 predictor variables have an association with the DORT variable (class) except for Itching
and Delayed healing (see fig. 5). In addition, Alopecia and Gender/sex have negative
correlation with the target variable among the 16 predictor variables (see fig. 5). The
association of the target variable on the predictor variables was observed (see Table 3). The
marginal (bi-variate) associations between the class variable (DORT) and each predictor
variables were done to classify the predictor variables into important and unimportant predictor
variables (see Table 3 and Fig. 5).
𝑟: correlation coefficient.
𝑥𝑖: values of PIE (predictor, independent, or explanatory) variables in samples.
𝑦𝑖: values of DORT (dependent, observatory, response target) variables in samples.
𝑥: mean values of PIE (predictor, independent, or explanatory) variables in samples.
𝑦: mean values of DORT (dependent, observatory, response target) variables in samples.
Fig 3: Frequency distribution of DORT
(class)
Fig 4: Box plot of Age (PIE) on Class
(DORT)
Tables 3: Test-statistics and p-values between the variables
Test-Statistic
P-value
Decision
H0
HA
Age
27834
0.0124
important
Sex/Gender
103.03
0
important
Polyuria
227.86
0
important
Polydipsia
216.17
0
important
Sudden Weight
97.29
0
important
weakness
29.76
0
important
Polyphagia
59.59
0
important
Genital thrush
5.79
0.0161
important
Visual blurring
31.80
0
important
40 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
Itching
0.04
0.82975
unimportant
Irritability
45.20
0
important
Delayed healing
0.96
0.32666
unimportant
Partial paresis
95.38
0
important
Muscle stiffness
7.28
0.00694
important
Alopecia
36.06
0
important
Obesity
2.32
0.12711
important
Fig 5: Correlation between DORT (target) and PIE variables
3. Results and Discussion
Logistic regression technique was applied to the train data to build a predictive model. Firstly,
we adopted the lasso regularization (L1) with penalty to obtain the tuning parameter (λ) with
cross validation. It is important to note that the L1 penalty is used for both variable selection
and shrinkage, since it has the ability of forcing some of the coefficient estimates to be zero.
Tables 5 represents important features (PIE variables) using the best predictive model with L1
penalty. Based on Table 5, it is obvious that out of the 16 predictor variables only
“sudden_weight_loss” is not important predictors. The test data was predicted using the
predictive model as well as the computation of its accuracy.
3.1 Dataset Partitioning
Since we have performed EDA and determined the important features (variables), we are ready
to fit the model on our training data Therefore, we must split the dataset into training (70%)
and testing data (30%). The logistic regression model that was used is the binomial due to the
features of our dataset and target variable. In addition, for consistency of the partitioned dataset,
we applied the set.seed(123) function for consistent values. The set.seed() function could take
any values within the bracket.
3.2 Model Fitting
Since our target (DORT) variable is binary (Categorical), we are resorting to a classifier
machine model (CMM). We will be using Logistic Regression model and Lasso as the
41 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
Regularization approach to predict Diabetes status of a patient. We chose Lasso Regularization
approach because our focus is having a Parsimonious model that adequately explains the target
(DORT) variable. We fit the Lasso
Logistic regression model with 200 sequences of turning parameter with lower limit 0.001 and
upper limit 0.5 (see fig 6).
Fig 6: Sample code used for fitting the model
Table 4: Lambda’s, misclassification rate and mean square error matrix
[,1]
[,2]
[,3]
[1,]
0.000100000
0.1182796
0.08601410
[2,]
0.002612060
0.1102151
0.08859546
[3,]
0.005124121
0.1102151
0.09098334
[4,]
0.007636181
0.1129032
0.09312719
[5,]
0.010148241
0.1155914
0.09532075
[6,]
0.012660302
0.1155914
0.09790142
Although the best lambda to regularize the model is evaluated using the Means Square Error
and Miss classification rate. The result of the first six rows of the Lambda’s, miss classification
rate and mean square error is printed above (see Table 4). Using MSE (Mean Square Error) as
the evaluation metric. Our Best Lambda (tuning parameter) is 0.0004 (see Fig 7).
Fig 7: Best tuning parameter based on Table 4
3.3 Lambda Vs Mean Square Error Plots
42 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
Fig 8: Showing plot of Lambda Vs MSE
Figure 8 shows as the tuning parameter (lambda) increases alongside with the MSE. Thus, it
implies that the value of Lambda must kept minimal to have a low MSE value. In addition, at
0.18 lambda, the MSE values becomes uniform.
3.4 Final Model Fitting
Having gotten the best lambda. We fit the final Lasso Logistic regression model with the
Training and Validation data pooled together (see fig. 9).
Fig 9: Showing sample code for final model fitting
Table 5: Coefficients of important predictors using LGR (l1) model
Variables
Coefficients
Age
0.06534024
Sex/Gender
-4.99548379
Polyuria
5.66066441
Polydipsia
6.02179976
Sudden Weight Loss
-
weakness
1.39414925
Polyphagia
1.38682283
Genital thrush
1.33510590
Visual blurring
0.13044201
Itching
-3.67583671
Irritability
1.70405358
Delayed healing
-0.45646626
Partial paresis
1.62380210
Muscle stiffness
-0.99207617
Alopecia
1.49216635
Obesity
-0.22411546
43 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
After fitting the model with the best lambda, only the variable “sudden_weight_loss” appears
not to be important in the model (see Table 5).
3.5 Odds ratio of the predictor variables
According to the Centres for Disease Control and Prevention [2], odds ratio is the “measure of
association” for a case-control study. It quantifies the relationship between an exposure and a
disease in a case-control study. The odds ratio is calculated using the number of case-patients
who did or did not have exposure to a factor and the number of controls who did or did not
have the exposure. The odds ratio tells us how much higher the odds of exposure are among
case-patients than among controls.
Generally, the intensity of the odds ratio is called the “strength of the association.” The further
away an odds ratio is from 1.0, the more likely it is that the relationship between the exposure
and the disease is causal. For instance, an odds ratio of 1.25 is above 1.0, but is not a strong
association while that of > 9.5 suggests a stronger association.
Table 6: Odds Ratio Indication and Implication [1]
Odds ratio
Indication
Implication
o 1.0 (or close to 1.0)
The odds of exposure among case-
patients are the same as, or similar
to, the odds of exposure among
controls.
The exposure is not associated
with the disease
o > 1.0
the odds of exposure among
case patients are greater than the
odds of exposure among
controls
The exposure might be a risk
factor for the disease.
o < 1.0
the odds of exposure among case
patients are lower than the odds of
exposure among controls.
The exposure might be a
protective factor against the
disease
The odds ratio of the important predictor variables was calculated using the exponential of the
fit. Best and the beta (see Table 6 and 7).
44 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
Table 7: Odds Ratio of the Predictor Variables
Odds Ratio
Implications
Target Variable
(class)
DORT
Likely to
Occur
Positive
(0)
Negative
(1)
Y (0) / N
(1)
F (1) / M
(0)
Age
1.067522
Age might be a risk factor for
diabetes
Yes
Sex/Gender
0.006768446
Sex might be a protective factor
for diabetes
Female
Polyuria
287.3395
Polyuria might be a risk factor for
diabetes
Yes
Polydipsia
412.32
Polydipsia might be a risk factor
for diabetes
Yes
weakness
4.031543
weakness might be a risk factor
for diabetes
Yes
Polyphagia
4.002114
Polydipsia might be a risk factor
for diabetes
Yes
Genital
thrush
3.800398
Genital thrush might be a risk
factor for diabetes
Yes
Visual
blurring
1.139332
Visual blurring might be a risk
factor for diabetes
Yes
Itching
0.0253282
Itching might be a protective
factor for diabetes
No
Irritability
5.496182
Irritability might be a risk factor
for diabetes
Yes
Delayed
healing
0.6335184
Delayed healing might be a
protective factor for
diabetes
No
Partial
paresis
5.072339
Partial paresis might be a risk
factor for diabetes
Yes
Muscle
stiffness
0.370806
Muscle stiffness might be a
protective factor for
diabetes
No
Alopecia
4.446718
Alopecia might be a risk factor for
diabetes
Yes
Obesity
0.7992229
Obesity might not be associated
with diabetes
Maybe/No
45 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
3.6 Model Evaluation
It is important to recall that model evaluation decides whether the model performs better.
Therefore, it is critical to consider the model outcomes according to every possible evaluation
method. Applying different methods can provide different perspectives. The following line of
code was used to generate our AUC (see fig.10).
Fig. 10: Showing line of codes used to obtain the AUC (Area Under Curve)
Table 8: Results of the Model Evaluation
cvAUC
se
ci
confidence
0.9594332
0.0145292
0.9310 0.9879
0.95
From Table 8, the AUC of 0.9594 indicate that our fitted model has 95.9% ability to correctly
classify diabetes status class positive or negative. The confidence interval also indicates the
true AUC falls within the interval (0.9310, 0.9879). Therefore, we are 95% confident that our
AUC is accurate.
3.7 Receiver Operating Characteristic (ROC) curve
The ROC curve below shows the trade-off between sensitivity (or TPR) and False Positive
Rate (1 – Specificity). It further indicates that the model performs better against the benchmark
(50%) with total area of 0.9594(95.9%).
Fig. 11: Showing the Receiver Operators Curve of the Hit Rate Vs False Alarm
3.8 Confusion Matrix and statistics
46 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
From the result of the Confusion matrix, the target variable (class) positive is represented with
the value “0” and the negative with value “1” (see Table 7). The Sensitivity (TPR) also Known
as “Recall” has a value of 0.8696 (87%) thus it indicates that our model has a higher percentage
of detecting Positive Diabetes class while it has a Specificity (TNR) value of 0.8750 (87.50%)
further indicating that our model has a higher percentage of detecting negative diabetes class
(see Table 9). In addition, the precision value of 92% indicates that our Model has a low false
positive rate (i.e., less classification error or high ability to predict correctly positive or negative
diabetes class). Furthermore, the F1 score of 0.8939 indicates that our model performs better.
The higher the F1 score the better the performance of a binary classification (see Table 9).
Table 9: Confusion Matrix and Statistics
Confusion Matrix and Statistics
Accuracy
0.8716
95%CI
(0.8068, 0.9209)
No Information Rate
0.6216)
P-Value [Acc > NIR]
1.225e-11
Kappa
0.7318
Mcnemar’s Test P-Value
0.3588
Sensitivity
0.8696
Specificity (True Negative Rate/TNR)
0.8750
Pos Pred Value
0.9195
Neg Pred value
0.8033
Precision
0.9195
Recall (True Positive Rate/TPR)
0.8696
F1
0.8939
Prevalence
0.6216
Detection Rate
0.5405
Detection Prevalence
0.5878
Balanced Accuracy
0.8723
Positive Class
0
Precision
0.9195402
4. Conclusion
The analysis started with the Description of the dataset in terms of the sample size, data type
of the predictor variables. We moved on to data cleaning of the dataset i.e., checking for missing
values, outliers, and wrong records. We proceeded into performing exploratory data analysis
of the dataset which involve visualizing the distribution of the target variable classification.
Association of the predictor variables with the target variable using Correlation, Wilcoxon and
chi-square was carried out to examine the strength of relationship between the predictor
variables and the target. The next phase of the analysis was the model building. It started with
dividing the dataset into training and test dataset. The dataset was trained with lasso penalized
47 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
logistics regression model where the lambda penalized parameter was first determined. After
that, the best lambda was used to fit a Logistic regression model on the training dataset. In
terms of the model performance metrics, we examined the model AUC (ROC curve), Recall,
Precision and F1.
Lasso Regularization method was used to penalize our Logistic regression Model, it was
observed that the predictor variable sudden_weight_loss does not appear to be statistically
significant in the model. Predictor variables Polyuria and Polydipsa contributed most to the
prediction of Class "Positive" based on their parameter values and odd ratios. The Gender
predictor variable indicates male are more likely to have diabetes than female. It was also
determined that our fitted model has an AUC of 95.9% with a Recall of 87%, precision of 92%
and an F1 score of 89.4%. All this indicate our fitted model has a higher performance in
explaining the target variable ("Diabetes class status"). Finally, based on the research questions
related to diabetes symptoms we can determine the diabetes status of the respondent.
References
1) American Diabetes Association. Diagnosis and classification of diabetes mellitus.
Diabetes Care 2009, 32, S62–S67. [CrossRef] [PubMed]
2) CDC 2022 National Diabetes Statistics Report: 2022 National Diabetes Statistics
Report.
3) Center for Disease Control and Prevention. National Diabetes Fact Sheet: National
Estimates and General Information on Diabetes and Prediabetes in the United States,
2011; US Department of Health and Human Services, Centers for Disease Control and
Prevention: Atlanta, GA, USA, 2011; Volume 201, pp. 2568–2569.
4) Centre For Disease Control and Prevention: CDC Wonder
https://wonder.cdc.gov/controller/datarequest/D76;jsessionid=F1696B2C464E3B34D
922962F 0D4E
5) Hastie T, Tibshirani R, Friedman J (2008) The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2 Eds, Springer
6) IDF Diabetes Atlas 2022 Report. Available from https://diabetesatlas.org/
7) IDF Diabetes Atlas: Global estimates of diabetes prevalence for 2017 and projections
for 2045
8) International Diabetes Federation: 02 November 2021 Affecting one in 10 adults
https://www.idf.org/news/240:diabetes-now-affects-one-in-10-adults-worldwide.html
9) James G, Witten D, Hastie T, et al. (2013) An Introduction to Statistical Learning with
Applications in R. Springer.
10) Kaur, H.; Kumari, V. Predictive modelling and analytics for diabetes using a machine
learning approach. Appl. Comput. Inform. 2020, 18, 90–100. [CrossRef]
11) Machine Learning Repository. Early-Stage Diabetes Risk Data Set. Available from:
https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset
12) Maniruzzaman, M.; Rahman, M.; Ahammed, B.; Abedin, M. Classification and
prediction of diabetes disease using machine learning paradigm. Health Inf. Sci. Syst.
2020, 8, 1–14. [CrossRef] [PubMed]
48 | International Journal of Scientific and Management Research 6(5) 34-48
Copyright © The Author, 2023 (www.ijsmr.in)
13) Salmonella in the Caribbean - 2013 Interpreting Results of Case-Control Studies
https://www.cdc.gov/training/SIC_CaseStudy/Interpreting_Odds_ptversion.pdf
14) Saeedi, P.; Petersohn, I.; Salpea, P.; Malanda, B.; Karuranga, S.; Unwin, N.; Colagiuri,
S.; Guariguata, L.; Motala, A.A.; Ogurtsova, K.; et al. Global and regional diabetes
prevalence estimates for 2019 and projections for 2030 and 2045: Results from the
International Diabetes Federation Diabetes Atlas. Diabetes Res. Clin. Pract. 2019, 157,
107843. [CrossRef] [PubMed]
15) Sisodia, D.; Sisodia, D.S. Prediction of diabetes using classification algorithms.
Procedia Comput. Sci. 2018, 132, 1578–1585. [CrossRef]
... According to Olufemi et al. (2023), the predictor variables could otherwise be known as "PIE (predictor, independent or explanatory) variables" while the response variables could otherwise be termed "DORT (dependent, observatory, response or target) variables" [3]. Features (variables) ...
... According to Olufemi et al. (2023), the predictor variables could otherwise be known as "PIE (predictor, independent or explanatory) variables" while the response variables could otherwise be termed "DORT (dependent, observatory, response or target) variables" [3]. Features (variables) ...
... The further away an odds ratio is from 1, the more likely it is that the relationship between the exposure and the disease is causal. For instance, an odds ratio of 1.25 is above 1, but is not a strong association while that of > 9.5 suggests a stronger association [3] ...
Research
Full-text available
This paper deals with ozone prediction in the atmosphere using a machine learning procedure. The persistence of highly concentrated ozone levels in the troposphere does harm biotic and abiotic. It is therefore vital to detect high levels of ozone early to ensure a healthy environment. El Paso, Texas is considered a high ozone affected city in the USA, with a history of very high ozone levels every year. In this paper we will use the data sets of air pollutants and meteorological variable from the El Paso area to classify the high/low ozone levels. The dataset was collected from the Texas Commission on Environment Quality (TCEQ) ground stations and cleaned for research purposes. In this study, we trained the data sets using Logistic Regression and Artificial Neural Network algorithms and further made comparison to determine the best model that accurately predicts early ozone level at El Paso area. We found that our model has a very high classification accuracy (89.3%) for predicting ozone level at a given day. From our evaluation metrics, the accuracy of both ANN and LR models were 89.3% and 88.4% respectively. In addition, the AUC of both models were almost the same with ANN having 95.4% while LR has 95.2%. Based on the outcome of our odds ratio, features like; 'Solar radiation', 'Std. Dev. Wind Direction', 'outdoor temperature', 'dew Point temperature' and 'PM10' contributes to the high level of ozone within El Paso Texas.
Article
Full-text available
The study is focused on identity theft and cybersecurity in United States. Hence, the study is aimed at examining the impact of cybersecurity on identity theft in United States using a time series data which covers the period between 2001 and 2021. Trend analysis of complaints of identity theft and cybersecurity over the years was conducted; also, the nature of relationship between the two variables was established. Chi-Square analysis was used to examine the impact of cybersecurity on identity theft in United States. Line graphs were used to analyze the trend in the variable. Time series data was used in the study and the data was obtained from secondary sources; Statista.com, US Federal Trade Commission, Insurance Information Institute and Identitytheft.org. Result from the study revealed that consumers’ complaints on identity theft were on the increase every year. Total spending of the economy (both private and public sector) on cybersecurity was on continuous increase over the years. More than 100% of spending in 2010 was incurred in 2018. The Chi-Square analysis revealed that cybersecurity does not have significant impact on identity theft. The study recommended that the government increase the level of public awareness to ensure that members of the public protect their personal and other information to ensure that they are not compromised for fraud or identity theft. Organizations also need to invest more in the security system and develop policies that will support the security system. At the country level, international treaties and collaboration should be encouraged to prosecute the fraudsters hiding behind national borders.
Article
Full-text available
Background and objectives Diabetes is a chronic disease characterized by high blood sugar. It may cause many complicated disease like stroke, kidney failure, heart attack, etc. About 422 million people were affected by diabetes disease in worldwide in 2014. The figure will be reached 642 million in 2040. The main objective of this study is to develop a machine learning (ML)-based system for predicting diabetic patients. Materials and methods Logistic regression (LR) is used to identify the risk factors for diabetes disease based on p value and odds ratio (OR). We have adopted four classifiers like naïve Bayes (NB), decision tree (DT), Adaboost (AB), and random forest (RF) to predict the diabetic patients. Three types of partition protocols (K2, K5, and K10) have also adopted and repeated these protocols into 20 trails. Performances of these classifiers are evaluated using accuracy (ACC) and area under the curve (AUC). Results We have used diabetes dataset, conducted in 2009–2012, derived from the National Health and Nutrition Examination Survey. The dataset consists of 6561 respondents with 657 diabetic and 5904 controls. LR model demonstrates that 7 factors out of 14 as age, education, BMI, systolic BP, diastolic BP, direct cholesterol, and total cholesterol are the risk factors for diabetes. The overall ACC of ML-based system is 90.62%. The combination of LR-based feature selection and RF-based classifier gives 94.25% ACC and 0.95 AUC for K10 protocol. Conclusion The combination of LR and RF-based classifier performs better. This combination will be very helpful for predicting diabetic patients.
Article
Full-text available
Diabetes is a major metabolic disorder which can affect entire body system adversely. Undiagnosed diabetes can increase the risk of cardiac stroke, diabetic nephropathy and other disorders. All over the world millions of people are affected by this disease. Early detection of diabetes is very important to maintain a healthy life. This disease is a reason of global concern as the cases of diabetes are rising rapidly. Machine learning (ML) is a computational method for automatic learning from experience and improves the performance to make more accurate predictions. In the current research we have utilized machine learning technique in Pima Indian diabetes dataset to develop trends and detect patterns with risk factors using R data manipulation tool. To classify the patients into diabetic and non-diabetic we have developed and analyzed five different predictive models using R data manipulation tool. For this purpose we used supervised machine learning algorithms namely linear kernel support vector machine (SVM-linear), radial basis function (RBF) kernel support vector machine, k-nearest neighbour (k-NN), artificial neural network (ANN) and multifactor dimensionality reduction (MDR).
Article
Aims: To provide global estimates of diabetes prevalence for 2019 and projections for 2030 and 2045. Methods: A total of 255 high-quality data sources, published between 1990 and 2018 and representing 138 countries were identified. For countries without high quality in-country data, estimates were extrapolated from similar countries matched by economy, ethnicity, geography and language. Logistic regression was used to generate smoothed age-specific diabetes prevalence estimates (including previously undiagnosed diabetes) in adults aged 20-79 years. Results: The global diabetes prevalence in 2019 is estimated to be 9.3% (463 million people), rising to 10.2% (578 million) by 2030 and 10.9% (700 million) by 2045. The prevalence is higher in urban (10.8%) than rural (7.2%) areas, and in high-income (10.4%) than low-income countries (4.0%). One in two (50.1%) people living with diabetes do not know that they have diabetes. The global prevalence of impaired glucose tolerance is estimated to be 7.5% (374 million) in 2019 and projected to reach 8.0% (454 million) by 2030 and 8.6% (548 million) by 2045. Conclusions: Just under half a billion people are living with diabetes worldwide and the number is projected to increase by 25% in 2030 and 51% in 2045.
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
Diagnosis and classification of diabetes mellitus
American Diabetes Association. Diagnosis and classification of diabetes mellitus. Diabetes Care 2009, 32, S62-S67. [CrossRef] [PubMed]
Diabetes Fact Sheet: National Estimates and General Information on Diabetes and Prediabetes in the United States
Center for Disease Control and Prevention. National Diabetes Fact Sheet: National Estimates and General Information on Diabetes and Prediabetes in the United States, 2011; US Department of Health and Human Services, Centers for Disease Control and Prevention: Atlanta, GA, USA, 2011; Volume 201, pp. 2568-2569.