ArticlePDF Available

Analyzing the Leading Causes of Traffic Fatalities Using XGBoost and Grid-Based Analysis: A City Management Perspective

Authors:

Abstract and Figures

Traffic accidents have been one of the most important global public problems. It has caused a severe loss of human lives and property every year. Studying the influential factors of accidents can help find the reasons behind. This can facilitate the design of effective measures and policies to reduce the traffic fatality rate and improve road safety. However, most of the existing research either adopted methods based on linear assumption or neglected to further evaluate the spatial relationships. In this paper, we proposed a methodology framework based on XGBoost and grid analysis to spatially analyze the leading factors on traffic fatality in Los Angeles County. Characteristics of the collision, time and location, and environmental factors are considered. Results show that the proposed method has the best modeling performance compared with other commonly seen machine learning algorithms. Eight factors are found to have the leading impact on traffic fatality. Spatial relationships between the eight factors and the fatality rates within the Los Angeles County are further studied using the grid-based analysis in GIS. Specific suggestions on how to reduce the fatality rate and improve road safety are provided accordingly.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1100/ACCESS.2000.DOI
Analyzing the Leading Causes of Traffic
Fatalities Using XGBoost and
Grid-based Analysis: A City
Management Perspective
JUN MA1,2, YUEXIONG DING2, JACK C.P. CHENG1, YI TAN3, VINCENT J.L. GAN1, and
JINGCHENG ZHANG4
1Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
2Department of Research and Development, Big Bay Innovation Research and Development Limited, Hong Kong, China
3College of Civil Engineering, Shenzhen University, Shenzhen, China
4Shenzhen Mariocode Science and Technology Co. Ltd., Shenzhen, China
Corresponding author: JINGCHENG ZHANG (e-mail: zhangjingcheng0306@outlook.com).
ABSTRACT Traffic accidents have been one of the most important global public problems. It has caused
a severe loss of human lives and property every year. Studying the influential factors of accidents can help
find the reasons behind. This can facilitate the design of effective measures and policies to reduce the traffic
fatality rate and improve road safety. However, most of the existing research either adopted methods based
on linear assumption or neglected to further evaluate the spatial relationships. In this paper, we proposed
a methodology framework based on XGBoost and grid analysis to spatially analyze the leading factors on
traffic fatality in Los Angeles County. Characteristics of the collision, time and location, and environmental
factors are considered. Results show that the proposed method has the best modeling performance compared
with other commonly seen machine learning algorithms. Eight factors are found to have the leading impact
on traffic fatality. Spatial relationships between the eight factors and the fatality rates within the Los Angeles
County are further studied using the grid-based analysis in GIS. Specific suggestions on how to reduce the
fatality rate and improve road safety are provided accordingly.
INDEX TERMS XGBoost; Factors analysis; GIS; Grid-based analysis; non-linear machine learning; traffic
fatality.
I. INTRODUCTION
TRAFFIC accidents are considered as one of the most
critical and dangerous problems all around the world.
According to World Health Organization (WHO), about 1.3
million people die each year due to traffic accidents. An
additional 20-50 million are injured or disabled [1]. In the
United States, traffic fatalities increased by 15.3 percent from
2011 to 2016 (29,867 to 37,461) [2]. How to reduce the
occurrence of fatal traffic accidents and improve road safety
has been a significant problem for both governments and
research institutions. To this end, many scholars have con-
ducted different kinds of research to study traffic accidents.
Some focused on the road safety management and education
[3], [4], some paid attention to the improvement of vehicles
[5], [6], some studied the emergency medical service [7], [8],
and some focused on the factors that influence the severity of
traffic accidents [9]–[12].
Knowing what the influential factors are and how they
affect the accidents can help us better understand the cause-
effect behind. This is beneficial to improve the estimation of
the accident severity and the preparation of countermeasures.
Many factors have been studied by previous literature. For
example, Adanu et al. [10] studied the factors that influ-
ence the severity of single-vehicle accidents that happen
on weekdays and weekends. They found that factors such
as driver unemployment, driving with invalid license, no
seatbelt, fatigue, have a high correlation with the severity
of the accidents. Mohamed et al. [11] studied the influential
factors of rear-end crashes. They found that factors like tail-
gating, driving too fast, years of experience, number of lanes,
are significantly affecting the severity of rear-end crashes.
Lee et al. [13] analyzed the impact of rainfall intensity and
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
water depth on traffic accident severity. Schneider et al.
[14] examined the statewide motorcycle crash data in a ve
year range in Ohio and found that female, the presence of
alcohol and high degree curve would increase the likelihood
of severe traffic accidents. Hashimoto et al. [15] studied the
correlations between traffic accidents and environmental fac-
tors like population and road condition. Aziz, Ukkusuri, and
Hasan [16] concluded that low-lighted-roads and pedestrian
crossing intersections would increase the likelihood of severe
pedestrian-vehicle accidents.
When analyzing the impact factors, these literature had
adopted different kinds of statistical methods. For example,
Russo and Savolainen [17] applied three regression models
namely negative binomial, order logit and multinomial logit
to ascertain the relationship between different median barrier
types and freeway median crash frequency and severity.
Dapilah et al. [18] examined how motorcyclist characteristics
influence the road traffic accidents using chi-square analysis.
Karimi and Kashi [19] investigated the effect of geometric
parameters on accident reduction using sensitivity analysis.
Onozuka et al. [20] studied whether a full moon contributes
to the road traffic accidents using conditional Poisson regres-
sion. However, most of these methods are based on linear
assumptions. Their performances are limited since the real
cases are very complicated, and the relationships between the
factors and accidents are non-linear. Therefore, non-linear
machine learning algorithms were applied in some recent
research to address this problem. For example, Mussone et
al. [21] predicted the severity of the accidents in urban road
intersections using Artificial Neural Network (ANN). Li et al.
[22] used Support Vector Machine (SVM) models for crash
injury severity analysis. However, algorithms like ANN and
SVM are referred to as black box processes because they do
not provide a direct explanation of the variable importance
[23]. This does not help when analyzing the cause-effect
behind traffic accidents. Therefore, a non-linear methodology
framework with the ability to calculate the variable impor-
tance needs to be proposed.
Furthermore, geographical information system (GIS) is
gradually becoming more popular due to its ability to capture,
store, manipulate, analyze, manage and present spatial or
geographical data [24]–[26]. Many previous studies have
used GIS to analyze the data spatially and temporally. For ex-
ample, Kazmi and Zubair [27] estimated the vehicle damage
cost involved in road traffic accidents based on GIS. Din et
al. [28] assessed the level of service of public transportation
using GIS. Zangeneh et al. [29] conducted a spatial-temporal
cluster analysis of mortality from road traffic injuries using
GIS. However, few studies have combined GIS and machine
learning algorithms to analyze the influential factors on traffic
accident fatality.
The objective of this paper is to propose a Grid-based
non-linear machine learning framework to analyze the crit-
ical factors that influence traffic accidents fatality. A non-
linear machine learning algorithm namely XGBoost is imple-
mented to process the data and distinguish the variable with
higher importance. Its performacne is compared with five
other machine learning algorithms, including multiple linear
regression (MLR), logistic regression (LR), multi-layer per-
ceptron (MLP), support vector machine (SVM) and random
forest (RF). Factors including crash characteristics, time and
locations, and environmental characteristics are considered in
the model. Whether a traffic accident involves fatality is the
classification target. Results show that XGBoost outperforms
other algorithms with higher classification accuracy. Eight
factors such as alcohol use and lighting conditions are found
to be the leading causes of fatal accidents based on the
proposed framework. GIS is then used to conduct the spatial
analysis on the relationships between these eight factors and
the fatality rate.
The remaining paper is organized as follows: Section II
describes the proposed methodology framework, Section III
presents a case study in Los Angeles County, Section IV and
Section V show the results and discussion, and Section VI
concludes the work.
II. METHODOLOGY FRAMEWORK
Figure 1 shows the proposed methodology framework. It
consists of three parts. The first part is data preprocessing.
Data collection, cleaning, formatting, and balancing are con-
ducted. The second part is model training and optimization.
XGBoost is applied to build the classification model and
calculate the feature importance. The last part is post-mining.
This is accomplished by conducting the grid-based spatial
analysis in GIS.
A. XGBOOST
The classification modeling using Extreme Gradient Boost-
ing Decision Tree (XGBoost) is an essential component
in the proposed framework. It plays the role of modeling
between the urban features and the outputs. This algorithm
has been reported to achieve good performance in differ-
ent research domains. For example, Zhang and Zhan [30]
adopted XGBoost to classify the rock faces. Torlay et al.
[31] utilized XGBoost to identify atypical language patterns
and differentiate patients with epilepsy. Ma and Cheng [32]
improved the identification accuracy using XGBoost when
modeling features that affect the green building markets.
Zheng et al. [33] applied XGBoost to evaluate the feature
importance in short-term electric load forecasting.
XGBoost is an ensemble technique developed based on the
Gradient Boosting proposed by Friedman [34]. It learns a set
of regression trees (CARTs) in parallel and obtains the result
by summing up the score of each CART. Compared with the
original GBDT algorithm [34], Chen and Guestrin [35] added
some improvements in 2016 and named it as XGBoost. One
of special improvements is the regularized objective to the
loss function. Calculation of the regularized objective Lkfor
the kth iteration is shown in Equation (1).
Lk=
n
X
i=1
l(y(i),ˆy(i)
k) +
k
X
j=1
Ω(fj)(1)
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 1: Methodology Framework.
where nrepresents the number of samples, ˆy(i)
krepresents the
prediction of the sample iat iteration k,l(·)represents the
original loss function. is the regularization term, and is
calculated by Equation (2).
Ω(f) = γT +1
2λ
T
X
j=1
ω2
j(2)
where Trepresents the number of leaf nodes. γand λare two
constants used to control the regularization degree.
Another improvement of XGBoost is the utilization of
additive learning strategy [36]. Instead of applying stochastic
gradient descent method to complement the corresponding
optimization procedure, XGBoost adds the best tree model
fk(x(i))into the current classification model to give predic-
tion result for the mth iteration. In this case, Equation (1) can
be further formulated as follows:
Lk=
n
X
i=1
l(y(i),ˆy(i)
k1+fk(x(i)))+Ω(fk)+
k1
X
j=1
Ω(fj)(3)
Furthermore, XGBoost adopts the Taylor Expansion sec-
ond order to the objective function, and Equation (3) can be
further expanded to Equation 4.
Lk=
n
X
i=1
[l(y(i),ˆy(i)
k1) + gi·fk(x(i))
+1
2hi·fk(x(i))] + Ω(fk) + C
(4)
where gi=ˆyk1l(y(i),ˆyk1)and hi=2
ˆyk1l(y(i),ˆyk1)
are the first and second order derivatives on the loss function
respectively, Cis a constant.
These advantages make XGBoost potentially to obtain
better results than traditional GBDT. According to the results
of the experiments in our case study, it also outperforms other
commonly-used machine learning algorithms. More details
will be provided later.
B. VARIABLE IMPORTANCE
Except the ability in modeling non-linear classification and
regression problems, XGBoost is also capable of ranking
the variable importance by using the weight. It is given by
the frequency of a feature used in splitting the data across
all CARTs. Importance of a variable v(Wv) based on the
weights can be calculated by Equation 5 and Equation 6.
Wv=
K
X
k=1
L1
X
l=1
I(Vl
k, v)(5)
I(Vl
k, v) = (1,if Vl
k=v
0,otherwise (6)
where Krepresents the number of trees or iterations, Lrep-
resents the number of leaf nodes of the kth tree, Vl
krepresents
the feature related to the node l, I(·) represents the indicator
function.
C. GRID-BASED ANALYSIS IN GIS
A geographical information system (GIS) is a framework for
gathering, integrating, managing, and analyzing data. It can
analyze the spatial locations and organize different layers of
information into visualizations using maps and 3D scenes.
With this capability, GIS reveals deeper and broader spatial
insights in data.
In this article, GIS is applied to further detect the spatial
relationships between the fatality rate and the influential fac-
tors. However, since we cannot directly calculate the fatality
rate using the traffic data points, the gridding technique in
GIS is then applied to tackle the problem. It divides the
studied region into many fishnets (grids), and then maps the
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
traffic data points into these grids. The fatality rate Ri
fof the
ith grid can be calculated using Equation 7.
Ri
f=Ni
f
Ni
a
(7)
where Ni
fmeans the total number of fatal accidents, and Ni
a
means the total number of accidents.
III. CASE STUDY
A. DATA COLLECTION
To validate the proposed framework, we conducted a case
study in Los Angeles County, USA. The dataset was col-
lected from California Statewide Integrated Traffic Records
System (SWITRS). It records every traffic accident reported
in Los Angeles County during 2010-2012, totaling 307,971
accidents. Each record contains the information of the acci-
dent severity, the number of parties involved, the surrounding
environment, and the location. Table 1 presents an overall
summary of the data. The data contains 28 collision features,
18 time and location features, and 7 environment features,
so the data dimension is 53. Besides, since the study focuses
on analyzing the leading causes of traffic fatalities, we set
whether the case involves fatality as the classification target.
Therefore, 3,146 fatal accidents are marked as negative cases
while 304,825 non-fatal accidents are marked as positive
cases.
TABLE 1: Data Summary
Item Description
Negative cases (Fatal accidents) 3,146 cases
Positive cases (Non-fatal accidents) 304,825 cases
Total number of accidents 307,971 cases
Number of collision description variables 28 features
Number of time and location variables 18 features
Number of environment variables 7 features
Total variable dimension 53 dimensions
Time span 2010-2012
Geographic area Los Angeles County
Table 2 shows the detailed description of the data features.
The second column in Table 2 shows the feature abbreviation
and description. The third column presents the data type.
Here C represents categorical features, N represents numeric
features, B represents binary features, and S represents string
features. For example, VIOLCAT in Collision Description
means the Violation Category, and this categorical feature
recorded what kind of traffic violation has the vehicle com-
mitted, including speeding, impeding traffic, traffic signals
etc. POP in the Environment represents the surrounding
population density of the accident spot. For more details,
please refer to the official SWITRS website.
B. DATA PREPROCESSING
Since the raw data always has some flaws, preprocessing is
usually necessary before applying the data into the math-
ematical model. The preprocessing process in this study
TABLE 2: Feature Description
Feature abbreviation
and description
Data
type3
Collision
Description
CHPTYPE:CHP1Beat Type C
VIOLCODE:PCF2Violation Code C
VIOLCAT: PCF Violation Category C
CRASHTYP:Type of Collision C
INVOLVE:Motor Vehicle Involved With C
PED:Pedestrian Action C
CHPFAULT:CHP Vehicle Type is at fault C
SHIFT:CHP effect of new 12-hour shifts C
VIOLSUB:PCF Violation Subsection C
BEATTYPE:Beat Type C
PARTIES:Party Count N
PEDCOL:whether involved a pedestrian B
BICCOL:whether involved a bicycle B
MCCOL:whether involved a motorcycle B
TRUCKCOL:whether involved a big truck B
ETOH:whether involved drinking party B
STFAULT:indicates who is at fault C
HITRUN:Hit And Run C
PCF:Primary Collision Factor C
RIGHTWAY:Control Device C
NOTPRIV:whether on private property B
DIRECT:Direction of the offset distance C
INTERSECT_:whether at an intersection B
KILLED:counts of victims with 1-degree injury N
INJURED: counts of victims with 2,3,4 of injury N
CRASHSEV: the severity level of the collision N
BEATNUMB: Beat Number S
VIOL: PCF Violation S
Time and
Location
YEAR_:Collision Year: C
MONTH_:The month of the year C
DAYWEEK:The Day of Week C
TIMECAT:3-hour categories time C
DATE_: the date when the collision occurred C
TIME_: the time (24 hour time) N
LAPDDIV:City Division LAPD S
STATEHW:whether on a state highway B
POINT_X: The longitude of the geocoded loca-
tion
N
POINT_Y: The latitude of the geocoded location N
JURDIST: Reporting District S
DISTANCE: Offset distance from the secondary
road
N
LOCATION: the location code PRIMARYRD S
JURIS: Jurisdiction S
POSTMILE: markers that indicate the mileage
in California
N
SECONDRD: A secondary reference road S
SPECIAL:Special Condition C
RAMP:Ramp Intersection C
Environment
WEATHER:the weather condition C
WEATHER2:the additional weather condition C
LIGHTING:lighting condition C
POP:Population level C
ROADSURF:Road Surface C
CHPRDTYP:CHP Road Type C
RDCOND1:Road Condition 1 C
1: California Highway Patrol
2: Primary Collision Factor
3: C-categorical, B-binary, N-numeric, S-string
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
includes three parts: data formatting, data cleaning, and data
balancing.
1) Data Formatting
The first part is data formatting. The features collected in
this study are mostly categorical but not ordinal. However,
some machine learning algorithms like logistic regression
cannot operate on categorical values directly. They require
the input variables and the output variables to be numeric.
Therefore, these categorical data were converted into dummy
variables in this study. Note that a dummy variable means to
use numeric value 0 or 1 to represent a binary variable. For
example, in the original dataset, categorical data DIRECT
has four independent labels, including north, east, south,
and west. After one hot encoding, four dummy variables,
DIRECT_N, DIRECT_E, DIRECT_S, and DIRECT_W, are
given to indicate the driving direction of the vehicle. In this
way, 29 categorical features were formatted into 359 dummy
variables, and the data dimension is expanded to 383.
2) Data Cleaning
Data cleaning usually refers to the process of excluding
useless data. Two types of data, including noisy data and
high correlational data [37], [38], need to be deleted from the
dataset. Noisy data means the data is irrelevant to our prob-
lem or has limited contribution to the model but increases the
computation complexity. For instance, POINT_X, POINT_Y,
LAPDDIV, and JURIS recorded the location and jurisdiction
information. In this study, they are not the potential causes of
accident fatality. Therefore, they are excluded from the data
mining process.
High correlational data includes two kinds of data. The
first kind is the data that are high correlational to the target.
For example, CRASHSEV has the similar meaning with
our target “fatality”. It will influence the accuracy of our
model. Therefore, it should be deleted. The second kind
of data is those closely related to each other. For example,
TIME means the record of accident time. TIMECAT has
the similar meaning but divides the time into nine intervals.
As TIMECAT is more convenient for further treatment and
analysis, TIME is deleted. Pearson correlation coefficient is
deployed to remove high correlational data. It is a method
based on co-variance and can give information about the
magnitude of the association or correlation. In each pair that
has a correlation coefficient higher than 0.9, we would delete
one feature in order to mitigate the multicollinearity problem.
In summary, 15 noisy features and 18 high correlational
features are excluded from this study. The remained number
of features reduces to 350.
3) Data Balancing
As shown in Table 1, there are 3,146 negative cases and
304,825 positive cases. This means the dataset is very im-
balanced and could make our analysis biased. Therefore,
the dataset needs to be balanced. Two sampling schemes
namely oversampling and undersampling are commonly used
to address this problem [39]. However, both of them have
disadvantages. Oversampling will induce more massive com-
putation and may cause overfitting, while undersampling
may discard potentially useful information. In this study,
we combine both undersampling and parts of oversampling
concept to balance their disadvantages. This is achieved by
conducting the under-sampling ten times and calculate the
average performance. Each time, all the 3,146 negative cases
were selected, and the same amount of positive cases was
sampled and selected without replacement. Combine them to
form a new dataset of 6,292 cases, and then cross-validate
the mathematical model and get the result. Averaging the ten
results gives the final result.
IV. RESULTS
A. MODELING AND OPTIMIZATION
The experiment was run on a computer with 16 GB ram,
Intel (R) Xeon CPUL5640, Windows 7 operating system.
The coding environment is Python 3.6. Parameter optimiza-
tion of XGBoost is firstly conducted. As stated in Section
II, parameter optimization refers to the adjustment of the
algorithm parameters so that the algorithm can better model
the problem. After referring some literature [30], [40]–[42],
the following parameters of XGBoost are optimized in this
section:
lr: Learning rate. This parameter adjusts the size of the
learning steps. Too small will lead to local optimum and
slow down the calculation, while too large may miss the
optimal value and not converge.
n_estimators: Number of boosting rounds(or trees). This
parameter represents the number of training iterations
on the data. Too small will lead to under fitting, while
too large will cause overfitting.
max_depth: Maximum depth of a tree. Increasing this
value will make the model more complex and more
likely to overfit.
min_child_weight: Minimum sum of instance weight
(hessian) needed in a child.
Note that all the optimization process was conducted using
5-fold cross-validation to obtain stable results. To begin with,
we set lr to {0.01, 0.05, 0.1, 0.15, 0.2} and n_estimators to
{10, 20, 30, ..., 1000}. The optimization result is shown in
Figure 2. It can be seen that the pair {lr, n_estimators}={1.0,
920} offered an optimal value. After this, max_depth and
min_child_weight were then processed to be tuned. As
shown in the Figure 3, when these two parameters were set as
max_depth=10 and min_child_weight=1, the model was able
to obtain an optimal result with 0.8673 in Accuracy. Note that
the optimization criteria used here is Accuracy. Its calculation
is shown in Figure 4.
In addition, to further test the performance of XGBoost,
we compared its performance with five other commonly
seen models: Multiple Linear Regression (MLR), Logistic
Regression (LR), Multi-Layer Perceptron (MLP, one typical
Artificial Neural Network model) , Support Vector Machine
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 2: Optimization of lr and n_estimators
FIGURE 3: Optimization of max_depth and
min_child_weight
FIGURE 4: Calculation of Precision, Recall and Accuracy
in the confusion matrix.
(SVM) and Random Forest (RF). 5-fold cross validation is
applied to stabilize the results. Also, two more commonly
used criteria for binary classifications are tested [42]. Their
calculations are shown in Figure 4. Precision reflects the
percentage of corrected predictions among the positively
predicted cases, while recall represents the percentage of
corrected predictions among the real positive cases. All Ac-
curacy, Precision, and Recall are within [0,1], and the higher
values mean better predictions.
The performances of the models are shown in Table 3. As
it can be seen, the indicators of the non-linear algorithms,
MLP, SVM, RF, and XGBoost are significantly higher than
the linear-based algorithms, MLR and LR. XGBoost outper-
forms MLP, SVM and RF with the highest numbers in all the
three indicators. This proves that selecting XGBoost to model
our problem and further analyze the variable importance is a
reasonable choice.
TABLE 3: Comparisons of models
Models MLR LR MLP SVM RF XGBoost
Accuracy 0.7739 0.7854 0.8278 0.8280 0.8463 0.8673
Precision 0.7249 0.7901 0.8335 0.8080 0.8674 0.8758
Recall 0.8490 0.7790 0.8205 0.8510 0.8178 0.8562
B. VARIABLE IMPORTANCE
As introduced in Section II, XGBoost can help calculate
the variable importance. Figure 5 presents the 15 features
that obtained the highest importance. The most influential
factor is ETHO, followed by PARTIES, CRASHTYPE_C,
INVOLVE_C, LIGHTING_A, etc. Explanations of these fea-
tures can be seen in Table 2.
FIGURE 5: Top 15 influential factors
It can be observed from Figure 5 that some features
have similar meanings or belong to the same category of
data. For example, MCCOL and CHPFAULT_02 both mean
whether the accidents involve motorcycles. DAYWEEK_6,
DAYWEEK_5, DAYWEEK_4, and DAYWEEK_7 all refer
to the date category. LIGHTING_A and LIGHTING_C both
refer to the lighting condition. PED_A and PEDCOL both
relate to pedestrian accidents. INVOLVE_C refers to motor-
motor collision, and it could be discussed when analyzing
motor-pedestrian collisions. Therefore, we grouped the 15
features from Figure 5 into eight representative ones for a
more explicit analysis and discussion. Table 4 lists these
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
eight features and presents the fatality rates with and without
the relevant feature. More details will be discussed in the
following section.
V. DISCUSSION
Before diving into the eight features, identifying the overall
distribution of the traffic accidents in Los Angeles County
can help us better understand the spatial relationship between
the fatality rate and the eight features. Figure 6 plots the
distribution of the accident density, the fatal accident density,
and the fatality rate. Figure 6 (1) and (2) were plotted using
the kernel density in GIS. It can be seen that most accidents
and fatal accidents happened in urban areas. Figure 6 (4)
was calculated using the grid-based analysis introduced in
Section II-C. It is drawn by firstly cutting the Los Angeles
County map in Figure 6 (3) into 60*60 grids. Then the Rf
of each grid was summarized and calculated. The grids with
Rf= 0 were presented using the green color, while others
are plotted based on their Rfvalues from low to high. It can
be observed that the fatality rates in eight areas are relatively
higher. They are:
Area A: the city of Lancaster, the city of Palmdale and
their surrounding areas.
Area B: Interstates 5 Highway between Santa Clarita
Valley and the Pyramid Lake.
Area C: the southern portion of the California State
Route 14 (SR 14).
Area D: the middle portion of the California State Route
2 (SR 2).
Area E: the city of Malibu, the Malibu Creek State Park
and their. surrounding areas.
Area F: the east of Los Angeles city and its surrounding
areas.
Area G: city of Pomona, the city of West Covina and
their surrounding areas.
Area H: interchange of Interstates 405, 5 and 210 High-
ways and its surrounding areas.
It can be observed from Figure 6 that the distribution of
the fatality rate is not highly correlated with the accident
density. This is because the density of accidents mostly relies
on traffic volumes. Higher traffic volumes in urban areas
will naturally lead to more traffic accidents [9]. However,
compared with other places, the higher fatality rates in area
A-H mean that these places are more dangerous, and the
traffic control and safety supervision are not doing well
there. Therefore, one object of this study is to look into the
reasons behind these dangerous areas and discuss possible
measurements to improve the safety performance. According
to the results shown in Table 4, the eight features may reveal
some clues.
A. DRINK-DRIVING
Drink-driving is the most important feature on the fatality
of the accidents according to our results. It can be seen
from Table 4 that when under the influence of alcohol, the
fatality rate is 4.645%, while the number drops to 0.714% for
those without alcohol. Previous studies have demonstrated
that driving under the influence of alcohol is considered an
elevated risk of traffic accidents [18], [43]. The reason behind
is that the response time and the awareness of the drivers
will be impaired under the influence of alcohol. When the
accidents are imminent, intoxicated drivers cannot judge the
dangers or take evasive actions in time. This highly increases
the potential fatality of traffic accidents.
Figure 7 (1) shows the distribution of the percentage of
traffic accidents associated with drink-driving. It indicates
that the numbers are higher in area A, C, G, H and F. This
might be caused by the relatively looser drink-driving control
there. Compared with Figure 6 (4), it can be inferred that
drink-driving is one of the crucial reasons why areas A, C,
G, H, and F have higher fatality rates. Hence, in these areas,
drink-driving enforcement should be strengthened.
B. PARTIES INVOLVEMENT
The second feature is the number of parties involved in
the accidents. It can be seen from Table 4 that when this
number is lower than or equal to 4, the fatality rate is
0.986%. While the number is higher than 4, the fatality rate
increases to 2.708%. More details about the number and the
fatality rate are shown in Figure 8. It can be seen that when
the number is larger than one, the fatality rate goes up as
the number of parties increases. This is reasonable because
multi-vehicle accidents are more likely to be serious traffic
accidents, which have a higher probability to cause fatality.
Previous studies also agreed that the more parties involved in
an accident, the higher the number of people to be involved,
which in turn increases the potential fatality rate [44]. Also,
in a multi-vehicle accident, those who are not wounded at
first and decide to escape at once from their vehicles are still
at risk of being hit by other upcoming vehicles. However, the
cause of multi-vehicle collisions is often hard to determine.
Increasing the road speed limit can be one choice to control
the rate of multi-vehicle accidents.
Another thing can be seen from Figure 8 is that when the
number of parties equals to one, the fatality rate is also high.
Common factors contributing to single-vehicle collisions in-
clude excessive speed, driver fatigue, driving under influence
and age of vehicle [45], [46]. Environmental and roadway
factors can also contribute to single-vehicle accidents. These
include inclement weather, falling rocks, poor lighting condi-
tion, narrow lanes and shoulders, insufficient curve banking
and sharp curves [47], [48]. These can be the reasons why
single-vehicle collisions have a high fatality rate.
C. REAR-END COLLISION
Whether the accident is a rear-end collision is the third factor
that differentiates the fatal and the non-fatal accidents. Rear-
end collisions are considered the most frequently occurring
type of traffic accidents [11]. In our data, there are more
than 108 thousand rear-end crashes, accounting for around
35.3% of the total accidents. It can be observed from Table 4
that if the type of accidents is rear-ended, the fatality rate
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 4: Fatality rates of the selected eight features.
Feature Description The fatality rate of the accidents
that not involve the feature (%)
The fatality rate of the accidents
that involve the feature (%)
ETOH Drink-driving 0.714 4.645
PARTIES Number of parties in the collision 0.986 (when the number <=4) 2.708 (when the number >4)
CRASHTYP_C Rear-end collision 1.408 0.385
LIGHTING_A Daylight 2.038 0.604
PEDCOL Pedestrian accidents 0.673 4.547
MCCOL Motorcycle accidents 0.912 3.24
DAYWEEK Day of the week 0.904 (weekday) 1.414 (weekend)
TIMECAT Time of the day 0.817 (6:00-24:00) 3.724 (00:00-6:00)
FIGURE 6: The distributions of (1) accident density, (2) fatal accident density, (3) main cities in Los Angeles County, and (4)
fatality rates (grid-based).
is 0.385%. Otherwise, the fatality rate is 1.408%. Fatality
rates of different crash types are shown in Figure 8. It can
also be observed that the rear-end crashes have the lowest
fatality rate while overturned and hit object crashes are the
highest. Overturned means the vehicles flip over its side or
roof and usually happens when the vehicle makes a high-
speed sharp turn. This apparently will lead to a higher fatality
rate. Hit object means leaving the roadway and hitting a fixed
object alongside the road. Such consequences always result
from behaviors like excessive speeds, drowsy driving, drink
driving, and reveal a higher fatality rate [11], [49].
D. LIGHTING CONDITION
Lighting condition has been concluded to be one of the most
critical factors affecting traffic accidents in previous literature
[50], [51]. Poor lighting condition will significantly increase
the likelihood of fatal accidents. In can be seen in Table 4
that the fatality rate during the daytime is only 0.604%, but it
increases to 2.038% under other lighting conditions. Detailed
fatality rates under different lighting conditions are presented
in Figure 8. It can be seen that when there is no street light,
or the street lights are not functioning, the fatality rates are
higher. This is reasonable since the drivers may not see the
road condition clearly and make wrong judgments. Besides,
when the surrounding is dark, drivers are more likely to be
influenced or dazzled by the strong headlights from other
vehicles. This is even more dangerous as the drivers may not
see the road at all.
Figure 7 (2) shows the distribution of the percentage of
accidents happened under poor lighting conditions (no street
lights or the lights are not functioning). It can be observed
that the numbers in area B, area C, area H and the main roads
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 7: The grid-based distribution of the percentage of (1) the drink-driving accidents, (2) the accidents under poor
lighting condition (no street lights or the lights are not functioning), (3) the accidents involving pedestrians, (4) the accidents
involving motorcycles, (5) the accidents happened on weekends and (6) the accidents happened during 0am-6am.
around area A are higher than other places. Compared with
Figure 6 (4), it can be inferred that poor lighting conditions
may be one of the reasons of the higher fatality rate in area
B, area C, area H and the surrounding areas of A. Therefore,
it is suggested that the maintenance of the street lights should
be strengthened in these areas.
E. PEDESTRIAN INVOLVEMENT
The fifth factor is the pedestrian involvement. Pedestrians
are referred to as one typical kind of vulnerable road users
(VRUs), which means they are at higher risk in the traf-
fic compared with other road users [52]. This is because
pedestrians do not have sufficient protections when traffic
accidents happen. Also, drinking and traffic-rule violation is
a severe risk not only to motor drivers, but also to pedestrians,
especially when they crossing the roads [55].
It can be observed from Table 4 that for accidents involving
pedestrians, the fatality rate reaches 4.547%, while for acci-
dents without the involvement of pedestrians, the rate is only
0.673%. More kinds of fatality rates of accidents between
motor vehicles and other parties are shown in Figure 8. It
can be seen that the fatality rate of pedestrian accidents is
almost eight times higher than motor-motor accidents (Other
Motor Vehicle). This could be the reason why INVOLVE_C
(motor-motor collisions) stands out in Figure 5 since its
relatively low fatality rate differentiates the fatal and non-
fatal accidents.
Distribution of the percentage of pedestrian accidents is
presented in Figure 7 (3). It indicates that area F, H and
G have a higher percentage of pedestrian accidents. This is
because that in these areas, the population density are higher.
Compared with Figure 6 (4), it can be inferred that pedestrian
accidents might be one of the reasons why area F, H, and G
have higher fatality rates. Therefore, it is suggested that, in
these three areas, traffic operation on pedestrians should be
strengthened, for example, reducing the vehicle speed limit,
installing more raised crosswalks, adding pedestrian lights,
etc. [53]–[55].
F. MOTORCYCLE INVOLVEMENT
Motorcycle accidents are the sixth factor. It can be observed
from Table 4 that for motorcycle accidents, the fatality rate
is 3.240%, but for accidents without the involvement of
motorcycles, the fatality rate decreases to 0.912%. Reasons
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 8: Fatality rate under different conditions.
behind the high fatality rate of motorcycle accidents can
be divided into two aspects. On one hand, compared with
automobile drivers, motorcycle drivers are not protected by
seatbelts or airbags. Therefore, when the accidents happen,
they are more likely to be ejected from the seats, and their
heads might be directly bumped without protection form any
cushions. These will increase the likelihood of fatality. On
the other hand, the stability of motorcycles is poor. They are
easily influenced by elements like debris, uneven road, wet
road surface and other small objects [56].
Figure 7 (4) shows the distribution of the percentage of
motorcycle accidents. It can be seen that area C, D and
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
E have higher percentages of motorcycle accidents. These
three areas are located in the mountainous areas of Los
Angeles County. Steep grades and curves of highways built
on mountainous areas pose serious safety threats on passing
vehicles, especially motorcycles [57]. Compared with Figure
6 (4), it can be inferred that motorcycle accidents can be
one of the causes of the high fatality rates in area C, D
and E. Thus, in these areas, traffic control on motorcycles
should be strengthened, speed limits should be lowered,
and more warning signs should be installed to remind the
motorcyclists.
G. DAY OF THE WEEK
Day of the week appears to be an essential factor differ-
entiating the fatal and non-fatal accidents according to our
calculation results. It can be observed from Table 4 that the
fatality rate is 0.904% and 1.414% for accidents in weekdays
and weekends respectively. Fatality rates from Monday to
Sunday within the whole county are presented in Figure 8.
It also indicates that the fatality rates in weekends are higher
than those on weekdays. This might be because that accidents
involving alcohol use, speeding, not using seatbelts, etc.,
are more likely to happen when people are having fun on
weekends [10].
The distribution of the percentage of accidents happened
on weekends is presented in Figure 7 (5). It can be seen that
the percentage of weekend accidents are higher in area B,
D, E, F and H. Compared with Figure 6 (4), it can be inferred
that the problems in weekends may be one of the reasons why
area B, D, E, F, and H have higher rates of fatality. Therefore,
in these areas, traffic control during the weekend should be
strengthened.
H. TIME OF THE DAY
Different times within a day have different fatality rates. It
can be seen from Table 4 that compared with other time
periods, the fatality rate is significantly higher from 0 am to
6 am, reaching 3.724%. Detailed fatality rates of different
time periods within a day are shown in Figure 8. It shows
that the fatality rate reaches the highest during 3 am and 6
am. Although the traffic volume during this period is not as
large as other periods, factors like fatigue driving, speeding,
poor lighting conditions, and loose traffic control all could be
possible reasons for exacerbating the severity of the accidents
after mid-night [58].
Figure 7 (6) presets the distribution of the percentage of
accidents happened from 0 am to 6 am. It can be seen that the
rate is higher in area F and H. Compared with Figure 6 (4),
it can be inferred that one of the reasons of the high fatality
rate in area F and H might be the problems during 0 am to 6
am. Therefore, in these two areas, the traffic control should
be strengthened during this period.
I. SUMMARY
Based on Table 4, Figure 6, Figure 7 and Figure 8, possible
reasons behind the high fatality rate in area A-H are analyzed
and discussed. According to the results revealed by our
methodology, relevant countermeasures to those areas can be
summarized as follows:
Area A: In this area, drink-driving and poor lighting
condition are two crucial reasons for the high fatality
rate. Therefore, in area A, drink-driving control should
be emphasized, more street lights should be installed,
and regular maintenance should be strengthened.
Area B: Poor lighting condition is one of the reasons
why area B has a high fatality rate. Thus, the local
government should increase the budget and improve the
lighting condition there. High percentage of weekend
accidents is another possible reason for the high fatality
rate in area B. Therefore, traffic control such as speed
limit should be strengthened in weekends there.
Area C: Similar to area A, drink-driving and poor
lighting conditions are two principal causes of the high
fatality rate in area C. Furthermore, the high frequency
of motorcycle accidents also worsens the road safety
there. Hence, in area C, besides the drink driving control
and the improvement of lighting infrastructures, proper
managements on motorcycles should also be empha-
sized.
Area D and E: The high fatality rates in area D and E
may be caused by motorcycle accidents and weekend
accidents. Therefore, in these two areas, the control of
motorcycles should be stricter, and the traffic manage-
ment in weekends need to be enhanced.
Area F and H: Reasons behind the high fatality rate
in area F and H include drink-driving, pedestrian acci-
dents, weekend accidents as well as the high accident
rate from 0 am to 6 am. In this case, it is suggested
that in these two areas, the management for drink-
driving control should be stricter. Road infrastructures
for pedestrians should be further improved. Traffic con-
trol during weekends and the time from 0 am to 6 am
should be given more attention. Also, area H also reveals
poor lighting conditions, and relevant enhancements
should be delivered.
Area G: In area G, drink-driving and pedestrian acci-
dents are the main driving factors of the high fatality
rate there. Therefore, in this area, drink-driving control
should be improved, and more road infrastructures for
pedestrians should be installed.
Note that this study is analyzing the influential factors
on fatality rates rather than accident rates. Fatality rates
represent the rates of the fatal accidents over all the accidents,
while accident rates mean the proportion of accidents over
traffic volume. These are two different problems. The former
problem assumes that if an accident happens, what are the
factors that will likely make it fatal. While the latter one
refers to the study of the factors that may lead to an acci-
dent from normal driving. Both problems are important and
typical in accident analysis [21]. Due to data availability, this
study focuses on the grid-based fatality rate. Future work will
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
try to integrate traffic volume data and analyze the influential
factors on accident rates.
In addition, driving speed is a potentially important factor
that had been mentioned by some literature [11], [55]. How-
ever, the speed related factor "whether the vehicle is driving
in an unsafe speed (speeding)" derived from the "VIOLCAT"
feature in Table 2 did not rank into top 15 in our experiment
in Los Angeles. This means it is not as important as the top
15 in leading an accident into fatal according to our method-
ology. This is understandable because, on one hand, due to
data availability, this speed related factor may not perfectly
represent the influence of speed on accident fatality. On the
other hand, in many fatal accidents, it is not because that the
drivers violated the traffic rules but the drunk pedestrians
or other vulnerable road users did. A deeper investigation
on the influence of driving speed on fatality rates should be
conducted in the future.
VI. CONCLUSIONS
To conclude, this paper proposed a GIS-based data mining
methodology framework to analyze the influential factors of
fatal traffic accidents. The idea of this study is firstly using
XGBoost to build a binary classification model between fatal
and non-fatal accidents. Based on the identified important
factors using XGBoost models, the grid-based analysis (or
the area analysis) is then used to conduct the spatial analysis
on fatality rates. A case study in Los Angeles County was
conducted to validate the proposed method. Five commonly
seen machine learning algorithms, including MLR, LR, MLP,
SVM, and RF, were applied to model the influential fea-
tures and the traffic fatality. Results showed that XGBoost
obtained the highest modeling accuracy. It was then applied
to investigate the variable importance of the studied features.
Eight factors were found to be the most influential. They were
drink-driving, the number of parties involved, rear-end crash,
lighting condition, pedestrian involvement, motorcycle in-
volvement, the day of the week and time of the day. Through
the grid-based analysis in GIS, their spatial relationships
between the fatality rate were analyzed.
A. CONTRIBUTIONS
The strengths and contributions of the study lie in two
aspects. First is that we proposed an effective methodology
framework in analyzing the influential factors on the fatality
of traffic accidents. The methodology not only provided
accurate modeling performance compared with traditional
linear methods, but also calculated the variable importance
more objectively, and then specifically located the relation-
ships between the factors and the target. Such a process
can systematically analyze the most influential features on
traffic fatalities in a numerical way. Of course, the current
experiments only supported the effectiveness of the method
in analyzing the accidents and fatalities in Los Angeles,
whether the method fits other places and scopes need further
studies to verify.
The second contribution is that the case study conducted
in Los Angeles County uncovered eight areas with higher
traffic fatality rate. According to our numerical analysis,
the higher fatality rate in these areas may result from the
top influential factors discovered in our model. Based on
the spatial analysis, specific practical suggestions on how to
improve road safety in these eight areas are provided. These
can be useful references for the governments during policy-
making.
B. LIMITATIONS AND FUTURE WORK
Still, there are limitations in this paper. Due to the availability
of the data, some possible influential factors, such as traffic
volume, education, and road width, are not considered for
different grids. Also, the authors did not obtain the accident
data in more recent years like 2017-2018 with sufficient
features and factors, this restricted us from analyzing and
comparing the difference and improvements on the traffic
fatalities in Los Angeles. Further studies can be expanded
to consider those factors and data for a more comprehensive
analysis.
REFERENCES
[1] “WHO | 10 facts on global road safety.” [Online]. Available: http:
//www.who.int/features/factfiles/roadsafety/en/
[2] vi.chilukuri.ctr@dot.gov, “USDOT Releases 2016 Fatal Traffic
Crash Data,” Oct. 2017. [Online]. Available: https://www.nhtsa.gov/
press-releases/usdot- releases-2016- fatal-traffic- crash-data
[3] Y. Shen, E. Hermans, Q. Bao, T. Brijs, G. Wets, and W. Wang, “Inter-
national benchmarking of road safety: State of the art,” Transportation
Research Part C: Emerging Technologies, vol. 50, pp. 37–50, Jan. 2015,
doi: 10.1016/j.trc.2014.07.006. [Online]. Available: https://linkinghub.
elsevier.com/retrieve/pii/S0968090X14002083
[4] P. St-Aubin, N. Saunier, and L. Miranda-Moreno, “Large-scale automated
proactive road safety analysis using video data, Transportation Research
Part C: Emerging Technologies, vol. 58, pp. 363–379, Sep. 2015, doi:
10.1016/j.trc.2015.04.007. [Online]. Available: https://linkinghub.elsevier.
com/retrieve/pii/S0968090X15001485
[5] S. E. Shladover and C. Nowakowski, “Regulatory challenges for
road vehicle automation: Lessons from the California experience,
Transportation Research Part A: Policy and Practice, Oct. 2017, doi:
10.1016/j.tra.2017.10.006. [Online]. Available: http://www.sciencedirect.
com/science/article/pii/S0965856417300952
[6] G. Ginzburg, S. Evtiukov, I. Brylev, and S. Volkov, “Reconstruction
of Road Accidents Based on Braking Parameters of Category L3
Vehicles,” Transportation Research Procedia, vol. 20, pp. 212–218,
Jan. 2017, doi: 10.1016/j.trpro.2017.01.054. [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S2352146517300546
[7] C. Castañeda P. and J. G. Villegas, Analyzing the response
to traffic accidents in Medellín, Colombia, with facility
location models,” IATSS Research, vol. 41, no. 1, pp. 47–
56, Apr. 2017, doi: 10.1016/j.iatssr.2016.09.002. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0386111216300437
[8] Y. Liu, Z. Li, J. Liu, and H. Patel, “A double standard
model for allocating limited emergency medical service vehicle
resources ensuring service reliability, Transportation Research
Part C: Emerging Technologies, vol. 69, pp. 120–133,
Aug. 2016, doi: 10.1016/j.trc.2016.05.023. [Online]. Available:
http://linkinghub.elsevier.com/retrieve/pii/S0968090X16300602
[9] M. A. Abdel-Aty and A. E. Radwan, “Modeling traffic accident
occurrence and involvement, Accident Analysis & Prevention, vol. 32,
no. 5, pp. 633–642, Sep. 2000, doi: 10.1016/S0001-4575(99)00094-
9. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0001457599000949
[10] E. K. Adanu, A. Hainen, and S. Jones, “Latent class analysis
of factors that influence weekday and weekend single-vehicle crash
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
severities, Accident Analysis & Prevention, vol. 113, pp. 187–
192, Apr. 2018, doi: 10.1016/j.aap.2018.01.035. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0001457518300411
[11] S. A. Mohamed, K. Mohamed, and H. A. Al-Harthi., “Investigating
Factors Affecting the Occurrence and Severity of Rear-End Crashes,
Transportation Research Procedia, vol. 25, pp. 2098–2107, Jan.
2017, doi: 10.1016/j.trpro.2017.05.403. [Online]. Available: http://www.
sciencedirect.com/science/article/pii/S235214651730710X
[12] F. Netjasov, D. Crnogorac, and G. Pavlovi´
c, “Potential safety
occurrences as indicators of air traffic management safety
performance: A network based simulation model,” Transportation
Research Part C: Emerging Technologies, vol. 102, pp. 490–
508, May 2019, doi: 10.1016/j.trc.2019.03.026. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S0968090X1831060X
[13] J. Lee, J. Chae, T. Yoon, and H. Yang, “Traffic accident severity analysis
with rain-related factors using structural equation modeling A case
study of Seoul City,” Accident Analysis & Prevention, vol. 112, pp.
1–10, Mar. 2018, doi: 10.1016/j.aap.2017.12.013. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0001457517304517
[14] W. H. Schneider and P. T. Savolainen, “Comparison of Severity of
Motorcyclist Injury by Crash Types, Transportation Research Record,
vol. 2265, no. 1, pp. 70–80, Jan. 2011, doi: 10.3141/2265-08. [Online].
Available: http://journals.sagepub.com/doi/10.3141/2265-08
[15] S. Hashimoto, S. Yoshiki, R. Saeki, Y. Mimura, R. Ando, and
S. Nanba, “Development and application of traffic accident density
estimation models using kernel density estimation,” Journal of Traffic
and Transportation Engineering (English Edition), vol. 3, no. 3, pp.
262–270, Jun. 2016, doi: 10.1016/j.jtte.2016.01.005. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S2095756415305808
[16] H. A. Aziz, S. V. Ukkusuri, and S. Hasan, “Exploring the
determinants of pedestrian–vehicle crash severity in New York
City,” Accident Analysis & Prevention, vol. 50, pp. 1298–
1309, Jan. 2013, doi: 10.1016/j.aap.2012.09.034. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S0001457512003533
[17] B. J. Russo and P. T. Savolainen, “A comparison of freeway median
crash frequency, severity, and barrier strike outcomes by median
barrier type,” Accident Analysis & Prevention, vol. 117, pp. 216–
224, Aug. 2018, doi: 10.1016/j.aap.2018.04.023. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0001457518301763
[18] F. Dapilah, B. Y. Guba, and E. Owusu-Sekyere, “Motorcyclist charac-
teristics and traffic behaviour in urban Northern Ghana: Implications
for road traffic accidents,” Journal of Transport & Health, vol. 4, pp.
237–245, Mar. 2017, doi: 10.1016/j.jth.2016.03.001. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S2214140516000232
[19] A. Karimi and E. Kashi, “Investigating the effect of geometric parameters
influencing safety promotion and accident reduction (Case study:
Bojnurd-Golestan National Park road),” Cogent Engineering, vol. 5, no. 1,
p. 1525812, 2018. [Online]. Available: https://www.tandfonline.com/doi/
abs/10.1080/23311916.2018.1525812
[20] D. Onozuka, K. Nishimura, and A. Hagihara, “Full moon and traffic
accident-related emergency ambulance transport: A nationwide case-
crossover study, Science of The Total Environment, vol. 644, pp. 801–
805, Dec. 2018, doi: 10.1016/j.scitotenv.2018.07.053. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0048969718325361
[21] L. Mussone, M. Bassani, and P. Masci, “Analysis of factors
affecting the severity of crashes in urban road intersections,
Accident Analysis & Prevention, vol. 103, pp. 112–122,
Jun. 2017, doi: 10.1016/j.aap.2017.04.007. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0001457517301355
[22] Z. Li, P. Liu, W. Wang, and C. Xu, “Using support
vector machine models for crash injury severity analysis,
Accident Analysis & Prevention, vol. 45, pp. 478–486,
Mar. 2012, doi: 10.1016/j.aap.2011.08.016. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0001457511002363
[23] G. Casalicchio, C. Molnar, and B. Bischl, “Visualizing the Feature
Importance for Black Box Models,” arXiv:1804.06620 [cs, stat], Apr.
2018, arXiv: 1804.06620. [Online]. Available: http://arxiv.org/abs/1804.
06620
[24] “Geographic information system,” Aug. 2018. [On-
line]. Available: https://en.wikipedia.org/w/index.php?title=Geographic_
information_system&oldid=853176113
[25] J. Ma and J. C. Cheng, “Estimation of the building energy use
intensity in the urban scale by integrating GIS and big data technology,
Applied Energy, vol. 183, pp. 182–192, 2016. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0306261916311679
[26] S. Hong, J. Heo, and A. P. Vonderohe, “Simulation-based approach
for uncertainty assessment: Integrating GPS and GIS,” Transportation
Research Part C: Emerging Technologies, vol. 36, pp. 125–137,
Nov. 2013, doi: 10.1016/j.trc.2013.08.008. [Online]. Available: https:
//linkinghub.elsevier.com/retrieve/pii/S0968090X13001721
[27] J. H. Kazmi and S. Zubair, “Estimation of Vehicle Damage Cost
Involved in Road Traffic Accidents in Karachi, Pakistan: A Geospatial
Perspective, Procedia Engineering, vol. 77, pp. 70–78, Jan. 2014,
doi: 10.1016/j.proeng.2014.07.008. [Online]. Available: http://www.
sciencedirect.com/science/article/pii/S1877705814009850
[28] M. A. M. Din, S. Paramasivam, N. M. Tarmizi, and A. M. Samad,
“The Use of Geographical Information System in the Assessment of
Level of Service of Transit Systems in Kuala Lumpur, Procedia
- Social and Behavioral Sciences, vol. 222, pp. 816–826, Jun.
2016, doi: 10.1016/j.sbspro.2016.05.181. [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S1877042816302567
[29] A. Zangeneh, F. Najafi, S. Karimi, S. Saeidi, and N. Izadi, “Spatial-
temporal cluster analysis of mortality from road traffic injuries using
geographic information systems in West of Iran during 2009–2014,
Journal of Forensic and Legal Medicine, vol. 55, pp. 15–22, Apr. 2018,
doi: 10.1016/j.jflm.2018.02.009. [Online]. Available: https://linkinghub.
elsevier.com/retrieve/pii/S1752928X18300258
[30] L. Zhang and C. Zhan, “Machine Learning in Rock Facies Classification:
An Application of XGBoost,” in International Geophysical Conference,
Qingdao, China, 17-20 April 2017. Qingdao, China: Society of
Exploration Geophysicists and Chinese Petroleum Society, May 2017,
pp. 1371–1374, doi: 10.1190/IGC2017-351. [Online]. Available: http:
//library.seg.org/doi/10.1190/IGC2017-351
[31] L. Torlay, M. Perrone-Bertolotti, E. Thomas, and M. Baciu, “Machine
learning–XGBoost analysis of language networks to classify patients
with epilepsy,” Brain Inf., vol. 4, no. 3, pp. 159–169, Sep. 2017,
doi: 10.1007/s40708-017-0065-7. [Online]. Available: http://link.springer.
com/10.1007/s40708-017- 0065-7
[32] J. Ma and J. C. P. Cheng, “Identification of the numerical patterns behind
the leading counties in the U.S. local green building markets using
data mining,” Journal of Cleaner Production, vol. 151, pp. 406 418,
2017. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0959652617305176
[33] H. Zheng, J. Yuan, and L. Chen, “Short-Term Load Forecasting
Using EMD-LSTM Neural Networks with a Xgboost Algorithm
for Feature Importance Evaluation,” Energies, vol. 10, no. 8, p.
1168, Aug. 2017, doi: 10.3390/en10081168. [Online]. Available:
http://www.mdpi.com/1996-1073/10/8/1168
[34] J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics
& Data Analysis, vol. 38, no. 4, pp. 367–378, Feb. 2002, doi:
10.1016/S0167-9473(01)00065-2. [Online]. Available: https://linkinghub.
elsevier.com/retrieve/pii/S0167947301000652
[35] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting
System,” in Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining - KDD ’16.
San Francisco, California, USA: ACM Press, 2016, pp. 785–794, doi:
10.1145/2939672.2939785. [Online]. Available: http://dl.acm.org/citation.
cfm?doid=2939672.2939785
[36] H. Zhang, D. Qiu, R. Wu, Y. Deng, D. Ji, and T. Li, “Novel framework
for image attribute annotation with gene selection XGBoost algorithm
and relative attribute model, Applied Soft Computing, vol. 80, pp.
57–79, Jul. 2019, doi: 10.1016/j.asoc.2019.03.017. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S1568494619301358
[37] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Preprocess-
ing for Supervised Leaning,” International Journal of Computer Science,
vol. 1, no. 1, pp. 111–117, Jan. 2006.
[38] J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson Correlation
Coefficient,” in Noise Reduction in Speech Processing, ser. Springer
Topics in Signal Processing, I. Cohen, Y. Huang, J. Chen, and
J. Benesty, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
2009, pp. 1–4, doi: 10.1007/978-3-642-00296-0_5. [Online]. Available:
https://doi.org/10.1007/978-3-642-00296- 0_5
[39] A. Y.-c. Liu, “The Effect of Oversampling and Undersampling on Classi-
fying Imbalanced Text Datasets, 2004.
[40] J. C. Cheng and L. J. Ma, A non-linear case-based reasoning
approach for retrieval of similar cases and selection of target credits
in LEED projects,” Building and Environment, vol. 93, pp. 349–361,
VOLUME 4, 2016 13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946401, IEEE Access
MA et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
2015. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0360132315300676
[41] J. Ma and J. C. P. Cheng, “Data-driven study on the achievement
of LEED credits using percentage of average score and association
rule analysis,” vol. 98, pp. 121 132, 2016. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0360132316300051
[42] M. A. Jun and J. C. P. Cheng, “Selection of target LEED credits
based on project information and climatic factors using data mining
techniques,” Advanced Engineering Informatics, vol. 32, pp. 224 236,
2017. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S1474034616301124
[43] B. Assemi and M. Hickman, “Relationship between heavy vehicle
periodic inspections, crash contributing factors and crash severity,”
Transportation Research Part A: Policy and Practice, vol. 113, pp.
441–459, Jul. 2018, doi: 10.1016/j.tra.2018.04.018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S096585641630502X
[44] K. K. Yau, H. Lo, and S. H. Fung, “Multiple-vehicle traffic accidents in
Hong Kong, Accident Analysis & Prevention, vol. 38, no. 6, pp. 1157–
1161, Nov. 2006, doi: 10.1016/j.aap.2006.05.002. [Online]. Available:
http://linkinghub.elsevier.com/retrieve/pii/S0001457506000807
[45] W. Haddon and V. A. Bradess, “ALCOHOL IN THE SINGLE
VEHICLE FATAL ACCIDENT: EXPERIENCE OF WESTCHESTER
COUNTY, NEW YORK, JAMA, vol. 169, no. 14, pp. 1587–1593,
Apr. 1959, doi: 10.1001/jama.1959.03000310039009. [Online]. Available:
https://jamanetwork.com/journals/jama/fullarticle/325699
[46] J.-K. Kim, G. F. Ulfarsson, S. Kim, and V. N. Shankar, “Driver-
injury severity in single-vehicle crashes in California: A mixed
logit analysis of heterogeneity due to age and gender, Accident
Analysis & Prevention, vol. 50, pp. 1073–1081, Jan. 2013, doi:
10.1016/j.aap.2012.08.011. [Online]. Available: http://www.sciencedirect.
com/science/article/pii/S0001457512002990
[47] V. Shankar and F. Mannering, “An exploratory multinomial logit
analysis of single-vehicle motorcycle accident severity,” Journal of Safety
Research, vol. 27, no. 3, pp. 183–194, Sep. 1996, doi: 10.1016/0022-
4375(96)00010-2. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/0022437596000102
[48] K. K. W. Yau, “Risk factors affecting the severity of single vehicle
traffic accidents in Hong Kong, Accident Analysis & Prevention, vol. 36,
no. 3, pp. 333–340, May 2004, doi: 10.1016/S0001-4575(03)00012-
5. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0001457503000125
[49] M. Hiramatsu, “Broadside collision scenarios for different types of
pre-crash driving patterns,” JSAE Review, vol. 24, no. 3, pp. 327–334,
Jul. 2003, doi: 10.1016/S0389-4304(03)00033-X. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S038943040300033X
[50] M. Jalayer, R. Shabanpour, M. Pour-Rouholamin, N. Golshani, and
H. Zhou, “Wrong-way driving crashes: A random-parameters ordered pro-
bit analysis of injury severity,” Accident Analysis & Prevention, vol. 117,
pp. 128–135, Aug. 2018, doi: 10.1016/j.aap.2018.04.019. [Online]. Avail-
able: https://linkinghub.elsevier.com/retrieve/pii/S0001457518301635
[51] R. O. Mujalli, G. López, and L. Garach, “Bayes classifiers for imbalanced
traffic accidents datasets,” Accident Analysis & Prevention, vol. 88, pp.
37–51, Mar. 2016, doi: 10.1016/j.aap.2015.12.003. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0001457515301548
[52] M. Vilaça, N. Silva, and M. C. Coelho, “Statistical Analysis of
the Occurrence and Severity of Crashes Involving Vulnerable Road
Users,” Transportation Research Procedia, vol. 27, pp. 1113–1120,
Jan. 2017, doi: 10.1016/j.trpro.2017.12.113. [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S2352146517310104
[53] M. G. Mohamed, N. Saunier, L. F. Miranda-Moreno, and S. V.
Ukkusuri, “A clustering regression approach: A comprehensive injury
severity analysis of pedestrian–vehicle crashes in New York, US and
Montreal, Canada,” Safety Science, vol. 54, pp. 27–37, Apr. 2013,
doi: 10.1016/j.ssci.2012.11.001. [Online]. Available: https://linkinghub.
elsevier.com/retrieve/pii/S0925753512002664
[54] S. Jung, X. Qin, and C. Oh, “Improving strategic policies for
pedestrian safety enhancement using classification tree modeling,”
Transportation Research Part A: Policy and Practice, vol. 85, pp.
53–64, Mar. 2016, doi: 10.1016/j.tra.2016.01.002. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S0965856416000021
[55] M. Soilán, B. Riveiro, A. Sánchez-Rodríguez, and P. Arias, “Safety
assessment on pedestrian crossing environments using MLS data,
Accident Analysis & Prevention, vol. 111, pp. 328–337, Feb. 2018,
doi: 10.1016/j.aap.2017.12.009. [Online]. Available: https://linkinghub.
elsevier.com/retrieve/pii/S0001457517304475
[56] Z. Hui, Y. Guang-yu, L. Sheng-xiong, Y. Zhi-yong, W. Zheng-
guo, H. Wei, Y. Yong-min, C. Rong, and L. Gui-e, Analysis
of 86 fatal motorcycle frontal crashes in Chongqing, China,
Chinese Journal of Traumatology, vol. 15, no. 3, pp. 170 174,
Jun. 2012, doi: https://doi.org/10.3760/cma.j.issn.1008-1275.2012.03.009.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/
S1008127515302984
[57] F. Chen and S. Chen, “Differences in Injury Severity of Accidents
on Mountainous Highways and Non-mountainous Highways,” Procedia
- Social and Behavioral Sciences, vol. 96, pp. 1868–1879, Nov.
2013, doi: 10.1016/j.sbspro.2013.08.212. [Online]. Available: https:
//linkinghub.elsevier.com/retrieve/pii/S1877042813023380
[58] G. Zhang, K. K. Yau, X. Zhang, and Y. Li, “Traffic
accidents involving fatigue driving and their extent of
casualties,” Accident Analysis & Prevention, vol. 87, pp. 34–
42, Feb. 2016, doi: 10.1016/j.aap.2015.10.033. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S0001457515301159
14 VOLUME 4, 2016
... Traditional machine learning algorithms are typically more interpretable, allowing us to comprehend the inherent relationships that exist in the prediction process. Ziakopoulos et al. [26][27][28][29]effectively identified the occurrence of accidents using the XGBoost algorithm and elucidated the influence of individual characteristics on accident occurrence with the Shapley. Upon this, a network analysis framework based on XGBoost was developed to identify the critical factors influencing the severity of accidents. ...
... Undersampling aims to equalize the distribution of injury outcomes by randomly discarding samples in large categories, while oversampling increases the samples in small categories. Many researchers have found that oversampling methods, especially the Synthetic Minority Oversampling Technique (SMOTE), can significantly improve the accuracy of crash severity classification [29,43]. To enhance the model prediction accuracy and facilitate the visualization of the prediction results, the dataset undergoes a SMOTE to balance the data. ...
Article
Full-text available
Providing accurate prediction of the severity of traffic collisions is vital to improve the efficiency of emergencies and reduce casualties, accordingly improving traffic safety and reducing traffic congestion. However, the issue of both the predictive accuracy of the model and the interpretability of predicted outcomes has remained a persistent challenge. We propose a Random Forest optimized by a Meta-heuristic algorithm prediction framework that integrates the spatiotemporal characteristics of crashes. Through predictive analysis of motor vehicle traffic crash data on interstate highways within the United States in 2020, we compared the accuracy of various ensemble models and single-classification prediction models. The results show that the Random Forest (RF) model optimized by the Crown Porcupine Optimizer (CPO) has the best prediction results, and the accuracy, recall, f1 score, and precision can reach more than 90 %. We found that factors such as Temperature and Weather are closely related to vehicle traffic crashes. Closely related indicators were analyzed interpretatively using a geographic information system (GIS) based on the characteristic importance ranking of the results. The framework enables more accurate prediction of motor vehicle traffic crashes and discovers the important factors leading to motor vehicle traffic crashes with an explanation. The study proposes that in some areas consideration should be given to adding measures such as nighttime lighting devices and nighttime fatigue driving alert devices to ensure safe driving. It offers references for policymakers to address traffic management and urban development issues.
... While ML methods were once considered black-box approaches, recent advancements and research on interpreting ML models have made them more transparent and interpretable. Among the various ML methods, Extreme Gradient Boost (XGBoost) and Random Forest (RF) have emerged as the most employed algorithms in traffic crash studies [10][11][12][13] . To interpret the results of machine learning models, many studies have extensively used SHAP (Shapley Additive Explanation) in recent years [14][15][16] . ...
Article
Full-text available
Single-vehicle crashes, particularly those caused by speeding, result in a disproportionately high number of fatalities and serious injuries compared to other types of crashes involving passenger vehicles. This study aims to identify factors that contribute to driver injury severity in single-vehicle crashes using machine learning models and advanced econometric models, namely mixed logit with heterogeneity in means and variances. National Crash data from the Crash Report Sampling System (CRSS) managed by the National Highway Traffic Safety Administration (NHTSA) between 2016 and 2018 were utilized for this study. XGBoost and Random Forest models were employed to identify the most influential variables using SHAP (Shapley Additive Explanations), while a mixed logit model was utilized to model driver injury severity accounting for unobserved heterogeneity in the data collection process. The results revealed a complex interplay of various factors that contribute to driver injury severity in single-vehicle crashes. These factors included driver characteristics such as demographics (male and female drivers, age below 26 years and between 35 and 45 years), driver actions (reckless driving, driving under the influence), restraint usage (lap-shoulder belt usage and unbelted), roadway and traffic characteristics (non-interstate highways, undivided and divided roadways with positive barriers, curved roadways), environmental conditions (clear and daylight conditions), vehicle characteristics (motorcycles, displacement volumes up to 2500 cc and 5,000–10,000 cc, newer vehicles, Chevy and Ford vehicles), crash characteristics (rollover, run-off-road incidents, collisions with trees), temporal characteristics (midnight to 6 AM, 10 AM to 4 PM, 4th quarter of the analysis period: October to December, and the analysis year of 2017). The findings emphasize the significance of driving behavior and roadway design to speeding behavior. These aspects should be given high priority for driver training as well as the design and maintenance of roadways by relevant agencies.
... This would confirm the reliability of our model. When the occurrence of a false positive is more troublesome than that of a false negative, accuracy serves as a useful signal (Ma J et al., 2019). ...
Article
Full-text available
Machine learning, one of the most well-known applications of artificial intelligence, is altering the world of research. The aim of this study is to generate predictions for Heart Disease Prediction (HDP) by employing effective machine learning approaches and to predict whether an individual has heart disease. The primary objective is to evaluate the predictive accuracy of various machine learning algorithms in predicting the presence or absence of heart disease. The KNIME data analysis program has been selected, and overall accuracy is chosen as the primary indicator to assess the effectiveness of these strategies. Utilizing details such as chest pain, cholesterol levels, age, and other factors, along with different machine learning technologies such as K Nearest Neighbor (KNN), Naive Bayes, and Logistic Regression, a dataset of 319,796 patient records with 18 attributes was utilized. Naive Bayes, K Nearest Neighbor (KNN), and Logistic Regression were employed as machine learning techniques, and their prediction accuracies were compared. The application results indicate that the logistic regression approach outperforms the K Nearest Neighbor method and the Naive Bayes method in terms of predicting accuracy for heart disease. The prediction accuracy of K-NN is 90.77%, Naive Bayes is 86.633%, and logistic regression is 91.60%. In conclusion, machine learning algorithms can accurately identify heart disease. The results suggest that these methods could assist doctors and heart surgeons in determining the likelihood of a heart attack in a patient.
... It focuses on how reliable the model is at predicting the positive class. − F1 Score: This score indicates how effectively the model balances recall and precision in the confusion matrix [10]. Figure 7 presents the formulas for the calculation of the accuracy, recall, and precision from the confusion matrix. ...
Article
Early stroke detection significantly increases the prognosis for both survival and rehabilitation. Patients are more likely to receive appropriate therapy that minimizes brain damage and lowers the risk of consequences if a stroke is detected early on. Researchers are motivated to investigate the possibilities of artificial intelligence and machine learning technologies in creating new categorization systems that can identify and detect strokes more quickly and accurately due to their rapid development. This could potentially enhance the likelihood of surviving and recuperating. The support-vector machine (SVM), logistic regression, decision tree, random forest, Bayes nets, and K-nearest neighbor (KNN) algorithms are employed in this study's CRISP model technique. To enhance the final quality, the dataset was balanced using an oversampling technique, and the algorithms employed were subjected to principal components analysis (PCA). With an accuracy rate of 99%, the Random Forest algorithm is regarded as the optimum for prediction. Our study illustrates that the random forest classification model using the data balancing strategy outperforms the other strategies investigated, with a 99% classification accuracy and a 98% F1 score. The study also shows that the outcomes are unaffected by preprocessing with the PCA technique. The next objectives of the study are to use a larger dataset, various preprocessing methods, and machine learning models to enhance the framework models.
... The concept of ensemble learning involves integrating several low-precision decision tree models to produce strong, high-precision learners. Since its first release as an open-source software package in 2014, XGBoost has been extensively used in traffic safety research in recent times Guo et al. 2021;Ma et al. 2019;Qu et al. 2019;Yang et al. 2022;Yang, Chen, and Yuan 2021). The choice of XGBoost lies in its ability to capture complex nonlinear relationships in crash data with high precision and less computational resources. ...
Article
Karst soils often exhibit elevated zinc (Zn) levels, providing an opportunity to cultivate Zn-enriched crops. (meanwhile) However, these soils also frequently contain high background levels of toxic metals, particularly cadmium (Cd), posing potential health risks. Understanding the bioaccumulation of Cd and Zn and the related drivers in a high geochemical background area can provide important insights for the safe development of Zn-enriched crops. Traditional models often struggle to accurately predict metal levels in crop systems grown on soils with high geochemical background. This study employed machine learning models, including Random Forest (RF), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost), to explore effective strategies for sustainable cultivation of Zn-enriched crops in karst regions, focusing on bioaccumulation factors (BAF). A total of 10,986 topsoil samples and 181 paired rhizosphere soil-crop samples, including early rice, late rice, and maize, were collected from a karst region in Guangxi. The SVM and XGBoost models demonstrated superior performance, achieving R2 values of 0.84 and 0.60 for estimating the BAFs of Zn and Cd, respectively. Key determinants of the BAFs were identified, including soil iron and manganese contents, pH level, and the interaction between Zn and Cd. By integrating these soil properties with machine learning, a framework for the safe cultivation of Zn-enriched crops was developed. This research contributes to the development of strategies for mitigating Zn deficiency in crops grown on Cd-contaminated soils.
Article
Full-text available
Transport and communication are the major sectors of economy in any country, and this phenomenon is known as one of development indices of countries. Therefore, safety management and road accident reduction are taken into account as key issues. The first step for controlling road accidents is to identify the accident process and factors influencing it. In recent years, considerable researches have been conducted to perceive the relationship among number of accidents, traffic volume, road geometric parameters and environmental factors in the form of accident prediction models. These models are strong tools in analyzing accident and employed for identification and analyses of accident black spots in suburban roads. The use of prediction models obtained from the accurate statistical methods and information of roads and accidents not only are effective in evaluating the managerial and geometric corrections of roads but also make the identification of accident black spots easier and accessible. Present research tried to offer a model for investigating the effect of geometric parameters influencing safety promotion and accident reduction of suburban roads by the use of collected databases. Finally, effective variables have been used for modeling. Then, using statistical data and field work in the studied road, accident analysis models have been performed on them and effective variables have been identified. Also, measures have been conducted for solving problem and safety promotion of roads.
Presentation
Full-text available
In recent years, a large amount of model-agnostic methods to improve the transparency, trustability and interpretability of machine learning models have been developed. We introduce local feature importance as a local version of a recent model-agnostic global feature importance method. Based on local feature importance, we propose two visual tools: partial importance (PI) and individual conditional importance (ICI) plots which visualize how changes in a feature affect the model performance on average, as well as for individual observations. Our proposed methods are related to partial dependence (PD) and individual conditional expectation (ICE) plots, but visualize the expected (conditional) feature importance instead of the expected (conditional) prediction. Furthermore, we show that averaging ICI curves across observations yields a PI curve, and integrating the PI curve with respect to the distribution of the considered feature results in the global feature importance. Paper: https://arxiv.org/abs/1804.06620
Article
Full-text available
This paper investigates factors that influence the severity of single-vehicle crashes that happen on weekdays and weekends. Crash data from 2012 to 2016 for the State of Alabama was used for this study. Latent class logit models were developed as alternative to the frequently used random parameters models to account for unobserved heterogeneity across crash-severity observations. Exploration of the data revealed that a high proportion of severe injury injury crashes happened on weekends. The study examined whether single-vehicle crash contributing factors differ between weekdays and weekends. The model estimation results indicate a significant association of severe injury crashes to risk factors such as driver unemployment, driving with invalid license, no seatbelt use, fatigue, driving under influence, old age, and driving on county roads for both weekdays and weekends. Research findings show a strong link between human factors and the occurrence of severe injury single-vehicle crashes, as it has been observed that many of the factors associated with severe-injury outcome are driver behavior related. To illustrate the significance of the findings of this study, a third model using the combined data was developed to explore the merit of using sub-populations of the data for improved and detailed segmentation of the crash-severity factors. It has also been shown that generally, the factors that influence single-vehicle crash injury outcomes were not very different between weekdays and weekends. The findings of this study show the importance of investigating sub-populations of data to reveal complex relationships that should be understood as a necessary step in targeted countermeasure application.
Article
Full-text available
Introduction: Road traffic injuries (RTIs) are considered as one of the most important health problems endangering people's life. The examination of the geographical distribution of RTIs could help policymakers in better planning to reduce RTIs. This study, therefore, aimed to determine the spatial-temporal clustering of mortality from RTIs in West of Iran. Methods: Deaths from RTIs, registered in Forensic Medicine Organization of Kermanshah province over a period of six years (2009-2014), were used. Using negative binomial regression, the mortality trend was investigated. In order to investigate the spatial distribution of RTIs, we used ArcGIS. (Version 10.3). Results: The median age of the 3231 people died in RTIs was 37 (IQR = 31) year, 78.4% were male. The 6-year average mortality rate from RTIs was 27.8/100,000 deaths, and the average rate had a declining trend. The dispersion of RTIs showed that most deaths occurred in Kermanshah, Islamabad, Bisotun, and Harsin road axes, respectively. The mean center of all deaths from RTIs occurred in Kermanshah province, the central area of Kermanshah district. The spatial trend of such deaths has moved to the northeast-southwest, and such deaths were geographically centralized. Results of Moran's I with respect to cluster analysis also indicated positive spatial autocorrelations. Conclusion: The results showed that the mortality rate from RTIs, despite the decline in recent years, is still high when compared with other countries. The clustering of accidents raises the concern that road infrastructure in certain locations may also be a factor. Regarding the results related to the temporal analysis, it is suggested that the enforcement of traffic rules be stricter at rush hours.
Article
This paper presents a Network Based Simulation Model developed with the objective of assessing novel safety performance indicators of the future Air Traffic Management system. A set of seven safety performance indicators is proposed based on Potential Safety Occurrences. Developed model contains three modules: separation violation detection module, TCAS activation module and risk of conflict assessment module. It was tested with planned flight trajectories crossing French and FABEC airspace during 24 h, covering several ATM improvement scenarios and different traffic demand levels. Results show capabilities to calculate certain safety performance indicators and to provide valuable safety feedback to traffic and airspace planners. Also, benefits of different ATM improvements to ATM safety performance have been estimated.
Article
Background: Several studies have examined the association between environmental factors and traffic accidents. However, the role of a full moon in triggering emergency ambulance transport due to road traffic casualties is unclear. Thus, we aimed to examine whether a full moon contributes to the incidence of emergency transport due to road traffic crashes. Methods: We acquired nationwide data on daily emergency transport due to road traffic crashes from all 47 prefectures of Japan from 2010 to 2014. We conducted a time-stratified case-crossover study using conditional Poisson regression to examine the relationship between the occurrence of a full moon and emergency transport due to road traffic crashes for each prefecture. Prefecture-level results were combined using a random-effects meta-analysis to evaluate nation-level estimates. Results: There were 842,554 cases of emergency transport due to road traffic crashes across 1826 nights (62 full moon nights: n = 29,584; 1764 control nights: n = 812,970). On days with a full moon, the pooled adjusted relative risk (RR) of emergency transport due to traffic accidents was 1.042 (95% confidence interval [CI], 1.021-1.063). Overall, 4.03% (95% CI: 2.06-5.93) of the cases (1192 cases) were attributable to full moon nights. Stratified analyses revealed a significant increase in emergency transport due to traffic accidents on full moon nights for males, people aged ≥40 years, and before midnight. Conclusions: Full moon nights are associated with an increase in the incidence of emergency transport due to road traffic crashes. These results indicate that public health strategies should account for full moon nights to decrease emergency transport due to traffic accidents.
Article
Heavy vehicle crashes are a major contributor to road-related fatalities. Representing only 3% of the total number of registered vehicles and 8% of the total vehicle kilometers traveled, heavy vehicles are involved in 18% of fatal and serious injury crashes in Australia. Given the contributing role of vehicle defects in many heavy vehicle crashes, vehicle inspection schemes have been implemented to more effectively manage heavy vehicle safety. However, there is little empirical research about the impact of periodic heavy vehicle inspections on vehicle defects and crash casualties. Hence, this research investigates the efficacy and effectiveness of periodic heavy vehicle inspections by examining their impact on the factors contributing to heavy vehicle crashes as well as the severity of these crashes. Accordingly, a partial least squares path model (PLS-PM) is proposed and evaluated using the data of periodic heavy vehicle inspections and heavy vehicle crashes in Queensland, for the period of 2011–2013. The PLS-PM results are also compared with the results of potential, alternative analysis methods to provide further insights about potential applications of PLS-PM in transportation research. Although the scheme cannot be evaluated completely through the proposed analysis approach, the findings of this study contribute to the causal theory and practice of heavy vehicle inspection protocols, especially in relation to vehicle defects and road safety outcomes.
Article
Median-crossover crashes are among the most hazardous events that can occur on freeways, often resulting in severe or fatal injuries. The primary countermeasure to reduce the occurrence of such crashes is the installation of a median barrier. When installation of a median barrier is warranted, transportation agencies are faced with the decision among various alternatives including concrete barriers, beam guardrail, or high-tension cable barriers. Each barrier type differs in terms of its deflection characteristics upon impact, the required installation and maintenance costs, and the roadway characteristics (e.g., median width) where installation would be feasible. This study involved an investigation of barrier performance through an in-depth analysis of crash frequency and severity data from freeway segments where high-tension cable, thrie-beam, and concrete median barriers were installed. A comprehensive manual review of crash reports was conducted to identify crashes in which a vehicle left the roadway and encroached into the median. This review also involved an examination of crash outcomes when a barrier strike occurred, which included vehicle containment, penetration, or re-direction onto the travel lanes. The manual review of crash reports provided critical supplementary information through narratives and diagrams not normally available through standard fields on police crash report forms. Statistical models were estimated to identify factors that affect the frequency, severity, and outcomes of median-related crashes, with particular emphases on differences between segments with varying median barrier types. Several roadway-, traffic-, and environmental-related characteristics were found to affect these metrics, with results varying across the different barrier types. The results of this study provide transportation agencies with important guidance as to the in-service performance of various types of median barrier.
Article
In the context of traffic safety, whenever a motorized road user moves against the proper flow of vehicle movement on physically divided highways or access ramps, this is referred to as wrong-way driving (WWD). WWD is notorious for its severity rather than frequency. Based on data from the U.S. National Highway Traffic Safety Administration, an average of 355 deaths occur in the U.S. each year due to WWD. This total translates to 1.34 fatalities per fatal WWD crashes, whereas the same rate for other crash types is 1.10. Given these sobering statistics, WWD crashes, and specifically their severity, must be meticulously analyzed using the appropriate tools to develop sound and effective countermeasures. The objectives of this study were to use a random-parameters ordered probit model to determine the features that best describe WWD crashes and to evaluate the severity of injuries in WWD crashes. This approach takes into account unobserved effects that may be associated with roadway, environmental, vehicle, crash, and driver characteristics. To that end and given the rareness of WWD events, 15 years of crash data from the states of Alabama and Illinois were obtained and compiled. Based on this data, a series of contributing factors including responsible driver characteristics, temporal variables, vehicle characteristics, and crash variables are determined, and their impacts on the severity of injuries are explored. An elasticity analysis was also performed to accurately quantify the effect of significant variables on injury severity outcomes. According to the obtained results, factors such as driver age, driver condition, roadway surface conditions, and lighting conditions significantly contribute to the injury severity of WWD crashes.