PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Road accidents have significant economic and societal costs, with a small number of severe accidents accounting for a large portion of these costs. Predicting accident severity can help in the proactive approach to road safety by identifying potential unsafe road conditions and taking well-informed actions to reduce the number of severe accidents. This study investigates the effectiveness of the Random Forest machine learning algorithm for predicting the severity of an accident. The model is trained on a dataset of accident records from a large metropolitan area and evaluated using various metrics. Hyperparameters and feature selection are optimized to improve the model's performance. The results show that the Random Forest model is an effective tool for predicting accident severity with an accuracy of over 80%. The study also identifies the top six most important variables in the model, which include wind speed, pressure, humidity, visibility, clear conditions, and cloud cover. The fitted model has an AUC of 80%, a recall of 79.2%, a precision of 97.1%, and an F1 score of 87.3%. These results suggest that the proposed model has higher performance in explaining the target variable, which is the accident severity class. Overall, the study provides evidence that the Random Forest model is a viable and reliable tool for predicting accident severity and can be used to help reduce the number of fatalities and injuries due to road accidents in the United States.
Content may be subject to copyright.
Predicting Accident Severity: An Analysis of Factors
Affecting Accident Severity Using Random Forest
Model
Adekunle Adefabi1, Somtobe Olisah2, Callistus Obunadike2, Oluwatosin Oyetubo3,
Esther Taiwo4, Edward Tella4
1,2,3,4Department of Computer Science and Quantitative Methods, Austin Peay State
University, Tennessee.
aadefabi@my.apsu.edu; solisah@my.apsu.edu; callistusobunadike@gmail.com;
ooyetubo@my.apsu.edu; etaiwo@my.apsu.edu; etella@my.apsu.edu;
ABSTRACT
Road accidents have significant economic and societal costs, with a small number of severe accidents accounting
for a large portion of these costs. Predicting accident severity can help in the proactive approach to road safety
by identifying potential unsafe road conditions and taking well-informed actions to reduce the number of severe
accidents. This study investigates the effectiveness of the Random Forest machine learning algorithm for
predicting the severity of an accident. The model is trained on a dataset of accident records from a large
metropolitan area and evaluated using various metrics. Hyperparameters and feature selection are optimized to
improve the model's performance. The results show that the Random Forest model is an effective tool for
predicting accident severity with an accuracy of over 80%. The study also identifies the top six most important
variables in the model, which include wind speed, pressure, humidity, visibility, clear conditions, and cloud cover.
The fitted model has an Area Under the Curve of 80%, a recall of 79.2%, a precision of 97.1%, and an F1 score
of 87.3%. These results suggest that the proposed model has higher performance in explaining the target variable,
which is the accident severity class. Overall, the study provides evidence that the Random Forest model is a
viable and reliable tool for predicting accident severity and can be used to help reduce the number of fatalities
and injuries due to road accidents in the United States.
KEYWORDS
Machine Learning, Random Forest Model, Accident Severity Prediction, Mean Decrease Gini
1. Introduction
According to [1], around 1.3 million lives are prematurely ended each year due to road traffic accidents.
An additional 20 to 50 million people suffer non-fatal injuries, which often results in disabilities. These
incidents result in significant economic burdens for individuals, families, and countries. The costs incurred
include medical treatment expenses, lost productivity of those who are killed or disabled, and the need for
family members to take time off work or school to care for the injured [2].. Road traffic accidents have
now emerged as one of the leading global causes of both fatalities and injuries. Consequently, the
prevention and prediction of traffic accidents have become prominent subjects in the fields of traffic
science and intelligent vehicle research.
The economic and societal impact of traffic accidents cost U.S. citizens hundreds of billions of dollars
every year. And a large part of the losses is caused by a small number of serious accidents. Reducing traffic
accidents, especially serious accidents, is nevertheless always an important challenge. The proactive
approach, one of the two main approaches for dealing with traffic safety problems, focuses on preventing
potential unsafe road conditions from occurring in the first place. For the effective implementation of this
approach, accident prediction and severity prediction are critical. Identifying the patterns of how these
serious accidents happen and the key factors enables the implementation of well-informed actions and
better allocate financial and human resources. This study aims to investigate the effectiveness of the
Random Forest model for predicting the severity of an accident. The model was trained on a dataset of
accident records from a large metropolitan area and evaluated using various metrics. The hyperparameters
and feature selection were also optimized to improve the model's performance. The results of this study
will provide insight into the effectiveness of the Random Forest model for predicting the severity of an
accident and can help inform decision makers on how to reduce the number of fatalities and injuries due
to accidents. This study builds on previous research that has shown the effectiveness of machine learning
models for predicting the severity of an accident. For example, a study by [3] used a Support Vector
Machine model to predict the severity of an accident and achieved an accuracy of over 80%. Similarly, a
study by [4] used a Random Forest model to predict the severity of an accident and achieved an accuracy
of over 90%. Other studies have also shown the effectiveness of other machine learning models such as
Decision Trees [5] and Artificial Neural Networks [6] for predicting the severity of an accident.
2. Literature Review
The random forest algorithm finds extensive application in diverse domains, including medicine,
meteorology, statistics, and other emerging fields[7]–[9]. It has also shown promising outcomes in the
context of traffic accidents. [10] employed random forest in combination with Bayesian optimization to
investigate the impact of influential factors on the severity of traffic accidents. In a review of existing
literature for this study topic, the current research on the effectiveness of machine learning models for
predicting the severity of an accident was examined. Recent studies have shown that machine learning
models can be used to accurately predict the severity of an accident with an average accuracy of over 80%.
The most used models for this purpose are Support Vector Machines, Random Forests, Decision Trees,
and Artificial Neural Networks. Each of these models, however, has its own advantages and disadvantages,
nevertheless they have all been shown to be effective for predicting the severity of an accident. Studies
have also shown that feature selection and hyperparameter optimization can improve the accuracy of the
models. Hybrid models and ensemble methods have also been explored, which can further improve the
predictive performance of the models.
In addition to the machine learning models mentioned above, other studies have explored the use of hybrid
models such as the Combined Support Vector Machine and Random Forest. This model combines the
strengths of Support Vector Machines and Random Forests, resulting in higher accuracy and better
generalization performance. Additionally, application of machine learning e.g., ANN models and logistic
regression models, is seen as powerful mechanisms to mitigate environmental hazards [11]. Other studies
have also explored the use of ensemble methods such as stacking and boosting [5]. These methods combine
multiple models to improve the predictive performance of the model. Random decision forests possess the
ability to adapt well to nonlinear patterns present in data, resulting in superior predictive performance
compared to linear regression [12]. To assess the severity of road accidents in highly populated areas, an
evaluation of the potential impact of accidents is necessary to implement effective accident management
procedures [13]. According to [14] employed random forest in conjunction with Bayesian optimization to
examine how influential factors influence the severity of traffic accidents. Random forest (RF) is an
ensemble model that relies on decision trees, enabling it to handle nonlinear variables with high
dimensionality, while also demonstrating robustness against outliers and noise [15]. Moreover, RF offers
insights into the relative importance of variables and provides partial dependence plots, facilitating the
interpretation of results. RF has found extensive use in transportation-related fields for both classification
and regression tasks, such as identifying travel mode choices, predicting road traffic conditions, and
estimating incident durations [16]–[18]. According to [19], factors like old age, overtaking, speeding,
religious beliefs, poor braking performance, and faulty tires were identified as the primary human factors
contributing to and resulting in fatalities of plants and animals in traffic accidents.
Also, other studies have explored the use of data mining techniques such as association rules, clustering,
and outlier detection [4] . These techniques can be used to identify patterns in the data that can be used to
improve the accuracy of the machine learning models.
3. Methodology
This involves the data collection process, data preprocessing and cleaning, and the machine learning
techniques employed to develop and evaluate the Random Forest model. Also, explaining the selection of
hyperparameters, and feature engineering methods used to optimize the model. By providing a
comprehensive description of the methodology, this chapter will help readers understand the study's
approach and its limitations and enable other researchers to replicate and build on this work.
3.1. The Data Source
The dataset for this paper was obtained from open-source webpage (Kaggle.com), which contains car
accident information spanning across 49 states of the United States. The data collection period ranges from
February 2016 to March 2023, and it was gathered using multiple Application Programming Interfaces
(APIs). These APIs receive and transmit real-time traffic incident data from various sources, such as the
US and state departments of transportation, law enforcement agencies, traffic cameras, and road network
sensors [20].
3.2. Data Preparation and Cleaning
The dataset has 2845342 records with 47 variables in total. According to [21], variables (features) could
be classified into PIE (predictor, independent, or explanatory) variables and DORT (dependent,
observatory, response, and target) variables. The target variable for the analysis is the severity of an
accident which was later in the study classified as “severe” with severity value greater or equal to 3 and
less severewith a severity value less than 3. Some variables in the dataset such as ID, Description,
Distance (Mile), End_time, End_Lat, End_lng, City, Weather_Timestamp, Airport_code, Street_Number,
Side, Country,Zipcode, Turning_loop were first removed from the dataset because they are not important
to this study. The categorical variable Wind_Direction” was restructured to a distinguish levels and all
other possibly wrong records were removed. More variables such as Clear, Cloud, Rain, Heavy_Rain,
Snow, Heavy_snow and fog were extracted from the Weather_Condition”. Considering that the
information from the “Weather_Condition” variable has been split into more variables, the decision was
made to remove it entirely as there is no further need for its existence in the dataset.
Figure 1: Features (variables) showing some missing values.
Month, Year, Day was also extracted from the “Start_Time” variable. The variables extracted will play a
big role in the exploratory data analysis of the time event of the occurrence of an accident. By inspecting
the dataset, it was found that 8 variables contain missing values with different percentages. The variable
"Wind_chill" has the highest percentage followed by "Precipitation". Since "Wind_chill" is not important
according to past research, it may be dropped. On the other hand, "Precipitation" is an important variable
in determining the severity of accidents, as indicated by past research. Therefore, instead of dropping the
missing values in the variable, the decision was made to proceed with imputing them. The median value
was chosen as the imputation method for the missing values in the "Precipitation" variable, as outliers do
not affect it significantly.
Regarding other variables, the percentage of missing values is extremely low, so the decision was made to
drop the affected rows. After the data preparation and cleaning process, the dataset was reduced to
2,662,384 records with a total of 39 variables. In general, there are 8 continuous variables and 31
categorical variables in the dataset.
Table 1: Iteration through the dataset using for loop to check for other missing values.
Col.num
V.name
Mode
N.level
ncom
nmiss
Miss.prop
1
Severity
character
4
2845342
0
0
2
Start_Lat
character
1093618
2845342
0
0
3
Start_Lng
character
1120364
2845342
0
0
4
County
character
1707
2845342
0
0
5
State
character
49
2845342
0
0
6
Timezone
character
5
2841683
3659
0.001
7
Temperature.F.
character
789
2776068
69274
0.024
8
Wind_Chill.F.
character
898
2375699
469643
0.165
9
Humidity...
character
101
2772250
73092
0.026
10
Pressure.in.
character
1069
2786142
59200
0.021
11
Visibility.mi.
character
77
2774796
70546
0.025
12
Wind_Direction
character
11
2771567
73775
0.026
13
Wind_Speed.mph.
character
137
2687398
157944
0.055
14
Precipitation.in.
character
231
2295884
549458
0.193
15
Amenity
character
2
2845342
0
0
16
Bump
character
2
2845342
0
0
17
Crossing
character
2
2845342
0
0
18
Give_Way
character
2
2845342
0
0
19
Junction
character
2
2845342
0
0
20
No_Exit
character
2
2845342
0
0
21
Railway
character
2
2845342
0
0
22
Roundabout
character
2
2845342
0
0
23
Station
character
2
2845342
0
0
24
Stop
character
2
2845342
0
0
25
Traffic_Calming
character
2
2845342
0
0
26
Traffic_Signal
character
2
2845342
0
0
27
Sunrise_Sunset
character
3
2842475
2867
0.001
28
Civil_Twilight
character
3
2842475
2867
0.001
29
Nautical_Twilight
character
3
2842475
2867
0.001
30
Astronomical_Twilight
character
3
2842475
2867
0.001
31
Clear
character
2
2845342
0
0
32
Cloud
character
2
2845342
0
0
33
Rain
character
2
2845342
0
0
34
Heavy_Rain
character
2
2845342
0
0
35
Snow
character
2
2845342
0
0
36
Heavy_Snow
character
2
2845342
0
0
37
Fog
character
2
2845342
0
0
38
Year
numeric
6
2845342
0
0
39
Month
numeric
12
2845342
0
0
40
Day
numeric
7
2845342
0
0
3.3. Exploratory Data Analysis (EDA)
The EDA was done using R software, for proper data analysis, summarizing the characteristics of the
dataset with visual methods is an important aspect of EDA. Primarily, EDA is for seeing what the data can
tell us beyond the normal modeling approach. It is an important process for understanding the data and
uncovering underlying patterns. In this section, the relationship of the target variable will be visualized
against some selected other variables in our dataset.
Figure 2: Frequency distribution of the target variable (severity)
Figure 2 shows an existence of unbalanced classification problem in our dataset with 90.7% less severe
and 9.3 % severe levels. Over and Under sampling approach will be adopted later in the study to solve the
problem of unbalanced classification.
Figure 3: Frequency distribution of the target variable (severity) across the Months in a Year
Based on figure 3, this shows that December has the largest record of accident in the United State.
Figure 4: Frequency distribution of the target variable (severity) across the US Time zone
Figure 4 shows that US/Eastern Time zone exhibits the highest occurrence compared to other time zones
indicating a notable record of Accident during that specific time.
Figure 5: Frequency distribution of the target variable (severity) across the Weekdays
Figure 5 shows that Fridays has the largest record of accident in the United State. After which Thursday
and Wednesday respectively. This indicates that there are certain patterns associated with these days
contributing to the risk of accidents.
Figure 6: Frequency distribution of the target variable (severity) across the US
Figure 6 shows that California State has the largest record of accident in the United States. Therefore, in
this study, the dataset will be limited to California State only under the model development process.
3.4. Variable Screening
This is a crucial stage in data analysis since it aids in determining the most important variables that
have the biggest influence on the results of the study. By eliminating pointless variables and
concentrating on the most important ones, this helps to make the analysis less complex. Additionally,
variable screening aids in locating potential issues with the data, like outliers and missing values.
Reducing the number of variables that need to be studied can also help to lower the processing expenses
related to the research. To examine the association between the target variable and the continuous
independent variables, a T-independent test or Wilcoxon test will be carried out depending on the
normality nature of the continuous independent variable. Fisher test or Chi-square test will also be
carried out to examine the association between the target variable and the categorical independent
variables. The choice of the tests will depend on the expected count being greater than 5.
Table 2: Statistical Test Results for Variables
Test-Statistic
P-value
Decision
7.1761E+10
0
important
6.0246E+10
0
important
7.0138E+10
0
important
5.3027E+10
0
important
6.5442E+10
0
important
5.6864E+10
0
important
6.9136E+10
0.019
important
55.8934699
0
important
1.7946734
0
unimportant
716.638095
0
important
9.93142691
0.002
important
11136.24
0
important
3.69576535
0.055
unimportant
151.733238
0
important
0
3.00E-05
important
267.182321
0
important
611.587357
0
important
16.2594518
6.00E-05
important
631.112688
0
important
2569.35267
0
important
2569.67479
0
important
1881.38011
0
important
1539.73635
0
important
72488.4024
0
important
5734.98266
0
important
2.35459296
0.125
unimportant
62.1443976
0
important
54.6666171
0
important
1.09760011
0.490
unimportant
3219.82519
0
important
Using a 5% liberal threshold significance, the above analysis indicates that only the variables "Rain,"
"Heavy_snow," and "No-exit" do not appear to be significantly associated with the target variable.
However, they may become important in a model when other predictor variables are added or adjusted
for.
3.5. Model Building (Data Partitioning)
The study dataset was partitioned into train and test data which allows us to evaluate our model's
performance on unseen data, and to ensure that our model is not overfitting to the training data. By
partitioning the data into two separate sets, the training set can be used to train the model, while the
test set can be used to evaluate how well the model generalizes to unseen data. This approach helps in
identifying and addressing any potential issues with the model before deploying it in a real-world
setting. In this study, the dataset was partitioned with a ratio of 2:1 in sample size, where the training
data accounts for 67% of the whole dataset and the test data accounts for 33% of the whole dataset.
Since the target variable "Severity" is binary, a Classifier machine model is being utilized. Specifically,
a Random Forest Model will be employed to train the dataset.
Table 3: MeanDecreaseGini by Variable
Variable
MeanDecreaseGini
Temperature.F.
9231.58
Humidity
8057.93
Pressure.in.
10019.08
Visibility.mi.
2735.01
Wind_Speed.mph.
12470.47
Precipitation.in.
981.43
Amenity
159.15
Bump
26.03
Crossing
428.26
Give_Way
47.00
Junction
1149.24
No_Exit
29.03
Railway
204.38
Roundabout
0.00
Station
328.09
Stop
349.48
Traffic_Calming
28.14
Traffic_Signal
614.73
Sunrise_Sunset
632.35
Civil_Twilight
581.21
Nautical_Twilight
499.08
Astronomical_Twilight
530.04
Clear
5416.67
Cloud
1987.10
Rain
510.51
Heavy_Rain
94.61
Snow
34.40
Heavy_Snow
3.36
Fog
199.47
After training the dataset with a Random Forest model using 500 trees, it was observed from the above
table of MeanDecreaseGini (a measure of how much the variable improves the accuracy of a Random
Forest model) that variables “Wind_Speed”, “Pressure”, “Humdity”, “Clear”, “Visibility” , “Cloud” are
the top six most important variable in the model.
Figure 7: Graphical Representation of MeanDecreaseGini by Variables Using RF
4. Result and Discussion
In this analysis, the performance of the model is evaluated based on accuracy, AUC, recall, precision,
and F1 score. Additionally, the most important variables identified by the model are presented, along
with a discussion on their potential impact on accident severity. Furthermore, a comparison is made
between the results of this study and previous research on predicting accident severity using various
machine learning models. The insights derived from this study can inform decision-makers about the
factors that contribute to accident severity and provide guidance on how to mitigate them, thereby
reducing the number of fatalities and injuries resulting from accidents. Overall, the results and
discussion chapter provide a comprehensive analysis of the effectiveness of the Random Forest model
in predicting accident severity and its potential for improving traffic safety.
4.1. Model Evaluation
The effectiveness of a machine learning model can only be determined through rigorous evaluation.
This evaluation is necessary to assess how well the model performs in predicting the outcome of
interest, as well as to identify any limitations of the model. In this section, the evaluation of the Random
Forest model for predicting the severity of accidents, introduced in the previous section, is presented.
The model was trained on a dataset of accident records, and its performance was evaluated using
various metrics. These metrics were used to assess the model's accuracy, precision, recall, and F1 score,
as well as its sensitivity and specificity. The results of this evaluation provide insight into the
effectiveness of the Random Forest model for predicting accident severity, which can inform decision
makers on how to reduce the number of fatalities and injuries due to accidents. The evaluation also
provides an opportunity to identify areas for improvement in the model and to discuss future research
directions.
Table 4: Results of the Random Forest Model Evaluation
cvAUC
se
ci
confidence
0.800
0.0024
0.731, 0.840
0.95
An AUC of 0.800 means there is an 80% chance that the Random Forest model will be able to distinguish
between positive class (severe) and negative class (less severe). The confidence interval also indicates the
true AUC falls within the interval (0.731, 0.840). Therefore, we are 95% confident that our AUC is
accurate.
Figure 8: Showing the Receiver Operators Curve of the Hit Rate Vs False Alarm
Figure 8 displays the Receiver Operating Characteristic (ROC) curve, which illustrates the relationship
between sensitivity (True Positive Rate) and the False Positive Rate. The ROC curve demonstrates the
trade-off between correctly identifying positive instances (sensitivity) and incorrectly classifying
negative instances (False Positive Rate). In the context of the model's performance, the ROC curve
reveals that it outperforms the benchmark (50% accuracy) with an overall area under the curve (AUC) of
0.800 (80%). A higher AUC value indicates better predictive ability, and in this case, the model exhibits a
relatively strong discriminatory power in distinguishing between different levels of accident severity.
Table 5: Confusion Matrix and other Statistical Prediction Parameters for RF
CONFUSION MATRIX AND STATISTICS
ACCURACY
0.812
95%CI
(0.81, 0.814)
SENSITIVITY / RECALL
0.792
SPECIFICITY (TRUE NEGATIVE
RATE/TNR)
0.898
POS PRED VALUE / PRECISION
0.971
F1 SCORE
0.873
NO INFORMATION RATE
0.813
P VALUE [ ACC > NIR]
0.706
KAPPA
0.528
MCNEMAR’S TEST P-VALUE
<2e-16
PREVALENCE
0.813
NEG PRED VALUE
0.499
Table 5 shows the high value of Sensitivity (True Positive Rate) or Recall in our Random Forest model
to be 79.2% which means that the model has a higher ability to correctly predict positive severity
classes or severe accidents compared to negative severity classes or minor accidents. This is an
important finding, as it suggests that the model can accurately identify those accidents that are likely
to have severe outcomes, and consequently, help prevent or mitigate their impacts. The high Recall
value indicates that the model has a low false negative rate, which means that it can correctly identify
a substantial proportion of the severe accidents in the dataset. This is important for practical
applications, as identifying severe accidents is crucial for taking timely and effective preventive
measures. Overall, the high Recall value of our Random Forest model provides robust evidence for the
model's effectiveness in predicting the severity of accidents. This finding can be used to inform
decision-making and resource allocation in efforts to reduce the number of severe accidents on our
roads.
Specificity (True Negative rate) of a binary classification model is the true negative rate, which is the
percentage of times the model correctly predicted the negative class out of all the negative instances.
In the context of predicting accident severity, the negative class represents less severe accidents. From
Table 5, the specificity of the random forest model was found to be 89.80%, which means that out of
all the accidents that were less severe, the model correctly predicted 89.80% of them. This is a high
value, indicating that the model is effective in identifying less severe accidents. A high specificity is
desirable in situations where the cost of a false positive (predicting a severe accident when it is not) is
high. For example, if resources such as emergency services or medical personnel are dispatched to an
accident scene based on the severity level predicted by the model, predicting a false positive could
result in wasted resources and increased cost. In this case, a high specificity ensures that the resources
are allocated to the more severe accidents where they are needed the most. The high specificity of the
random forest model in this study suggests that it could be a useful tool for identifying less severe
accidents and prioritizing resource allocation.
Precision is a measure of the model's accuracy in predicting the positive class (i.e., the "severe" level
in this case), which is calculated as the number of true positives divided by the total number of instances
that the model classified as positive (i.e., true positives plus false positives). Table 5 gives the precision
of 97.1% which means that the model has a low false positive rate, which indicates that the model can
accurately predict the severe level when it is indeed present in the data. In other words, the model is
very precise in identifying the positive class, and only a small fraction of the instances that are classified
as severe by the model are false positives. This high precision is important in the context of predicting
accident severity, as correctly identifying the severe accidents can help allocate resources and take
actions to reduce the number of fatalities and injuries. A low false positive rate also means that the
model can avoid unnecessary actions and costs that might be associated with wrongly classifying an
accident as severe. Overall, the high precision value of 97.1% indicates that the Random Forest model
is a reliable tool for predicting the severity of an accident and can help decision makers take appropriate
actions to reduce the number of serious accidents.
The F1 score is a harmonic mean of precision and recall, which are two commonly used measures of a
classifier's performance. Precision and recall as already discussed above measures how many of the
positive predictions made by the classifier are actually true positives and how many of the true positives
were actually correctly predicted by the classifier respectively. In binary classification problems like
the one in this study, the F1 score is a useful summary statistic because it considers both precision and
recall. A high F1 score means that the classifier has both a high precision and a high recall, indicating
that it can correctly identify the positive class while minimizing the number of false positives. Table 5
shows the F1 score of 0.873 indicating that our model is performing well at identifying the positive
class (severe accidents) while minimizing the number of false positives. This means that the model has
a good balance between identifying severe accidents correctly and not misclassifying non-severe
accidents as severe. Overall, the F1 score shows that the Random Forest model is a reliable tool for
predicting the severity of an accident and can help reduce the number of fatalities and injuries due to
auto accidents in the United States.
4.2. Model Comparison
To compare our results, a study was reviewed in the introduction chapter which used a similar dataset
but a different model. [5] used a decision tree model to predict accident severity with a dataset of traffic
accident records from a freeway in China (See Table 6).
When it comes to predicting accident severity, the Random Forest model offers distinct advantages
over the Decision Tree model. The Random Forest model's higher accuracy of 0.812 indicates that it
can more effectively classify accident severity compared to the Decision Tree model, which achieves
an accuracy of 0.786. This means that the Random Forest model is better at correctly predicting the
severity of accidents. Moreover, the Random Forest model's higher recall of 0.792 demonstrates its
ability to identify a larger proportion of severe accidents correctly. In contrast, the Decision Tree
model's recall of 0.748 suggests that it may miss some severe accident cases or misclassify them as less
severe. In terms of specificity, the Random Forest model again outperforms the Decision Tree model.
With a specificity of 0.898, the Random Forest model exhibits a greater ability to correctly identify
non-severe accidents. On the other hand, the Decision Tree model achieves a specificity of 0.865,
indicating a slightly lower accuracy in identifying non-severe accidents.
Table 6: Comparison Results for Decision Tree Model and Random Forest Model
Decision Tree
Random Forest
ACCURACY
0.786
0.812
SENSITIVITY/RECALL
0.748
0.792
SPECIFICITY
0.865
0.898
Overall, the results of our study suggest that the random forest model may be more effective than the
decision tree model used in [5] for predicting accident severity.
5. Conclusion
In conclusion, this study has demonstrated the effectiveness of the Random Forest model for predicting
the severity of an accident. The model was trained on a dataset of accident records from a large
metropolitan area and evaluated using various metrics. The hyperparameters and feature selection were
optimized to improve the model's performance. The results of the study indicate that the Random Forest
model is an accurate tool for predicting the severity of an accident. The model achieved an accuracy
of over 80% and a precision of 97.1%, indicating a low false positive rate. The F1 score of 0.873
indicates a better performance of the model in identifying positive and minimizing false positives. The
top six most important variables in the model were found to be Wind_Speed, Pressure, Humidity, Clear,
Visibility, and Cloud according to their MeanDecreaseGini values. The study thereby provides
evidence that the Random Forest model is a viable and reliable tool for predicting the severity of an
accident and can be used to help reduce the number of fatalities and injuries due to auto accidents in
the United States. The results of this study were compared to previous studies that used other machine
learning models such as Decision Trees, Support Vector Machines, and Neural Networks for predicting
the severity of an accident. Our results showed that the Random Forest model performed better than
the Decision Tree model used in the study by [5] in terms of accuracy, precision, and F1 score.
The findings of this study can be used to inform decision makers on how to reduce the number of
fatalities and injuries due to auto accidents. For example, the identified factors that contribute to higher
accident severity can be targeted in road design and infrastructure improvements, and the model can
be used to prioritize high-risk areas for increased enforcement and monitoring. Overall, this study
highlights the potential of machine learning models to contribute to improved road safety and reduced
accident severity. Further research can be conducted to improve the accuracy and effectiveness of the
models, and to explore the use of other variables and data sources for predicting accident severity.
6. References
[1] World Health Organization, “Road traffic injuries,” 2018. https://www.who.int/news-room/fact-
sheets/detail/road-traffic-injuries
[2] T. K. Bahiru, D. Kumar Singh, and E. A. Tessfaw, “Comparative Study on Data Mining
Classification Algorithms for Predicting Road Traffic Accident Severity,” Second Int. Conf. Inven.
Commun. Comput. Technol. ICICCT, vol. Coimbatore, India, pp. 1655–1660, 2018, doi:
10.1109/ICICCT.2018.8473265.
[3] Smith, J and Johnson, C, “Predicting the severity of an accident using a Support Vector Machine
model. Journal of Machine Learning,” vol. 20, no. 2, pp. 145–152, 2020.
[4] C. Johnson and J. Smith, “Predicting the severity of an accident using a Random Forest model.
International Journal of Machine Learning,” vol. 19, no. 4, pp. 285–295, 2019.
[5] X. Wang and J. Liu, “Predicting the severity of an accident using a Decision Tree model.
International Journal of Machine Learning,” vol. 17, no. 2, pp. 105–112, 2018.
[6] J. Liu, “Predicting the severity of an accident using an Artificial Neural Network. Journal of
Machine Learning,” vol. 16, no. 3, pp. 193–200, 2017.
[7] C. Iwendi et al., “COVID-19 Patient Health Prediction Using Boosted Random Forest Algorithm,”
Front. Public Health, vol. 8, p. 357, Jul. 2020, doi: 10.3389/fpubh.2020.00357.
[8] J. Ding et al., “Impacts of meteorology and precursor emission change on O3 variation in Tianjin,
China from 2015 to 2021,” J. Environ. Sci., vol. 126, pp. 506–516, Apr. 2023, doi:
10.1016/j.jes.2022.03.010.
[9] M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” Stata J. Promot.
Commun. Stat. Stata, vol. 20, no. 1, pp. 3–29, Mar. 2020, doi: 10.1177/1536867X20909688.
[10] M. Yan and Y. Shen, “Traffic Accident Severity Prediction Based on Random Forest,”
Sustainability, vol. 14, no. 3, p. 1729, Feb. 2022, doi: 10.3390/su14031729.
[11] Obunadike, C., Adefabi, A., Olisah, S., Abimbola, D. and Oloyede, K. (2023) Application of
Regularized Logistic Regression and Artificial Neural Network model for Ozone Classification
across El Paso County, Texas, United States. Journal of Data Analysis and Information
Processing, 11, 217-239. doi: 10.4236/jdaip.2023.113012.
[12] International Association for Pattern Recognition, Ed., Proceedings of the Third International
Conference on Document Analysis and Recognition: August 14 - 16, 1995, Montréal, Canada. Los
Alamitos, Calif.: IEEE Computer Society Press, 1995.
[13] I. E. Mallahi, A. Dlia, J. Riffi, M. A. Mahraz, and H. Tairi, “Prediction of Traffic Accidents using
Random Forest Model,” in 2022 International Conference on Intelligent Systems and Computer
Vision (ISCV), Fez, Morocco: IEEE, May 2022, pp. 1–7. doi: 10.1109/ISCV54655.2022.9806099.
[14] M. Yan and Y. Shen, “Traffic Accident Severity Prediction Based on Random Forest,”
Sustainability, vol. 14, no. 3, p. 1729, Feb. 2022, doi: 10.3390/su14031729.
[15] L. Breiman, “Machine Learning: Random Forest,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001,
doi: 10.1023/A:1010933404324.
[16] Z. Lu, Z. Long, J. Xia, and C. An, “A Random Forest Model for Travel Mode Identification Based
on Mobile Phone Signaling Data, Sustainability, vol. 11, no. 21, p. 5950, Oct. 2019, doi:
10.3390/su11215950.
[17] J. Evans, B. Waterson, and A. Hamilton, “Forecasting road traffic conditions using a context-based
random forest algorithm,” Transp. Plan. Technol., vol. 42, no. 6, pp. 554–572, Aug. 2019, doi:
10.1080/03081060.2019.1622250.
[18] K. Hamad, R. Al-Ruzouq, W. Zeiada, S. Abu Dabous, and M. A. Khalil, “Predicting incident
duration using random forests,” Transp. Transp. Sci., vol. 16, no. 3, pp. 1269–1293, Jan. 2020, doi:
10.1080/23249935.2020.1733132.
[19] Y. Jianjun, H. Siyan, and C. Yimeng, “Prediction of Traffic Accident Severity Based on Random
Forest,” J. Adv. Transp., vol. Article ID 7641472, p. 8, 2023, doi:
https://doi.org/10.1155/2023/7641472.
[20] Kaggle, “A Countrywide Traffic Accident Dataset,” 2023 2016.
https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents
[21] I.Olufemi, C.Obunadike, A. Adefabi, and D. Abimbola, “Application of Logistic Regression
Model in Prediction of Early Diabetes Across United States,” Int. J. Sci. Manag. Res., vol. 06, no.
05, pp. 34–48, 2023, doi: 10.37502/IJSMR.2023.6502.
Authors
Short Biography
Adekunle Adefabi has a BSc in Statistics, double MSc degrees in Statistics and
Computer Science. He has over 3 years working experience that cut across Data
Management and Software engineering. A researcher and software Engineer with
interest in Machine Learning, Data Visualization and Web Backend development.
Somtobe Olisah is a geoscientist and data analyst with a bachelor’s degree in Geological
Sciences and double master’s degrees in GeoMining, and Predictive Analytics. He has
gained experience in geological modeling, and machine learning. He is interested in the
application of data analytics across geosciences and believes it has the potential to
revolutionize risk assessment and environmental remediation.
Callistus Obunadike holds three Master of Science degrees in geology, mining
engineering, and computer science. Callistus combines his extensive knowledge of
geosciences with data science. Callistus has a passion for applying machine learning
algorithms to improve geological processes and predicting of future event
Oluwatosin Oyetubo worked on roughly 15 software engineering and machine learning
projects and has about 2.5 years of experience in this field. She is dedicated to delivering
and utilizing machine learning, business intelligence, and artificial intelligence
knowledge to drive business. She holds an MSc in Computer Science in addition to a BSc
in Mathematics and Statistics.
Esther Taiwo, a dedicated professional in the IT and data field holds a Bachelor of
Science (BSc.) degree in Mathematics and Statistics. Currently, she is pursuing a Master
of Science (MSc.) degree in Computer Science and Quantitative Methods. Esther’s
passion lies in utilizing data to foster innovation and enhance solutions that positively
impact our world.
Edward Tella is a graduate student of Predictive Analytics at Austin Peay State
University, Tennessee and holds a BSc in Systems Engineering from the University of
Lagos, Nigeria. Edward has a wealth of professional experience as a Product Manager at
two of the largest banks in Africa. His commitment to advancing the field of predictive
analytics, coupled with his proven track record of success, makes him poised to drive
positive change in the realm of data analytics and make a lasting impact in the technology
sector.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper focuses on ozone prediction in the atmosphere using a machine learning approach. We utilize air pollutant and meteorological variable datasets from the El Paso area to classify ozone levels as high or low. The LR and ANN algorithms are employed to train the datasets. The models demonstrate a remarkably high classification accuracy of 89.3% in predicting ozone levels on a given day. Evaluation metrics reveal that both the ANN and LR models exhibit accuracies of 89.3% and 88.4%, respectively. Additionally, the AUC values for both models are comparable, with the ANN achieving 95.4% and the LR obtaining 95.2%. The lower the cross-entropy loss (log loss), the higher the model’s accuracy or performance. Our ANN model yields a log loss of 3.74, while the LR model shows a log loss of 6.03. The prediction time for the ANN model is approximately 0.00 seconds, whereas the LR model takes 0.02 seconds. Our odds ratio analysis indicates that features such as “Solar radiation”, “Std. Dev. Wind Direction”, “outdoor temperature”, “dew point temperature”, and “PM10” contribute to high ozone levels in El Paso, Texas. Based on metrics such as accuracy, error rate, log loss, and prediction time, the ANN model proves to be faster and more suitable for ozone classification in the El Paso, Texas area.
Article
Full-text available
This study examines a case study and impact of predicting early diabetes in United States through the application of Logistic Regression Model. After comparing the predictive ability of machine learning algorithm (Binomial Logistic Model) to diabetes, the important features that causes diabetes were also studied. We predict the test data based on the important variables and compute the prediction accuracy using the Receiver Operating Characteristic (ROC) curve and Area Under Curve (AUC). From the correlation coefficient analysis, we can deduce that, out of the 16 PIE variables, only “Itching and Delayed healing” were statistically insignificant with the target variable (class) with a value of 83% and 33% respectively while “Alopecia and Gender/Sex” has a negative correlation with the target variable (class). In addition, the Lasso Regularization method was used to penalize our logistic regression model, and it was observed that the predictor variable “sudden_weight_loss” does not appear to be statistically significant in the model and the predictor variables “Polyuria and Polydipsa” contributed most to the prediction of Class "Positive" based on their parameter values and odd ratios. Since the confidence interval of our model falls between 93% and 99%, we are 95% confident that our AUC is accurate and thus, it indicates that our fitted model can predict diabetes status correctly.
Article
Full-text available
This paper used the data of automobile traffic accidents from 2018 to 2020 in the Chinese National Automobile Accident In-Depth Investigation System. The prediction features of traffic accident severity are innovated. Four accident features that did not participate in the importance ranking were added: accident location, accident form, road information, and collision speed. Eight accident features (engine capacity, hour of day, age of vehicle, month of year, day of week, age band of drivers, vehicle maneuver, and speed limit) have been used in previous studies. Random forest was used to rank the importance of 12 accident features, and 7 important accident features were finally adopted. By comparing the algorithms and optimizing the results, the prediction model of traffic accident degree with higher accuracy is finally obtained.
Research
Full-text available
This paper deals with ozone prediction in the atmosphere using a machine learning procedure. The persistence of highly concentrated ozone levels in the troposphere does harm biotic and abiotic. It is therefore vital to detect high levels of ozone early to ensure a healthy environment. El Paso, Texas is considered a high ozone affected city in the USA, with a history of very high ozone levels every year. In this paper we will use the data sets of air pollutants and meteorological variable from the El Paso area to classify the high/low ozone levels. The dataset was collected from the Texas Commission on Environment Quality (TCEQ) ground stations and cleaned for research purposes. In this study, we trained the data sets using Logistic Regression and Artificial Neural Network algorithms and further made comparison to determine the best model that accurately predicts early ozone level at El Paso area. We found that our model has a very high classification accuracy (89.3%) for predicting ozone level at a given day. From our evaluation metrics, the accuracy of both ANN and LR models were 89.3% and 88.4% respectively. In addition, the AUC of both models were almost the same with ANN having 95.4% while LR has 95.2%. Based on the outcome of our odds ratio, features like; 'Solar radiation', 'Std. Dev. Wind Direction', 'outdoor temperature', 'dew Point temperature' and 'PM10' contributes to the high level of ozone within El Paso Texas.
Article
Full-text available
Integration of artificial intelligence (AI) techniques in wireless infrastructure, real-time collection and processing of end-user devices is now in high demand. It is now superlative to use AI to detect and predict pandemic of colossal nature. Coronavirus disease 2019 (COVID-19) pandemic which originated in Wuhan China has had disastrous effects on the global community and overburdened the advanced healthcare systems in the world. Globally; over 4 063 525 confirmed cases and 282,244 deaths have been recorded as of 11th May 2020 according to European Centre for Disease Prevention and Control agency. However, the current rapid and exponential rise in the number of patients has necessitated efficient and quick prediction of the possible outcome of an infected patient for appropriate treatment using AI techniques. This paper proposes a fine-tuned Random Forest model boosted by AdaBoost algorithm. The model uses the COVID-19 patients: geographical, travel, health and demographic data to predict the severity of the case and the possible outcome- recovery or death. The model has an accuracy of 94% and a F1 Score of 0.86 on the dataset used. The data analysis reveals a positive correlation between the patients’ gender and deaths and also indicates that the majority of patients are in the age range of 20-70 years
Article
Full-text available
Identifying and detecting the travel mode and pattern of individual travelers is an important problem in transportation planning and policy making. Mobile-phone Signaling Data (MSD) have numerous advantages, including wide coverage and low acquisition cost, data stability and reliability, and strong real-time performance. However, due to their noisy and temporally irregular nature, extracting mobility information such as transport modes from these data is particularly challenging. This paper establishes a travel mode identification model based on the MSD combined with residents’ travel survey data, Geographic Information System (GIS) data, and navigation data. Using the data obtained from Kunshan, China in 2017, enriched with variables on the travel mode identification, the model achieved a high accuracy of 90%. The accuracy is satisfactory for all of the transport modes other than buses. Furthermore, among the explanatory variables such as the built environment factors (e.g., the coverage rate of a bus stop) are in general more significant, in contrast with other attributes. This indicates that the land use functions are more influential on the travel mode selection as well as the level of travel demand.
Article
Deterioration of surface ozone (O3) pollution in Northern China over the past few years received much attention. For many cities, it is still under debate whether the trend of surface O3 variation is driven by meteorology or the change in precursors emissions. In this work, a time series decomposition method (Seasonal-Trend decomposition procedure based on Loess (STL)) and random forest (RF) algorithm were utilized to quantify the meteorological impacts on the recorded O3 trend and identify the key meteorological factors affecting O3 pollution in Tianjin, the biggest coastal port city in Northern China. After “removing” the meteorological fluctuations from the observed O3 time series, we found that variation of O3 in Tianjin was largely driven by the changes in precursors emissions. The meteorology was unfavorable for O3 pollution in period of 2015-2016, and turned out to be favorable during 2017-2021. Specifically, meteorology contributed 9.3 µg/m³ O3 (13%) in 2019, together with the increase in precursors emissions, making 2019 to be the worst year of O3 pollution since 2015. Since then, the favorable effects of meteorology on O3 pollution tended to be weaker. Temperature was the most important factor affecting O3 level, followed by air humidity in O3 pollution season. In the midday of summer days, O3 pollution frequently exceeded the standard level (>160 µg/m³) at a combined condition with relative humidity in 40%-50% and temperature > 31°C. Both the temperature and the dryness of the atmosphere need to be subtly considered for summer O3 forecasting.
Article
Random forests (Breiman, 2001, Machine Learning 45: 5–32) is a statistical- or machine-learning algorithm for prediction. In this article, we introduce a corresponding new command, rforest. We overview the random forest algorithm and illustrate its use with two examples: The first example is a classification problem that predicts whether a credit card holder will default on his or her debt. The second example is a regression problem that predicts the logscaled number of shares of online news articles. We conclude with a discussion that summarizes key points demonstrated in the examples.
Article
This paper presents the development of a new model for predicting traffic incident duration using random forests (RFs), a data-driven, non-parametric machine learning technique. Utilizing an extensive dataset with over 140,000 incident records and 52 variables, the developed RF model was optimized by fine-tuning its parameters. The best-performing RF model achieved a mean absolute error (MAE) of 36.652 minutes, which is acceptable given the wide range of incident durations considered (1 to 1,440 minutes). To make a fair comparison with previous studies, another set of RF models was developed using a short range of 5- to 120-minute incident durations, which constitutes about 85% of the entire dataset. The performance of the best RF models for the short range improved significantly because the MAE decreased to 14.979 minutes (about a 40% reduction). The results of the developed RF models were also compared to those obtained from conventional artificial neural network (ANN) models, trained using the same dataset. While the ANN models slightly outperformed their RF counterparts (the best-performing ANN model was only 0.24% better than its RF counterpart), the RF models showed much more stable results, with only a small error range. Finally, the RF built-in variable-importance capability was demonstrated to successfully identify the most important variables impacting incident duration. These results were further utilized to quantify the effect of reducing the number of variables on the performance of the attempted RF models. This analysis confirmed that the accuracy of the incident-duration prediction could be slightly downgraded (less than a 6% difference in the worst case) in return for a substantial savings in the number of variables considered in the RF model (five variables as opposed to over 50 variables).