Conference PaperPDF Available
Two-dimensional-reduction Random Forest
Shuquan Ye
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
e-mail: 201536401655@mail.scut.edu.cn
Zhiwen Yu
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
e-mail: zhwyu@scut.edu.cn
Jiaying Lin
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
e-mail: 201530501498@mail.scut.edu.cn
Kaixiang Yang
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
e-mail: xgkaixiang@163.com
Dan Dai
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
e-mail: daidanjune@hotmail.com
Zhi-Hui Zhan
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
Wei-Neng Chen
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
Jun Zhang
School of Computer Science and Engineering
South China University of Technology
Guangzhou 510006, China
AbstractRandom forest (RF) is a competitive machine
learning theorem, while one of the big challenges for it is
imbalanced real-world data. A Two-dimensional-reduction RF
(2DRRF) is presented in this paper, which is optimized based on
traditional RF and three innovation points as follows. To
improve RF in terms of performance on imbalanced data, a two-
dimensional-reduction approach is created. Then, a modified T-
link is proposed focusing on detecting and reducing safe samples.
Moreover, a biased sampling manner is employed to build up
optimal training datasets. Across 13 imbalanced datasets from
KEEL-dataset with imbalance-ratio ranging from 6.38 to 129.44,
experiments are carried out indicating that 2DRRF steadily
holds advantages over the other two relevant implementations
of RF in terms of accuracy, recall, precision and F-value.
KeywordsImbalanced dataset; classification; random forest;
two-dimensional-reduction; biased sampling
I. INTRODUCTION
The imbalanced datasets can be defined as the data that
holds a negligible number of samples belonging to a particular
class, or positive class, which are of our main interest but
outnumbered by the remainders. With the continuous
emergence of information in fields such as medicine,
genomics, financial businesses, education [1] or network
intrusion [2], vast amount of data are generated with high
imbalance ratio. Regarding the standard data mining theorems
which are sensitive to the imbalance in data, gaining vital
information from imbalanced dataset has become a
challenging task.
A variety of theorems have been proposed to overcome
imbalanced problems. Generally, they are classified as data
processing methods and algorithmic methods. The first
category of methods apply a bias on the data distribution by
resampling or adjusting weights of each class to balance the
dataset ( e.g. under-sampling on majority class, over-sampling
on minority class and weight matrix [3] ). The second category
is internal, which is based on adjusting traditional algorithm
by taking imbalance into consideration ( e.g. biased minimax
probability machine [4], weighted extreme learning machine
[5] ).
The challenge of resampling method is to distinguish the
informative from redundant data, while the challenge of
algorithmic level approaches are selecting and improving
accuracy over original algorithms. This study provides an
approach combining the characteristics from above two
categories with bagging random forest model.
Traditionally, random forests (RF) are competitive
ensemble machine learning theorem, which combines the
result of tree predictors depending on the respective bootstrap
training samples. However, relying on the bootstrap sampling
approach, standard RF are not adapted to the contingency of
learning from imbalanced datasets.
Recently, a parallel random forest algorithm for big data
[6] was proposed to solve the high-dimensional and noisy data
problem which can be summarized by two steps: a bootstrap
sampling method, followed by a reduction in the number of
features. The result demonstrates the promising results of RF
and that there are plenty of room to adapt random forest model
to real-world data.
To adapt RF to learning imbalanced datasets, a two-
dimensional-reduction RF (2DRRF) is put forward in this
study. Usually, datasets are expressed in two dimensions: the
volume of data and the amount of features, which correspond
to the numbers of rows and columns of the data matrix
respectively. For the reduction in horizontal dimension, we
proposed a modified Tomek-link (T-link) used for reducing
safe samples, combined with Hart’s Condensed Nearest
Neighbor Rule (CNN) [7] under-sampling method. Following
this, a biased sampling manner was applied. For the reduction
in vertical dimension, Gini index [8] and Chi-square [9] are
employed of feature-reduction usage. With the two index
above focusing on informational and relational features
respectively, such framework can provide a new benchmark
for feature measurement. According to all above, training sets
can be built up for each tree.
Besides the improvement of learning the imbalanced data,
the two-dimensional-reduction(2DR) may also provide some
guidance for dealing with challenge of data growth of real-
world data.
The rest of the paper is organized as follows. Section 2
reviews the background works. Section 3 describes 2DRRF in
detail. Section 4 presents the experiment results and
conclusions are shown in Section 5 together with future works.
II. BACKGROUND
A. Related Work
Leo Breiman published his work on the first RF in 2001
[10]. Since then, this relatively new learner has showed its
power and robustness and been applied to a vast variety of
fields. With few experimentation on the construction of
random forest classifiers on imbalanced data has been
reported, T.Khoshgoftaar, M.Golawala and J.Hulse firstly
discussed the topic of performance of RF on imbalanced data
on Weka in 2007 [11]. In the work of M.Khalilia,
S.Chakraborty and M.Popescu in 2011, they provided
extensive proof of the guaranteed result on RF when used for
disease prediction with the highly imbalanced Healthcare Cost
and Utilization Project (HCUP) dataset [12].
A study of several methods on imbalanced training data
for machine learning was carried out by G.Batista in 2004,
performing a broad experimental evaluation [13] and
enlightened this study, in which one-sided selection (OSS) [14]
was the under-sampling method we are improving. Recently,
there are also many other researches combining or improving
methods in prior study, such as a hybrid scheme of over-
sampling and under-sampling in 2013 [15], Class Ratio
Preserved RF in 2015 [16], K-L feature compression method
in 2017 [17], and class weights voting (CWsRF) in 2018 [18].
Recently, a parallel-RF algorithm for big data on Apache
Spark platform [6] was presented by J.Chen, K.Li and Z.Tang
in 2017 to deal with the high-dimensional data by two steps: a
traditional bootstrap sampling manner in RF, followed by a
reduction in the number of features. Inspired by their study,
we took a step forward to combining modified T-link and
CNN in sampling followed by improved feature-reduction
method.
A scikit-learn[19] ensemble random forest classifier,
which is one of the most popular machine learning packages
in python, is an simple and efficient tool for this reseach. Their
algorithms are perturb-and-combine techniques [20]. While
another strong and robust implementation of ensembles and
tree learning algorithm called treelearn[21] was employed as
another control group.
The experimental datasets are from online KEEL
(Knowledge Extraction based on Evolutionary Learning)
imbalanced-dataset [22] which is a set of benchmarks
allowing to perform a complete analysis of the behavior of the
learning methods in comparison to existing ones.
B. Performance measure
In real-world tasks, a learner can avail itself of the
imbalance distribution of datasets. For example, with a dataset
of 99% negative class samples, a learner always predicting
negative should holds 99% average accuracy. In this case, the
average accuracy is not an adequate criterion. Several
alternative criteria are used in our experiment:
A Confusion matrix as shown in Table 1 with four
outcomes when examples are classified, is the most
straightforward way used as evaluation of a machine
learning model. From the confusion matrix, accuracy,
precision, recall and F-value may be defined as
follows:
   
   
  
 
  
 
      
   
In which β is usually set to 1. The main goal for
classifying imbalanced datasets is to heighten the
recall without hurting the precision and the F-value
represents the trade-off among different values of them.
Identify applicable sponsor/s here. If no sponsors, delete this text box
(sponsors).
III. TWO DIMENSIONAL REDUCTION RF
This section describes the optimization on Random forest
algorithm aiming at accommodating to imbalanced data and
we propose a Two-dimensional-reduction RF (2DRRF). The
major goal is to improve the overall performance and
overcome the difficulties with imbalanced datasets for RF as
analyzed below, by applying both vertically and horizontally
reductions to ensures the training subset are sampled
reasonably and features selected are optimally.
The steps constructing 2DRRF are shown in figure 1. For
the horizontal dimension reduction, modified T-link is used
firstly. Other than the convention of using traditional T-link
[23] for reducing boundary samples, it is used for reduction
among the interiority of majority class distribution in our
method so as to reduce the proportion of majority data as well
as emphasize the original border between the majority and
minority. Next, with CNN, we will find consistent subset from
the dataset we did with the previous step. To sum up, for
horizontal dimension we reduce examples from the majority
class that are distant from the decision border, since these sorts
of examples might be considered redundant or less relevant
for learning. After the horizontal dimension reduction, a
biased sampling manner was applied. For the vertical
dimension reduction, a Gini index is used firstly to compute
entropy, with which feature variables can be divided into low
and high importance. After Chi-square test, we can further
divide the feature variable into low or high correlation.
According to the four groups above, features can be extracted
obliquely so as to tease out informative and relative features
building up training sets. More detailed description are shown
in Subsection B, C and D.
A. The Random Forest Model
The traditional random forest model is an ensemble of
predictors
   
such that each tree are trained on independent samples
with the same distribution for all trees in the forest using
random sampling with replacement, or bootstrap, along with
random selection of features. During prediction, each sample
in testing data should be classified by all trees, which returns
final result according to the votes.
However, these fundamental ideas and rules based on
general case will be misled when imbalance ratio exceeds
certain threshold. Theoretically, the upper bound for the
generalization error of a forest, given by (1), is in direct
proportion to the correlation between the individual trees and
of negative correlation with the strength of them. Given sparse
positive examples, the positive regions would be extremely
Table 1: Confusion Matrix
Fig 1: Process for 2DR construction
Fig 2: Process for RF construction
small with decision border fitted closely to positive samples,
causing overfitting and reduce strength of trees eventually.
Also, the variance of , which is proportional to the mean
value of correlation as shown in (2), will be high when vast
majority of samples in each training set are from negative
class. The main processes of constructing a random forest are
as illustrated in figure 2.
  
Where the strength of the set of classifier is
   
And the variance of  is

  
B. Horizontal dimension reduction
For the multi-class problem with different class
labels   , suppose the given training
dataset is in the form
  

, where is the feature vector and is the corresponding label.
There are totally samples. The explanation of relevant
elements and denotes are shown in Table 2 which will appear
in the following steps.
Modified T-link: Reduction Inside Majority Class
In this step, to improve the performance of the RF
algorithm on imbalanced data, we present a modified T-link
for the usage of under sampling.
Traditional T-link as shown in figure 3 colored by deep
green, can be defined as follows: Given two examples in
dataset, and , coming from different classes, say,
 and  . Let  refer to the distance
between feature vectors and . A T-link is said to be
formed when there is no sample so that  
or   . That is:
    
       
The character of T-link is that for the two examples who
form a T-link, they should be removed because both examples
are boundary, or either one is noise. In another words, the
samples are divided into two categories:
1) Borderline examples distributing around class
boundaries, where the positive and negative classes overlap.
2) Safe samples which are located in relatively
homogeneous areas without overlapping class labels.
However, it is the deletion of borderline samples that causes
disputes because the samples along boundary are important
information for forming decision border. In consideration of
this, an adjustment is made to T-link as shown in figure 3
which are colored by bright orange. When a T-link is found
between samples both from majority class,   and
 , a reduction in majority class would be carried out.
A sample is randomly selected from and , say, , will be
moved from to another set in order to counteract the bias
of majority data far away from the borderline without hurting
the original border between the majority and minority at the
same time.
After this step, raw dataset is divided into and all
the rest . The modified Tomek-link algorithm is shown in
algorithm 1.
CNN: Find consistent subset
In this step, Hart’s Condensed Nearest Neighbor Rule is
applied to find a consistent subset of examples in  for under
sampling.
The CNN attempts to remove the examples far away from
decision border and selectively reduces the original
population. A subset    is consistent with when
Table 2: Table of elements
Element
jDj
D
jyi=C1j
C1
V(C1)
C1
Cmax
Cmax =Cmjyi=Cmj
Cmin
dim(xi)
xi
V(C1)
C1
ak
a
Fig 3: Tomek-links
using a 1-nearest-neighbor algorithm (1-NN), can
correctly classify all the instances in correctly.
This algorithm goes as follows: Firstly, select one sample
from the majority class and all samples belonging to the
minority class forming . A reminder that, this majority class
is not necessarily the majority in raw dataset because of the
previous step of T-link. Secondly, a 1-NN was applied upon
 to classify the examples in  Afterwards, all
misclassified instances are added to  so as to ensure
correctness. The three processes above are repeated until no
examples in will be misclassified. The rest are collected
into where  . To note that this algorithm does
not guarantee a smallest consistent set.
The horizontal dimension reduction ends with three
subsets , and .
  
C. Sampling with bias
In this step, to sample from the dataset which was reduced
or compressed by the first step, a biased sampling manner is
applied. A biased sample is collected in the way that some
samples of the intended population are less or more likely to
be included than others. In this case, aiming at selecting
subsets      from , and  as
training data for trees, with samples each, the probability
should be set differently:
 
    
  

This is a simple implementation of dynamically
probability setting for biased sampling, the principle is to do a
square operation on 
  , because in the range
[0,1], the characteristics of square operation is monotonically
increasing, and its derivation also monotonically increasing. It
can be automatically realized when the set  is too small and
the set is too large, the probability of extracting
from the set is appropriately increased in order to
ensure the diversity of the training subsets.
D. Second dimension reduction
In this part, we are further reducing or compressing each
training subsets in the dimension of feature space.
Both Gini index and Chi-square are employed to evaluate
features comprehensively. The former is from the perspective
of tree-growth and information gain which emphasizes
learning depth, while the latter is from the perspective of
statistical relativity between features and final result which
emphasizes correlation.
Gini index: Evaluating importance
For the sample subspace in   ,
there are total    features. If the sample is discrete,
and is divided into two subsets  and , by one feature
, Gini index can be computed as follows:



If the sample is divided into subsets where   ,
 


Where for dataset ,
   


    


Here Gini index is used as feature importance evaluation
and the reason why Gini index was chosen is that it is much
Algorithm 1 modi¯ed T-link
Input: x=fxigN
i=1: The set of feature vectors.
id: The indexes of samples in xfrom ma-
jority class
Output: Da=all link,D0=all rest
1: Initialize all link =Â;all rest =Â;
x nearest neighbor =Â;
2: x nearest neighbor =NearestNeighbor(x),
stores index of neighbor
3: for i in id do
4: if x nearest neighbor(i)=2id, whether
sample ibelongs to majority class then
5: Continue
6: end if
7: if x nearest neighbor(i)< i, to ensure
the pair in all link is unique then
8: Continue
9: end if
10: Randomly choose one from
x nearest neighbor and i, put into
all link
11: end for
12: all rest =x¡all link
Table 2: Four groups
: High correlation
and high importance
:High correlation
and low importance
:Low correlation
and high importance
:Low correlation
and low importance
more easier to compute than information gain ratio(entropy),
for it does not employ the logarithm operation.
After sorting according to Gini index and divide the
features by a certain threshold, the important feature set and
the unimportant feature set can be generated. In this
application, the top    , or ranking  are moved to
and the rest which ranks from    to are .
  
.
Chi-square: Evaluating correlation
Pearson’s Chi-square test is used in nominal variables to
determine whether there is a significant correlation between
the expect features, and in this situation. The steps for
Chi-square test are as follows:
1) Null pypothesis and Alternative hypothesis: Null
pypothesis assumes that there is no association between
and , while Alternative hypothesis assumes that there is an
association.
2) Transform the data into table: Each row represent a
value of  and each column represent a value of . Let  
 and   .
3) Expected value of and :

О


О



  
О

  
О

  
4) Chi-square test of independence:

О
 




О
  
5) Degree of freedom:
      
6) Hypothesis testing:
The critical value for the chi-square statistic is determined
by the level of significance (typically 0.05) and the degrees of
freedom. If the observed chi-square test statistic is greater than
the critical value, the null hypothesis can be rejected and there
is a significant correlation between and .
After sorting with Gini index and Chi-square test,
eventually, the feature variables are grouped as shown in table
2. The following step is to select features with bias which is
the same way as the biased sampling. For example:
   
 
  

  
This procedure ends up with training subsets of  

 
 
including biased selected features.
All operations on data can be expressed as figure 4.
IV. EXPERIMENTAL EVALUATION
This section performed experiments to evaluate the
proposed idea of RF improvement, which was improved on
the basis of scikit-learn ensemble random forest classifier,
while another strong and robust implementation of ensembles
and tree learning algorithm called treelearn was employed as
another control group. For experimental details, in terms of
the accuracy, recall, precision and F-value of our
implementation on 2DRRF are evaluated by comparison with
two RF: the scikit-learn RF classifier, and the treelearn
classifier.
Fig 4: Data operation process
Original data is split into random training and testing sets
with 0.25 of the proportion of the dataset to include in test
splits. Each performance test result is obtained averaging the
measurements of 100-30 loops, according to time consuming.
In these experiments, it turns out that several parameters
and attributes are keys for classification performance. Using
GridSearch, the optimal value of the number of trees can be
automatically selected. Also, there are other factors in 2DRRF
critical for learning: ratio of sample size for each tree, ratio of
sample bias, and three ratio of feature biases. To make it more
rigorous, in order to carry out controlled experiment, some
parameters of scikit-learn RF are set manually instead of using
default value. Totally 13 real-world-datasets from KEEL-
dataset[22] are chosen with different volume of samples from
168 to 4174, number of features from 6 to 41, and imbalance
ratios from 6.38 to 129.44 as presented in table 3.
Various experimentations of comparison are constructed
and performed on two different machines with Intel 3.20GHz
x64- based processor and 8.00GB RAM each.
As described in table 4, the performance of scikitlearn-RF,
treelearn-RF and 2DRRF are compared in terms of the
accuracy, recall, precision and F-value.
In general, our implementation of 2DRRF outperforms the
other two, especially in terms of the F-value, providing very
good result in practice. The process of 2DR will provide
increase their strength, thus reducing generalization error.
When analyzing each result of datasets, three conclusions can
be extracted:
In terms of performance on datasets 9, 10 and 13
whose imbalance ratio are over 50, or highly
imbalanced, first of all, it can be observed that
2DRRF performs steadily holding the highest Recall,
Precision and F-value, as those three standards are
relatively more convincing when measuring highly
imbalanced data sets, since their design is aimed at
measuring the classification of positive samples by
the model, which makes it more suitable as an
indicator of the performance on imbalanced data.
Moreover, the recall remains at 1.0.
In terms of performance on datasets 4-7, where each
of these datasets holds the same data volume, 2DRRF
still maintains its advantages. However, with these
four datasets, especially with yeast3, our approach
shows less stable performance on Recall. Also with
fixed data amount and features but increasing the
imbalance ratio, our approach didn’t show a more or
less performance on Accuracy and Precision.
Considering small or medium sized dataset without
high imbalance ratio, nor large feature number, our
algorithm performs slightly worse than in other case.
When looking at the above three cases, answer may be
hidden in the design and structure of the algorithm itself. For
the first case of great performance on large and high-
imbalanced data, and the third case of poor performance on
low feature number and small data-size, the reduction
procedure can be the point. Since 2DR can reduce the size of
original data and the feature for tree construction, it provides
RF with better ability to learn large datasets. In contrast, its
ability to learn from small and few features data is weaken. As
for case two, the result shows that 2DR is relatively not
sensitive to changes in imbalance-ratio. In other words, it
holds some stability and robustness.
V. CONCLUSION AND LIMITATIONS
This paper presents the improvement of Random forest
over imbalanced data with a strategy which employs reduction
in both the data volume and the feature space, and
substantiates the improvement both theoretical and
experimental. A proposal of a taxonomy of 2DR is offered,
under-sampling input training set with a combination of
modified T-link and CNN, driving training subsets for each
tree with a biased sampling manner, selecting features with a
mixed feature measurement of Gini index and Chi-square test.
After experimental study of improved RF comparing with
traditional RF, the main conclusions achieved are as follows:
Benefitting from the 2DR, our algorithm steadily
outperforms the other two in comparison and shown
more robust to the change of imbalance ratio in terms
of accuracy, recall and precision, indicating notable
superiority and strength over the others.
Taking advantage of the reduction method, it has
gained a better ability to deal with large data with
many features and high imbalance ratio.
The difficulty of dealing with small sized and few
features dataset indicated that pure dimensionality
reduction may not be the optional solution to all cases
in imbalanced problem.
For feature work, we will focus on improvement of
2DRRF in three fields:
To further improve the performance on small dataset,
we may employee the algorithm combining both 2DR
and oversampling method so as to adjust itself to
sample size.
In comparison with the original RF algorithm, the
process of 2DR ensures the training subset and
selected features are optimal, while we may loss some
diversity among the trees, which may lead to increase
of generalization error, or overfitting, the problem not
mentioned in the original algorithm. To solve this
problem, we can simplify the specific decision trees
by pruning.
In our implementation, training time and efficiency
are not taken in account, resulting in significantly
increased time-consuming. Therefore, we plan to
simplify the algorithm, optimize the order on the basis
of carrying out complexity analysis.
ACKNOWLEDGMENT
The work described in this paper was partially funded by
grants from the NSFC No. 61722205, No. 61751205, No. 61-
572199, No. 61572540, and No. U1611461, the grant from
the Guangdong Natural Science Funds (No. S2013050014677,
and No. 2017A030312008), the grant from Science and
Technology Planning Project of Guangdong Province, China
(No. 2015A050502011, No. 2016B090918042, No. 2016-
A050503015, No. 2016B010127003), the grant from Guang-
zhou Science and Technology Planning Project (No.
201704030051)
REFERENCES
[1] Chau V, Phung N, Proceedings - 2013 RIVF International Conference
on Computing and Communication Technologies: Research,
Innovation, and Vision for Future, RIVF 2013 (2013) pp. 135-140.
[2] J. Zhang and M. Zulkernine. Network intrusion detection using random
forests. Proc. Of the Third Annual Conference on Privacy, Security and
Trust (PST), pages 5361, October 2005.
[3] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P.
(2002). SMOTE: Synthetic minority over-sampling technique. Journal
of Artificial Intelligence Research, 16:321357.
[4] Huang, K., Yang, H., King, I., and Lyu, M. R. (2006). Imbalanced
learning with a biased minimax probability machine. IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,
36(4):913923.
[5] Zong, W., Huang, G.B., and Chen, Y. (2013). Weighted extreme
learning machine for imbalance learning. Neurocomputing 101, 229
242.
[6] Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C., and Li, K. (2017).
A Parallel Random Forest Algorithm for Big Data in a Spark Cloud
Computing Environment. IEEE Transactions on Parallel and
Distributed Systems 28, 919933.
[7] Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE
Transactions on Information Theory, 14, 515516.
[8] Breiman, L., Friedman, J., Olshen, R., Stone., C.: Classification and
Regression Trees, Wadsworth, Belmont, MA (1984).
[9] Michalski, Stepp, & Diday, 1981; Diday, 1974 .
[10] Breiman, L.: Random forests. Mach. Learn. 45(1), 532 (2001).
[11] Khoshgoftaar T, Golawala M, Hulse J. 19th IEEE International
Conference on Tools with Artificial Intelligence(ICTAI 2007), vol. 2
(2007) pp. 310-317.
[12] Khalilia M, Chakraborty S, Popescu M. BMC Medical Informatics and
Decision Making, vol. 11, issue 1 (2011).
[13] G.Batista, R.Prati, M.Monard. ACM SIGKDD Explorations
Newsletter, vol. 6, 2004.
[14] Kubat, M., and Matwin, S. Addressing the Course of Imbalanced
Training Sets: One-sided Selection. In ICML (1997), pp. 179186.
[15] Chau, V.T.N., and Phung, N.H. (2013). Imbalanced educational data
classification: An effective approach with resampling and random
forest. In Proceedings - 2013 RIVF International Conference on
Computing and Communication Technologies: Research, Innovation,
and Vision for Future, RIVF 2013, pp. 135140.
[16] Khoshgoftaar T, Fazelpour A, DIttman D, Napolitano A, Proceedings
- 2015 IEEE 16th International Conference on Information Reuse and
Integration, IRI 2015 (2015) pp. 342-348 Published by Institute of
Electrical and Electronics Engineers Inc.
[17] Zhu M, Su B, Ning G, 2017 International Conference on Smart Grid
and Electrical Automation (ICSGEA) (2017) pp. 273-277 Published by
IEEE.
[18] Zhu M, Xia J, Jin X, Yan M, Cai G, Yan J, Ning G, IEEE Access (2018)
pp. 1-1.
[19] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12,
pp. 2825-2830, 2011.
[20] L.Breiman, “Arcing Classifiers”, Annals of Statistics 1998.
[21] Ensembles and Tree Learning Algorithms for Python, 1iskandr
https://github.com/capitalk/treelearn, 2013.
[22] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L.
nchez, F. Herrera. KEEL Data-Mining Software Tool: Data Set
Repository, Integration of Algorithms and Experimental Analysis
Framework. Journal of Multiple-Valued Logic and Soft Computing
17:2-3 (2011) 255-287.
[23] Tomek, I. Two Modifications of CNN. IEEE Trans- actions on Systems
Man and Communications SMC-6 (1976), 769772.
... The random forest classifier is basically used for dimensionality reduction or feature selection (Widmann and Silipo, 2015;Ye et al., 2018). The feature selection can be easily done through variable importance ranking. ...
Article
This study investigates acceleration behavior and crossing decision of the drivers under increasing time pressure driving conditions. A typical urban route was designed in a fixed-base driving simulator consisting of four signalized intersections with varying time to stop line (4 s and 6 s) and maneuver type (right-turn and go-through). 97 participants’ data were obtained under No Time Pressure (NTP), Low Time Pressure (LTP), and High Time Pressure (HTP) driving conditions. The acceleration behavior was examined at the onset of yellow signal in four ways: continuous deceleration, acceleration-deceleration, deceleration-acceleration, and continuous acceleration. A random forest model was used to build an acceleration behavior prediction model for identifying the significant explanatory variables based on variable importance ranking. Further, a Mixed Effects Multinomial Logit (MEML) model was developed using the explanatory variables obtained from a random forest model. Additionally, a generalized linear mixed model was incorporated for estimating the likelihood of crossing an intersection by considering all the explanatory variables. A MEML model result revealed that the odds of adopting acceleration-deceleration, deceleration-acceleration, and continuous acceleration instead of continuous deceleration increased by 63 %, 123 %, and 77 %, respectively under HTP driving conditions. Moreover, the likelihood of crossing a signalized intersection increased by 2.73 times and 4.26 times when the drivers were under LTP and HTP driving conditions, respectively as compared to NTP driving condition. Apart from this, time to stop line (reference: 6 s) and age showed negative association with crossing probability. Overall, the findings from this study revealed that drivers altered their acceleration behavior for executing risky driving decisions under increasing time pressure driving conditions.
Article
Full-text available
With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm’s accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability.
Conference Paper
Full-text available
Educational data mining is emerging in the data mining research arena. Despite an applied field of data mining techniques and methods, educational data mining is full of challenges that have not been completely resolved. Especially data classification in an academic credit system is a very tough task which must deal with imbalanced issues and missing data on the technical side and tackle the flexibility of the education system leading to the heterogeneity of data on the practical side. In this paper, we present our approach with a hybrid resampling scheme and random forest for the imbalanced educational data classification task with multiple classes based on student's performance. The proposed approach has not yet been available in educational data mining. Besides, it has been extensively proved in our empirical study to be effective for student's final study status prediction and usable in a knowledge-driven educational decision support system.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Article
The classification in class imbalanced data has drawn significant interest in medical application. Most existing methods are prone to categorize the samples into the majority class, resulting in bias, in particular the insufficient identification of minority class. A kind of novel approach, class weights random forest is introduced to address the problem, by assigning individual weights for each class instead of a single weight. The validation test on UCI datasets demonstrate that for imbalanced medical data the proposed method enhanced the overall performance of the classifier while producing high accuracy in identifying both majority and minority class.
Chapter
Clustering is described as a multistep process in which some of the steps are performed by a data analyst and some by a computer program. At present, those performed by a computer program do not produce any description of the generated clusters. The recently introduced method of conjunctive conceptual clustering overcomes this problem by requiring that each cluster has a conjunctive description built from relations on object attributes and closely “fitting” the cluster. The paper explains the above clustering method in terms of dynamic clustering and shows by an example its advantages over methods of numerical taxonomy from the viewpoint of cluster interpretation.
Article
This work is related to the KEEL (Knowledge Extraction based on Evolutionary Learning) tool, an open source software that supports data management and a designer of experiments. KEEL pays special attention to the implementation of evolutionary learning and soft computing based techniques for Data Mining problems including regression, classification, clustering, pattern mining and so on. The aim of this paper is to present three new aspects of KEEL: KEELdataset, a data set repository which includes the data set partitions in the KEEL format and shows some results of algorithms in these data sets; some guidelines for including new algorithms in KEEL, helping the researchers to make their methods easily accessible to other authors and to compare the results of many approaches already included within the KEEL software; and a module of statistical procedures developed in order to provide to the researcher a suitable tool to contrast the results obtained in any experimental study.Acase of study is given to illustrate a complete case of application within this experimental analysis framework.
The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a) retention of unnecessary samples and b) occasional retention of internal rather than boundary samples. Two modifications of CNN are presented which remove these disadvantages by considering only points close to the boundary. Performance is illustrated by an example.