Article

A novel random forest approach for imbalance problem in crime linkage

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Crime linkage is a challenging task in crime analysis, which is to find serial crimes committed by the same offenders. It can be regarded as a binary classification task detecting serial case pairs. However, most case pairs in the real world are nonserial, so there is a serious class imbalance in the crime linkage. In this paper, we propose a novel random forest based on the information granule. The approach doesn’t resample the minority class or the majority class but concentrates on indistinguishable case pairs at the classification boundary. The information granule is used to identify case pairs that are difficult to distinguish in the dataset and constructs a nearly balanced dataset in the uncertainty region to deal with the imbalanced problem. In the proposed approach, random trees come from the original dataset and the above mentioned nearly balanced dataset. A real-world robbery dataset and some public imbalanced datasets are employed to measure the performance of the approach. The results show that the proposed approach is effective in dealing with class imbalances, and it can be extended to combine with other methods solving class imbalances.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The paper [18] proposes a supervised information granular-based IGRF method to correlate crimes to identify serial crime incidents. The method relies on the experience of a single feature for a single multi-category single classification for the process pattern, behavioral pattern, and structural pattern of a crime incident. ...
... Exploring the features of the incident profile is a prerequisite for studying incident profile and incident depiction. Among them, based on the features of incident profiles covered in the papers [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] [20][21][22][23] and combined with the features of incidents, the incident profiles are sum-marized as time pattern, space pattern, suspect profile, process pattern, and result, as shown in Figure 2. Based on the papers [1][2][3][4][5][6]13,[17][18][19][21][22][24][25], a taxonomy of incident depiction methods is summarized as depiction methods of spatial and temporal, depiction methods of suspect, depiction methods of process and depiction methods of motive and result. Among which, depiction methods of spatial and temporal have depiction methods of regression-based, depiction methods based on decision tree classification, and depiction methods based on neural network classification. ...
... Exploring the features of the incident profile is a prerequisite for studying incident profile and incident depiction. Among them, based on the features of incident profiles covered in the papers [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] [20][21][22][23] and combined with the features of incidents, the incident profiles are sum-marized as time pattern, space pattern, suspect profile, process pattern, and result, as shown in Figure 2. Based on the papers [1][2][3][4][5][6]13,[17][18][19][21][22][24][25], a taxonomy of incident depiction methods is summarized as depiction methods of spatial and temporal, depiction methods of suspect, depiction methods of process and depiction methods of motive and result. Among which, depiction methods of spatial and temporal have depiction methods of regression-based, depiction methods based on decision tree classification, and depiction methods based on neural network classification. ...
... At the algorithmic level, one tends to adjust the network structure to improve the classification accuracy on the unbalanced data. Ensemble learning is an effective method, such as random forest [23] and XGBoost [47], which integrates several weak classifiers and can avoid the drawback of a single classifier. In [12], a dynamic ensemble learning algorithm based on K-means achieves diverse base classifiers and distance-based dynamic ensemble creates a personalized combinational result for each test sample. ...
... Another trend is to optimize the costs of prediction errors or other potential costs with developed conceptualizations or techniques to focus the learning on minority samples. Cost-sensitive model [50] punishes parameters severely if the model misclassifies a minority sample, which can guide the model 23 to pay more attention to the minority class. CS-SVM optimizes the SVM by extending the standard loss function with a constructive procedure [16]. ...
... Three classifiers, namely support vector machine (SVM) [20], multilayer perceptron (MLP) [39] and random forest (RF) [23], are used to assess the efficiency of resampling methods. From a large number of experimental results, the coefficient k in K-means is set to 4, 5, 5, 2, 3, 10, 11, 20, 2, 8, 20, 8, 20 and 5 for each dataset, respectively. ...
Article
Data-imbalanced problems are present in many applications. A big gap in the number of samples in different classes induces classifiers to skew to the majority class and thus diminish the performance of learning and quality of obtained results. Most data level imbalanced learning approaches generate new samples only using the information associated with the minority samples through linearly generating or data distribution fitting. Different from these algorithms, we propose a novel oversampling method based on generative adversarial networks (GANs), named OS-GAN. In this method, GAN is assigned to learn the distribution characteristics of the minority class from some selected majority samples but not random noise. As a result, samples released by the trained generator carry information of both majority and minority classes. Furthermore, the central regularization makes the distribution of all synthetic samples not restricted to the domain of the minority class, which can improve the generalization of learning models or algorithms. Experimental results reported on 14 datasets and one high-dimensional dataset show that OS-GAN outperforms 14 commonly used resampling techniques in terms of G-mean, accuracy and F1-score.
... Then, we apply two machine learning methods, i.e., the Logistic Regression (LR) model and the Random Forest (RF) model as the classifiers for training and prediction. The effect of the models is compared using relevant evaluation indexes such as F1-score, G-means, MCC, and AUC [45][46][47]. ...
... The F1-measure is the harmonic mean of the recall rate R and precision rate P, which can evaluate the overall classification of unbalanced data sets [45][46][47]. The larger the value of F1, the better is the classification effect of the classifier, with. ...
... Otherwise, the value of G-means will be low. The G-means is expressed as follows [45][46][47]: ...
Article
Full-text available
As online P2P loans in automotive financing grows, there is a need to manage and control the credit risk of the personal auto loans. In this paper, the personal auto loans data sets on the Kaggle platform are used on a machine learning based credit risk assessment mechanism for personal auto loans. An integrated Smote-Tomek Link algorithm is proposed to convert the data set into a balanced data set. Then, an improved Filter-Wrapper feature selection method is presented to select credit risk assessment indexes for the loans. Combining Particle Swarm Optimization (PSO) with the eXtreme Gradient Boosting (XGBoost) model, a PSO-XGBoost model is formed to assess the credit risk of the loans. The PSO-XGBoost model is compared against the XGBoost, Random Forest, and Logistic Regression models on the standard performance evaluation indexes of accuracy, precision, ROC curve, and AUC value. The PSO-XGBoost model is found to be superior on classification performance and classification effect.
... The proposal of LJP can greatly reduce the workload of relevant employees and reduce the occurrence of B Xiaoding Guo gxd@henu.edu.cn 1 unfairness, false, and misunderstanding. At the same time, LJP can help increase the authority of judicial organs and increase people's trust in the judicial system. ...
... In recent years, with the rise of Artificial Intelligence technology, more and more researchers have apply Artificial Intelligence to all fields. At the same time, with the opening of a large number of high-quality legal judgments texts, more and more researchers have apply new technologies to LJP tasks, such as [1,2]. At present, LJP mainly includes the following three tasks [3]: (1) Applicable law article prediction (2) Charge prediction (3) Term of penalty prediction. ...
Article
Full-text available
In recent years, the field of intelligent justice has attracted a lot of academic attention. Among them, legal judgment prediction (LJP) is the most important research direction in the field of intelligent justice. LJP predict the results according to the facts of criminal cases and LJP is becoming a research hotspot in the legal realm. LJP mainly includes the following: (1) applicable law article prediction (2) charge prediction (3) term of penalty prediction. Most of the existing research methods are based on neural networks. Due to the characteristics of neural network, the interpretability of the final results is poor. In this paper, formal concept analysis (FCA) was introduced in the LJP task and a method called FCA-LJP was proposed. The original FCA method is improved based on the characteristics of the LJP task in order to make FCA-LJP more suitable the LJP task. The generalization and specialization of the formal concept used in FCA can find the common ground between different cases in the same charge. At last, we conduct experiments on the public data set of the Legal AI Challenge. The experimental results show that the FCA-LJP method has better prediction results.
... The multicriteria decision-making [27,33] has also been used, values of criminal behaviors are described by linguistic variables, and the similarity between crimes is calculated according to their linguistic variables. Machine learning classification algorithms are more popular in crime linkage and have achieved excellent effect, including neural networks [34] , logistic regression [35][36][37][38] , decision trees [39] , Bayesian classification [40] , and random forest [5,41] , etc. However, these studies rely on the output of algorithms to classify all samples as serial crimes or nonserial crimes, and each sample will obtain a certain class, which may cause decision-making errors. ...
... Crime linkage can be treated as a binary classification task with two classes denoting "serial crimes" and "nonserial crimes" [4] . Many classification algorithms have been applied and achieved excellent performance [5] . However, the disadvantage of traditional classification algorithms is that they only make a decision of accepting or rejecting a sample, that is, two-way decisions [6] . ...
Article
Crime linkage is a difficult task and is of great significance to maintaining social security. It can be treated as a binary classification problem. Some crimes are difficult to determine whether they are serial crimes under the existing evidence, so the two-way decisions are easy to make mistakes for some case pairs. Here, the three-way decisions based on the decision-theoretic rough set are applied and its key issue is to determine thresholds by setting appropriate loss functions. However, sometimes the loss functions are difficult to obtain. In this paper, a method to automatically learn thresholds of the three-way decisions without the need to preset explicit loss functions is proposed. We simplify the loss function matrix according to the characteristic of crime linkage, re-express thresholds by loss functions, and investigate the relationship between overall decision cost and the size of the boundary region. The trade-off between the uncertainty of the boundary region and the decision cost is taken as the optimization objective. We apply multiple traditional classification algorithms as base classifiers, and employ real-world cases and some public datasets to evaluate the effect of our proposed method. The results show that the proposed method can reduce classification errors.
... Dhanani [3] uses word embedding to extend legal advice and judgment. Yu [4] constructs a causal relationship between case facts and crimes using random forest. Ji [5] uses the neural network model to vectorise legal texts and capture the potential connections between case elements. ...
Article
Full-text available
The use of computer-assisted legal judgment prediction (LJP) is a current research hotspot, driven by advances in artificial intelligence technology. Previous LJP methods have mainly relied on feature models and emphasized parameter sharing within the coding layer, ignoring progressive sequential relationships between LJP subtasks as well as potential similarity correlations between cases. These limitations have hindered the improvement of the accuracy of LJP methods. This article proposes an LJP algorithm, called MTL-LJP, based on optimised multi-task learning that fuses similarity correlations. MTL-LJP uses EnMo to encode cases and SiMa to compute similarity matrices between cases. MuTa, which is based on multi-task learning, is used to predict LJP subtasks. EnMo vectorises case facts from multiple perspectives using encoders based on CNN, Bi-GRU, Bi-GRU with attention mechanism and MMoE. SiMa computes centroids and distance vectors based on historical case labels and LJP subtask predictions, allowing the computation of similarity matrices for each subtask with low computational complexity. MuTa predicts subsequent subtasks by using the intermediate results of previous subtasks through the forward auxiliary mechanisms. MuTa modifies predictions by correlating the similarity of cases. Experimental results on several real case datasets show that MTL-LJP outperforms previous methods on LJP subtasks.
... The Weighted Random Forest, for instance, assigns higher penalty weights to minority instances, whereas the Balanced Random Forest deals with the class imbalance by undersampling the majority instances (Chen et al., 2004). Li et al. (2020) proposed a novel random forest based on information granules that is able to focus on classifying undifferentiated case pairs without boundaries to combine with. ...
Article
Full-text available
Churn prediction on imbalanced data is a challenging task. Ensemble solutions exhibit good performance in dealing with class imbalance but fail to improve the profit-oriented goal in churn prediction. This paper attempts to develop a new bagging-based selective ensemble paradigm for profit-oriented churn prediction in class imbalance scenarios. The proposed approach exploits an over-produce and choose strategy, which uses a cost-weighted negative binomial distribution to generate training subsets and a cost-sensitive logistic regression with a lasso penalty to combine base classifiers selectively. Extensive experiments were carried out on ten real-world data sets exhibiting a high level of imbalance from the telecommunication industry. The experimental results show that our proposed method obtains better performance than the other twelve state-of-the-art ensemble solutions for class imbalance in both accuracy-based and profit-based measures. Our research provides a new ensemble tool for imbalanced churn prediction for both academicians and practitioners.
... Learning from class imbalanced data is a hotspot and challenging issue in the field of machine learning [1,2]. Also, we note that the class imbalance problem exists widely in real-world applications, including biology data processing [3], business data analysis [4], industry fault detection [5], face recognition [6] and crime linkage discovery [7]. In these applications, the users generally focus on the minority class that has less training instances than the other classes. ...
Article
Full-text available
Learning from imbalanced data is a challenging task, as with this type of data, most conventional supervised learning algorithms tend to favor the majority class, which has significantly more instances than the other classes. Ensemble learning is a robust solution for addressing the imbalanced classification problem. To construct a successful ensemble classifier, the diversity of base classifiers should receive specific attention. In this paper, we present a novel ensemble learning algorithm called Selective Evolutionary Heterogeneous Ensemble (SEHE), which produces diversity by two ways, as follows: 1) adopting multiple different sampling strategies to generate diverse training subsets and 2) training multiple heterogeneous base classifiers to construct an ensemble. In addition, considering that some low-quality base classifiers may pull down the performance of an ensemble and that it is difficult to estimate the potential of each base classifier directly, we profit from the idea of a selective ensemble to adaptively select base classifiers for constructing an ensemble. In particular, an evolutionary algorithm is adopted to conduct the procedure of adaptive selection in SEHE. The experimental results on 42 imbalanced data sets show that the SEHE is significantly superior to some state-of-the-art ensemble learning algorithms which are specifically designed for addressing the class imbalance problem, indicating its effectiveness and superiority.
... In recent years, learning from imbalanced data distributions has gradually developed to be a hotspot issue in machine learning field due to this issue is emerging in more and more practical applications, including medical diagnosis [1], industrial fault diagnosis [2,3], network intrusion detection [4], financial fraud detection [5,6], text classification [7,8], bioinformatics [9], soil classification [10], air performance prediction [11], and criminal linkage detection [12]. ...
Article
Full-text available
Class imbalance learning (CIL), which aims to addressing the performance degradation problem of traditional supervised learning algorithms in the scenarios of skewed data distribution, has become one of research hotspots in fields of machine learning, data mining, and artificial intelligence. As a postprocessing CIL technique, the decision threshold moving (DTM) has been verified to be an effective strategy to address class imbalance problem. However, no matter adopting random or optimal threshold designation ways, the classification hyperplane could be only moved parallelly, but fails to vary its orientation, thus its performance is restricted, especially on some complex and density variable data. To further improve the performance of the existing DTM strategies, we propose an improved algorithm called CDTM by dividing majority training instances into multiple different density regions, and further conducting DTM procedure on each region independently. Specifically, we adopt the well-known DBSCAN clustering algorithm to split training set as it could adapt density variation well. In context of support vector machine (SVM) and extreme learning machine (ELM), we respectively verified the effectiveness and superiority of the proposed CDTM algorithm. The experimental results on 40 benchmark class imbalance datasets indicate that the proposed CDTM algorithm is superior to several other state-of-the-art DTM algorithms in term of G-mean performance metric.
... Nonetheless, in many real-world cases, serial crime pairs are much less than nonserial crime pairs. To address such a challenge, some studies applied class imbalance algorithms [28]. In a real robbery case, they focus on the indistinguishable case pairs at the classification boundary instead of resampling minor or major classes to handle the imbalanced dataset. ...
... After oversampling the weighted average F1-score increased from 88% to 96%, implying the relevance of balancing the classes, particularly in case of smaller datasets, like the one shown is this study. Although MCRF models are known to handle class-imbalance very efficiently, studies [42,43] have shown that in case of smaller datasets with very few samples in any of the classes, the Gini measure for data splitting is skew sensitive and biased towards the majority class. Notably, the proposed classifier, generates such reports every five minutes for each batch of RFs, based on the predictions obtained from the final model. ...
Conference Paper
Full-text available
Maritime Domain Awareness (MDA) is primarily driven by an Automatic Identification System (AIS), an automated tracking system that identifies and displays other vessels in the vicinity. There have been several implementations of AIS, nevertheless this algorithm demonstrates a near real-time Machine Learning (ML) pipeline that detects and clusters the radio frequencies (RF) received from a Software Defined Radio (SDR) implementation of the maritime AIS, operated by USM. The hardware setup consists of an SDR, raspberry-Pi, and a super wide band antenna. The near real-time analytic pipeline reads the RF signals every five minutes and performs a feature-learning based random forest classification of the detected RF frequencies. The model with the selected feature-set and 500 estimators (yielding the lowest Out-Of-Bag error rate of 0.10) was used to test the streaming signals every five minutes. With a 96% classification accuracy, this SDR-ML pipeline demonstrated the ability to be used in port surveillance and security by detecting the center frequencies and bandwidths of the local active signals around the port in near real-time. Further, by integrating the AIS tracking and detection, the source of a suspicious RF signal can also be classified which would play a key role in ensuring MDA.
... The Weighted Random Forest, for instance, assigns higher penalty weights to minority instances, whereas the Balanced Random Forest deals with the class imbalance by undersampling the majority instances (Chen et al., 2004). Li et al. (2020) proposed a novel random forest based on information granules that is able to focus on classifying undifferentiated case pairs without boundaries to combine with. ...
Article
Churn prediction on imbalanced data is a challenging task. Ensemble solutions exhibit good performance in dealing with class imbalance but fail to improve the profit-oriented goal in churn prediction. This paper attempts to develop a new bagging-based selective ensemble paradigm for profit-oriented churn prediction in class imbalance scenarios. The proposed approach exploits an over-produce and choose strategy, which uses a cost-weighted negative binomial distribution to generate training subsets and a cost-sensitive logistic regression with a lasso penalty to combine base classifiers selectively. Extensive experiments were carried out on ten real-world data sets exhibiting a high level of imbalance from the telecommunication industry. The experimental results show that our proposed method obtains better performance than the other twelve state-of-the-art ensemble solutions for class imbalance in both accuracy-based and profit-based measures. Our research provides a new ensemble tool for imbalanced churn prediction for both academicians and practitioners.
... Another reason for the popularity of DTs is that multiple tree-based ensembles achieve highly competitive classification results; in a 2017 survey [3], Random Forest [4], and XGBoost [5] are among the top-ranked algorithms; in a 2021 survey [6] using synthetic databases, bagged CART and Gradient Boosted Trees were the top-ranked algorithms in problems with few features and a high number of objects. Some recent applications of tree-based classifiers in real-world problems are online fault diagnosis for rotor systems with a variation of CART (D-CART) [7], and crime linkage, where a variation of Random Forest (IGRF) is used to solve the class imbalance problem [8]. ...
Article
Decision trees (DTs) are popular classifiers partly due to their reasonably good classification performance, their ease of interpretation, and their widespread use in ensembles. To improve the classification performance of individual DTs, researchers have used linear combinations of features in inner nodes (Multivariate decision trees), leaf nodes (Model trees), or both (Functional trees). In this paper, we present a new functional tree, Functional Tree for class imbalance problems (FT4cip). FT4cip is designed to work with class imbalance problems, where one of the classes in the database has few objects compared to another class. FT4cip achieves better classification performance, in terms of AUC, than the best model tree (LMT) and functional tree (Gama) that we identified. The statistical comparison was made in 110 databases using Bayesian statistical tests. We also make a meta-analysis of classification performance per type of database, which helps us recommend a classifier given a problem. We show how each design decision taken when building FT4cip contributes to classification performance or simple models, and rank them according to their importance to classification performance. To avoid a problem of fragmentation in DT literature, we contrast each design decision taken when building FT4cip against LMT and Gama.
... And different ensemble learning models have achieved many good performances on various classification problems [15,37,44]. Random forest is a representative algorithm of ensemble learning, which has significantly superior generalization ability than a single classifier and is often applied to small sample learning problems when the imbalanced ratio is not too high [32,73]. ...
Article
Full-text available
Currently, various machine learning (ML) techniques have been developed to solve geotechnical engineering problems. However, the lack of representative field samples limits the application of ML models. A shield jamming risk prediction method based on numerical samples and random forest (RF) classifier is proposed. The database with samples of different shield jamming risk levels is established by numerical simulation of the TBM construction process. By setting different values and combinations, seven influencing parameters, i.e., advance rate, overcut, elastic modulus, tensile strength, in situ stress, maximum thrust and friction coefficient are considered. The shield jamming risk level is determined according to the ratio of the total friction force to TBM residual thrust. Feature importance analysis indicates that elastic modulus, overcut and in-situ stress are the major influencing factors of shield jamming. Based on the labeled database, the RF model integrating multiple decision trees is established to capture the complex relationship between shield jamming risk and different influencing factors. The trained model has shown a good prediction performance on test set, and the prediction results of six field instances in DXL tunnel are in good agreement with the actual shield jamming situation. Compared with other conventional classifiers, i.e., support vector machine (SVM), k-nearest neighbors (KNN), decision tree (DT) and logistic regression (LR), the proposed FR classifier has stronger prediction accuracy and generalization ability. Additionally, the influence of sample numbers and sample imbalance is discussed.
... The ensemble-margin based random forest, proposed by Feng et al. [28], combines the random forest and new subsampling iterative techniques. Random forest also has a wide range of applications for its good stability and generalization [29,30]. Fernández-Delgado et al. [31] asserted that ''The classifiers most likely to be the best are the random forest versions''. ...
Article
Many researchers have studied the combinations of machine learning techniques and traditional statistical strategies, and proposed effective procedures for complicated data sets. Yet, there is still some lack of running time and prediction accuracy. In this paper, we propose an iterative feature screening procedure, named forward recursive selection. We combine the random forest and forward selection to address the model-based limitations and the related requirements. We also use the forward strategy with a limited number of iterations to improve the computational efficiency. To provide the theoretical guarantees of this method, we calculate functions of the permutation importance of this algorithm in different models and data with group structures. Numerical comparisons and empirical analysis support our results, and the proposed procedure works well.
... Finally, the trained subtree is merged with the standard random forest. A new method based on an information granule (IGRF) [28] applied the idea of BRAF to crime detection. The IGRF process combines information granularity and the series of crime pairs in k neighbors to form the critical areas. ...
Article
Full-text available
Many machine learning problem domains, such as the detection of fraud, spam, outliers, and anomalies, tend to involve inherently imbalanced class distributions of samples. However, most classification algorithms assume equivalent sample sizes for each class. Therefore, imbalanced classification datasets pose a significant challenge in prediction modeling. Herein, we propose a density-based random forest algorithm (DBRF) to improve the prediction performance, especially for minority classes. DBRF is designed to recognize boundary samples as the most difficult to classify and then use a density-based method to augment them. Subsequently, two different random forest classifiers were constructed to model the augmented boundary samples and the original dataset dependently, and the final output was determined using a bagging technique. A real-world material classification dataset and 33 open public imbalanced datasets were used to evaluate the performance of DBRF. On the 34 datasets, DBRF could achieve improvements of 2–15% over random forest in terms of the F1-measure and G-mean. The experimental results proved the ability of DBRF to solve the problem of classifying objects located on the class boundary, including objects of minority classes, by taking into account the density of objects in space.
... Li et al. [11] applied an information granule in the random forest for the forecasting of crime and crime linkage analysis. The developed method focuses on the indistinguishable case pairs at the classification boundary instead of resampling of minor or major classes to handle the imbalanced dataset. ...
Article
Full-text available
Crime prediction models are very useful for the police force to prevent crimes from happening and to reduce the crime rate of the city. Existing crime prediction models are not efficient in handling the data imbalance and have an overfitting problem. In this research, an adaptive DRQN model is proposed to develop a robust crime prediction model. The proposed adaptive DRQN model includes the application of GRU instead of LSTM unit to store the relevant features for the effective classification of Sacramento city crime data. The storage of relevant features for a long time helps to handle the data imbalance problem and irrelevant features are eliminated to avoid overfitting problems. Adaptive agents based on the MDP are applied to adaptively learn the input data and provide effective predictions. The reinforcement learning method is applied in the proposed adaptive DRQN model to select the optimal state value and to identify the best reward value. The proposed adaptive DRQN model has an MAE of 36.39 which is better than the existing Recurrent Q-Learning model has 38.82 MAE.
... Learning from imbalanced data is an important and hot topic in machine learning, as it has been widely applied to diagnose and classify diseases [1,2], detect software defects [3,4], analyze biology and pharmacology data [5,6], evaluate credit risk [7], predict actionable revenue change and bankruptcy [8,9], diagnose faults in the industrial procedure [10,11], classify soil types [12,13], and even predict crash injury severity [14] or analyze crime linkages [15]. Meanwhile, class imbalance learning (CIL) is also a challenging task. ...
Article
Full-text available
Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distribution makes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associated with the prior probability, but does not relate to the conditional probability of training data. Then, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNB with these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.
... Compared with a single CART, the majority vote of several CARTs is less susceptible to outliers, which mitigates the volatility due to small data and improves the robustness. Random forest has been applied to many fields, such as the remote sensing [3], crime linkage [4],target detection [5], lymph node segmentation [6], speech emotion recognition [7], hemagglutinin sequence data [8], driver's stress level classification [9], estimation of daily PM2.5 concentrations [10][11][12], CO2 emissions [13]. ...
Article
Full-text available
Random forest (RF) is an ensemble classifier method, all decision trees participate in voting, some low-quality decision trees will reduce the accuracy of random forest. To improve the accuracy of random forest, decision trees with larger degree of diversity and higher classification accuracy are selected for voting. In this paper, the RF based on Kappa measure and the improved binary artificial bee colony algorithm (IBABC) are proposed. Firstly, Kappa measure is used for pre-pruning, and the decision trees with larger degree of diversity are selected from the forest. Then, the crossover operator and leaping operator are applied in ABC, and the improved binary ABC is used for secondary pruning, and the decision trees with better performance are selected for voting. The proposed method (Kappa+IBABC) are tested on a quantity of UCI datasets. Computational results demonstrate that Kappa+IBABC improves the performance on most datasets with fewer decision trees. The Wilcoxon signed-rank test is used to verify the significant difference between the Kappa+IBABC method and other pruning methods. In addition, Chinese haze pollution is becoming more and more serious. This proposed method is used to predict haze weather and has achieved good results.
... To answer RQ3, we selected SVM and RF as state-of-the-art techniques, and compared the performance with LCBPA. The SVM achieved a better result in classification [37], whereas the performance of RF is adequate to deal with an imbalanced dataset [38]. Therefore, RF is considered as an ensemble technique. ...
Article
Full-text available
Software maintenance is an important phase of a development life cycle that needs to be essentially performed in order to avoid the software failure. To systematically handle the bugs (defects), the software development organization develops a bug report that demonstrates the vulnerabilities from the software under test. However, manually handling the bug reports is a laborious, tedious, and time-consuming task. Moreover, the bug repository receives large numbers of bug reports on daily basis, which demands to timely fix the found and received bugs. Motivated by this, current work proposes an automated bug prioritization and assignment technique, called LCBPA (Long short-term memory, Content-based filtering for Bug Prioritization and Assignment). To perform the bug prioritization, we employed Long Short-Term Memory (LSTM) to predict the priority of the bug report. In contrast, for bug assignment, we used content-based filtering, where the prioritized bug reports are automatically assigned to the developers based on their previous knowledge. The performance of the proposed bug prioritization model is determined by comparing with the state-of-the-art bug prioritization techniques using precision, recall and f1-score. Similarly, the effectiveness of the bug assignment model is evaluated by defining various case scenarios. The results show that the proposed LCBPA technique outperforms the current state-of-the-art bug prioritization techniques (with a 22% increase in F1-score), and also efficiently handles the bug assignment problem compared to the existing bug assignment techniques.
... In [72], handle imbalanced data by using IGRF. This granular information is based on pairs that are not easily distinguishable. ...
Thesis
Electricity theft (ET) is a major problem in developing countries. It a�ects the economy that causes revenue loss. It also decreases the reliability and stability of electricity utilities. Due to these losses, the quality of supply e�ects and tari � imposed on legitimate consumers. ET is an essential part of Non-technical loss (NTL) and it is challenging for electricity utilities to �nd the responsible people. Several methodologies have developed to identify ET behaviors automatically. However, these approaches mainly assess records of consumers' electricity usage, may prove inadequate in detecting ET due to a variety of theft attacks and irregularity of consumers' behavior. Moreover, some important challenges are needed to be addressed. (i) The number of normal consumers has been wrongly identi�ed as fraudulent. This leads to high False-positive rate (FPR). After the detection of theft, on-site inspection is needed to validate the detected person, either is it fraudulent or not and it is costly. (ii) The imbalanced nature of datasets which negatively a�ect on the model's performance. (iii) The problem of over�tting and generalization error is often faced in deep learning models, predicts unseen data inaccurately. So, the motivation for this work to detect illegal consumers accurately. We have proposed four Arti�cial intelligence (AI) models in this thesis. In system model 1, we have proposed Enhanced arti�cial neural network blocks with skip connections (EANNBS). It makes training easier, reduces over�tting, FPR and generalization error, as well as execution time. Temporal convolutional network with enhanced multi-layer perceptron (TCN-EMLP) is proposed in system model 2. It analyzes the sequential data based on daily electricity-usage records, obtained from smart meters. At the same time, EMLP integrates the non-sequential auxiliary data, such as data related to electrical connection type, property area, electrical appliances usage, etc. System model 3 based on Residual network (RN) that is used to automate feature extraction while three tree-based classi�ers such as Decision tree (DT), Random forest (RF) and Adaptive boosting (AdaBoost) are trained on the obtained features for classi�cation. Hyperparameter tuning toolkit is presented in this system model, named as Hyperactive optimization toolkit. Bayesian is used as an optimizer in this toolkit that aims to simplify the tuning process of DT, RF and AdaBoost. In system model 4, input is forwarded to three di�erent and well-known Machine learning (ML) techniques, i.e., Support vector machine (SVM), as an input. At this stage, a meta-heuristic algorithm named Simulated annealing (SA) is employed to acquire optimal values for ML models' hyperparameters. Finally, ML models' outputs are used as features for meta-classi�ers to achieve �nal classi�cation with Light Gradient boosting machine (LGBM) and Multi-layer perceptron (MLP). Furthermore, Pakistan residential electricity consumption dataset (PRECON1), State grid corporation of china (SGCC2) and Commission for energy regulation (CER3) datasets is used in this thesis. SGCC dataset contains 9% fraudulent consumers, which are extremely less than non-fraudulent consumers, due to the imbalance nature of data. Furthermore, many classi�cation techniques have poor predictive class accuracy for the positive class. These techniques mainly focus on minimizing the error rate while ignoring the minority class. Many re-sampling techniques are used in literature to adjust the class ratio; however, sometimes, these techniques remove the important information that is necessary to learn the model and cause over�tting. By using six previous theft attacks, we generate theft cases to mimic the real world theft attacks in original data. We have proposed the combination of oversampling and under-sampling techniques that is Near miss borderline synthetic minority oversampling technique (NMB-SMOTE), Tomek link borderline synthetic minority oversampling technique with support vector machine (TBSSVM) and Synthetic minority oversampling technique with near miss (SMOTE-NM) to handle imbalanced classi�cation problem. We have conducted a comprehensive experiment using SGCC, CER and PRECON datasets. The performance of suggested model is validated using di�erent performance metrics that are derived from Confusion matrix (CM).
... Prediction about the global climate problem using the index quantization ability of random forest and the optimizing ability of PSO in the NN prediction model is the main purpose of [30]. Li et al. [31] solve the class imbalance by detecting serial case pairs. ...
Article
Full-text available
The goal of aggregating the base classifiers is to achieve an aggregated classifier that has a higher resolution than individual classifiers. Random forest is one of the types of ensemble learning methods that have been considered more than other ensemble learning methods due to its simple structure, ease of understanding, as well as higher efficiency than similar methods. The ability and efficiency of classical methods are always influenced by the data. The capabilities of independence from the data domain, and the ability to adapt to problem space conditions, are the most challenging issues about the different types of classifiers. In this paper, a method based on learning automata is presented, through which the adaptive capabilities of the problem space, as well as the independence of the data domain, are added to the random forest to increase its efficiency. Using the idea of reinforcement learning in the random forest has made it possible to address issues with data that have a dynamic behaviour. Dynamic behaviour refers to the variability in the behaviour of a data sample in different domains. Therefore, to evaluate the proposed method, and to create an environment with dynamic behaviour, different domains of data have been considered. In the proposed method, the idea is added to the random forest using learning automata. The reason for this choice is the simple structure of the learning automata and the compatibility of the learning automata with the problem space. The evaluation results confirm the improvement of random forest efficiency.
... The experiments performed using a data split technique with 70% of training data and 30% for testing dataset. The results were conducted by testing different problem transformation and adaptive techniques such as OvA, BR, LP, CC, and adaptive ML -KNN with different classification algorithms such as SVC [43], and [44], LR [45], RF [46], Gaussian NB [47], and DT [48]. ...
Article
Full-text available
Outcome-based education (OBE) is a well-proven teaching strategy based upon a predefined set of expected outcomes. The components of OBE are Program Educational Objectives (PEOs), Program Outcomes (POs), and Course Outcomes (COs). These latter are assessed at the end of each course and several recommended actions can be proposed by faculty members’ to enhance the quality of courses and therefore the overall educational program. Considering a large number of courses and the faculty members’ devotion, bad actions could be recommended and therefore undesirable and inappropriate decisions may occur. In this paper, a recommender system, using different machine learning algorithms, is proposed for predicting suitable actions based on course specifications, academic records, and course learning outcomes’ assessments. We formulated the problem as a multi-label multi-class binary classification problem and the dataset was translated into different problem transformation and adaptive methods such as one-vs.-all, binary relevance, label powerset, classifier chain, and ML-KNN adaptive classifier. As a case study, the proposed recommender system is applied to the college of Computer and Information Sciences, Jouf University, Kingdom of Saudi Arabia (KSA) for helping academic staff improving the quality of teaching strategies. The obtained results showed that the proposed recommender system presents more recommended actions for improving students’ learning experiences
Article
Full-text available
Frost resistance in very cold areas is an important engineering issue for the durability of concrete, and the efficient and accurate prediction of the frost resistance of concrete is a crucial basis for determining reasonable design mix proportions. For a quick and accurate prediction of the frost resistance of concrete, a Bayesian optimization (BO)-random forest (RF) approach was used to establish a frost resistance prediction model that consists of three phases. A case study of a key national engineering project results show that (1) the RF can be used to effectively screen the factors that influence concrete frost resistance. (2) R2 of BO-RF for the training set and the test set are 0.967 and 0.959, respectively, which are better than those of the other algorithms. (3) Using the test data from the first section of the project for prediction, good results are obtained for the second section. The proposed BO-RF hybrid algorithm can accurately and quickly predict the frost resistance of concrete, and provide a reference basis for intelligent prediction of concrete durability.
Chapter
Over the past few decades, metamodeling tools have received attention for their ability to represent and improve complex systems. The use of metamodeling techniques in optimization problems via simulation has grown considerably in recent years to promote more robust and agile decision-making, determining the best scenario among the solution space to the problem. In this sense, this Chapter discusses the state of the art of metamodeling-based simulation optimization, presenting the gaps, opportunities, and future perspectives found in literature.
Article
Imbalanced classification is a challenging task in the fields of machine learning and data mining. Cost-sensitive learning can tackle this issue by considering different misclassification costs of classes. Weighted extreme learning machine (W-ELM) takes a cost-sensitive strategy to alleviate the learning bias towards the majority class to achieve better classification performance. However, W-ELM may not achieve the optimal weights for the samples from different classes due to the adoption of empirical costs. In order to solve this issue, multi-objective optimization-based adaptive class-specific cost extreme learning machine (MOAC-ELM) is presented in this paper. To be Specific, the initial weights are first assigned depending on the class information. Based on that, the representation of the minority class could be enhanced by adding penalty factors. In addition, a multi-objective optimization with respect to penalty factors is formulated to automatically determine the class-specific costs, in which multiple performance criteria are constructed by comprehensively considering the misclassification rate and generalization gap. Finally, ensemble strategy is implemented to make decisions after optimization. Accordingly, the proposed MOAC-ELM is an adaptive method with good robustness and generalization performance for imbalanced classification problems. Comprehensive experiments have been performed on several benchmark datasets and a real-world application dataset. The statistical results demonstrate that MOAC-ELM can achieve competitive results on classification performance.
Article
The synthetic minority oversampling technique (SMOTE) algorithm is considered a benchmark algorithm for addressing the class imbalance learning (CIL) problem. However, SMOTE fails to observe the distribution of the training data and to explore its internal structure, resulting in an unstable and non-robust classification result. Recently, more than 100 SMOTE variants have been developed to solve this problem. Most of them attemp to directly explore the prior distribution information of the training data, which may provide extremely inaccurate guidance in some classification scenarios. In this study, we present the instance weighted SMOTE (IW-SMOTE) algorithm, a more robust and universal solution for improving SMOTE by exploit distribution data indirectly. In particular, an UnderBagging-alike undersampling ensemble algorithm that uses classification and regression tree (CART) as the base classifier is first adopted to classify each training instance and acquire the corresponding confusing information. We can accurately estimate location information for each instance, including noise, borders and safety, based on the confusing information. Then, the noisy instances can be removed, and the borderline instances can be given more chances than the safe instances to be seed instances in the SMOTE procedure. Finally, the balanced instance set was used to train the CART, K nearest neighbors (KNN) and support vector machine (SVM) classifiers to verify whether the proposed algorithm is irrelevant to the specific classification model. We compare IW-SMOTE with several state-of-the-art SMOTE-based algorithms on many class imbalance data sets, and IW-SMOTE shows promising results.
Article
In recent years, class imbalance learning (CIL) has become an important branch of machine learning. The Synthetic Minority Oversampling TEchnique (SMOTE) is considered to be a benchmark algorithm among CIL techniques. Although the SMOTE algorithm performs well on the vast majority of class-imbalance tasks, it also has the inherent drawback of noise propagation. Many SMOTE-variants have been proposed to address this problem. Generally, the improved solutions conduct a hybrid sampling procedure, i.e., carrying out an undersampling process after SMOTE to remove noises. However, owing to the complexity of data distribution, it is sometimes difficult to accurately identify real instances of noise, resulting in low modeling quality. In this paper, we propose a more robust and universal SMOTE hybrid variant algorithm named SMOTE-reverse k-nearest neighbors (SMOTE-RkNN). The proposed algorithm identifies noise based on probability density but not local neighborhood information. Specifically, the probability density information of each instance is provided by RkNN, a well-known KNN variant. Noisy instances are found and deleted according to their relevant probability density. In experiments on 46 class-imbalanced data sets, SMOTE-RkNN showed promising results in comparison with several popular SMOTE hybrid variant algorithms.
Article
Legal Judgment Prediction (LJP) aims to predict the judgment result based on the fact description of a criminal case, and is gradually becoming a hot research topic in the legal realm. Generally, a classic LJP contains three subtasks, i.e., applicable law article prediction, charge prediction, and term of penalty prediction. In real-world scenarios, both charge prediction and applicable law article prediction are actually multi-class classification tasks for multi-label scenario. However, most existing studies only model them as multi-class classification problems for single-label scenario. Besides, they only consider the context of the fact description, and ignore the exploitation of effective keywords that are widely existed in abundant law articles. To fill the above gaps, we propose a novel multi-task legal judgment prediction framework via multi-view encoder fusing legal keywords, named MVE-FLK, to jointly model multiple subtasks in LJP. Specifically, the multi-view encoder is the core module of MVE-FLK, in this module, we devise a word and sentence encoder (WSE) with an attention mechanism to fuse legal keywords. And then, we develop a multi-view attention network to combine WSE with classic Transformer and DAN (Deep Averaging Network) for encoding the case from multiple views. After that, we propose a multi-task prediction module by developing a novel keywords fusing approach to enhance the performance of multi-task prediction. In addition, we devise a unique prediction principle for each subtask at a fine-grained level, which effectively improves the performance of subtasks. The experimental results on two real-life legal datasets show that our model yields significant prediction performance advantages over six competitive methods.
Article
Full-text available
The capability of distinguishing between small objects when manipulated with hand is essential in many fields, especially in video surveillance. To date, the recognition of such objects in images using Convolutional Neural Networks (CNNs) remains a challenge. In this paper, we propose improving robustness, accuracy and reliability of the detection of small objects handled similarly using binarization techniques. We propose improving their detection in videos using a two level methodology based on deep learning, called Object Detection with Binary Classifiers. The first level selects the candidate regions from the input frame and the second level applies a binarization technique based on a CNN-classifier with One-Versus-All or One-Versus-One. In particular, we focus on the video surveillance problem of detecting weapons and objects that can be confused with a handgun or a knife when manipulated with hand. We create a database considering six objects: pistol, knife, smartphone, bill, purse and card. The experimental study shows that the proposed methodology reduces the number of false positives with respect to the baseline multi-class detection model.
Article
Full-text available
This study compared the ability of seven statistical models to distinguish between linked and unlinked crimes. The seven models utilised geographical, temporal, and modus operandi information relating to residential burglaries (n = 180), commercial robberies, (n = 118), and car thefts (n = 376). Model performance was assessed using receiver operating characteristic analysis and by examining the success with which the seven models could successfully prioritise linked over unlinked crimes. The regression‐based and probabilistic models achieved comparable accuracy and were generally more accurate than the tree‐based models tested in this study. The Logistic algorithm achieved the highest area under the curve (AUC) for residential burglary (AUC = 0.903) and commercial robbery (AUC = 0.830) and the SimpleLogistic algorithm achieving the highest for car theft (AUC = 0.820). The findings also indicated that discrimination accuracy is maximised (in some situations) if behavioural domains are utilised rather than individual crime scene behaviours and that the AUC should not be used as the sole measure of accuracy in behavioural crime linkage research.
Article
Full-text available
Purpose To conduct a test of the principles underpinning crime linkage (behavioural consistency and distinctiveness) with a sample more closely reflecting the volume and nature of sexual crimes with which practitioners work, and to assess whether solved series are characterized by greater behavioural similarity than unsolved series. Method A sample of 3,364 sexual crimes (including 668 series) was collated from five countries. For the first time, the sample included solved and unsolved but linked‐by‐DNA sexual offence series, as well as solved one‐off offences. All possible crime pairings in the data set were created, and the degree of similarity in crime scene behaviour shared by the crimes in each pair was quantified using Jaccard's coefficient. The ability to distinguish same‐offender and different‐offender pairs using similarity in crime scene behaviour was assessed using Receiver Operating Characteristic analysis. The relative amount of behavioural similarity and distinctiveness seen in solved and unsolved crime pairs was assessed. Results An Area Under the Curve of .86 was found, which represents an excellent level of discrimination accuracy. This decreased to .85 when using a data set that contained one‐off offences, and both one‐off offences and unsolved crime series. Discrimination accuracy also decreased when using a sample composed solely of unsolved but linked‐by‐DNA series (AUC = .79). Conclusions Crime linkage is practised by police forces globally, and its use in legal proceedings requires demonstration that its underlying principles are reliable. Support was found for its two underpinning principles with a more ecologically valid sample.
Article
Full-text available
This article describes how serial crimes are very interesting for study in the absence of proper and solid evidence. From a high volume of criminal cases of similar types, it is difficult to detect the crimes that were committed by the same offender or not. The process of linking of crimes which were committed by the same offender or offenders is called Crime Linkage Analysis. In this article, a new hesitant fuzzy distance measure has been introduced and a fuzzy multicriteria decision-making approach has been proposed to help Crime Linkage Analysis, which enables us to find to what extent a pair of crime shares a common offender or offenders.
Article
Full-text available
The classification in class imbalanced data has drawn significant interest in medical application. Most existing methods are prone to categorize the samples into the majority class, resulting in bias, in particular the insufficient identification of minority class. A kind of novel approach, class weights random forest is introduced to address the problem, by assigning individual weights for each class instead of a single weight. The validation test on UCI datasets demonstrate that for imbalanced medical data the proposed method enhanced the overall performance of the classifier while producing high accuracy in identifying both majority and minority class.
Article
Full-text available
The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.
Article
Full-text available
Purpose: This study compared the utility of different statistical methods in differentiating sexual crimes committed by the same person from sexual crimes committed by different persons. Methods: Logistic regression, iterative classification tree (ICT), and Bayesian analysis were applied to a dataset of 3,364 solved, unsolved, serial, and apparent one-off sexual assaults committed in five countries. Receiver Operating Characteristic analysis was used to compare the statistical approaches. Results: All approaches achieved statistically significant levels of discrimination accuracy. Two out of three Bayesian methods achieved a statistically higher level of accuracy (Areas Under the Curve [AUC] = 0.89 [Bayesian coding method 1]; AUC = 0.91 [Bayesian coding method 3]) than ICT analysis (AUC = 0.88), logistic regression (AUC = 0.87), and Bayesian coding method 2 (AUC = 0.86). Conclusions: The ability to capture/utilize between-offender differences in behavioral consistency appear to be of benefit when linking sexual offenses. Statistical approaches that utilize individual offender behaviors when generating crime linkage predictions may be preferable to approaches that rely on a single summary score of behavioral similarity. Crime linkage decision-support tools should incorporate a range of statistical methods and future research must compare these methods in terms of accuracy, usability, and suitability for practice.
Conference Paper
Full-text available
As an important content of data mining, the accuracy and generalization ability of classification is influenced by the complexity of data sets and the selected classification algorithm. In the criminal data mining, in order to link serial crime, we often build discrimination model to determine whether two cases were committed by same criminal according to the similarity data of this cases-pair. It is a typical binary classification problem. However, there are multiple features in the similarity data set and only a few features contribute to classification. Thus, in order to classify the data effectively, we are supposed to select features. This paper presents our ongoing study on feature selection in the similarity data set based on data separability. Indices of feature's separablility and data set's separability are introduced and we use the data set's separation rate as an evaluation criterion in feature selection. The reasonability of feature selection is validated by doing K-fold cross validation and making predictions on a new data set.
Article
Full-text available
Most existing classification approaches assume the underlying training set is evenly distributed. In class imbalanced classification, the training set for one class (majority) far surpassed the training set of the other class (minority), in which, the minority class is often the more interesting class. In this paper, we review the issues that come with learning from imbalanced class data sets and various problems in class imbalance classification. A survey on existing approaches for handling classification with imbalanced datasets is also presented. Finally, we discuss current trends and advancements which potentially could shape the future direction in class imbalance learning and classification. We also found out that the advancement of machine learning techniques would mostly benefit the big data computing in addressing the class imbalance problem which is inevitably presented in many real world applications especially in medicine and social media.
Article
Full-text available
To identify series of residential burglaries, detecting linked crimes performed by the same constellations of criminals is necessary. Comparison of crime reports today is difficult as crime reports traditionally have been written as unstructured text and often lack a common information-basis. Based on a novel process for collecting structured crime scene information, the present study investigates the use of clustering algorithms to group similar crime reports based on combined crime characteristics from the structured form. Clustering quality is measured using Connectivity and Silhouette index (SI), stability using Jaccard index, and accuracy is measured using Rand index (RI) and a Series Rand index (SRI). The performance of clustering using combined characteristics was compared with spatial characteristic. The results suggest that the combined characteristics perform better or similar to the spatial characteristic. In terms of practical significance, the presented clustering approach is capable of clustering cases using a broader decision basis.
Article
Full-text available
Cluster ensemble has been shown to be very effective in unsupervised classification learning by generating a large pool of different clustering solutions and then combining them into a final decision. However, the task of it becomes more difficult due to the inherent complexities among base cluster results, such as uncertainty, vagueness and overlapping. Granular computing is one of the fastest growing information-processing paradigms in the domain of computational intelligence and human-centric systems. As the core part of granular computing, the rough set theory dealing with inexact, uncertain, or vague information, has been widely applied in machine learning and knowledge discovery related areas in recent years. From these perspectives, in this paper, a hierarchical cluster ensemble model based on knowledge granulation is proposed with the attempt to provide a new way to deal with the cluster ensemble problem together with ensemble learning application of the knowledge granulation. A novel rough distance is introduced to measure the dissimilarity between base partitions and the notion of knowledge granulation is improved to measure the agglomeration degree of a given granule. Furthermore, a novel objective function for cluster ensembles is defined and the corresponding inferences are made. A hierarchical cluster ensemble algorithm based on knowledge granulation is designed. Experimental results on real-world data sets demonstrate the effectiveness for better cluster ensemble of the proposed method.
Article
Full-text available
One of the most challenging problems facing crime analysts is that of identifying crime series, which are sets of crimes committed by the same individual or group. Detecting crime series can be an important step in predictive policing, as knowledge of a pattern can be of paramount importance toward finding the offenders or stopping the pattern. Currently, crime analysts detect crime series manually; our goal is to assist them by providing automated tools for discovering crime series from within a database of crimes. Our approach relies on a key hypothesis that each crime series possesses at least one core of crimes that are very similar to each other, which can be used to characterize the modus operandi (M.O.) of the criminal. Based on this assumption, as long as we find all of the cores in the database, we have found a piece of each crime series. We propose a subspace clustering method, where the subspace is the M.O. of the series. The method has three steps: We first construct a similarity graph to link crimes that are generally similar, second we find cores of crime using an integer linear programming approach, and third we construct the rest of the crime series by merging cores to form the full crime series. To judge whether a set of crimes is indeed a core, we consider both pattern-general similarity, which can be learned from past crime series, and pattern-specific similarity, which is specific to the M.O. of the series and cannot be learned. Our method can be used for general pattern detection beyond crime series detection, as cores exist for patterns in many domains.
Article
Full-text available
The object of this paper is to develop a statistical approach to criminal linkage analysis that discovers and groups crime events that share a common offender and prioritizes suspects for further investigation. Bayes factors are used to describe the strength of evidence that two crimes are linked. Using concepts from agglomerative hierarchical clustering, the Bayes factors for crime pairs are combined to provide similarity measures for comparing two crime series. This facilitates crime series clustering, crime series identification, and suspect prioritization. The ability of our models to make correct linkages and predictions is demonstrated under a variety of real-world scenarios with a large number of solved and unsolved breaking and entering crimes. For example, a naive Bayes model for pairwise case linkage can identify 82% of actual linkages with a 5% false positive rate. For crime series identification, 74%-89% of the additional crimes in a crime series can be identified from a ranked list of 50 incidents.
Article
Full-text available
Case linkage uses crime scene behaviours to identify series of crimes committed by the same offender. This paper tests the underlying assumptions of case linkage (behavioural consistency and behavioural distinctiveness) by comparing the behavioural similarity of linked pairs of offences (i.e. two offences committed by the same offender) with the behavioural similarity of unlinked pairs of offences (i.e. two offences committed by different offenders). It is hypothesised that linked pairs will be more behaviourally similar than unlinked pairs thereby providing evidence for the two assumptions. The current research uses logistic regression and receiver operating characteristic analyses to explore which behaviours can be used to reliably link personal robbery offences using a sample of 166 solved offences committed by 83 offenders. The method of generating unlinked pairs is then refined to reflect how the police work at a local level, and the success of predictive factors re‐tested. Both phases of the research provide evidence of behavioural consistency and behavioural distinctiveness with linked pairs displaying more similarity than unlinked pairs across a range of behavioural domains. Inter‐crime distance and target selection emerge as the most useful linkage factors with promising results also found for temporal proximity and control. No evidence was found to indicate that the property stolen is useful for linkage. Copyright © 2012 John Wiley & Sons, Ltd.
Article
Full-text available
According to the Swedish National Council for Crime Prevention, law enforcement agencies solved approximately three to five percent of the reported residential burglaries in 2012. Internationally, studies suggest that a large proportion of crimes are committed by a minority of offenders. Law enforcement agencies, consequently, are required to detect series of crimes, or linked crimes. Comparison of crime reports today is difficult as no systematic or structured way of reporting crimes exists, and no ability to search multiple crime reports exist.
Article
Full-text available
Much previous research on behavioural case linkage has used binary logistic regression to build predictive models that can discriminate between linked and unlinked offences. However, classification tree analysis has recently been proposed as a potential alternative owing to its ability to build user‐friendly and transparent predictive models. Building on previous research, the current study compares the relative ability of logistic regression analysis and classification tree analysis to construct predictive models for the purposes of case linkage. Two samples are utilised in this study: a sample of 376 serial car thefts committed in the UK and a sample of 160 serial residential burglaries committed in Finland. In both datasets, logistic regression and classification tree models achieve comparable levels of discrimination accuracy, but the classification tree models demonstrate problems in terms of reliability or usability that the logistic regression models do not. These findings suggest that future research is needed before classification tree analysis can be considered a viable alternative to logistic regression in behavioural case linkage. Copyright © 2012 John Wiley & Sons, Ltd.
Article
Full-text available
In the absence of forensic evidence (such as DNA or fingerprints), offender behavior can be used to identify crimes that have been committed by the same person (referred to as behavioral case linkage). The current study presents the first empirical test of whether it is possible to link different types of crime using simple aspects of offender behavior. The discrimination accuracy of the kilometer distance between offense locations (the intercrime distance) and the number of days between offenses (temporal proximity) was examined across a range of crimes, including violent, sexual, and property-related offenses. Both the intercrime distance and temporal proximity were able to achieve statistically significant levels of discrimination accuracy that were comparable across and within crime types and categories. The theoretical and practical implications of these findings are discussed and recommendations made for future research.
Conference Paper
Full-text available
Random Forest is a computationally efficient technique that can operate quickly over large datasets. It has been used in many recent research projects and real-world applications in diverse domains. However, the associated literature provides almost no directions about how many trees should be used to compose a Random Forest. The research reported here analyzes whether there is an optimal number of trees within a Random Forest, i.e., a threshold from which increasing the number of trees would bring no significant performance gain, and would only increase the computational cost. Our main conclusions are: as the number of trees grows, it does not always mean the performance of the forest is significantly better than previous forests (fewer trees), and doubling the number of trees is worthless. It is also possible to state there is a threshold beyond which there is no significant gain, unless a huge computational environment is available. In addition, it was found an experimental relationship for the AUC gain when doubling the number of trees in any forest. Furthermore, as the number of trees grows, the full set of attributes tend to be used within a Random Forest, which may not be interesting in the biomedical domain. Additionally, datasets’ density-based metrics proposed here probably capture some aspects of the VC dimension on decision trees and low-density datasets may require large capacity machines whilst the opposite also seems to be true.
Article
Full-text available
Purpose . This paper is concerned with case linkage, a form of behavioural analysis used to identify crimes committed by the same offender, through their behavioural similarity. Whilst widely practised, relatively little has been published on the process of linking crimes. This review aims to draw together diverse published studies by outlining what the process involves, critically examining its underlying psychological assumptions and reviewing the empirical research conducted on its viability. Methods . Literature searches were completed on the electronic databases, PsychInfo and Criminal Justice Abstracts, to identify theoretical and empirical papers relating to the practice of linking crimes and to behavioural consistency. Results . The available research gives some support to the assumption of consistency in criminals' behaviour. It also suggests that in comparison with intra‐individual variation in behaviour, inter‐individual variation is sufficient for the offences of one offender to be distinguished from those of other offenders. Thus, the two fundamental assumptions underlying the practice of linking crimes, behavioural consistency and inter‐individual variation, are supported. However, not all behaviours show the same degree of consistency, with behaviours that are less situation‐dependent, and hence more offender‐initiated, showing greater consistency. Conclusions . The limited research regarding linking offenders' crimes appears promising at both a theoretical and an empirical level. There is a clear need, however, for replication studies and for research with various types of crime.
Article
Full-text available
Purpose. Through an examination of serial rape data, the current article presents arguments supporting the use of receiver operating characteristic (ROC) analysis over traditional methods in addressing challenges that arise when attempting to link serial crimes. Primarily, these arguments centre on the fact that traditional linking methods do not take into account how linking accuracy will vary as a function of the threshold used for determining when two crimes are similar enough to be considered linked. Methods. Considered for analysis were 27 crime scene behaviours exhibited in 126 rapes, which were committed by 42 perpetrators. Similarity scores were derived for every possible crime pair in the sample. These measures of similarity were then subjected to ROC analysis in order to (1) determine threshold‐independent measures of linking accuracy and (2) set appropriate decision thresholds for linking purposes. Results. By providing a measure of linking accuracy that is not biased by threshold placement, the analysis confirmed that it is possible to link crimes at a level that significantly exceeds chance ( AUC = .75). The use of ROC analysis also allowed for the identification of decision thresholds that resulted in the desired balance between various linking outcomes (e.g. hits and false alarms). Conclusions. ROC analysis is exclusive in its ability to circumvent the limitations of threshold‐specific results yielded from traditional approaches to linkage analysis. Moreover, results of the current analysis provide a basis for challenging common assumptions underlying the linking task.
Chapter
Full-text available
Criminal profiling attempts to understand the behavioral and personality characteristics of an offender and has gained increasing recognition as a valuable investigative procedure. This chapter investigates sexual fantasy within the context of sexual crimes. It opens by providing an account of sexual fantasy, its nexus with sexually aberrant behavior, and how it has been utilized within the domain of criminal profiling. Research that applied grounded theory to develop a tripartite model of sexual fantasy within the context of sexual offending is presented, as well as the implications of the model to the process of criminal profiling. In closing, we argue that sexual fantasy plays an integral role in the development and maintenance of sexually aberrant behavior and can provide important insights into the internal world of the offender.
Article
Full-text available
Case linkage involves identifying crime series on the basis of behavioral similarity and distinctiveness. Research regarding the behavioral consistency of serial rapists has accumulated; however, it has its limitations. One of these limitations is that convicted or solved crime series are exclusively sampled whereas, in practice, case linkage is applied to unsolved crimes. Further, concerns have been raised that previous studies might have reported inflated estimates of case linkage effectiveness due to sampling series that were first identified based on similar modus operandi (MO), thereby overestimating the degree of consistency and distinctiveness that would exist in naturalistic settings. We present the first study to overcome these limitations; we tested the assumptions of case linkage with a sample containing 1) offenses that remain unsolved, and 2) crime series that were first identified as possible series through DNA matches, rather than similar MO. Twenty-two series consisting of 119 rapes from South Africa were used to create a dataset of 7021 crime pairs. Comparisons of crime pairs that were linked using MO vs. DNA revealed significant, but small differences in behavioral similarity with MO-linked crimes being characterized by greater similarity. When combining these two types of crimes together, linked pairs (those committed by the same serial offender) were significantly more similar in MO behavior than unlinked pairs (those committed by two different offenders) and could be differentiated from them. These findings support the underlying assumptions of case linkage. Additional factors thought to impact on linkage accuracy were also investigated. KeywordsComparative case analysis–Linkage analysis–Behavioral linking–Sexual assault–Sexual offense
Article
Detecting serial crimes is one of the most challenging tasks in crime analysis. Linking crimes committed by the same criminal can improve the work efficiency of police offices and maintain public safety. Previous crime linkage studies have focused on the crime features of modus operandi (M.O.) but did not address the crime process. In this paper, we proposed an approach for detecting serial robbery crimes based on understanding offender M.O. by integrating crime process information. According to the crime narrative text, a natural language processing method is used to extract the action and object characteristics of the crime process, a dynamic time warping method was introduced in the similarity measurement of these characteristics, and an information entropy method was used to weight the similarity of the action and object characteristics to obtain the comprehensive similarity of criminals’ crime process. A real-world robbery dataset is employed to measure the performance of finding serial crimes after adding the crime process information. According to the results, information about the crime process obtained from the case narrative text has significant separability and can better characterize better the offender’s M.O. Five machine learning algorithms are used to classify the case pairs and identify serial cases and nonserial cases. Based on the crime features, the results show that the addition of crime process information can substantially improve the effect of detecting serial crimes.
Article
Imbalance classification is one of the most challenging research problems in machine learning. Techniques for two-class imbalance classification are relatively mature nowadays, yet multi-class imbalance learning is still an open problem. Moreover, the community lacks a suitable software tool that can integrate the major works in the field. In this paper, we present Multi-Imbalance, an open source software package for multi-class imbalanced data classification. It provides users with seven different categories of multi-class imbalance learning algorithms, including the latest advances in the field. The source codes and documentations for Multi-Imbalance are publicly available at https://github.com/chongshengzhang/Multi_Imbalance.
Article
The authors describe the development of a set of three supervised machine-learning models, which the New York City Police Department uses to help identify related crimes, including burglaries, robberies, and grand larcenies.
Article
Extending previous work on quantile classifiers (q-classifiers) we propose the q*-classifier for the class imbalance problem. The classifier assigns a sample to the minority class if the minority class conditional probability exceeds 0 < q* < 1, where q* equals the unconditional probability of observing a minority class sample. The motivation for q*-classification stems from a density-based approach and leads to the useful property that the q*-classifier maximizes the sum of the true positive and true negative rates. Moreover, because the procedure can be equivalently expressed as a cost-weighted Bayes classifier, it also minimizes weighted risk. Because of this dual optimization, the q*-classifier can achieve near zero risk in imbalance problems, while simultaneously optimizing true positive and true negative rates. We use random forests to apply q*-classification. This new method which we call RFQ is shown to outperform or is competitive with existing techniques with respect to G-mean performance and variable selection. Extensions to the multiclass imbalanced setting are also considered.
Article
Record linkage is a typical two-class recognition problem in data mining. To improve its classification performance of the problem, this paper proposes to apply three-way classification to identify uncertain points (regions) for further clerical investigation in decision-making. The detailed three-way decision process is realized by a two-phase approach. During the first phase, an information granule is constructed to describe the uncertain region in the data space. In the second phase, the constructed granule is utilized to discriminate between certain points (those with a high likelihood of belonging to one of the classes) and uncertain points (viz. those requiring clerical attention). For uncertain points, manual investigation is realized; for certain points, the generic binary classifier is applied for classification. Synthetic data and publicly available data are used to demonstrate the performance of the proposed approach. Finally, the proposed approach is shown effective in applications involving real-world record linkage data.
Article
The class imbalance issue has been a persistent problem in machine learning that hinders the accurate predictive analysis of data in many real-world applications. The class imbalance problem exists when the number of instances present in a class (or classes) is significantly fewer than the number of instances belonging to another class (or classes). Sufficiently recognizing the minority class during classification is a problem as most algorithms employed to learn from data input are biased toward the majority class. The underlying issue is made more complex with the presence of data difficult factors embedded in such data input. This paper presents a novel and effective ensemble-based method for dealing with the class imbalance problem. This paper is motivated by the idea of moving the oversampling from the data level to the algorithm level, instead of increasing the minority instances in the data sets, the algorithms in this paper aims to "oversample the classification ensemble" by increasing the number of classifiers that represent the minority class in the ensemble, i.e., random forest. The proposed biased random forest algorithm employs the nearest neighbor algorithm to identify the critical areas in a given data set. The standard random forest is then fed with more random trees generated based on the critical areas. The results show that the proposed algorithm is very effective in dealing with the class imbalance problem.
Article
Payment card fraud leads to heavy annual financial losses around the world, thus giving rise to the need for improvements to the fraud detection systems used by banks and financial institutions. In the academe, as well, payment card fraud detection has become an important research topic in recent years. With these considerations in mind, we developed a method that involves two stages of detecting fraudulent payment card transactions. The extraction of suitable transactional features is one of the key issues in constructing an effective fraud detection model. In this method, additional transaction features are derived from primary transactional data. A better understanding of cardholders’ spending behaviors is created by these features. After which the first stage of detection is initiated. A cardholder's spending behaviors vary over time so that new behavior of a cardholder is closer to his/her recent behaviors. Accordingly, a new similarity measure is established on the basis of transaction time in this stage. This measure assigns greater weight to recent transactions. In the second stage, the dynamic random forest algorithm is employed for the first time in initial detection, and the minimum risk model is applied in cost-sensitive detection. We tested the proposed method on a real transactional dataset obtained from a private bank. The results showed that the recent behavior of cardholders exerts a considerable effect on decision-making regarding the evaluation of transactions as fraudulent or legitimate. The findings also indicated that using both primary and derived transactional features increases the F-measure. Finally, an average 23% increase in prevention of damage (PoD) is achieved with the proposed cost-sensitive approach.
Article
Fuzzy clustering has emerged as one of the fundamental conceptual and algorithmic frameworks supporting the development of information granules. Generic fuzzy clustering such as Fuzzy C-Means (FCM) has been utilized in a broad range of applications. However, the constructs resulting from fuzzy clustering, namely partition matrix and prototypes are numeric and as such are not capable of fully capturing the essence of the data. In this study, we propose an alternative augmented way of building information granules by generating hypercube-like information granules. A collection of hypercubes is referred to as a family of $\varepsilon$ -information granules. This family is constructed around numeric prototypes generated through a modified version of the FCM algorithm whose running time is linear with respect to the number of clusters. Next by admitting a certain level of information granularity ( $\varepsilon$ ), a collection of hypercubes is formed around these prototypes. The quality of information granules realized in this way is assessed by involving them in the granulation - degranulation process and determining a value of the coverage criterion. The level of information granularity and the number of the granular prototypes in the family of $\varepsilon$ -information granules form an important design asset directly impacting the obtained coverage level of the data. The computational facet of the approach is stressed. It has been demonstrated that the granular enhancements of the description of data come with a very limited computing overhead. Experimental studies involve synthetic data as well as data coming from the UCI Machine Learning repository. The granular reconstruction capabilities delivered by the family of $\varepsilon$ -information granules are discussed.
Article
Enterprise credit evaluation model is an important tool for bank and enterprise risk management, but how to construct an effective decision tree (DT) ensemble model for imbalanced enterprise credit evaluation is seldom studied. This paper proposes a new DT ensemble model for imbalanced enterprise credit evaluation based on the synthetic minority over-sampling technique (SMOTE) and the Bagging ensemble learning algorithm with differentiated sampling rates (DSR), which is named as DTE-SBD (Decision Tree Ensemble based on SMOTE, Bagging and DSR). In different times of iteration for base DT classifier training, new positive (high risky) samples are produced to different degrees by SMOTE with DSR, and different numbers of negative (low risky) samples are drawn with replacement by Bagging with DSR. However, in the same time of iteration with certain sampling rate, the training positive samples including the original and the new are of the same number as the drawn training negative samples, and they are combined to train a DT base classifier. Therefore, DTE-SBD can not only dispose the class imbalance problem of enterprise credit evaluation, but also increase the diversity of base classifiers for DT ensemble. Empirical experiment is carried out for 100 times with the financial data of 552 Chinese listed companies, and the performance of imbalanced enterprise credit evaluation is compared among the six models of pure DT, over-sampling DT, over-under-sampling DT, SMOTE DT, Bagging DT, and DTE-SBD. The experimental results indicate that DTE-SBD significantly outperforms the other five models and is effective for imbalanced enterprise credit evaluation.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Current benchmark reports of classification algorithms generally concern common classifiers and their variants but do not include many algorithms that have been introduced in recent years. Moreover, important properties such as the dependency on number of classes and features and CPU running time are typically not examined. In this paper, we carry out a comparative empirical study on both established classifiers and more recently proposed ones on 71 data sets originating from different domains, publicly available at UCI and KEEL repositories. The list of 11 algorithms studied includes Extreme Learning Machine (ELM), Sparse Representation based Classification (SRC), and Deep Learning (DL), which have not been thoroughly investigated in existing comparative studies. It is found that Stochastic Gradient Boosting Trees (GBDT) matches or exceeds the prediction performance of Support Vector Machines (SVM) and Random Forests (RF), while being the fastest algorithm in terms of prediction efficiency. ELM also yields good accuracy results, ranking in the top-5, alongside GBDT, RF, SVM, and C4.5 but this performance varies widely across all data sets. Unsurprisingly, top accuracy performers have average or slow training time efficiency. DL is the worst performer in terms of accuracy but second fastest in prediction efficiency. SRC shows good accuracy performance but it is the slowest classifier in both training and testing.
Article
Serial crimes pose a great threat to public security. Linking crimes committed by the same offender can assist the detecting of serial crimes and is of great importance in maintaining public security. Currently, most crime analysts still link serial crimes empirically especially in China and desire quantitative tools to help them. This paper presents a decision support system for crime linkage based on various, including behavioral, features of criminal cases. Its underlying technique is pairwise classification based on similarity, which is interpretable and easy to tune. We design feature similarity algorithms to calculate the pairwise similarities and build up a classifier to determine whether a case pair should belong to a series. A comprehensive case study of a real-world robbery dataset demonstrates its promising performance even with the default setting. This system has been deployed in a public security bureau of China and running for more than one year with positive feedback from users. The use of this system would provide individual officers with strong support in crimes investigation then allow law enforcement agency to save resources, since the system not only can link serial crimes automatically based on a classification model learned from historical crime data, but also has flexibility in training data update and domain experts interaction, including adjusting the key components like similarity matrices and decision thresholds to reach a good tradeoff between caseload and number of true linked pairs.
Article
Rare events, especially those that could potentially negatively impact society, often require humans’ decision-making responses. Detecting rare events can be viewed as a prediction task in data mining and machine learning communities. As these events are rarely observed in daily life, the prediction task suffers from a lack of balanced data. In this paper, we provide an in depth review of rare event detection from an imbalanced learning perspective. Five hundred and seventeen related papers that have been published in the past decade were collected for the study. The initial statistics suggested that rare events detection and imbalanced learning are concerned across a wide range of research areas from management science to engineering. We reviewed all collected papers from both a technical and a practical point of view. Modeling methods discussed include techniques such as data preprocessing, classification algorithms and model evaluation. For applications, we first provide a comprehensive taxonomy of the existing application domains of imbalanced learning, and then we detail the applications for each category. Finally, some suggestions from the reviewed papers are incorporated with our experiences and judgments to offer further research directions for the imbalanced learning and rare event detection fields.
Article
Advanced biomedical instruments and data acquisition techniques generate large amount of physiological data. For accurate diagnosis of related pathology, it has become necessary to develop new methods for analyzing and understanding this data. Clinical decision support systems are designed to provide real time guidance to healthcare experts. These are evolving as an alternate strategy to increase the exactness of diagnostic testing. Generalization ability of these systems is governed by the characteristics of dataset used during its development. It is observed that sub pathologies have a much varied ratio of occurrence in the population, making the dataset extremely imbalanced. This problem can be resolved at both levels i.e. at data level as well as algorithmic level. This work proposes a synthetic sampling technique to balance dataset along with Modified Particle Swarm Optimization (M-PSO) technique. A comparative study of multiclass support vector machine (SVM) classifier optimization algorithm based on grid selection (GSVM), hybrid feature selection (SVMFS), genetic algorithm (GA) and M-PSO is presented in this work. Empirical analysis of five machine learning algorithms demonstrate that M-PSO statistically outperforms the others.
Article
Decision tree is a simple and effective method and it can be supplemented with ensemble methods to improve its performance. Random Forest and Rotation Forest are two approaches which are perceived as "classic" at present. They can build more accurate and diverse classifiers than Bagging and Boosting by introducing the diversities namely randomly chosen a subset of features or rotated feature space. However, the splitting criteria used for constructing each tree in Random Forest and Rotation Forest are Gini index and information gain ratio respectively, which are skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. Hellinger distance decision tree (HDDT) was proposed by Chawla, which is skew-insensitive. Especially, bagged unpruned HDDT has proven to be an effective way to deal with highly imbalanced problem. Nevertheless, the bootstrap sampling used in Bagging can lead to ensembles of low diversity compared to Random Forest and Rotation Forest. In order to combine the skew-insensitivity of HDDT and the diversities of Random Forest and Rotation Forest, we use Hellinger distance as the splitting criterion for building each tree in Random Forest and Rotation Forest respectively. An experimental framework is performed across a wide range of highly imbalanced datasets to investigate the effectiveness of Hellinger distance, information gain ratio and Gini index which are used as the splitting criteria in ensembles of decision trees including Bagging, Boosting, Random Forest and Rotation Forest. In addition, Balanced Random Forest is also included in the experiment since it is designed to tackle class imbalance problem. The experimental results, which contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance as the splitting criterion to build individual decision tree in forest can improve the performances of Random Forest and Rotation Forest for highly imbalanced classification.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
This paper investigates how text analysis and classification techniques can be used to enhance e-government, typically law enforcement agencies' efficiency and effectiveness by analyzing text reports automatically and provide timely supporting information to decision makers. With an increasing number of anonymous crime reports being filed and digitized, it is generally difficult for crime analysts to process and analyze crime reports efficiently. Complicating the problem is that the information has not been filtered or guided in a detective-led interview resulting in much irrelevant information. We are developing a decision support system (DSS), combining natural language processing (NLP) techniques, similarity measures, and machine learning, i.e., a Naïve Bayes' classifier, to support crime analysis and classify which crime reports discuss the same and different crime. We report on an algorithm essential to the DSS and its evaluations. Two studies with small and big datasets were conducted to compare the system with a human expert's performance. The first study includes 10 sets of crime reports discussing 2 to 5 crimes. The highest algorithm accuracy was found by using binary logistic regression (89%) while Naive Bayes' classifier was only slightly lower (87%). The expert achieved still better performance (96%) when given sufficient time. The second study includes two datasets with 40 and 60 crime reports discussing 16 different types of crimes for each dataset. The results show that our system achieved the highest classification accuracy (94.82%), while the crime analyst's classification accuracy (93.74%) is slightly lower.
Article
Traditional classification algorithms usually provide poor accuracy on the prediction of the minority class of imbalanced data sets. This paper proposes a new method for dealing with imbalanced data sets by over-sampling the borderline minority class instances. A Support Vector Machine (SVM) classifier is then trained to predict future instances. Compared with other over-sampling methods, the proposed method focuses only on the minority class instances residing along the decision boundary, due to the fact that this region is the most crucial for establishing the decision boundary. Furthermore, the artificial minority instances are generated in such a way that the regions of the minority class with fewer majority class instances would be expanded by extrapolation, otherwise the current boundary of the minority class would be consolidated by interpolation. Experimental results show that the proposed method achieves a better performance than other over-sampling methods.
Conference Paper
Grouping events having similarities has always been interesting for analysts. Actually, when a label is put on top of a set of events to denote they share common properties, the automation and the capability to conduct reasoning with this set drastically increase. This is particularly true when considering criminal events for crime analysts, conjunction, interpretation and explanation can be key success factors to apprehend criminals. In this paper, we present the CriLiM methodology for investigating both serious and high-volume crime. Our artifact consists in implementing a tailored computerized crime linkage system, based on a fuzzy MCDM approach in order to combine spatio-temporal, behavioral, and forensic information. As a proof of concept, series in burglaries are examined from real data and compared to expert results.
Article
The study of data complexity metrics is an emergent area in the field of data mining and is focused on the analysis of several data set characteristics to extract knowledge from them. This information can be used to support the election of the proper classification algorithm.This paper addresses the analysis of the relationship between data complexity measures and classifiers behavior. Each one of the metrics is evaluated covering its range of values and studying the classifiers accuracy on these values.The results offer information about the usefullness of these measures, and which of them allow us to analyze the nature of the input data set and help us to decide which classification method could be the most promising one.
Article
In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, F-measure and weighted accuracy are computed. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.
Article
Whilst case linkage is used with serious forms of serial crime (e.g. rape and murder), the potential exists for it to be used with volume crime. This study replicates and extends previous research on the behavioural linking of burglaries. One hundred and sixty solved residential burglaries were sampled from a British police force. From these, 80 linked crime pairs (committed by the same serial offender) and 80 unlinked crime pairs (committed by two different serial offenders) were created. Following the methodology used by previous researchers, the behavioural similarity, geographical proximity, and temporal proximity of linked crime pairs were compared with those of unlinked crime pairs. Geographical and temporal proximity possessed a high degree of predictive accuracy in distinguishing linked from unlinked pairs as assessed by logistic regression and receiver operating characteristic analyses. Comparatively, other traditional modus operandi behaviours showed less potential for linkage. Whilst personality psychology literature has suggested we might expect to find a relationship between temporal proximity and behavioural consistency, such a relationship was not observed. Copyright © 2010 John Wiley & Sons, Ltd.