Content uploaded by Olta Llaha
Author content
All content in this area was uploaded by Olta Llaha on Dec 28, 2021
Content may be subject to copyright.
Crime Analysis and Prediction using Machine
Learning
Olta Llaha
South East European University/Faculty of Contemporary Sciences and Technologies, Tetovo, North Macedonia
E-mail: ol29064@seeu.edu.mk
Abstract - Data mining and machine learning have become a
vital part of crime detection and prevention. The purpose of
this paper is to evaluate data mining methods and their
performances that can be used for analyzing the collected
data about the past crimes. I identified the most appropriate
data mining methods to analyze the collected data from
sources specialized in crime prevention by comparing them
theoretically and practically. Some attributes of this dataset
are, gender, age, employment status, crime place. Methods
are applied on these data to determine their effectiveness in
analyzing and preventing crime. Evaluations on the data
showed that the method with a higher performance is
“Decision Tree”. This was achieved by some performance
measures, such as the number of instances correctly
classified, accuracy or precision and recall, that has brought
better results compared to other methods. I come to the
conclusion that the data mining methods contribute to the
predictions on the possibility of occurrence of the crime and
as a result in its prevention.
Keywords - Machine Learning, Prediction, Crime Analysis,
Data Mining
I. INTRODUCTION
The increase in crime data recording coupled with data
analytics resulted in the growth of research approaches
aimed at extracting knowledge from crime records to better
understand criminal behavior and ultimately prevent future
crimes.
Crime is a complex social phenomenon that has grown
due to major changes in society. Law enforcement
agencies need to learn the factors that lead to an increase
in crime tendency. To curb this, there is always a need for
strategies and policies to prevent crime. As a result of
technology development, science and information, data
mining and artificial intelligence tools are increasingly
prevalent in the law enforcement community.
Law enforcement agencies face a large volume of data
that needs to be processed and turned into useful
information, and data mining can improve crime analysis
by helping to predict and prevent it. By processing
criminal data, law enforcement agencies can use models
that may be important in the crime prevention process.
The use of data mining accelerates data analysis, and
analysts can examine existing data to identify patterns and
trends of crime. This paper is structured as follows:
Section. 2 describes the relationship that exists between
data mining, machine learning and criminology. The
methodology and description of the dataset are described in
Section. 3. Sections. 4 and 5, represent a theoretical
description of the methods and algorithms that will be
applied practically to our data. Section 6 presents the
results of the application of algorithms and an explanation
for the algorithm with the best results. In sect. 7 the
conclusions and future work are discussed.
II. USING DATA MINING AND MACHINE
LEARNING IN CRIMINOLOGY
Criminology is an area where the scientific study of
crime and criminal behavior focuses. This is one of the
most important areas when applying data mining
techniques that can produce significant results [1].
Crime analysis, as part of criminology, is tasked with
exploring and discovering crime and its relationship with
criminals. Law enforcement is a process that aims to
identify the characteristics of crime. Identifying crime
characteristics is the first step in developing further
analysis. The high volume of crime data and the
complexity of the relationships between them have made
criminology an appropriate field for applying data mining
techniques [2].
Data mining can be used to examine many large
datasets involving a large set of variables beyond what a
single analyst, or even an analytical team or task force, can
consider correct, whereas machine learning uses neural
networks, predictive model and automated algorithms to
make the decisions. Like any other problem solving
method, the task of data mining begins with a problem
definition. The identification of the data mining problem
enables the determination of the data mining process and
538
MIPRO 2020/CTI
the modeling technique. Machine learning is a subfield of
data science that deals with algorithms able to learn from
data and make accurate predictions [3]. Data mining gives
law enforcement agencies the opportunity to learn about
crime trends, how and why crimes are committed. Using
data mining methods and machine learning improves
crime analysis and help reduce and prevent crime.
III. DATA AND METHODOLOGY
I compare theoretically and practically data mining
methods to discover the most appropriate method for our
data. The methods were compared by applying machine
learning algorithms to concrete data in the WEKA
“Waikato Environment for Knowledge Analysis” [4]
environment. The implemented algorithms are: Simple
Logistic, Logistic, Multilayer Perceptron, Naive Bayes,
Bayes Net, SMO, C4.5.In data collection step I am
collecting data from law enforcement agencies. The
collected data is stored into database for further process.
They relate to the areas where crime and perpetrator data
occur.
The dataset is made up of 100 records or instances.
Table 1. Dataset details
The variables or attributes of this dataset are: age
(from 17 to 55 years old), gender, education (middle
school. high school, university) employment status
(whether employed or not), civil status (whether married,
single, or divorced), the area where the crime occurred
(urban or rural) and whether the person who committed
the crime was previously convicted or not. Crime dataset is
in CSV format.
IV. CLASSIFICATION METHODS
Classification is a data mining technique that
categorizes data in order to assist in more accurate
predictions and analyzes [5, 6]. It is one of the data mining
methods that aims to analyze very large datasets. It is used
to derive patterns that accurately define the important data
classes within the data set. Classification consists in
predicting a given result based on a given input [6].
Classification algorithms attempt to detect relationships
between attributes that would make it possible to predict
the result. They analyze the input and produce a
prediction.
A. Artificial Neural Networks
Neural networks are an area of Artificial Intelligence
(AI) based on the inspiration from the human brain. I use
them to find data structures and algorithms for learning
and classifying data. By applying neural network
techniques, a program can learn from the examples and
create an internal set of rules for classifying different
inputs. Artificial Neural Networks (ANNs) are capable of
predicting new observations from existing observations. A
neural network consists of interconnected processing
elements also called units, nodes, or neurons [5].
All processes of a neural network are performed by this
group of neurons or units. Each neuron is a separate
communication device, making its operation relatively
simple. The function of one unit is simply to receive data
from other units, as a function of the inputs it receives to
calculate an output value, which it sends to other units. In
artificial neural networks, neurons are organized in layers
which process information using dynamic state responses
to external inputs [6]. The Multilayer Perceptron (MLP) is
a feed-forward artificial neural network model that maps
sets of input data to a set of appropriate outputs [7]. In a
feed-forward neural network, the input signal traverses the
neural network in a forward direction from the input layer
to the output layer through the hidden layers.
B. Naive Bayes Classifier
Bayesian classification represents a supervised
learning method as well as a statistical classification
method. It assumes a high-probability underlying model,
which allows us to determine in principle the uncertainties
for the model, thus determining the probability of the
results. The Naive Bayes Classifier technique is based on
the Bayesian theorem and is used especially when the
dimensionality of the inputs is high [5, 8]. Naive Bayes
classifier is a term in Bayesian statistics dealing with a
simple probabilistic classifier based on applying Bayes'
theorem with strong (naive) independence assumptions.
Bayesian classification provides practical learning
algorithms and prior knowledge, here the observed data
can be combined. Bayesian classification provides a useful
perspective for understanding and evaluating many
learning algorithms. It calculates the apparent hypothetical
probability. The algorithm works as follows. Bayes'
theorem offers a way to calculate the probability of a
hypothesis based on our prior knowledge.
MIPRO 2020/CTI
539
P(c|x) is the posterior probability of class (target)
given predictor (attribute).
P(c) is the prior probability of class.
(x|c) is the likelihood which is the probability
of predictor given class.
P(x) is the prior probability of predictor.
Class (c) is independent of the values of other
predictors. Naïve Bayes Classifier can be trained
effectively in supervised learning [8]. After calculating the
conditional probability for a different number of
hypotheses, I can solve the hypothesis (class) with the
highest probability. An advantage of the Naive Bayes
classifier is that it requires a small amount of training data
to calculate the parameters (mean and variance of the
variables) needed for the classification [8]. Because the
independent variables are assumed, then only the
discrepancies of the variables for each class need to be
determined and not the full matrix distribution. The Naive
Bayesian classifier is fast and incremental can deal with
discrete and continuous attributes, has excellent
performance and can explain its decisions.
C. Support Vector Machine
Support Vector Machines are based on the concept of
decision making plans that set the boundaries of decisions.
A decision plan is one that divides a group of objects that
have different class memberships. Classification tasks that
are based on the dividing lines between different class
membership objects are known as hyper-plane Classifiers
[9]. SVMs are a set of related supervised learning methods
used for classification and regression. Support Vector
Machine (SVM) is primarily a classification method that
performs classification tasks by constructing hyper-plane
in a multidimensional space. The SVM uses statistical
learning theory to search for a regularized hypothesis that
fits the available data well without over-fitting. SVM also
supports regression and classification techniques and can
handle multiple continuous and categorical variables [9].
The efficiency of SVM-based classification is not
directly dependent on the dimension of the classified
entities. SVM can also be extended to learn nonlinear
decision functions by first projecting the input data into a
high dimensional space using kernel functions and
formulating a linear classification problem in that space.
SMO (Sequential Minimal Optimization ) implements
John C. Platt's sequential minimal optimization algorithm
for training a Support Vector classifier using polynomial
or RBF(Radial Basis Function) kernels [9].This
implementation globally replaces all lost values and
transforms nominal attributes into binary ones. It can be
seen that the choice of kernel function and best value of
parameters for particular kernel is critical for a given
amount of data. It also normalizes all attributes by default.
D. The decision tree
The decision tree is a method in which data is
presented in a tree structure based on the values of their
attributes. It splits the data in the database into subsets
based on the values of one or more fields. This process will
be repeated for each subgroup recursively until all
instances are a node in a single class. The result of the
decision tree is a tree-shaped structure that describes a
series of decisions given at each step [5, 6]. These
decisions are then considered as rules for the classification
task. The algorithms commonly used to construct decision
trees are; ID3 and C4.5.
The ID3 (Iterative Dichotomiser 3) algorithm [10]
induces classification models, or decision trees, from data.
It is a supervised learning algorithm that is trained by
examples for different classes. After being trained, the
algorithm should be able to predict the class of a new item.
ID3 identifies attributes that differentiate one class from
another. All attributes must be known in advance, and
must also be either continuous or selected from a set of
known values. For instance, temperature (continuous), and
country of citizenship (set of known values) are valid
attributes. To determine which attributes are the most
important, ID3 uses the statistical property of entropy [10].
The C4.5 algorithm [11] overcomes this problem by
using another statistical property known as information
gain. Information gain measures how well a given attribute
separates the training sets into the output classes. This
algorithm has input in the form of training samples and
samples. Training samples in the form of sample data that
will be used to build a tree that has been substantiated.
C4.5 algorithms are algorithms result of the development
of the algorithm ID3 [11]. C4.5 algorithm works by
grouping several training sample data that will result in a
decision tree based on the facts on the training data.
540
MIPRO 2020/CTI
V. ASSOCIATION RULES AND REGRESSION
Association Rule is one of the most important
canonical tasks in data mining and probably one of the
most studied techniques for pattern discovery. Association
rules are if/then statements that help to uncover
relationships between unrelated data in a database,
relational database or other information repository [12].
Association rules are used to find the relationships between
the objects which are frequently used together [12].
Association Rules identify the arguments found together
with a given, event or record: "the presence of one set of
arguments brings the presence of another set". This is how
rules of type are identified: "if argument A is part of an
event, then for a certain probability argument B is also part
of the event" [13]. The objective of the association rule was
to discover interesting association or correlation
relationships among a large set of data items. Support and
confidence are the most known measures for the evaluation
of association rule interestingness.
While classification provides categorical, discrete
labels, regression has continuous function values. So
regression is used mainly to predict missing numeric data
values rather than discrete class labels. Regression analysis
is a statistical methodology often used for numerical
prediction, although there are other methods for doing this
[14]. Regression also involves identifying the distribution
of trends based on available data. For this purpose
regression trees can be used as well as decision trees whose
nodes have numerical values instead of categorical values.
Linear regression is a mathematical technique that can be
used to make a numerical data set by creating a
mathematical equation [14]. On the other hand logical
regression estimates the probability of verifying an event
under certain circumstances, using the factors observed
together with the occurrence of the event [14].
VI. EXPERIMENTAL RESULTS
To conduct this study I used WEKA [4] software based
on the approach and familiarity with its use. WEKA is a
collection of machine learning algorithms for data mining
tasks. It contains tools for data pre-processing,
classification, regression, association rules, and
visualization. It can be used to detect the various hidden
patterns in our dataset and find the most determining data
factors.
Figure. 1. Pre-processed data visualization
Experiments are done by using cross-validation on default
option folds = 10. Cross-validation is a technique to
evaluate predictive models by partitioning the original
sample into a training set to train the model, and a test set
to evaluate it. The process is repeated 10 times for each
fold. Performance indicators are given on the following
Table 2.
Table 2: Comparison of the results of the algorithms applied in WEKA
In this paper I used some algorithms (Table 2) and
among them is C4.5 algorithm, which is a Decision Tree
algorithm. This algorithm is clear and easy when I used it
to interpret the results. The model construction is done by
modifying the parameter values and this algorithm
classifies crime data with a higher accuracy than other
MIPRO 2020/CTI
541
algorithms of data mining methods. I converted our data to
format. The C4.5 algorithm was implemented in this data.
Figure 2: Performance of algorithms
The C4. 5 algorithm for building decision trees is
implemented in WEKA as a classifier called J48. J48 has
the full name weka.classifiers.trees.J48. What came out of
this algorithm: the visualization and the decision tree are
presented in Figure 3 and Figure 4.
Figure 3: C4.5 (J48) Classifier
Figure 4: Decision Tree
Figure 3 shows the result of implementing the C4.5
algorithm. It shows that the number of correctly classified
instances is 76 with a percentage of 76% and the number
of incorrectly classified instances is 24, so 24%.
F-measure is a measure of a test's accuracy. It
considers both the precision and the recall of the test to
compute the score: precision is the number of correct
positive results divided by the number of all positive results
returned by the classifier, and recall is the number of
correct positive results divided by the number of all
relevant samples (all samples that should have been
identified as positive).
Recall =
Precision =
The results of this algorithm for recall and precision values
are respectively 0.760 (recall) and 0.762 (precision).
F-Measure =
542
MIPRO 2020/CTI
True positive (TP): correct positive prediction
False positive (FP): incorrect positive prediction
True negative (TN): correct negative prediction
False negative (FN): incorrect negative prediction
F-measure after the application of the algorithm has the
value 0.761.
The implementation of this algorithm has classified the
crime data based on the dataset attributes as e.g. the place
where the crime occurred (urban areas, rural areas) where:
the number of correctly, classified instances, the accuracy
or precision and recall have the highest values compared to
other algorithms of data mining methods.
Figure 4 shows the visualization of the decision tree
which is generated by the implementation of the C4.5
algorithm. Through the decision tree generated I
understand in which areas more crimes occur, as well as
the characteristics of the people who committed the crimes.
Having this information helps law enforcement agencies to
create policies or make decisions about areas where the
crime rate is higher.
VII. CONCLUSION AND FUTURE WORK
The purpose of this study is to examine crime analysis
through the applicability of data mining methods in the
process of crime prediction and prevention. The results of
experiments conducted in this research by implementing
algorithms of data mining methods have revealed that
these methods are applicable in the process of crime
prediction. The decision tree as a data mining
classification method has classified crime data at an
accuracy rate of 76%. This method has shown promising
results for the problem of crime prediction as the accuracy
rate is high in the experiments performed. Furthermore,
the decision tree seems more viable due to the fact that in
contrast to other algorithms, it expresses the rules
explicitly. These rules can be expressed in human
language so that anyone can understand them. The use of
machine learning and data mining in crime analysis is
important because data mining methods can be used in the
decision making process. Decision making is very
important in crime prevention in order to decide accurate
actions and law enforcement strategies. Through our data
analysis law enforcement agencies can create strategies,
operating in areas where most crimes occur. In the future
extension of this study some models will be created for
predicting the crime hot-spots that would help the
deployment of police to places of crimes. Algorithms’
behavior changes will be looked at when more data is
added. I also plan to look into developing social link
networks of criminals, suspects and gangs. I also intend to
implement this study to an integrated enterprise software
that will be created.
REFERENCES
[1] K. Zakir Hussain, M. Durairaj and G. R. J. Farzana, "Criminal
behavior analysis by using data mining techniques," IEEE-International
Conference On Advances In Engineering, Science And Management
(ICAESM -2012), Nagapattinam, Tamil Nadu, 2012, pp. 656-658.
[2] Keyvanpour, Mohammad & Javideh, Mostafa & Ebrahimi,
Mohammadreza. (2011). Detecting and investigating crime by means
of data mining: A general crime matching framework. Procedia CS. 3.
872-880. 10.1016/j.procs.2010.12.143.
[3] Ioannis Kavakiotis OlgaTsave Athanasios Salifoglou Nicos
Maglaveras Ioannis Vlahavas Ioanna Chouvarda, Machine Learning
and Data Mining Methods in Diabetes Research, Computational and
Structural Biotechnology Journal Volume 15, 2017, Pages 104-116
[4] Frank, Eibe & Hall, Mark & Holmes, Geoffrey & Kirkby, Richard &
Pfahringer, Bernhard & Witten, Ian & Trigg, Len. (2010). Weka-A
Machine Learning Workbench for Data Mining. 10.1007/978-0-387-
09823-4_66.
[5] Pang-Ning Tan; Michael Steinbach; Anuj Karpatne; Vipin Kuma
Introduction to Data Mining 2nd ed, Publisher: Pearson, 2019, Print
ISBN: 9780133128901, 0133128903 eText ISBN: 9780134080284,
013408028
[6] M. Kantardzic, Data Mining Concepts, Models, Methods, and
Algorithms, 2nd ed, John Wiley & Sons, Inc., Hoboken, New Jersey
2011, ISBN 978-0-470-89045-5 , oBook ISBN: 978-1-118-02914-5,
ePDF ISBN: 978-1-118-02912-1, ePub ISBN: 978-1-118-02913-8
[7] Ahishakiye, Emmanuel & Opiyo, Elisha & Wario, Ruth & Niyonzima,
Ivan. (2017). A Performance Analysis of Business Intelligence
Techniques on Crime Prediction. International Journal of Computer
and Information Technology. 06. 84 - 90.
[8] Marlina, Leni & Muslim, Muslim & Siahaan, Andysah Putera Utama.
(2016). Data Mining Classification Comparison (Naïve Bayes and
C4.5 Algorithms). International Journal of Emerging Trends &
Technology in Computer Science. 38. 380-383.
10.14445/22315381/IJETT-V38P268.
[9] Himani Bhavsar, Mahesh H. Panchal, (2012). A Review on Support
Vector Machine for Data Classification, International Journal of
Advanced Research in Computer Engineering & Technology
(IJARCET), Volume 1, Issue 10, December 2012, ISSN: 2278 –
1323.
[10] Xiaohu, Wang & Lele, Wang & Nianfeng, Li. (2012). An Application
of Decision Tree Based on ID3. Physics Procedia. 25. 1017-1021.
10.1016/j.phpro.2012.03.193.
[11] Hssina, Badr & MERBOUHA, Abdelkarim & Ezzikouri, Hanane &
Erritali, Mohammed. (2014). A comparative study of decision tree ID3
and C4.5. (IJACSA) International Journal of Advanced Computer
Science and Applications. Special Issue on Advances in Vehicular Ad
Hoc Networking and Applications.
10.14569/SpecialIssue.2014.040203.
[12] Kumbhare, Trupti A. and Santosh V. Chobe. “An Overview of
Association Rule Mining Algorithms.” (2014).
[13] Chengqi Zhang Shichao Zhang, Association Rule Mining Models and
Algorithms, ISSN 0302-9743, ISBN 3-540-43533-6 Springer-Verlag
Berlin Heidelberg New York, Springer, 2002
[14] Larose, Daniel T. Data mining methods and models, Published by John
Wiley & Sons, Inc., Hoboken, New Jersey, 2006, ISBN-13 978-0-471-
66656-1 ISBN-10 0-471-66656-4
MIPRO 2020/CTI
543