Conference Paper

An Outlier-based Data Association Method For Linking Criminal Incidents

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Serial criminals are a major threat in the modern society. Associating incidents committed by the same offender is of great importance in studying serial criminals. In this paper, we present a new outlier-based approach to resolve this criminal incident association problem. In this approach, criminal incident data are first modeled into a number of cells, and then a measurement function, called outlier score function, is defined over these cells. Incidents in a cell are determined to be associated with each other when the score is significant enough. We applied our approach to a robbery dataset from Richmond, VA. Results show that this method can effectively solve the criminal incident association problem.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Due to the limitations mentioned above, we want to consider the characteristics of crime series and design a new blocking method. When some crimes have the same behaviours, they are more likely to be serial crimes if these behaviours are rare [20], so we aim to take the rarity of behaviours into account. For the blocking method, we aim to consider multiple criteria to form a combined blocking method(CB) and do not integrate them into a comprehensive index. ...
... The fewer case pairs a BHK covers, the rarer the behaviour pattern is. Research has shown that when some crimes have the same behaviours, they are more likely to be serial crimes if these behaviours are rare [20]. Therefore, if a BHK covers fewer case pairs, it indicates that these case pairs are more likely to be serial case pairs, and the BHK corresponds to a higher score. ...
Article
Full-text available
Detecting serial crimes is to find criminals who have committed multiple crimes. A classification technique is often used to process serial crime detection, but the pairwise comparison of crimes is of quadratic complexity, and the number of nonserial case pairs far exceeds the number of serial case pairs. The blocking method can play a role in reducing pairwise calculation and eliminating nonserial case pairs. But the limitation of previous studies is that most of them use a single criterion to select blocks, which is difficult to guarantee an excellent blocking result. Some studies integrate multiple criteria into one comprehensive index. However, the performance is easily affected by the weighting method. In this paper, we propose a combined blocking (CB) approach. Each criminal behaviour is defined as a behaviour key (BHK) and used to form a block. CB learns several weak blocking schemes by different blocking criteria and then combines them to form the final blocking scheme. The final blocking scheme consists of several BHKs. Because rare behaviour can better identify crime series, each BHK is assigned a score according to its rarity. BHKs and their scores are used to determine whether a case pair need to be compared. After comparing with multiple blocking methods, CB can effectively guarantee the number of serial case pairs while greatly reducing unnecessary nonserial case pairs. The CB is embedded in a supervised machine learning framework. Experiments on real-world robbery cases demonstrate that it can effectively reduce pairwise comparison, alleviate the class imbalance problem and improve detection performance.
... Assessing crime through analysis also helps in crime prevention efforts [3][4][5]. Lots of researches have been conducted to solve the problem of linking criminal incidents [6][7][8]. Literature [6] proposed a similarity-based approach, while [7] aims at outlier analysis. In the present scenario, the criminals are becoming technologically sophisticated in committing crimes. ...
... Lots of researches have been conducted to solve the problem of linking criminal incidents [6][7][8]. Literature [6] proposed a similarity-based approach, while [7] aims at outlier analysis. In the present scenario, the criminals are becoming technologically sophisticated in committing crimes. ...
Article
Full-text available
Crime is canonically “capricious”. It is not necessarily hap-hazardous, but neither does it occur consistently. A better theoretical perception is needed to facilitate practical crime prevention solutions that correspond to specific places and times. Crime analysis and prevention is a systematic approach for identifying and analyzing patterns and trends in crime. Crime data analysts helps in law enforcement officers to speed up the process of solving crimes, owing to increase in the advent of computerized systems. This research is an attempt to forecast the occurrences of crimes, and predicts the frequency (count) of crimes at beat-day level in the city of Chicago. Forecasting crimes helps in taking care of crime prevention methods and the frequency of crimes helps to focus on the type of crime. This novel work is a collaboration between computer science and criminal justice aimed to develop a data mining procedure that can help solve crimes faster. Instead of focusing on causes of crime occurrence like political enmity, the criminal background of the offender etc. the author focused on crime factors for each day.
... This is because the majority of the survey data were geocoded to just 24 postcodes. The approach, of inferring association from outlier overlap has been utilized by researchers with examples from health geography (Mazumdar et al. 2012;Smurthwaite and Bagheri 2017), climatology (Leathers and Robinson 1993), ecology (Nagalingum et al. 2014) and criminology (Doran and Burgess 2011;Lin and Brown 2006). Spatial clusters are the geographic equivalent of statistical outliers. ...
... Spatial clusters are the geographic equivalent of statistical outliers. The underlying logic behind this approach is that since the likelihood of two spatial clusters of different attributes overlapping by chance is extremely small, the two attributes must be associated (Lin and Brown 2006). Thus for example, in the 'Wollongong Study' a group of researchers used a small dataset of 234 people to map hotspots of 'collective avoidance' or areas that people consistently avoid from fear of crime. ...
Article
Full-text available
A growing literature is pointing towards the prevalence of healthy lifestyles, such as adequate walking, in the Central Business District (CBD) of many cities. However, only one study has investigated the presence of walking hotpsots. Using coarsely geocoded, routinely collected survey data from the Australian Capital Territory (ACT), and a unique ‘overlap of spatial clusters’ approach, we investigated if a) There were patterns of clustering in walking behaviours, b) If these clusters were co-located and therefore associated with various built environment characteristics and c) What the demographic and active transport use profiles of the residents of these clusters were. A hotspot of walking was found in the CBD of the ACT, co-located with hotspots or spatial clusters of dwelling density, public transport frequency and destination accessibility. Residents of the hotspot walked approximately an hour more than the average ACT resident and had significantly higher odds of using active transport to workplaces relative to driving. Odds ratios of walking relative to driving to work were 9.20 (5.97, 14.18) and bicycling were 3.54 (1.87, 6.69) in the walking hotspot compared to non-hotspots. Policies directed towards creating a ‘CBD like’ built environment in various locations of cities or refurbishing and redesigning existing CBDs to make them more liveable helps encourage walking behaviors. In addition coarsely geocoded, freely available routinely collected survey data can be used to investigate relationships that may help drive policy.
... According to Porter (2016), there are three main types of approaches to detect crime series committed by the same offender, which are pairwise case linkage, reactive linkage, crime series clustering respectively: (1) pairwise case linkage (Cocx and Kosters, 2006;Lin and Brown, 2006;Nath, 2006) involves identifying whether a pair of crimes was committed by the same offender or criminal group, where each pair is usually considered separately. Works evaluating the similarity between cases according to the weights determined by experts include (Cocx and Kosters, 2006;Lin and Brown, 2006), and other works learn the similarity from data (Nath, 2006). ...
... According to Porter (2016), there are three main types of approaches to detect crime series committed by the same offender, which are pairwise case linkage, reactive linkage, crime series clustering respectively: (1) pairwise case linkage (Cocx and Kosters, 2006;Lin and Brown, 2006;Nath, 2006) involves identifying whether a pair of crimes was committed by the same offender or criminal group, where each pair is usually considered separately. Works evaluating the similarity between cases according to the weights determined by experts include (Cocx and Kosters, 2006;Lin and Brown, 2006), and other works learn the similarity from data (Nath, 2006). However, they do not consider M.O. ...
Preprint
Crimes emerge out of complex interactions of behaviors and situations; thus there are complex linkages between crime incidents. Solving the puzzle of crime linkage is a highly challenging task because we often only have limited information from indirect observations such as records, text descriptions, and associated time and locations. We propose a new modeling and learning framework for detecting linkage between crime events using \textit{spatio-temporal-textual} data, which are highly prevalent in the form of police reports. We capture the notion of \textit{modus operandi} (M.O.), by introducing a multivariate marked point process and handling the complex text jointly with the time and location. The model is able to discover the latent space that links the crime series. The model fitting is achieved by a computationally efficient Expectation-Maximization (EM) algorithm. In addition, we explicitly reduce the bias in the text documents in our algorithm. Our numerical results using real data from the Atlanta Police show that our method has competitive performance relative to the state-of-the-art. Our results, including variable selection, are highly interpretable and may bring insights into M.O. extraction.
... non-trivial and useful information) to gain insight for knowledge support. Crime patterns often differ and have their unique Modus Operandi (MO), since the opportunities available to potential offenders vary across different spatial space due to differences in spatial factors [5]. Hence, a spatial framework with features and instances embedded within the framework is central to deriving a useful spatial analysis. ...
... Our threshold is computed based on a sound mathematical principle and crime expert recommendations. The significance and prevalence thresholds measure the interest similarity support, and helps to conceptualise the underlying graphical structure, and ensures that a link ensues between two crimes if and only if the support of the similarity attributes is greater than or equal to parameters S(:= 5) and P. While the parameter S come from crime intelligence experts as was also done in previous research [5], the coefficient P is a parameter we learn from the data. The prevalence threshold considers attributes relating to "day", "time" and "location" information of a crime incident. ...
... As for the crime series detection approaches, Lin and Brown presented an outlier-based cluster method [28] . They argued that a group of incidents is likely to be committed by the same criminal when they are not only similar to each other, but also distinctive. ...
... Since the pairwise case linkage approach has some advantages, such as relatively accurate, interpretable, flexible and easy to develop and tune, we use it as core method in the decision support system. To the best of our knowledge, only a small part of literature build a model with multi-type of features [10,28] . However, in these papers, the design of categorical attributes is too simple and cannot reflect subtle difference among attribute values. ...
Article
Serial crimes pose a great threat to public security. Linking crimes committed by the same offender can assist the detecting of serial crimes and is of great importance in maintaining public security. Currently, most crime analysts still link serial crimes empirically especially in China and desire quantitative tools to help them. This paper presents a decision support system for crime linkage based on various, including behavioral, features of criminal cases. Its underlying technique is pairwise classification based on similarity, which is interpretable and easy to tune. We design feature similarity algorithms to calculate the pairwise similarities and build up a classifier to determine whether a case pair should belong to a series. A comprehensive case study of a real-world robbery dataset demonstrates its promising performance even with the default setting. This system has been deployed in a public security bureau of China and running for more than one year with positive feedback from users. The use of this system would provide individual officers with strong support in crimes investigation then allow law enforcement agency to save resources, since the system not only can link serial crimes automatically based on a classification model learned from historical crime data, but also has flexibility in training data update and domain experts interaction, including adjusting the key components like similarity matrices and decision thresholds to reach a good tradeoff between caseload and number of true linked pairs.
... Outlier detection plays a major important role in almost every quantitative discipline, such as cybersecurity, finance, and machine learning (Shah et al. , 2022Shafiq et al. 2020b, a). It has many applications, including fault diagnosis, web analytics, medical diagnosis, fraud detection, criminal activities detection, and malware detection (Andersson et al. 2016;Jabez and Muthukumar 2015;Lin and Brown 2006;Lucas et al. 2020;Malini and Pushpa 2017;Sandosh et al. 2020;Shafiq et al. 2020c). ...
Article
Full-text available
Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain data. This paper proposes an entropy-based grid approach (EGO) for outlier detection in clustered data. The given hard clusters obtained from a hard clustering algorithm, EGO uses entropy on the dataset as a whole or on an individual cluster to detect outliers. EGO works in two steps: explicit outlier detection and implicit outlier detection. Explicit outlier detection is concerned with those data points that are isolated in the grid cells. They are either far from the dense region or maybe a nearby isolated data point and therefore declared as an explicit outlier. Implicit outlier detection is associated with the detection of outliers that are perplexedly deviated from the normal pattern. The determination of such outliers is achieved using entropy change of the dataset or a specific cluster for each deviation. The elbow based on the trade-off between entropy and object geometries optimizes the outlier detection process. Experimental results on CHAMELEON datasets and other similar datasets suggested that the proposed approach(es) detect the outliers more precisely and extend the capability of outliers detection to an additional 4.5% to 8.6%. Moreover, the resultant clusters became more precise and compact when the entropy-based gridding approach is applied on top of hard clustering algorithms. The performance of the proposed algorithms is compared with well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Finally, a case study for detecting outliers in environmental data has been carried out using the proposed approach and results are generated on our synthetically prepared datasets. The performance shows that the proposed approach may be an industrial-oriented solution to outlier detection in environmental monitoring data.
... It has been used to discover crime patterns, link crime incidents committed by the same offender, and narrow down possible suspects (D. E. Brown & Hagen, 2003;Buczak & Gifford, 2010;Lin & Brown, 2006). Given its utility in identifying relationships in data, association rule mining was employed in the current study to help uncover offending behaviour and personal attributes that co-occur to help inform more targeted interventions for justice-involved individuals with FASD. ...
Article
Full-text available
Neurodevelopmental impairments resulting from Foetal Alcohol Spectrum Disorder (FASD) can increase the likelihood of justice system involvement. This study compared offence characteristics in young people with FASD to demographically matched controls (n = 500) in Western Australia. A novel approach (i.e. association rule mining) was adopted to uncover relationships between personal attributes and offence characteristics. For FASD participants (n = 100), file records were reviewed retrospectively. Mean age of the total sample was 15.60 years (range = 10–24), with 82% males and 88% Australian Aboriginal. After controlling for demographic factors, regression analyses showed FASD participants were more likely than controls to be charged with reckless driving (odds ratio, OR = 4.20), breach of bail/community orders (OR = 3.19), property damage (OR = 1.84), and disorderly behaviour (OR = 1.54). Overall, our findings suggest justice-involved individuals with FASD have unique offending profiles. These results have implications for sentencing, diversionary/crime prevention programs and interventions.
... Many methods have been applied to detect crime series. Brown and Hagen [31,32] measured the similarity between crimes by the weighted average of the similarity of all attributes. The multicriteria decision-making [27,33] has also been used, values of criminal behaviors are described by linguistic variables, and the similarity between crimes is calculated according to their linguistic variables. ...
Article
Crime linkage is a difficult task and is of great significance to maintaining social security. It can be treated as a binary classification problem. Some crimes are difficult to determine whether they are serial crimes under the existing evidence, so the two-way decisions are easy to make mistakes for some case pairs. Here, the three-way decisions based on the decision-theoretic rough set are applied and its key issue is to determine thresholds by setting appropriate loss functions. However, sometimes the loss functions are difficult to obtain. In this paper, a method to automatically learn thresholds of the three-way decisions without the need to preset explicit loss functions is proposed. We simplify the loss function matrix according to the characteristic of crime linkage, re-express thresholds by loss functions, and investigate the relationship between overall decision cost and the size of the boundary region. The trade-off between the uncertainty of the boundary region and the decision cost is taken as the optimization objective. We apply multiple traditional classification algorithms as base classifiers, and employ real-world cases and some public datasets to evaluate the effect of our proposed method. The results show that the proposed method can reduce classification errors.
... For example, Brown and Hagen (2003) use ARM algorithms and look at similarities to detect serial robbery incidents. Lin and Brown (2006) then enhance the algorithm with an outlier-based measure to better extract notable series of robbery incidents committed by the same criminal(s). Chen et al. (2004) propose a crime data mining suite which later became the popular package CopLink, an important component of which is association rule learning. ...
Article
Full-text available
This paper focuses on advancing the traditional association rule mining (ARM) approach to capture the rich, multidimensional and multiscalar context that is anticipated to be associated with residential Motor Vehicle Theft (MVT) across urban environments. We tackle the challenge to materialize complex social and spatial components in the mining process and present a novel interactive visualization based on social network analysis of rules and associations to facilitate the analysis of mined rules. The spatial ARM (SARM) findings successfully identify many socio-spatial associations to MVT prevalence and establish their relative influence on crime outcome in a case study. Also, the analysis provides unique insights to understand the interactive relationships between neighborhood characteristics and environmental features to both high and low MVT and underscores the importance of spatial properties of spillover and neighborhood effects on urban residential MVT prevalence. This work follows the tradition of inductive and abductive learning and presents a promising analysis framework using data mining which can be applied to different applications in social sciences.
... Several supervised learning algorithms have been applied to crime linkage, including neural networks [11], logistic regression [29,30], decision trees [31], Bayesian classification [32], etc. Researchers use unsupervised methods to identify all serial crimes rather than serial crime pairs, various clustering algorithms [33], outlier detection [34] and Restricted Boltzmann Machine (RBM) [35], etc. In addition, some scholars applied semi-supervised algorithms [13] and fuzzy multi-criteria decision making [36,37] to associate crimes. ...
Article
Crime linkage is a challenging task in crime analysis, which is to find serial crimes committed by the same offenders. It can be regarded as a binary classification task detecting serial case pairs. However, most case pairs in the real world are nonserial, so there is a serious class imbalance in the crime linkage. In this paper, we propose a novel random forest based on the information granule. The approach doesn’t resample the minority class or the majority class but concentrates on indistinguishable case pairs at the classification boundary. The information granule is used to identify case pairs that are difficult to distinguish in the dataset and constructs a nearly balanced dataset in the uncertainty region to deal with the imbalanced problem. In the proposed approach, random trees come from the original dataset and the above mentioned nearly balanced dataset. A real-world robbery dataset and some public imbalanced datasets are employed to measure the performance of the approach. The results show that the proposed approach is effective in dealing with class imbalances, and it can be extended to combine with other methods solving class imbalances.
... Association rule mining (ARM) is a data mining technique that is used to search the probability in a set collection of items. The technique involves a process of finding patterns, associates and correlates the pattern by producing the association rules [4,5]. ARM is an unsupervised data mining technique that used to demonstrate the hidden relationship between the co-occurrences of data items [6]. ...
... The techniques for affiliation have been utilized for apps in regulations enforcement. The proposed technique basically looks for equality through a original total equality measurement with information theory based weight between the characteristics of the robbery record from the Police Department, so that events may possibly be associated.After this, Lin and Brown described a new association law, in which the outline score function was added and [15] the Richmond Police Department examined the same robbery data. The new outlinebased technology improves equality-based technology with promising outcomes to give increasingly helpful information for police officer. ...
... In the calculation of the traditional Jaccard's coefficient, the importance of all values on the attribute is regarded as identical. However, when a group of records has some common features and these features are outliers, these records are more likely to be generated by the same cause [39]. Therefore, different scores should be given according to the features' frequencies; the lower is the occurrence probability of a feature, the higher is its importance. ...
Article
Detecting serial crimes is one of the most challenging tasks in crime analysis. Linking crimes committed by the same criminal can improve the work efficiency of police offices and maintain public safety. Previous crime linkage studies have focused on the crime features of modus operandi (M.O.) but did not address the crime process. In this paper, we proposed an approach for detecting serial robbery crimes based on understanding offender M.O. by integrating crime process information. According to the crime narrative text, a natural language processing method is used to extract the action and object characteristics of the crime process, a dynamic time warping method was introduced in the similarity measurement of these characteristics, and an information entropy method was used to weight the similarity of the action and object characteristics to obtain the comprehensive similarity of criminals’ crime process. A real-world robbery dataset is employed to measure the performance of finding serial crimes after adding the crime process information. According to the results, information about the crime process obtained from the case narrative text has significant separability and can better characterize better the offender’s M.O. Five machine learning algorithms are used to classify the case pairs and identify serial cases and nonserial cases. Based on the crime features, the results show that the addition of crime process information can substantially improve the effect of detecting serial crimes.
... Their study used Apriori algorithm, and it is seen that association rules generate satisfactory results on crime data. Through their study, it is detected that the true prediction rate increases in proportion to the amount of data in the dataset [4]. Mehmet Sevri et al. applied association rules to datasets of several different types of crimes, and revealed the relationship between the attributes of criminal cases. ...
Conference Paper
In the past, people mainly analyzed serial cases by manual methods, which was time-consuming and labor-intensive and less efficient. In this paper, Apriori algorithm is used to analyze serial cases, to achieve intelligent analysis, and to study the implementation path of correlation analysis in serial cases. When the support is set to 0.4 and the degree of lift is set to 2.0, the best association rules can be mined. Apriori algorithm can realize intelligent association analysis effectively, and machine-learning methods in the future will further improve it, thus making it suitable for various types of serial cases.
... It is very important to identify anomalies in data because they can distort the analysis and the decisions based on the analysis [23], [43], [81], [82], [85], [83], and [30]. Anomaly detection (or outlier or novelty detection as it is sometimes referred to in the literature) is a well studied problem in different research areas such as communications, statistics, data mining and machine learning (see, e.g., [80], [33], [16], [151], [97], [115], and [45]). ...
Article
Full-text available
Anomaly detection has numerous applications in diverse fields. For example, it has been widely used for discovering network intrusions and malicious events. It has also been used in numerous other applications such as identifying medical malpractice or credit fraud. Detection of anomalies in quantitative data has received a considerable attention in the literature and has a venerable history. By contrast, and despite the widespread availability use of categorical data in practice, anomaly detection in categorical data has received relatively little attention as compared to quantitative data. This is because detection of anomalies in categorical data is a challenging problem. Some anomaly detection techniques depend on identifying a representative pattern then measuring distances between objects and this pattern. Objects that are far from this pattern are declared as anomalies. However, identifying patterns and measuring distances are not easy in categorical data compared with quantitative data. Fortunately, several papers focussing on the detection of anomalies in categorical data have been published in the recent literature. In this article, we provide a comprehensive review of the research on the anomaly detection problem in categorical data. Previous review articles focus on either the statistics literature or the machine learning and computer science literature. This review article combines both literatures. We review 36 methods for the detection of anomalies in categorical data in both literatures and classify them into 12 different categories based on the conceptual definition of anomalies they use. For each approach, we survey anomaly detection methods, and then show the similarities and differences among them. We emphasize two important issues, the number of parameters each method requires and its time complexity. The first issue is critical, because the performance of these methods are sensitive to the choice of these parameters. The time complexity is also very important in real applications especially in big data applications. We report the time complexity if it is reported by the authors of the methods. If it is not, then we derive it ourselves and report it in this article. In addition, we discuss the common problems and the future directions of the anomaly detection in categorical data.
... Hawkins et al. (2002) and Williams et al. (2002) proposed the use of an autoencoder, an unsupervised learning technique for anomaly detection [49,50]. As anomaly detection method, the autoencoder technique is not only used for fraud detection [51,52], but is also applicable to various domains where unsupervised learning can detect anomalies [53][54][55][56]. Autoencoder technology has been used for anomaly detection in various fields including the fraud detection in the work processes [9]. ...
Article
Full-text available
Fraud detection is becoming an integral part of business intelligence, as detecting fraud in the work processes of a company is of great value. Fraud is an inhibitory factor to accurate appraisal in the evaluation of an enterprise, and it is economically a loss factor to business. Previous studies for fraud detection have limited the performance enhancement because they have learned the fraud pattern of the whole data. This paper proposes a novel method using hierarchical clusters based on deep neural networks in order to detect more detailed frauds, as well as frauds of whole data in the work processes of job placement. The proposed method, Hierarchical Clusters-based Deep Neural Networks (HC-DNN) utilizes anomaly characteristics of hierarchical clusters pre-trained through an autoencoder as the initial weights of deep neural networks to detect various frauds. HC-DNN has the advantage of improving the performance and providing the explanation about the relationship of fraud types. As a result of evaluating the performance of fraud detection by cross validation, the results of the proposed method show higher performance than those of conventional methods. And from the viewpoint of explainable deep learning the hierarchical cluster structure constructed through HC-DNN can represent the relationship of fraud types.
... The primary focus of these approaches remains the same: to provide some theoretical or statistical method for approximating how likely two or more crimes are to be connected (by the same suspect). Crime linkage has been used in research on a number of cases, including sexual assault/rapes (Grubin et al. 2001;Santtila et al. 2005), arson (Ellingwood et al. 2013), and varies types of burglary and theft (Bennell and Jones 2005;Lin and Brown 2006). ...
Article
When a Detective arrives at a crime scene, or investigates multiple cases, they are often tasked with understanding whether the crimes are linked. Knowing whether the same suspect(s) was involved across multiple crimes is a key part of the investigation. To date, there are numerous methods for crime linkage; however, very few take temporal sequences of events into account. It is known that modus operandi and signatures change over time, and therefore linkage analyses should integrate these temporal changes. The current paper presents a new method of crime linkage, the Path Similarity Metric, which is based on Sequence Analysis procedures. The method is proposed, outlined, and tested in contrast to existing linkage analyses (e.g., Jaccard’s coefficient). The Path Similarity Metric outperforms Jaccard’s coefficient across a series of crimes. Future applications of the Path Similarity Metric are outlined and directions for the use of the metric in on-going investigations is considered alongside other linkage methods.
... This provides a scientific basis for improved sustainable land use planning. Similar analyses have been successfully carried out for pollution (Zhang et al., 2008), conflict (Anselin, 1995), disease (Ruiz et al, 2004; and crime management (Anselin et al., 2000;Lin & Brown, 2006). ...
Chapter
The management of urban sprawl is fundamental to achieving sustainable urban development. Monitoring urban sprawl is, however, challenging. This study proposes the use of two spatial statistics, namely global Moran and local Moran to indentify statistically significant urban sprawl hot and cold spots. The findings reveal that the Moran indexes are sensitive to the distance band spatial weight matrices employed and that multiple bands should be used when these indexes are used. The authors demonstrate how the indexes can be used in combination with various visualisation methods to support planning decisions.
... In this method, weights among attributes of a certain crime records data are determined in such a way that incidents possibly committed by the same or group of criminals are strongly associated to each other. This association method was then developed further in (Lin & Brown, 2006) by combining it with outlier score function, and was tested on the same crime data as investigated in (Brown & Hagen, 2003). The new approach outperforms the similarity-based method as it provides more helpful insights in criminal investigation carried out by police forces. ...
Chapter
Cybercriminology as a subject area has numerous dimensions. Some studies in the field primarily focus on a corrective action to reduce the impact of an already committed crime. However, there are existing computational techniques which can assist in predicting and therefore preventing cyber-crimes. These quantitative techniques are capable of providing valuable holistic and strategic insights for law enforcement units and police forces to prevent the crimes from happening. Moreover, these techniques can be used to analyse crime patterns to provide a better understanding of the world of cyber-criminals. The main beneficiaries of such research works, are not only the law enforcement units, as in the era of Internet-connectivity, many business would also benefit from cyber attacks and crimes being committed in the cyber environment. This chapter provides an all-embracing overview of machine learning techniques for crime analysis followed by a detailed critical discussion of data mining and predictive analysis techniques within the context of cyber-criminology.
... Novel data sources and analysis methods are also studied in terrorism and organised crime detection as well as case association (e.g. Xu & Chen, 2004;Lin & Brown, 2006;Yang & Li, 2007). ...
Article
Full-text available
Crime prediction is crucial to criminal justice decision makers and efforts to prevent crime. The paper evaluates the explanatory and predictive value of human activity patterns derived from taxi trip, Twitter and Foursquare data. Analysis of a six-month period of crime data for New York City shows that these data sources improve predictive accuracy for property crime by 19% compared to using only demographic data. This effect is strongest when the novel features are used together, yielding new insights into crime prediction. Notably and in line with social disorganisation theory, the novel features cannot improve predictions for violent crimes.
... Novel data sources and analysis methods are also studied in terrorism and organised crime detection as well as case association (e.g. Xu & Chen, 2004;Lin & Brown, 2006;Yang & Li, 2007). ...
Article
Full-text available
Data from social media has created opportunities to understand how and why people move through their urban environment and how this relates to criminal activity. To aid resource allocation decisions in the scope of predictive policing, the paper proposes an approach to predict weekly crime counts. The novel approach captures spatial dependency of criminal activity through approximating human dynamics. It integrates point of interest data in the form of Foursquare venues with Twitter activity and taxi trip data, and introduces a set of approaches to create features from these data sources. Empirical results demonstrate the explanatory and predictive power of the novel features. Analysis of a six-month period of real-world crime data for the city of New York evidences that both temporal and static features are necessary to effectively account for human dynamics and predict crime counts accurately. Furthermore, results provide new evidence into the underlying mechanisms of crime and give implications for crime analysis and intervention.
... Their study uses Apriori algorithm and it is seen that association rules create fine results on crime data. By their study, it is detected that true prediction rate increases in proportion to the increase in the amount of data in the dataset [12]. ...
... Similarity-based association mining is used mainly to compare the features of a crime with the criminal's behavioral patterns which are referred as modus operandi or behavioral signature. In outlier-based association mining, crime associations will be created on the fact that both the crime and the criminal have the possibility of having some distinctive feature or a deviant behavior (Lin & Brown, 2006). Entity association mining/link analysis is the task of finding and charting associations between crime entities such as persons, weapons, and organizations. ...
Article
Full-text available
It is a well-known fact that some criminals follow perpetual methods of operations, known as modus operandi (MO) which is commonly used to describe the habits in committing something especially in the context of criminal investigations. These modus operandi are then used in relating criminals to other crimes where the suspect has not yet been recognized. This paper presents a method which is focused on identifying the perpetual modus operandi of criminals by analyzing their previous convictions. The method involves in generating a feature matrix for a particular suspect based on the flow of events. Then, based on the feature matrix, two representative modus operandi are generated: complete modus operandi and dynamic modus operandi. These two representative modus operandi will be compared with the flow of events of the crime in order to investigate and relate a particular criminal. This comparison uses several operations to generate two other outputs: completeness probability and deviation probability. These two outcomes are used as inputs to a fuzzy inference system to generate a score value which is used in providing a measurement for the similarity between the suspect and the crime at hand. The method was evaluated using actual crime data and four other open data sets. Then ROC analysis was performed to justify the validity and the generalizability of the proposed method. In addition, comparison with five other classification algorithms showed that the proposed method performs competitively with other related methods.
... Similarity-based association mining is used mainly to compare the features of a crime with the criminal's behavioral patterns which are referred as modus operandi or behavioral signature. In outlier-based association mining, crime associations will be created on the fact that both the crime and the criminal have the possibility of having some distinctive feature or a deviant behavior (Lin & Brown, 2006). Entity association mining/link analysis is the task of finding and charting associations between crime entities such as persons, weapons, and organizations. ...
Article
Full-text available
It is a well-known fact that some criminals follow perpetual methods of operations known as modi operandi. Modus operandi is a commonly used term to describe the habits in committing crimes. These modi operandi are used in relating criminals to crimes for which the suspects have not yet been recognized. This paper presents the design, implementation and evaluation of a new method to find connections between crimes and criminals using modi operandi. The method involves generating a feature matrix for a particular criminal based on the flow of events of his/her previous convictions. Then, based on the feature matrix, two representative modi operandi are generated: complete modus operandi and dynamic modus operandi. These two representative modi operandi are compared with the flow of events of the crime at hand, in order to generate two other outputs: completeness probability (CP) and deviation probability (DP). CP and DP are used as inputs to a fuzzy inference system to generate a score which is used in providing a measurement for the similarity between the suspect and the crime at hand. The method was evaluated using actual crime data and ten other open data sets. In addition, comparison with nine other classification algorithms showed that the proposed method performs competitively with other related methods proving that the performance of the new method is at an acceptable level.
... Similarity-based association mining is used mainly to 169 compare the features of the crime with the criminal's behavioral patterns which are referred as 170 modus operandi or behavioral signature. In outlier-based association mining, crime associations 171 will be created on the fact that both the crime and the criminal have the possibility of having 172 some distinctive feature or a deviant behavior [16]. Entity association mining/link analysis is the 173 task of finding and charting associations between crime entities such as persons, weapons, and 174 organizations. ...
Article
Full-text available
It is a well-known fact that some criminals follow perpetual methods of operations, known as modus operandi (MO) which is commonly used to describe the habits in committing something especially in the context of criminal investigations. These modus operandi are then used in relating criminals to other crimes where the suspect has not yet been recognized. This paper presents a method which is focused on identifying the perpetual modus operandi of criminals by analyzing their previous convictions. The method involves in generating a feature matrix for a particular suspect based on the flow of events. Then, based on the feature matrix, two representative modus operandi are generated: complete modus operandi and dynamic modus operandi. These two representative modus operandi will be compared with the flow of events of the crime in order to investigate and relate a particular criminal. This comparison uses several operations to generate two other outputs: completeness probability and deviation probability. These two outcomes are used as inputs to a fuzzy inference system to generate a score value which is used in providing a measurement for the similarity between the suspect and the crime at hand. The method was evaluated using actual crime data and four other open data sets. Then ROC analysis was performed to justify the validity and the generalizability of the proposed method. In addition, comparison with five other classification algorithms showed that the proposed method performs competitively with other related methods.
... The proposed approach automatically looks for similarities by using a new total similarity measure with information theoretic-based weights among attributes of robbery records data from the Richmond Police Department in order to associate incidents possibly committed by the same or group of criminals. Thereafter, Lin and Brown presented a new association method that combined outlier score function and tested it on the same robbery data from the Richmond Police Department in ref. [41]. The new outlier-based method outperforms the similarity-based method, with promising results in providing more helpful information for police officers. ...
Article
Full-text available
Crime continues to remain a severe threat to all communities and nations across the globe alongside the sophistication in technology and processes that are being exploited to enable highly complex criminal activities. Data mining, the process of uncovering hidden information from Big Data, is now an important tool for investigating, curbing and preventing crime and is exploited by both private and government institutions around the world. The primary aim of this paper is to provide a concise review of the data mining applications in crime. To this end, the paper reviews over 100 applications of data mining in crime, covering a substantial quantity of research to date, presented in chronological order with an overview table of many important data mining applications in the crime domain as a reference directory. The data mining techniques themselves are briefly introduced to the reader and these include entity extraction, clustering, association rule mining, decision trees, support vector machines, naive Bayes rule, neural networks and social network analysis amongst others. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016
... Similarity-based association mining is used mainly to compare the features of a crime with the criminal's behavioral patterns which are referred as modus operandi or behavioral signature. In outlier-based association mining, crime associations will be created on the fact that both the crime and the criminal have the possibility of having some distinctive feature or a deviant behavior (Lin & Brown, 2006). Entity association mining/link analysis is the task of finding and charting associations between crime entities such as persons, weapons, and organizations. ...
Chapter
Full-text available
Virtual enterprises bring together different companies under one umbrella, and the organizational structure is tailored to the common project rather than reflecting the participating companies’ structure. A virtual organization is also dynamic in nature, as jobs/positions can be created or abolished as the project progresses. Access to information needs to be very flexible in such environment. On one hand, people should have access to all information they need to perform their duties, and those duties may change as time progresses. On the other hand, providing access to data other than needed for a particular job can lead to information overload and also poses a security risk, as it can lead to information leak or accidental modification of data. Role Based Access Control (RBAC) offers a solution to this problem by associating roles with jobs and assigning access privileges to roles. This paper examines Role Based Access Control and proposes some modifications to conventional RBAC to make it suitable for virtual enterprises.
... Determining fraud in financial transactions, trading activity, or insurance claims typically requires the determination of unusual patterns in the data generated by the actions of the criminal entity (Lin and Brown, 2006). ...
Article
There have been many attempts to find knowledge from data using conventional statistics, data mining, artificial intelligence, machine learning and pattern recognition. In those research areas, knowledge is approached in two ways. Firstly, researchers discover knowledge represented in general features for universal recognition, and secondly, they discover exceptional and distinctive features. In process mining, an instance is sequential information bounded by case ID, known as process instance. Here, an exceptional process instance can cause a problem in the analysis and discovery algorithm. Hence, in this paper we develop a method to detect the knowledge of exceptional and distinctive features when performing process mining. We propose a method for anomaly detection named Distance-based Anomaly Process Instance Detection (DAPID) which utilizes distance between process instances. DAPID contributes to a discovery of distinctive characteristic of process instance. For verifying the suggested methodology, we discovered characteristics of exceptional situations from log data. Additionally, we experiment on real data from a domestic port terminal to demonstrate our proposed methodology.
... Determining fraud in financial transactions, trading activity, or insurance claims typically requires the determination of unusual patterns in the data generated by the actions of the criminal entity [21]. ...
Article
Information systems has been developed and used in various business area, therefore there are abundance of history data (log data) stored, and subsequently, it is required to analyze those log data. Previous studies have been focusing on the discovering of relationship between events and no identification of anomaly instances. Previously, anomaly instances are treated as noise and simply ignored. However, this kind of anomaly instances can occur repeatedly. Hence, a new methodology to detect the anomaly instances is needed. In this paper, we propose a methodology of LAPID (Local Anomaly Process Instance Detection) for discriminating an anomalous process instance from the log data. We specified a distance metric from the activity relation matrix of each instance, and use it to detect API (Anomaly Process Instance). For verifying the suggested methodology, we discovered characteristics of exceptional situations from log data. To demonstrate our proposed methodology, we performed our experiment on real data from a domestic port terminal.
Article
Abstract This paper reviews the crime linkage literature to identify how data were pre-processed for analysis, methods used to predict linkage status/series membership, and methods used to assess the accuracy of linkage predictions. Thirteen databases were searched, with 77 papers meeting the inclusion/exclusion criteria. Methods used to pre-process data were human judgement, similarity metrics (including machine learning approaches), spatial and temporal measures, and Mokken Scaling. Jaccard's coefficient and other measures of similarity (e.g., temporal proximity, inter-crime distance, similarity vectors) are the most common ways of pre-processing data. Methods for predicting linkage status were varied and included human (expert) judgement, logistic regression, multi-dimensional scaling, discriminant function analysis, principal component analysis and multiple correspondence analysis, Bayesian methods, fuzzy logic, and iterative classification trees. A common method used to assess linkage-prediction accuracy was to calculate the hit rate, although position on a ranked list was also used, and receiver operating characteristic (ROC) analysis has emerged as a popular method of assessing accuracy. The article has been published open access and is free to download from https://www.sciencedirect.com/science/article/pii/S1359178924001046
Article
The occurrence of incidents seriously affects the operation of the whole urban railway system and passengers’ travel experience. Accurate delay prediction is important for traffic control and management under incidents. Few studies were reported on incident prediction in urban railway systems because of the unexpected nature of incidents and the lack of comprehensive incident data. Existing models used to predict incident delay can be divided into statistical methods and traditional machine learning methods, as well as ensemble learning methods. This study conducts a methodology review for these models by comparing their performance in predicting incident delays using a large-scale incident dataset collected from an urban railway system in Hong Kong. Three statistical models and six machine/ensemble learning methods are examined: ordinary least squares, accelerated failure time, quantile regression (QR), support vector regression (SVR), K-nearest neighbor, random forest, adaptive boosting, gradient boosting decision tree (GBDT), and extreme gradient boosting (XGBoost) tree. The results indicate that statistical models perform better than machine/ensemble learning models in predicting train delays under incidents. The QR, SVR, and XGBoost tree models outperform other models in incident delay prediction in their respective methodological categories. The factors of the incident type and affected line type present the most significant effects on incident delay prediction in selected models.
Conference Paper
Full-text available
The occurrence of incidents seriously affects the operation of the whole urban railway system and passengers' travel experience. Accurate delay prediction is important for traffic control and management under incidents. Few studies were reported on incident prediction in urban railway systems due to the unexpected nature of incidents and the lack of comprehensive incident data. Existing models used to predict incident delay can be divided into statistical and machine learning methods. This study aims to conduct a methodology review for these models by comparing their performance in predicting incident delays using a 7-year incident dataset collected from an urban railway system. Three statistical models and six machine learning methods are examined, including ordinary least squares (OLS), accelerated failure time (AFT), quantile regression (QR), support vector regression (SVR), K-nearest neighbor (KNN), random forest (RF), adaptive boosting (Adaboost), gradient boosting decision tree (GBDT) and extreme gradient boosting tree (XGBoost). The results indicate that statistical models perform better than machine learning models in predicting train delays under incidents. The QR and XGBoost models outperform other models in incident delay prediction in their respective methodological categories. The factors of the incident type and affected line type present the most significant effects on incident delay prediction in selected models.
Chapter
Assessing crime is an important process as crime is on the rise these days. Assessing cybercrime can be a daunting task. It can be challenging to collect existing data and work on new techniques. In cybercrime, direct real-time assessment is obligatory. However, it is difficult to pinpoint when the subsequent offense happens. Information on past offense is limited that where and when it happened. The proposal aptly refers to the recent crime statistics in this work. The datasets contain representing spatial and temporary residual networks to analyze copy crime for hours. In terms of accuracy, comparisons with these experiments and several existing estimation methods show fewer performances than the proposed adaptive deep learning-based crime prediction (ADLCP) model. Finally, to solve shortcomings in existing models for expansion in the actual prediction of crimes, the proposal introduces a novel strategy.
Article
Full-text available
A new protection scheme was proposed to avoid this problem. It is intended that each node uses a salted and a not salted HLL. If their estimates differ considerably, an attacker attempts to manipulate the estimates of HLL. In addition to avoiding manipulation, the proposed salted and unsalted (SNS) regime can also detect attempts at manipulation. A practical configuration showing how manipulation attempts can be detected in a low false positive probability has been shown to be applicable to this SNS scheme. Therefore, if merge ability is to be preserved it can be an interesting approach to protect HLLs from avoidance. In this paper the proposal for a new mining algorithm based on Animal Migration Optimization is made to decrease the number of Association Rules called ARM-AMO. The idea is to remove from the data rules which are not highly supportive and unnecessary. First of all, common item sets and association rules are generated with an Apriori algorithm. AMO also reduces the number of association rules incorporated in a new fitness function. In here, we provide a well-organized mechanism for incident derivation under the unwanted incident. This mechanism very useful for measure the heavy load of an incoming incident and exact calculation of the probability. In additional method is a Select-ability mechanism, which performs an important responsibility in incident derivation under the unwanted incident in both the settled and the unknown incident. A model for signifying derivative incident introduced jointly with an Advanced Sampling Technique that come close to the derived incident probabilities. This augmentation executed the prioritization techniques. In this prioritization techniques, recognize such cases in which the order of incident finding is strong-minded and mechanism for the definition of a settled detection execution.
Article
The identification of crime series is of great importance for public safety in a smart city development. This research presents a novel crime clustering model, CriClust+, for detecting Crime Series Pattern (CSP). The analysis is augmented using geometric projection with a dual-threshold model. The pattern prevalence information extracted from the model is encoded in similarity graphs. Clusters are identified by finding highly-connected subgraphs using adaptive graph size and Monte-Carlo heuristics in the Karger-Stein mincut algorithm. We propose two new interest measures: (i) Proportion Difference Evaluation (PDE), which reveals the propagation effect of a series and dominant series; and (ii) Pattern Space Enumeration (PSE), which reveals strong underlying correlations and defining features for a series. Our findings on experimental data set based on a Gaussian distribution and expert knowledge recommendation reveal that, identifying CSP and statistically interpretable patterns could contribute significantly to strengthening public safety service delivery. Evaluation was conducted to investigate: (i) the reliability of the model in identifying all inherent series in a crime dataset; (ii) the scalability of the model with varying crime records volume; and (iii) unique features of the model compared to related research. The study also found that PDE and PSE of series clusters can provide valuable insight into crime deterrence strategies. This research presents considerable empirical evidence, which shows that the proposed crime clustering (CriClust+) model is promising and can assist in deriving useful crime pattern knowledge.
Article
Crime is pervasive all around the world. Understanding the influence of social features on crime occurrences of a city is a hot topic among researchers. Correlations between crime and other social characteristics have been studied by large amounts of statistical models, including Ordinary Least Square (OLS) linear regression model, Random Forest (RF) regression model, Artificial Neural Network (ANN) model and so on. However, results of these studies, such as the prediction accuracy, are not satisfying and many contradictory conclusions are achieved in previous research works. These controversies are triggered by several factors, including the non-Gaussian distributions and multicollinearity of urban social data, inaccuracy and inadequacy of the processed data, etc. To fill these gaps, we analyzed the influence of 18 urban indicators within 6 categories including geography, economy, education, housing, urbanization and population structure on crime risk in China’s major prefecture-level cities by year. We used the big data algorithm, Least Absolute Shrinkage and Selection Operator (LASSO) and Extremely-randomized Trees (Extra-Trees), to predict the crime risk and quantify the influence of urban parameters on crime. 83% of accuracy on crime risk prediction can be obtained from our fitted model and the importance of urban indicators is ranked. Results show that area of land used for living, number of subscribers of mobile telephone, employed population are the three main factors on the crime occurrences in China. Our research makes contributions to better understanding of the effects of urban indicators on crime in a socialist nation, and providing instructions and strategies for crime prediction and crime rate control with governments, in this big-data era.
Article
Successfully detecting, analyzing, and reasoning about collective anomalies is important for many real-life application domains (e.g., intrusion detection, fraud analysis, software security). The primary challenges to achieving this goal include the overwhelming number of low-risk events and their multimodal relationships, the diversity of collective anomalies by various data and anomaly types, and the difficulty in incorporating the domain knowledge of experts. In this paper, we propose the novel concept of the faceted High-Order Correlation Graph (HOCG). Compared with previous, low-order correlation graphs, HOCG achieves better user interactivity, computational scalability, and domain generality through synthesizing heterogeneous types of objects, their anomalies, and the multimodal relationships, all in a single graph. We design elaborate visual metaphors, interaction models, and the coordinated multiple view based interface to allow users to fully unleash the visual analytics power of the HOCG. We conduct case studies for three application domains and collect feedback from domain experts who apply our method to these scenarios. The results demonstrate the effectiveness of the HOCG in the overview of point anomalies, the detection of collective anomalies, and the reasoning process of root cause analyses.
Article
Full-text available
The increase in crime data recording coupled with data analytics resulted in the growth of research approaches aimed at extracting knowledge from crime records to better understand criminal behavior and ultimately prevent future crimes. While many of these approaches make use of clustering and association rule mining techniques, there are fewer approaches focusing on predictive models of crime. In this paper, we explore models for predicting the frequency of several types of crimes by LSOA code (Lower Layer Super Output Areas — an administrative system of areas used by the UK police) and the frequency of anti-social behavior crimes. Three algorithms are used from different categories of approaches: instance-based learning, regression and decision trees. The data are from the UK police and contain over 600,000 records before preprocessing. The results, looking at predictive performance as well as processing time, indicate that decision trees (M5P algorithm) can be used to reliably predict crime frequency in general as well as anti-social behavior frequency.
Conference Paper
This paper describes a visual analytics method for visualizing the effects of multiple anomaly detection models, exploring the complex model space of a specific type of detection method, namely Query with Conditional Attributes (QCAT), and facilitating the construction of composite models using multiple QCATs. We have developed a prototype system that features a browser-based interface, and database-driven back end. We tested the system using the "Inside Threats Dataset" provided by CMU.
Article
Full-text available
This survey introduces the emergence of link mining and its relevant application to detect anomalies which can include events that are unusual, out of the ordinary or rare, unexpected behaviour, or outliers.
Article
Full-text available
Data points that exhibit abnormal behavior, either spatially, temporally, or both, are considered spatio-temporal outliers. Spatio-Temporal outlier detection is important for the discovery of exceptional events due to the rapidly increasing amount of spatio-temporal data available, and the need to understand such data. A tropical cyclone system or a hurricane can be considered abnormal activities of the atmosphere system. Discovery of such abnormality usually leverages data from a satellite or radar. Not many people have thought about using a weather buoy, a floating device that provides meteorological/environmental information in real-time for open ocean and coastal zones. The aim of this research is to see if a spatio-temporal outlier approach can help to discover the evolution and movement of the hurricane system from weather buoy observations. This paper leverages an algorithm, Spatio-Temporal Local Density-Based Clustering of Applications with Noise (ST-LDBCAN), which has been developed and used by the authors to detect outliers in various scenarios. The ST-LDBCAN has a novel way of defining spatio-temporal context and can handle multivariate data, which is its advantage over existing algorithms. The results show a good correlation between detected spatio-temporal outliers and the paths and evolution of Hurricanes Katrina and Gustav.
Article
This survey introduces the emergence of link mining and its relevant application to detect anomalies which can include events that are unusual, out of the ordinary or rare, unexpected behaviour, or outliers.
Article
Validation of Operations Research/Management Science (OR/MS) decision support models is usually performed with the aim of providing the decision makers with sufficient confidence to utilize the model's recommendations to support their decision-making. OR/MS models for investigation and improvement provide a particularly challenging task as far as validation is concerned. Their complex nature often relies on a wide variety of data sources and empirical estimates used as parameters in the model, as well as on multiple conceptual and computational modelling techniques. In this paper, we performed an extensive literature review of validation techniques for healthcare models in OR/MS. Despite calls for systematic approaches to validation of complex OR/MS decision support models; we identified a clear lack of appropriate application of validation techniques reported in published healthcare models. The “Save a minute – save a day” model for evaluation of long-term benefits of faster access to thrombolysis therapy in acute ischaemic stroke is used as a case to demonstrate how multiple aspects of data validity, conceptual model validity, computerized verification, and operational model validity can be systematically addressed when developing a complex OR/MS decision support model for investigation and improvement in health services.
Book
Clemens Pirker addresses the frequent doubt among researchers on how to deal with extreme and outlying observations in data analysis. He draws on various scientific domains to explore possible handling alternatives and the relevance of the phenomenon. In the empirical section, a published segmentation study on international tourists using sample censoring is replicated and the effects on the results are discussed. The dissertation concludes that authors may leverage their insights from given data by reporting those cases and their influence on the results. © Gabler | GWV Fachverlage GmbH, Wiesbaden 2009. All rights reserved.
Article
The outlier detection has many applications and research attention to it is increasing. The detection rate is a significant measure for outlier mining that evaluates the outlier detection algorithms' performance. The problem is especially challenging because of the difficulty of defining a significant outlier measure in order to have a better detection rate. This paper proposes a novel approach for outlier detection with consideration of frequent negative itemset. This approach also produces positive itemset together with negative itemset. The knowledge and interesting pattern generated from frequent positive and negative (FPN) itemsets confidently enhances the outlier detection task in this method. The FPN itemset helps identification of transactions that are rare and in conflict with each other. However, discovering negative itemsets remains a challenge. To further investigate the potential knowledge of frequent negative itemsets in outlier detection, an experiment is conducted using the UCI datasets. The FPN itemset approach obtains better detection rate compared to other algorithms for majority datasets, indicating that the proposed approach is a promising approach in solving outlier detection problems.
Article
Full-text available
Article
Full-text available
The improper use of law enforcement and data hinders the crime-fighting capabilities of government agencies. Police personnel lack in sharing information with other agencies. To overcome the challenges of handling excessive information, COPLINK, an integrated information and knowledge management environment was introduced. COPLINK helps in capturing, analyzing, visualizing and sharing law enforcement-related information in social and organizational contexts. COPLINK consists of two main components, COPLINK Connect whic has been designed to allow diverse police departments to share data seamlessly through an easy to use interface and COPLINK Detect which uncovers various types of criminal associations that exist in police databases.
Conference Paper
Full-text available
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Article
Full-text available
In this article, we describe an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size. The method is based on measuring similarity between features whereby redundancy therein is removed. This does not need any search and, therefore, is fast. A new feature similarity measure, called maximum information compression index, is introduced. The algorithm is generic in nature and has the capability of multiscale representation of data sets. The superiority of the algorithm, in terms of speed and performance, is established extensively over various real-life data sets of different sizes and dimensions. It is also demonstrated how redundancy and information loss in feature selection can be quantified with an entropy measure
Article
Full-text available
As said in signal processing, "One person's noise is another person's signal." For many applications, such as the exploration of satellite or medical images, and the monitoring of criminal activities in electronic commerce, identifying exceptions can often lead to the discovery of truly unexpected knowledge. In this paper, we study an intuitive notion of outliers. A key contribution of this paper is to show how the proposed notion of outliers unifies or generalizes many existing notions of outliers provided by discordancy tests for standard statistical distributions. Thus, a unified outlier detection system can replace a whole spectrum of statistical discordancy tests with a single module detecting only the kinds of outliers proposed. A second contribution of this paper is the development of an approach to find all outliers in a dataset. The structure underlying this approach resembles a data cube, which has the advantage of facilitating integration with the many OLAP and data mining s...
Article
Full-text available
The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.
Article
Full-text available
This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers in large datasets can only deal efficiently with two dimensions/attributes of a dataset. Here, we study the notion of DB- (Distance- Based) outliers. While we provide formal and empirical evidence showing the usefulness of DB-outliers, we focus on the development of algorithms for computing such outliers. First, we present two simple algorithms, both having a complexity of O(k N 2 ), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear w...
Article
Developments in a number of academic disciplines-the sociology of deviance, criminology, economics, psychology-suggest that it is useful to see criminal behavior not as the result of psychologically and socially determined dispositions to offend, but as the outcome of the offender's broadly rational choices and decisions. This perspective provides a basis for devising models of criminal behavior that (1) offer frameworks within which to locate existing research, (2) suggest directions for new research, (3) facilitate analysis of existing policy, and (4) help to identify potentially fruitful policy initiatives. Such models need not offer comprehensive explanations; they may be limited and incomplete, yet still be "good enough" to achieve these important policy and research purposes. To meet this criterion they need to be specific to particular forms of crime, and they need separately to describe both the processes of involvement in crime and the decisions surrounding the commission of the offense itself. Developing models that are crime specific and that take due account of rationality will also demand more knowledge about the ways in which offenders process and evaluate relevant information. Such a decision perspective appears to have most immediate payoff for crime control efforts aimed at reducing criminal opportunity.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
As the practicality of expert systems continues to materialize, many high-value problems are beginning to be addressed. In recent years, expert systems have been developed for practical applications in areas such as oil exploration, chemical structure analysis, medical diagnosis, industrial equipment trouble-shooting, personnel training, farm management, business credit analysis, mathematical analysis, configuration of computer equipment, production scheduling and control, cancer classification, space vehicle electric power management, and monitoring of biological activities in ponds. An expert system is a sophisticated computer program that uses a knowledge base and 'If-Then' inferences to solve difficult decision problems that would otherwise require a human expert to solve. The flexibility in the structure of expert systems makes the technology easily adaptable to a wide-ranging list of human functional endeavors. This paper describes an ongoing project aimed at developing an expert system for suspect identification in armed robbery incidents. A prototype of the system, named AREST (Armed Robbery Eidetic Suspect Typing), has been completed and is described in this paper.
Article
This paper treats essentially the first derivative of an estimator viewed as functional and the ways in which it can be used to study local robustness properties. A theory of robust estimation “near” strict parametric models is briefly sketched and applied to some classical situations. Relations between von Mises functionals, the jackknife and U-statistics are indicated. A number of classical and new estimators are discussed, including trimmed and Winsorized means, Huber-estimators, and more generally maximum likelihood and M-estimators. Finally, a table with some numerical robustness properties is given.
Article
The discovery of preferences in space and time is important in a variety of applications. In this paper we first establish the correspondence between a set of preferences in space and time and density estimates obtained from observations of spatial-temporal features recorded within large databases. We perform density estimation using both kernel methods and mixture models. The density estimates constitute a probabilistic representation of preferences. We then present a point process transition density model for space-time event prediction that hinges upon the density estimates from the preference discovery process. The added dimension of preference discovery through feature space analysis enables our model to outperform traditional preference modeling approaches. We demonstrate this performance improvement using a criminal incident database from Richmond, Virginia. Criminal incidents are human-initiated events that may be governed by criminal preferences over space and time. We applied our modeling technique to breaking and entering crimes committed in both residential and commercial settings. Our approach effectively recovers the preference structure of the criminals and enables one-week ahead forecasts of threatened areas. This capability to accommodate all measurable features, identify the key features, and quantify their relationship with event occurrence over space and time makes this approach applicable to domains other than law enforcement.
Chapter
Introduction Historical Perspective Graphical Display of Multivariate Data Points Graphical Display of Multivariate Functionals Geometry of Higher Dimensions
Article
Associating records in a large database that are related but not exact matches has importance in a variety of applications. In law enforcement, this task enables crime analysts to associate incidents possibly resulting from the same individual or group of individuals. In practice, most crime analysts perform this task manually by searching through incident reports looking for similarities. This paper describes automated approaches to data association. We report tests showing that our data association methods significantly reduced the time required by manual methods with accuracy comparable to experienced crime analysts. In comparison to analysis using the structured query language (SQL), our methods were both faster and more accurate.
Conference Paper
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.
Conference Paper
Data analysis applications typically aggregate data across many dimensions looking for unusual patterns. The SQL aggregate functions and the GROUP BY operator produce zero-dimensional or one-dimensional answers. Applications need the N-dimensional generalization of these operators. The paper defines that operator, called the data cube or simply cube. The cube operator generalizes the histogram, cross-tabulation, roll-up, drill-down, and sub-total constructs found in most report writers. The cube treats each of the N aggregation attributes as a dimension of N-space. The aggregate of a particular set of attribute values is a point in this space. The set of points forms an N-dimensionaI cube. Super-aggregates are computed by aggregating the N-cube to lower dimensional spaces. Aggregation points are represented by an “infinite value”: ALL, so the point (ALL,ALL,...,ALL, sum(*)) represents the global sum of all items. Each ALL value actually represents the set of values contributing to that aggregation
Article
Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996.
Article
Article
Ratios of the form (xnxnj)/(xnxi)(x_n - x_{n-j})/(x_n - x_i) for small values of i and j and n=3,,30n = 3, \cdots, 30 are discussed. The variables concerned are order statistics, i.e., sample values such that x1<x2<<xnx_1 < x_2 < \cdots < x_n. Analytic results are obtained for the distributions of these ratios for several small values of n and percentage values are tabled for these distributions for samples of size n30n \leqq 30.
Article
Thesis (Ph. D.)--University of Virginia, 2003. Includes bibliographical references (leaves 117-122).
Article
Receiver operating characteristic (ROC) methodology has been increasingly used in medical applications in the last 10 years. The text by Swets and Pickett has popularized the technique and the journal Medical Decision Making (1981--) provides a forum for further methodologic issues. In this article, I will (1) describe the nature of the data generated by ROC studies; (2) evaluate the choices of summary indices of performance (accuracy); (3) outline the data-analytic techniques used, and how to incorporate data from multiple observers and multiple "readings"; (4) review proposed alternatives to the commonly used binormal ROC model; and (5) discuss issues, such as verification bias, and challenges, such as multicenter comparative imaging studies and the difficulty of obtaining "truth data", which need to be addressed when adapting ROC methods to medical contexts.
Article
: Data analysis applications typically aggregate data across many dimensions looking for unusual patterns. The SQL aggregate functions and the GROUP BY operator produce zero-dimensional or one-dimensional answers. Applications need the N-dimensional generalization of these operators. This paper defines that operator, called the data cube or simply cube. The cube operator generalizes the histogram, cross-tabulation, roll-up, drill-down, and sub-total constructs found in most report writers. The cube treats each of the N aggregation attributes as a dimension of N-space. The aggregate of a particular set of attribute values is a point in this space. The set of points forms an N-dimensional cube. Super-aggregates are computed by aggregating the N-cube to lower dimensional spaces. Aggregation points are represented by an "infinite value", ALL. For example, the point (ALL,ALL,ALL,...,ALL, sum(*)) would represent the global sum of all items. Each ALL value actually represents the set of...
Article
Credit card fraud falls broadly into two categories: behavioural fraud and application fraud. Application fraud occurs when individuals obtain new credit cards from issuing companies using false personal information and then spend as much as possible in a short space of time. However, most credit card fraud is behavioural and occurs when details of legitimate cards have been obtained fraudulently and sales are made on a 'Cardholder Not Present' basis. These sales include telephone sales and e-commerce transactions where only the card details are required. In this paper, we are concerned with detecting behavioural fraud through the analysis of longitudinal data. These data usually consist of credit card transactions over time, but can include other variables, both static and longitudinal. Statistical methods for fraud detection are often classification (supervised) methods that discriminate between known fraudulent and non-fraudulent transactions; however, these methods rely on accurate identification of fraudulent transactions in historical databases -- information that is often in short supply or non-existent. We are particularly interested in unsupervised methods that do not use this information but instead detect changes in behaviour or unusual transactions. We discuss two methods for unsupervised fraud detection in credit data in this paper and apply them to some real data sets. Peer group analysis is a new tool for monitoring behaviour over time in data mining situations. In particular, the tool detects individual accounts that begin to behave in a way distinct from accounts to which they had previously been similar. Each account is selected as a target account and is compared with all other accounts in the database, using either external comparison criteria or i...
Article
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k th nearest neighbor. We rank each point on the basis of its distance to its k th nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nestedloop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality. 1
Career Criminal Apprehension Program: Annual Report, Office of Criminal Justice Planning
  • R O Heck
R.O. Heck, Career Criminal Apprehension Program: Annual Report, Office of Criminal Justice Planning, Sacramento, CA, 1991.
Sniper: Inside the Hunt for the Killers Who Terrorized the Nation
  • M Ruane
  • S Horwitz
M. Ruane, S. Horwitz, Sniper: Inside the Hunt for the Killers Who Terrorized the Nation, 1st edition, Random House, New York, 2003.
  • D Andrews
  • P Bickel
  • F Hampel
  • P Huber
  • W Rogers
Andrews, D. Bickel, P., Hampel, F., Huber, P., Rogers, W., and Tukey, J. Robust Estimate of Location, Princeton University Press, 1972
Space-Time Point Process Modeling: Feature Selection and Transition Density Estimation
  • Hua Liu
Liu, Hua, Space-Time Point Process Modeling: Feature Selection and Transition Density Estimation, " Dissertation for Systems Engineering University of Virginia, 1999
An overview of data warehousing and OLAP technology
  • Chaudhuri
Outlier-based Method for Data Association, Dissertation in Systems and Information Engineering
  • S Lin