Article

Automated criminal link analysis based on domain knowledge

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Link (association) analysis has been used in the criminal justice domain to search large datasets for associations between crime entities in order to facilitate crime investi- gations. However, link analysis still faces many challeng- ing problems, such as information overload, high search complexity, and heavy reliance on domain knowledge. To address these challenges, this article proposes several techniques for automated, effective, and efficient link analysis. These techniques include the co-occurrence analysis, the shortest path algorithm, and a heuristic ap- proach to identifying associations and determining their importance. We developed a prototype system called CrimeLink Explorer based on the proposed techniques. Results of a user study with 10 crime investigators from the Tucson Police Department showed that our system could help subjects conduct link analysis more effi- ciently than traditional single-level link analysis tools. Moreover, subjects believed that association paths found based on the heuristic approach were more accurate than those found based solely on the co-occurrence analysis and that the automated link analysis system would be of great help in crime investigations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Over time, many platforms for digital forensic analysis (Table 1) emerged to provide support for Big Data Analysis and provide ways to integrate and link knowledge to support police investigation and security events [17,23]. The needs of such platforms are mainly to integrate a plethora of tools/systems available (such as Pig, Hadoop, Cassandra, Zookeeper, Lucene, and Mahout) and different types of analysis required for big digital forensics analysis, such as link analysis to connect knowledge from different sources (e.g., [17,23]) or text/data mining approaches supported by machine learning [19]. ...
... Over time, many platforms for digital forensic analysis (Table 1) emerged to provide support for Big Data Analysis and provide ways to integrate and link knowledge to support police investigation and security events [17,23]. The needs of such platforms are mainly to integrate a plethora of tools/systems available (such as Pig, Hadoop, Cassandra, Zookeeper, Lucene, and Mahout) and different types of analysis required for big digital forensics analysis, such as link analysis to connect knowledge from different sources (e.g., [17,23]) or text/data mining approaches supported by machine learning [19]. ...
... CrimeLink Analysis Explorer [23] is a platform that provides support for link analysis investigations, supporting co-occurrence analysis, the shortest path algorithm, and a heuristic to identify the importance of associations. The platform was developed as an ad-hoc solution based on a management system supported by a database connection and modules for co-occurrence weights, a heuristic module, an association path module, and a graphical user interface. ...
Preprint
Full-text available
With the advancing digitization of our society, network security has become one of the critical concerns for most organizations. In this paper, we present CopAS, a system targeted at Big Data forensics analysis, allowing network operators to comfortably analyze and correlate large amounts of network data to get insights about potentially malicious and suspicious events. We demonstrate the practical usage of CopAS for insider threat detection on a publicly available PCAP dataset and show how the system can be used to detect insiders hiding their malicious activity in the large amounts of networking data streams generated during the daily activities of an organization.
... N OWADAYS, numerous highly formatted databases are utilized to construct domain specific KGs [21], reasoning on which is widely used in various applications, e.g. criminal link analysis [15], [16], suspicious transaction detection [12], e-commerce recommendation [10], etc. In such applications, reasoning between two entities often relates to the reachability queries with both label and substructure constraints [16], [22], as KGs can be considered as edgelabeled graphs. ...
... criminal link analysis [15], [16], suspicious transaction detection [12], e-commerce recommendation [10], etc. In such applications, reasoning between two entities often relates to the reachability queries with both label and substructure constraints [16], [22], as KGs can be considered as edgelabeled graphs. ...
... else if LCS(s, v, L, F ) then 8: if LCS(v, t, L, T ) then 9: return Q = T 10: 11: if LCS(v, t, L, T ) then 12: return Q = T 13: return Q = F 14: Function: LCS(s * , t * , L, B) // B is a boolean 15: if B = T then 16: 18: Take an element u from S 19: for each edge e = (u, l, w), l ∈ L, incident to u do // We explore w in the following cases 20: ...
Preprint
Since knowledge graphs (KGs) describe and model the relationships between entities and concepts in the real world, reasoning on KGs often correspond to the reachability queries with label and substructure constraints (LSCR). Specially, for a search path p, LSCR queries not only require that the labels of the edges passed by p are in a certain label set, but also claim that a vertex in p could satisfy a certain substructure constraint. LSCR queries is much more complex than the label-constraint reachability (LCR) queries, and there is no efficient solution for LSCR queries on KGs, to the best of our knowledge. Motivated by this, we introduce two solutions for such queries on KGs, UIS and INS. The former can also be utilized for general edge-labeled graphs, and is relatively handy for practical implementation. The latter is an efficient local-index-based informed search strategy. An extensive experimental evaluation, on both synthetic and real KGs, illustrates that our solutions can efficiently process LSCR queries on KGs.
... Different types of crime, such as arm robbery and kidnapping concern police at various divisions and levels of government [27,30]. Other types of crime such as cybercrimes and terrorism are usually investigated by local, national and international agencies [31]. ...
... Method used Tasks Gaps [9,12,14,15,31,32,33] Association rule mining Crime pattern analysis from crime data Model's processing time and visualization were not considered. [1,2,10,17,27,35] Statistical based Analysis of the similar crime data Data collection not available and no concern with solving processing time [3,14,15,17,18,24,25,31,33,34] Clustering Data groupings into similarities and clusters Produces unreliable results with noisy data and when number of clusters are small. ...
... Method used Tasks Gaps [9,12,14,15,31,32,33] Association rule mining Crime pattern analysis from crime data Model's processing time and visualization were not considered. [1,2,10,17,27,35] Statistical based Analysis of the similar crime data Data collection not available and no concern with solving processing time [3,14,15,17,18,24,25,31,33,34] Clustering Data groupings into similarities and clusters Produces unreliable results with noisy data and when number of clusters are small. [11,16,17,18,19,23,25,27,36,37,38,42,43,45] Supervised learning Crime data analysis and classification Accuracy of prediction and efficiency of the model were not considered. ...
... However, our focus is limited to the studies in illicit (and criminal) networks. Schroeder et al. (2007) discussed four principal methods for criminal link analysis: heuristics-based, template-based, similarity-based, and statistical. Link detection in these methods depend on decision rules, pre-defined template, similarity between entities, and lexical statistics, respectively. ...
... When applied to a drug network, PFS fared better in terms of execution time. Schroeder et al. (2007) combined co-occurrence analysis, shortest path algorithm, and a heuristic approach for automatic link analysis. Isah et al. (2015) used a bipartite network model to infer hidden ties between actors in the illicit medicine supply chain. ...
Article
Illicit trades have emerged as a significant problem to almost every government across the world. Their gradual expansion and diversification throughout the years suggests the existence of robust yet obscure supply chains as well as the inadequacy of current approaches to understand and disrupt them. In response, researchers have been trying hard to identify strategies that would succeed in controlling the proliferation of these trades. With the same motivation, this paper conducts a comprehensive review of prior research in the field of illicit supply-chain networks. The review is primarily focused on the trade of physical products, ignoring virtual products and services. Our discussion includes analyses of their structure and operations, as well as procedures for their detection and disruption, especially from the perspective of operations research, management science, network science, and industrial engineering. We also address persisting challenges in this domain and offer future research directions to pursue.
... As far as applying grouping strategies in crime DM (data mining), numerous usages joined more than one explicit sort of order strategy. In this way, the survey that follows is arranged by every method in sequential requests, and relying upon conditions, those unpredictable blend cases are not recreated [45][46][47][48]. ...
... The characterization strategy, otherwise called decision trees (DT), is utilized [47] for recognizing dubious messages and announced over 95% precision inaccurately characterizing messages in a huge estimated dataset. The model of the beguiling hypothesis is applied to the dataset of messages, and the decision tree is produced using the ID3 calculation. ...
Article
Full-text available
Crime remains to continue to be a serious threat to all groups and peoples throughout the world together with the complexity in technology and procedures that are being manipulated to allow extremely complex criminal acts. Data mining is now an essential tool for examining, reducing, and avoiding crime and is manipulated by both government and private institutions across the globe which is the method of revealing hidden information from Big Data. The data mining methods themselves are temporarily presented to the reader and this information includes the social network analysis, neural networks, naive Bayes rule, support vector machines, decision trees, association rule mining, clustering, entity extraction, and amongst others. The main objective of this article is to offer a concise analysis of the data mining applications in crime. Finally, the article evaluates applications of data mining in crime, including a considerable quantity of the study to date, displayed in chronological order with a summary table of numerous crucial information mining applications in the crime area as a directory of reference.
... They have utilised heterogeneous bibliographic information network analysis to predict the potential novel co-citations that are likely to occur in the future as discussed in Section 5.1.3.3. 109, 110, 113-118, 121-126, 128, 129, 132, 135, 137-140, 142-144, 146-153, 155, 156, 159-161, 163, 165-168, 170, 173-176, 178-181, 184, 189-192, 194-196, 198-206, 210-217, 219-224] Other domains Industrial domain (electric vehicles energy storage systems) [200], Water Purification Techniques [111], Robotics↔Gerontology [81,82], Chance discovery↔Olympic games [152], Counterterrorism [84], Built Environment [93], Genetic algorithms [47], Chinese agricultural economics [73], Crime investigation [162], Climate Science [127], Sustainability issues↔Aviation industry [136] ...
... Past Studies Category 1 [1, 6-8, 11, 12, 16, 18, 22, 30, 38-42, 54, 59-61, 63, 66, 67, 72, 77-79, 94, 95, 113, 115, 121, 124, 128, 135, 137-139, 143, 155, 156, 159, 161, 166-168, 178, 196, 201-203, 216, 223, 224] Category 2 [10, 15, 17, 19, 21, 24, 26-29, 31, 32, 45, 46, 51-53, 58, 68-71, 73-76, 83, 85, 90-92, 99-101, 104, 106, 107, 110, 113, 114, 116, 118, 123, 125, 126, 132, 140, 144, 146, 147, 149-151, 153, 160, 170, 173-175, 179-181, 184, 189-192, 195, 199, 205, 206, 210, 211, 213-215, 219-221] Category 3 [14,25,33,35,48,49,109,122,129,142,163,165,194,198] Category 4 [152,200] Category 5 [47,73,81,82,84,93,111,127,136,162] ...
Article
The vast nature of scientific publications brings out the importance of Literature-Based Discovery (LBD) research that is highly beneficial to accelerate knowledge acquisition and the research development process. LBD is a knowledge discovery workflow that automatically detects significant, implicit knowledge associations hidden in fragmented knowledge areas by analysing existing scientific literature. Therefore, the LBD output not only assists in formulating scientifically sensible, novel research hypotheses but also encourages the development of cross-disciplinary research. In this systematic review, we provide an in-depth analysis of the computational techniques used in the LBD process using a novel, up-to-date, and detailed classification. Moreover, we also summarise the key milestones of the discipline through a timeline of topics. To provide a general overview of the discipline, the review outlines LBD validation checks, major LBD tools, application areas, domains, and generalisability of LBD methodologies. We also outline the insights gathered through our statistical analysis that capture the trends in LBD literature. To conclude, we discuss the prevailing research deficiencies in the discipline by highlighting the challenges and opportunities of future LBD research.
... prediction of a sub-populations involvement in a very specific financial crime) or rely on an entity to seed the problem (e.g. find all shortest paths between the source entity and a target entity) [2,3]. An alternative approach is using the entire context of the criminal complex system to support the detection of crime fragments, adopting the open world assumption. ...
... The COPLINK software originally developed from research between Arizona State University, Tuscon Police Department, and Phoenix Police Department since 1997, has contributed a significant body of literature on using SNA and related technology to increase our understanding of crime [22]. COPLINK does not focus on the detection of criminal subgraphs per se, but conducts a range of topological metrics on subgraphs after they have been identified [21,23] including link analysis [3,22], topology [15], and identification of significant facilitators in evolving criminal networks [16]. The examples provided by Xu and Chen [21] include two graphs of 57 and 60 nodes. ...
Article
Full-text available
Abstract Law enforcement and intelligence agencies generally have access to a number of rich data sources, both structured and unstructured, and with the advent of high performing entity resolution it is now possible to fuse multiple heterogeneous datasets into an explicit generic data representation. But once this is achieved how should agencies go about attempting to exploit this data by proactively identifying criminal events and the actors responsible? The authors will outline an effective generic method that; computationally extracts minimally overlapping contextual subgraphs, then uses these subgraphs as the basis to construct a mesoscopic graph based on the intersections between the subgraphs, enabling knowledge discovery from these data representations for the purpose of maximally disrupting terrorism, organised crime and the broader criminal network.
... This technique can correlate massive amounts of data about entities in regard to fraud, terrorism, narcotics, and others. 11 This data mining technique is the first level of analysis by which networks of people, locations, groups, vehicles, contact addresses, bank accounts, and other tangible entities can be explored, assembled, detected, and analyzed (see Figure 1). Linkage data is visualized as a graph with linked nodes where nodes represent suspects of interest for the investigators, and the link indicates a relationship or transaction between suspects and criminal artifact. ...
... In CrimeLink Explorer the heuristic approach supports to incorporate investigative domain knowledge into a link analysis approach for measuring association strength automatically. 11 A criminal link analysis approach is proposed in Ref 15, where the system uses the time-based relationship, event similarity, time-related proximity, and document distributional proximity to identify the events of a terrorist attack incident. For a criminal network, the characteristic of a criminal is extracted via some ties and links. ...
Article
Applications of various data analytics technologies to security and criminal investigation during the past three decades have demonstrated the inception, growth, and maturation of criminal analytics. We first identify five cutting‐edge data mining technologies such as link analysis, intelligent agents, text mining, neural networks, and machine learning. Then, we explore their recent applications to the criminal analytics domain, and discuss the challenges arising from these innovative applications. We also extend our study to big data analytics which provides some state‐of‐the‐art technologies to reshape criminal investigations. In this paper, we review the recent literature, and examine the potentials of big data analytics for security intelligence under a criminal analytics framework. We examine some common data sources, analytics methods, and applications related to two important aspects of social network analysis namely, structural analysis and positional analysis that lay the foundation of criminal analytics. Another contribution of this paper is that we also advocate a novel criminal analytics methodology that is underpinned by big data analytics. We discuss the merits and challenges of applying big data analytics to the criminal analytics domain. Finally, we highlight the future research directions of big data analytics enhanced criminal investigations. WIREs Data Mining Knowl Discov 2017, 7:e1208. doi: 10.1002/widm.1208 This article is categorized under: Fundamental Concepts of Data and Knowledge > Data Concepts Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Computer Architectures for Data Mining
... CNA requires the ability to integrate information from multiple crime incidents where the relationships between crime entities are recognized using link analysis [5]. ...
... Jennifer Schroeder, Jennifer Jie Xu, Hsinchun Chen and Michael Chau [5] establish association path linking using heuristic methods in knowledge engineering which builds up a knowledge base that results in an inference engine. This paper implements a system called as Crime Link Explorer based on a set of structured crime incidents from Tuscan police department. ...
Article
Full-text available
This paper categories and compares various works done in the field of social networking for covert networks. It uses criminal network analysis to categorize various approaches in social engineering like dynamic network analysis, destabilizing covert networks, counter terrorism, key player, subgroup detection and homeland security. The terrorist network has been takenfor study because of its network of individuals who spread from continents to continents and have an effective influence of their ideology throughout the globe. It also presents various metrics based on which thecentrality of nodes in the graphs could be identified and it’s illustrated based on a synthetic dataset for 9/11 attack. This paper will also discuss various open problems in this area.
... CNA requires the ability to integrate information from multiple crime incidents where the relationships between crime entities are established using link analysis [5]. ...
... Jennifer Schroeder, Jennifer Jie Xu, Hsinchun Chen and Michael Chau [5] establish association path linking using heuristic methods in knowledge engineering which constructs a knowledge base that develops an inference engine. This paper implements a system called as Crime Link Explorer based on a set of structured crime incidents from Tuscan police department. ...
Article
Full-text available
The world we live in is a complex socio-technical system and systematically thinking about, representing, modeling and analyzing these systems has been made possible by social network analysis approach. A lot of groups or communities do exist in the society but the terrorist network has been taken for study in this paper because they consist of networks of individuals that span countries, continents, the economic status, and form around specific ideology. In this paper we present a survey to study the terrorist network using the criminal network analysis which is based on dynamic network analysis, destabilizing covert networks, counter terrorism, key player, subgroup detection and criminal network analysis in homeland security. This paper will also discuss various open problems in this area.
... De nombreux autres avantages tactiques et stratégiques en découlent, comme une évaluation approfondie de la structure interne des groupes criminels et l'identification de pseudonymes difficilement détectables dans les enquêtes régulières (Berlusconi et al., 2016). En analysant de cette façon les relations et les associations entre les entités recueillies, l'analyse des liens peut également fournir des informations sur les motifs ayant poussé les criminels à agir et ainsi orienter les différentes pistes d'enquête (Piza et Feng, 2017;Schroeder et al., 2007). L'identification de relations entre différents dossiers d'enquête peut donc se faire en recherchant des entités communes inscrites dans les documents constituant ces différents dossiers. ...
Article
Full-text available
L’information produite par nos activités numériques est en constante augmentation. Ce flux d’informations en continu se traduit aussi par un accroissement important du nombre de données à traiter dans le cadre d’activités de renseignement et d’enquêtes policières. Afin de faciliter ce traitement de données, de nouvelles techniques ayant recours à l’intelligence artificielle s’offrent aux personnels policiers afin d’automatiser une partie de leur travail. Dans ce contexte, le présent article propose une démarche en six étapes permettant le déploiement d’une démarche structurée et d’un modèle algorithmique de reconnaissance d’entités nommées, spécifiquement adaptée pour l’analyse de documents d’enquête policière. En mettant l’accent plus spécifiquement sur le traitement de dossier d’infractions pour fraude, la démarche méthodologique à entreprendre pour avoir recours efficacement à ces nouvelles technologies d’analyse y est donc décrite en détail. De plus, l’évolution du rôle de l’analyste en renseignement criminel, l’acteur étant au cœur de l’intégration de ce type d’innovations, y est également discutée, tout en soulignant la pertinence de la reconnaissance d’entités nommées en contexte d’enquête policière.
... Non-traditional data types: Few research studies have attempted to perform LBD using non-traditional data types such as tweets (Bhattacharya & Srinivasan, 2012), Food and Drug Administration (FDA) drug labels (Bisgin et al., 2011), Popular Medical Literature (PML) news articles (Maclean & Seltzer, 2011), web content (Gordon, Lindsay & Fan, 2002), crime incident reports (Schroeder et al., 2007) and commission reports (Jha & Jin, 2016a). Their results have proved the suitability of LBD discovery setting in a non-traditional context to elicit hidden links. ...
Article
Full-text available
As scientific publication rates increase, knowledge acquisition and the research development process have become more complex and time-consuming. Literature-Based Discovery (LBD), supporting automated knowledge discovery, helps facilitate this process by eliciting novel knowledge by analysing existing scientific literature. This systematic review provides a comprehensive overview of the LBD workflow by answering nine research questions related to the major components of the LBD workflow (i.e., input, process, output, and evaluation). With regards to the input component, we discuss the data types and data sources used in the literature. The process component presents filtering techniques, ranking/thresholding techniques, domains, generalisability levels, and resources. Subsequently, the output component focuses on the visualisation techniques used in LBD discipline. As for the evaluation component, we outline the evaluation techniques, their generalisability, and the quantitative measures used to validate results. To conclude, we summarise the findings of the review for each component by highlighting the possible future research directions.
... Isah, Neagu & Trundle (2015) have demonstrated the use of the bipartite model over pharmaceutical dataset for extracting the hidden relationship between criminals. Some other earlier examples related to our work are commercial tools like COPLINK Explorer (Schroeder, Xu, Chen, & Chau, 2007), Dynalink (Park, Tsang, & Brantingham, 2012), JIGSAW (Stasko, Görg, & Liu, 2008). However, either most of these tools lack proper visualization or do not have the ability to extract criminal relationship from textual data. ...
... The framework is able to identify crimes of different kinds. Chau et al. [22] improved efficiency and accuracy of a link analysis system by incorporating several techniques including cooccurrence analysis, the shortest path algorithm, and a heuristic approach to effectively identify connections in criminal networks. Techniques presented in the previous two studies have been applied to identify suspicious web communication, called Dark Web, for analyzing online communication between suspects of the 9/11 attacks [3]. ...
Article
Full-text available
Cybercriminals exploit the opportunities provided by the information revolution and social media to communicate and conduct underground illicit activities such as online fraudulence, cyber predation, cyberbullying, hacking, blackmailing, and drug smuggling. To combat the increasing number of criminal activities, structure and content analysis of criminal communities can provide insight and facilitate cybercrime forensics. In this paper, we propose a framework to analyze chat logs for crime investigation using data mining and Natural Language Processing (NLP) techniques. The proposed framework extracts the social network from chat logs and summarizes conversation into topics. The crime investigator can use Information Visualizer to see the crime-related results. To test the validity of our proposed framework, we worked in a joint effort with the cybercrime unit of a Canadian law enforcement agency. Experimental outcomes on real-life data and feedback from the law enforcement officers suggest that the proposed chat log mining framework meets the need of law enforcement agencies and is very effective for crime investigation.
... Following the 2001 terrorist attacks, link analysis came to prominence in academic research to contribute possible technological solutions for uncovering terrorist networks and to enhance public safety and national security (Schroeder, Xu, Chen, & Chau, 2007;Xu & Chen, 2004;Xu & Chen, 2005a, 2005bXu, Hu, & Chen, 2009). The general idea behind the use of link analysis in these efforts is that if two or more criminals share a common node in their network of relationships, then chances are high that the shared node is also involved in illegal or terrorist activities. ...
Article
The availability of data in massive collections in recent past not only has enabled data-driven decision-making, but also has created new questions that cannot be addressed effectively with the traditional statistical analysis methods. The traditional scientific research not only has prevented business scholars from working on emerging problems with big and rich data-sets, but also has resulted in irrelevant theory and questionable conclusions; mostly because the traditional method has mainly focused on modeling and analysis/explanation than on the real/practical problem and the data. We believe the lack of due attention to the analytics paradigm can to some extent be attributed to the business scholars' unfamiliarity with the analytics methods/methodologies and the type of questions it can answer. Therefore, our purpose in this paper is to illustrate how analytics, as a complement, rather than a successor, to the traditional research paradigm, can be used to address interesting emerging business research questions.
... Following the 2001 terrorist attacks, link analysis came to promi- nence in academic research to contribute possible technological solu- tions for uncovering terrorist networks and to enhance public safety and national security (Schroeder, Xu, Chen, & Chau, 2007;Xu & Chen, 2004;Xu & Chen, 2005a, 2005bXu, Hu, & Chen, 2009). The general idea behind the use of link analysis in these efforts is that if two or more criminals share a common node in their network of relationships, then chances are high that the shared node is also involved in illegal or terrorist activities. ...
... Because of that the data on which we operate distinguish the characteristics of particular nodes, and the characteristics of links between them, it was necessary to apply a model, taking into account both types of parameters [2]. Having that, we perform analysis based on one of known methods -statistical [8]. The strength of the relationship between users is determined by two different parameters, the number of comments and the percentage of commented posts. ...
... We here cite some of the related projects that involve association mining tasks.Among such projects is the Prep-Search [5] that answers only WHO and WHERE questions identifying suspects and their locations.The CrimeLink Explorer [6] identifies associations between people only, without revealing any associations of other entity types such as location or vehicle. They employed shortest path algorithm, co-occurrence analysis, and a heuristic approach over the structured crime incident data extracted from the Tucson Police Department(TPD) Records Management System.Some researchers [7] however have employed modus operandi based similarity component of the crime to establish associations among the crime cases and chronic criminals on burglary and robbery dataset. ...
... By modeling their performance, a clear and objective rationale is provided for further development as a viable path for improving crime investigation. Schroeder et al. [38] proposed the Crimeless Explorer system, which combined several techniques, including co-occurrence analysis, shortest-path algorithm and the heuristic approach, to provide a better link analysis to assist crime investigations. In terms of enhancing investigation effectiveness for Indian police, Gupta et al. [39] proposed a crime analysis tool with an interactive query-based interface to help the police in crime investigations. ...
Article
Full-text available
Crime continues to remain a severe threat to all communities and nations across the globe alongside the sophistication in technology and processes that are being exploited to enable highly complex criminal activities. Data mining, the process of uncovering hidden information from Big Data, is now an important tool for investigating, curbing and preventing crime and is exploited by both private and government institutions around the world. The primary aim of this paper is to provide a concise review of the data mining applications in crime. To this end, the paper reviews over 100 applications of data mining in crime, covering a substantial quantity of research to date, presented in chronological order with an overview table of many important data mining applications in the crime domain as a reference directory. The data mining techniques themselves are briefly introduced to the reader and these include entity extraction, clustering, association rule mining, decision trees, support vector machines, naive Bayes rule, neural networks and social network analysis amongst others. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016
... As for the specific application domain, in the seminal work in [9], the authors propose the use of different data mining algorithms (for clustering, association rule mining sequential pattern mining, deviation detection, classification and social network analysis) for analyzing data about criminal activities, in order to solve specific problems such as identification of possible money laundering, characterization of criminals' profiles, network intrusion detection, etc. Following this seminal work, most of the research has mainly focused on (social/criminal) network link analysis [21,20] and crime entity associations extraction from textual documents [9,8,24]. However, at the best of our knowledge, no work considers the problem of mining evolutions of criminal behaviors over time. ...
Article
Full-text available
One of the recently addressed research directions focuses on the issues raised by the diffusion of highly dynamic on-line information, particularly on the problem of mining topic evolutions from news. Among several applications, risk identification and analysis may exploit mining topic evolution from news in order to support law enforcement officers in risk and threat assessment. Assimilating the concept of topic to the concept of crime typology represented by a group of "similar" criminals, it is possible to apply topic evolution mining techniques to discover evolutions of criminal behaviors over time. At this aim, we incrementally analyze streams of publicly available news about criminals (e.g. daily police reports, public court records, legal instruments) in order to identify clusters of similar criminals and represent their evolution over time. Experimental results on both real world and synthetically generated datasets prove the effectiveness of the proposed approach.
... 6.2.5). For visualization, COPLINK resorts to a hyperbolic tree view and a hierachical list view (Xiang et al. 2005;Schroeder et al. 2007;Xu and Chen 2004). Chen et al. (2004) used a concept-space approach, in order to extract criminal relations from the incident summaries and create a likely network of suspects. ...
Article
Full-text available
‘AI & Law’ research has been around since the 1970s, even though with shifting emphasis. This is an overview of the contributions of digital technologies, both artificial intelligence and non-AI smart tools, to both the legal professions and the police. For example, we briefly consider text mining and case-automated summarization, tools supporting argumentation, tools concerning sentencing based on the technique of case-based reasoning, the role of abductive reasoning, research into applying AI to legal evidence, tools for fighting crime and tools for identification.
... In this paper, we use only three roles that are generally available in most law enforcement databases: suspect, arrestee, and victim. Since it was found in a previous study (Schroeder et al., 2007) that the suspect and arrestee types are very similar, we combine them and use suspect/arrestee to represent both. ...
Article
Full-text available
Complex problems like drug crimes often involve a large number of variables interacting with each other. A complex problem may be solved by breaking it into parts (i.e., sub-problems), which can be tackled more easily. The identity matching problem, for example, is a part of the problem of drug and other types of crimes. It is often encountered during crime investigations when a single criminal is represented by multiple identity records in law enforcement databases. Because of the discrepancies among these records, a single criminal may appear to be different people. Following Enid Mumford's three-stage problem solving framework, we design a new method to address the problem of criminal identity matching for fighting drug-related crimes. Traditionally, the complexity of criminal identity matching was reduced by treating criminals as isolated individuals who maintain certain personal identities. In this research, we recognize the intrinsic complexity of the problem and treat criminals as interrelated rather than isolated individuals. In other words, we take into consideration of the social relationships between criminals during the matching process. We study not only the personal identities but also the social identities of criminals. Evaluation results were quite encouraging and showed that combining social features with personal features could improve the performance of criminal identity matching. In particular, the social features become more useful when data contain many missing values for personal attributes.
... In this paper, we use only three roles that are generally available in most law enforcement databases: suspect, arrestee, and victim. Since it was found in a previous study (Schroeder et al., 2007) that the suspect and arrestee types are very similar, we combine them and use suspect/arrestee to represent both. ...
Article
Full-text available
Complex problems like drug crimes often involve a large number of variables interacting with each other. A complex problem may be solved by breaking it into parts (i.e., sub-problems), which can be tackled more easily. The identity matching problem, for example, is a part of the problem of drug and other types of crimes. It is often encountered during crime investigations when a single criminal is represented by multiple identity records in law enforcement databases. Because of the discrepancies among these records, a single criminal may appear to be different people. Following Enid Mumford’s three-stage problem solving framework, we design a new method to address the problem of criminal identity matching for fighting drug-related crimes. Traditionally, the complexity of criminal identity matching was reduced by treating criminals as isolated individuals who maintain certain personal identities. In this research, we recognize the intrinsic complexity of the problem and treat criminals as interrelated rather than isolated individuals. In other words, we take into consideration of the social relationships between criminals during the matching process. We study not only the personal identities but also the social identities of criminals. Evaluation results were quite encouraging and showed that combining social features with personal features could improve the performance of criminal identity matching. In particular, the social features become more useful when data contain many missing values for personal attributes.
Article
The ‘Online Voting System’ is a web based voting platform for conducting elections online. This system seeks to use face recognition algorithm for voter identity authentication to enhance the security of the electioneering process and ultimately providing an online platform which enables all eligible voters to exercise this activity from any location. The user must sign in/login using their respective credentials and they will be logged in into the system only after the face recognition authentication is successful. Thereafter, the voter can cast their vote securely and logout of the system. Hence, this project based on Online Voting System could be used for conducting secure and fair elections online. Keywords: online voting, face detection, face recognition, authentication, django, tensorflow.js, deepface
Research
Full-text available
In general public require reviews approximately about the product before investing their money onto it. So the users look for various opinions within the internet site however they can't differentiate between genuine or fake reviews. In few websites some of the good reviews are posted by the company members in order to create a false product reviews. Also they used to provide good reviews for the different products designed by their own company. The user will no longer be capable on finding out whether an evaluation is authentic or faux. To discover the fake evaluation in the websites this "Fake Product Review Monitoring and Removal for Genuine Online Product Reviews Using IP Address Tracking" application is delivered. This system will discover fake evaluations made via posting faux feedback approximately a product with the aid of figuring out the IP address along with review posting patterns. The user will be login to the application with the usage of his consumer identification and the password and will be viewing various products and will provide the review about those product. To discover whether a overview is faux or proper, system will be finding out the IP address of a consumer. If a system study faux reviews sent by using the similar IP Address then it will be marked as fake review which will be discarded by the admin from the system. This system allows a consumer to find out the correct assessment of a product.
Chapter
This chapter presents the central theme and a big picture of the methods and technologies covered in this book (see Fig. 2.2). For the readers to comprehend presented security and forensics issues, and associated solutions, the content is organized as components of a forensics analysis framework. The framework is employed to analyze online messages by integrating machine learning algorithms, natural language processing techniques, and social networking analysis techniques in order to help cybercrime investigation.
Article
This manuscript explains the concept of data mining and its application in cybercrimes. Cybercrimes are becoming very serious day by day due to large data sets are generated by organizations and lack of the awareness of internet users. The application of data mining in cybercrime and framework of data mining for detection of financial fraud is explained. A comparative analysis on digital forensic tools and techniques are done with their benefits.
Chapter
Full-text available
Police and law enforcement agencies perform social media analysis to gain a better understanding of criminal social networks structures and to identify potential criminal activities. The use of data mining techniques in social media analysis, however, faces issues and challenges such as linkage-based structural analysis, association extraction, community or group detection, behaviour and mood analysis, sentiment analysis and dynamic analysis of streaming networks. This chapter describes the extension of our developed framework and propose an association model for extracting multilevel associations based on associative questioning. We also describe data mining techniques used to visualize these associations through a 2D crime cluster space. The developed framework provides a complete data analytic solution towards identifying and understanding associations between crime entities and thus expedites the crime matching process.
Conference Paper
Full-text available
The law enforcement analyst had a necessity for a new type of network analysis to understand the structure of the organization behind such brutal and well executed attacks so that experts could identify the involved key players. The network analysis which studies who is related to whom based on what relation called as Social Network Analysis (SNA) is applied here. In this paper we present graph theoretic and PO SET oriented methodologies to efficiently fragment a clandestine network using a naIve Relationship Centrality CR' Our results evidently show that the predominant cut-set of actors recognized, isolate the covert network into multiple fragments with minimum number of actors. CR measure could precisely identify the top ordered crucial nodes of the attack by not only using the distance measure as other centralities but also considers various other attributes like relations, time order and importance of the event occurrence. This paper also compares the relational measure with other conventional distance based centralities.
Conference Paper
Potential cyber victim detection is an important research issue in the domain of network security. During an adjacent period of time, cyber victims or even potential cyber victims within an enterprise have several common patterns to the currently seized victims. Hence, this paper applies the link analysis method and proposes a hybrid method to automatically discover potential victims through their behavioral patterns hidden in the network log data. In the experiment, the proposed method has been applied to reveal potential victims from a big data (6,846,097 records of proxy logs in 1.7G and 84,693,445 records of firewall logs in 9.3G). Afterward, a ranking list of potential victims can consequently be generated for stakeholders to understand the safety condition within an enterprise. Moreover, the hierarchical connection graph of hosts can further assist managers or stakeholders to find out the potential victims more easily. As a result, the safety and prevention practice of the information security group in an enterprise would be upgraded to an active mode rather than passive mode.
Article
Full-text available
The law enforcement analyst had a necessity for a new type of network analysis to understand the structure of the organization behind such brutal and well executed attacks so that experts could identify the involved key players. The network analysis which studies who is related to whom based on what relation called as Social Network Analysis (SNA) is applied here. In this paper we present graph theoretic and POSET oriented methodologies to efficiently fragment a clandestine network using a naïve Relationship Centrality CR. Our results evidently show that the predominant cut-set of actors recognized, isolate the covert network into multiple fragments with minimum number of actors. CR measure could precisely identify the top ordered crucial nodes of the attack by not only using the distance measure as other centralities but also considers various other attributes like relations, time order and importance of the event occurrence. This paper also compares the relational measure with other conventional distance based centralities.
Article
Link charts are used in criminal investigations in order to facilitate the processing of largescale investigation data. The relevant elements of the inquiry are represented in the form of diagrams describing the relationships between events and entities featuring in the investigation. Traditional uses of those graph-like techniques are: the representation of criminal networks, smuggling of goods, chronologies of events, as well as the visualisation of telephone records and financial data. In this context, visualisations are used for many objectives, such as analysing the traces and the information gathered, evaluating a cold-case, helping along the categorization of a particular offence, facilitating the transmission and receipt of a case or supporting an argument at trial. Common practice includes simple software tools that produce powerful and often elegant visualisations. However their use raises important difficulties. This research suggests that there are astonishing disparities in the use of these techniques. Reasoning and perception biases can be introduced, sometimes leading to wrong decisions with serious consequences. To highlight these difficulties, evaluations were conducted with practitioners and students. An empirical picture of the extent of changes in design and interpretation of representations has been established. The impact of this variety on decision making is also discussed. The nature and variety of concepts to represent, the absence of an emerging consensus on how to represent data, the diversity of visual solutions, the constraints imposed by tools and the absence of a clear formalization of the language, are all supposed causes of the observed difficulties. This observation reveals the need to consolidate the methods.
Chapter
This is a chapter about what link analysis and data mining can do for criminal investigation. It is a long and complex chapter, in which a variety of techniques and topics are accommodated. It is divided in two parts, one about methods, and the other one about real-case studies. We begin by discussing social networks and their visualisation, as well as what unites them with or distinguishes them from link analysis (which itself historically arose from the disciplinary context of ergonomics). Having considered applications of link analysis to criminal investigation, we turn to crime risk assessment, to geographic information systems for mapping crimes, to detection, and then to multiagent architectures and their application to policing. We then turn to the challenge of handling a disparate mass of data, and introduce the reader to data warehousing, XML, ontologies, legal ontologies, and financial fraud ontology. A section about automated summarisation and its application to law is followed by a discussion of text mining and its application to law, and by a section on support vector machines for information retrieval, text classification, and matching. A section follows, about stylometrics, determining authorship, handwriting identification and its automation, and questioned documents evidence. We next discuss classification, clustering, series analysis, and association in knowledge discovery from legal databases; then, inconsistent data; rule induction (including in law); using neural networks in the legal context; fuzzy logic; and genetic algorithms. Before turning to case studies of link analysis and data mining, we take a broad view of digital resources and uncovering perpetration: email mining, computer forensics, and intrusion detection. We consider the Enron email database; the discovery of social coalitions with the SIGHTS text mining system, and recursive data mining. We discuss digital forensics, digital steganography, and intrusion detection (the use of learning techniques, the detection of masquerading, and honeypots for trapping intruders). Case studies include, for example: investigating Internet auction fraud with NetProbe; graph mining for malware detection with Polonium; link analysis with Coplink; a project of the U.S. Federal Defense Financial Accounting Service; information extraction tools for integration with a link analysis tool; the Poznan ontology model for the link analysis of fuel fraud; and fiscal fraud detection with the Pisa SNIPER project.
Article
This paper investigates how text analysis and classification techniques can be used to enhance e-government, typically law enforcement agencies' efficiency and effectiveness by analyzing text reports automatically and provide timely supporting information to decision makers. With an increasing number of anonymous crime reports being filed and digitized, it is generally difficult for crime analysts to process and analyze crime reports efficiently. Complicating the problem is that the information has not been filtered or guided in a detective-led interview resulting in much irrelevant information. We are developing a decision support system (DSS), combining natural language processing (NLP) techniques, similarity measures, and machine learning, i.e., a Naïve Bayes' classifier, to support crime analysis and classify which crime reports discuss the same and different crime. We report on an algorithm essential to the DSS and its evaluations. Two studies with small and big datasets were conducted to compare the system with a human expert's performance. The first study includes 10 sets of crime reports discussing 2 to 5 crimes. The highest algorithm accuracy was found by using binary logistic regression (89%) while Naive Bayes' classifier was only slightly lower (87%). The expert achieved still better performance (96%) when given sufficient time. The second study includes two datasets with 40 and 60 crime reports discussing 16 different types of crimes for each dataset. The results show that our system achieved the highest classification accuracy (94.82%), while the crime analyst's classification accuracy (93.74%) is slightly lower.
Article
Surprisingly, evidence used to be a Cinderella until the late 1990s in ‘AI & Law’ artificial intelligence for law, itself a burgeoning domain already in the 1980s. It was not until the 2000s that models of reasoning about legal evidence started to feature prominently. In so doing, it became vulnerable to the controversy about ‘probabilities in law’ among legal theorists; hence, the importance of developing models of plausibility i.e. ranking alternative accounts as opposed to strong commitment to probabilistic models of determination of guilt. Probabilistic models however potentially have a role in helping a prosecutor decide whether to prosecute. Moreover, they are quite useful in models supporting police investigation or helping with police intelligence. This is especially the case of data mining, a class of techniques which has been applied to legal databases as well as to law enforcement. Success was achieved especially in unravelling networks and in detecting fraud. We survey these classes of tools.
Conference Paper
Cyber criminals exploit opportunities for anonymity and masquerade in web-based communication to conduct illegal activities such as phishing, spamming, cyber predation, cyber threatening, blackmail, and drug trafficking. One way to fight cyber crime is to collect digital evidence from online documents and to prosecute cyber criminals in the court of law. In this paper, we propose a unified framework using data mining and natural language processing techniques to analyze online messages for the purpose of crime investigation. Our framework takes the chat log from a confiscated computer as input, extracts the social networks from the log, summarizes chat conversations into topics, identifies the information relevant to crime investigation, and visualizes the knowledge for an investigator. To ensure that the implemented framework meets the needs of law enforcement officers in real-life investigation, we closely collaborate with the cyber crime unit of a law enforcement agency in Canada. Both the feedback from the law enforcement officers and experimental results suggest that the proposed chat log mining framework is effective for crime investigation.
Article
Artificial Intelligence AI and Law has been a burgeoning domain since the 1980s, but it was not until the 2000s that models of reasoning about legal evidence started to feature prominently. With regard to data mining, it has been applied to legal databases, as well as to law enforcement. We survey forensic applications of data mining to fraud and otherwise to crime intelligence or investigation. Traditionally a field separate from AI and Law, the two are now coming together. Success has been achieved especially in unravelling networks and in detecting fraud.
Article
An efficient term mining method to build a general term network is presented. The resulting term network can be used for entity relation visualization and exploration, which is useful in many text-mining applications such as crime exploration and investigation from vast piles of crime news or official criminal records. In the proposed method, terms from each document in a text collection are first identified. They are subjected to an analysis for pairwise association weights. The weights are then accumulated over all the documents to obtain final similarity for each term pair. Based on the resulting term similarity, a general term network for the collection is built with terms as nodes and non-zero similarities as links. In application, a list of predefined terms having similar attributes was selected to extract the desired sub-network from the general term network for entity relation visualization. This text analysis scenario based on the collective terms of the similar type or from the same topic enables evidence-based relation exploration. Some practical instances of crime exploration and investigation are demonstrated. Our application examples show that term relations, be it causality, subordination, coupling, or others, can be effectively revealed by our method and easily verified by the underlying text collection. This work contributes by presenting an integrated term-relationship mining and exploration approach and demonstrating the feasibility of the term network to the increasingly important application of crime exploration and investigation.
Article
Full-text available
Link charts are used in criminal investigations in order to facilitate the processing of large- scale investigation data. The relevant elements of the inquiry are represented in the form of diagrams describing the relationships between events and entities featuring in the investigation. Traditional uses of those graph-like techniques are: the representation of criminal networks, smuggling of goods, chronologies of events, as well as the visualisation of telephone records and financial data. In this context, visualisations are used for many objectives, such as analysing the traces and the information gathered, evaluating a cold- case, helping along the categorization of a particular offence, facilitating the transmission and receipt of a case or supporting an argument at trial. Common practice includes simple software tools that produce powerful and often elegant visualisations. However their use raises important difficulties. This research suggests that there are astonishing disparities in the use of these techniques. Reasoning and perception biases can be introduced, sometimes leading to wrong decisions with serious consequences. To highlight these difficulties, evaluations were conducted with practitioners and students. An empirical picture of the extent of changes in design and interpretation of representations has been established. The impact of this variety on decision making is also discussed. The nature and variety of concepts to represent, the absence of an emerging consensus on how to represent data, the diversity of visual solutions, the constraints imposed by tools and the absence of a clear formalization of the language, are all supposed causes of the observed difficulties. This observation reveals the need to consolidate the methods.
Article
Visual displays of information (such as dashboards) have become very sophisticated in rendering world semantics but neglect display semantics, leading to what is commonly called information overload. Underlying storage and retrieval research has been utilizing semantic and cognitive theory to drive the current implementations of ontology markup using the resource description framework (RDF) and Web ontology language (OWL) for over a decade. Yet despite these semantically rich underlying description logics, and despite the very large and mature stream of cognitive and neuroscience theory literature on visual perception and attention, memory, and linguistics, this is one aspect within the area of information visualization and human factors research where empirically tested semantic theory has not yet caught up, and begs for theory-driven research into display semantics using what might be termed “graphical linguistics.” We conducted an experiment to assess the cognitive effort of interpreting domain general knowledge using the same information represented in three forms, and found that graphical linguistics reduce cognitive effort for a specific type of task involving high-density time-sensitive information typically found in situation control rooms.
Conference Paper
In knowledge-work, there are increasing amounts of complex information rendered by information technology, which has led to the common term, information overload. Information visualization is one area where empirically tested semantic theory has not yet caught up with that of the underlying information storage and retrieval theory, contributing to information overload. In spite of a vast body of cognitive theory, much of the human factors research on information visualization has overlooked it. Specifically, information displays have facilitated the data gathering (ontological) aspects of human problem-solving and decision-making, but have exacerbated the meaning-making (epistemological) aspects of those activities by presenting information in linear rather than in graphical (holistic) forms. Drawing from extant empirical research, we present a thesis suggesting that cognitive load may be reduced when holistic information is imbued with transformational grammar to help alleviate the information overload problem, along with a methodological approach for investigation.
Conference Paper
Fuel laundering scam is prevalent in many countries. Typically, a case may concern 100 companies, several hundred people, and up to 100 thousand money transfers/invoices. Analysis of this amount of data is difficult even if it is stored in a database. To gain insight on the mechanism of the case we use in this work the extension of previously proposed ontology model, called the minimal model. The conceptual minimal model consists of eight layers of concepts, that are structured in order to use available data on facts to uncover relations. FuelFlowVis is an intelligent tool that supports continuous visual analytic process by exploiting the following features: navigation between global and local views, filters allowing displaying transactions by value, time, and type of goods. A user can inspect selected flows which give insight into crime patterns. We used the tool for 3 large Polish fuel laundering cases form the 2001--2003 period. For none of the cases we have complete data. We find that the methods to hide the proceeds of crime are very similar between the cases. The evidence as presented by prosecutors is of varied quality, and depends on the size of the crime group. In all the cases prosecutor's had an enormous problem to uncover money flows from the source of money (profit centre) to sinks (where the money leaves companies and goes as cash to organizers of the scheme). This occurs because the use of traditional analytic tools (spreadsheets or non-semantic visualization tools) cannot provide information about chains of transactions --a separate binary relations view does not provide complete insight to the case. Prospects on future reasoning capabilities of the tool will be presented.
Article
Strong ties play a crucial role in transmitting sensitive information in social networks, especially in the criminal justice domain. However, large social networks containing many entities and relations may also contain a large amount of noisy data. Thus, identifying strong ties accurately and efficiently within such a network poses a major challenge. This paper presents a novel approach to address the noise problem. We transform the original social network graph into a relation context-oriented edge-dual graph by adding new nodes to the original graph based on abstracting the relation contexts from the original edges (relations). Then we compute the local k-connectivity between two given nodes. This produces a measure of the robustness of the relations. To evaluate the correctness and the efficiency of this measure, we conducted an implementation of a system which integrated a total of 450GB of data from several different data sources. The discovered social network contains 4,906,460 nodes (individuals) and 211,403,212 edges. Our experiments are based on 700 co-offenders involved in robbery crimes. The experimental results show that most strong ties are formed with k⩾2.
Article
Full-text available
The volume of crime data is increasing along with the incidence and complexity of crimes. Data mining is a powerful tool that criminal investigators who may lack extensive training as data analysts can use to explore large databases quickly and efficiently. The collaborative Coplink project between University of Arizona researchers and the Tucson and Phoenix police departments correlates data mining techniques applied in criminal and intelligence analysis with eight crime types. The framework has general applicability to crime and intelligence analysis because it encompasses all major crime types as well as both traditional and new intelligence-specific data mining techniques. Three case studies demonstrate the framework?s effectiveness.
Article
Full-text available
Analysis of the job shop scheduling domain has indicated that the crux of the scheduling problem is the determination and satisfaction of a large variety of constraints. Schedules are influenced by such diverse and conflicting factors as due date requirements, cost restrictions, production levels, machine capabilities and substitutability, alternative production processes, order characteristics, resource requirements, and resource availability. This paper describes ISIS, a scheduling system capable of incorporating all relevant constraints in the construction of job shop schedules. We examine both the representation of constraints within ISIS, and the manner in which these constraints are used in conducting a constraint-directed search for an acceptable schedule. The important issues relating to the relaxation of constraints are addressed. Finally, the interactive scheduling facilities provided by ISIS are considered.
Chapter
Full-text available
A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.
Conference Paper
Full-text available
In this paper, we present Stepping Stones and Pathways (SSP), an alternative model of building and presenting answers for the cases when queries on document collections cannot be answered just by a ranked list. Stepping Stones can handle questions like: "What is the relation of topics X and Y?" SSP addresses when the contents of a small set of related documents is needed as an answer rather than a single document, or when "query splitting" is required to satisfactorily explore a document space. Query results are networks of document groups representing topics, each group relating to and connecting (by documents) to other groups in the network. Thus, a network answers the user's information need. We devise new and more effective representations and techniques to visualize such answers, and to involve users as part of the answer-finding process. In order to verify the validity of our approach, and since the questions we aim to answer involve multiple topics, we performed a study involving a custom built broad collection of operating systems research papers, and evaluated the results with interested computer science students, using multiple measures.
Article
Full-text available
Previous research has shown that researchers can generate medical hypotheses by using computers to analyze several, seemingly unrelated, medical literatures. In this work we suggest broader application for the ideas of literature-based discovery. Specifically, we suggest that literature-based discovery can be fruitful in areas other than medicine; that in addition to finding “cures” for “problems,” literature-based discovery offers the possibility of finding new problems for existing technologies; that the analysis of a single literature may be sufficient for literature-based discovery; and that literature-based discovery can support individuals seeking to draw together ideas from various areas of inquiry, even if such connections have been previously made by others. We describe literature-based discovery experiments conducted on the World Wide Web that support these ideas.
Article
Full-text available
Valuable criminal-justice data in free texts such as police narrative reports are currently difficult to be accessed and used by intelligence investigators in crime analyses. It would be desirable to automatically identify from text reports meaningful entities, such as person names, addresses, narcotic drugs, or vehicle names to facilitate crime investigation. In this paper, we report our work on a neural network-based entity extractor, which applies named-entity extraction techniques to identify useful entities from police narrative reports. Preliminary evaluation results demonstrated that our approach is feasible and has some potential values for real-life applications. Our system achieved encouraging precision and recall rates for person names and narcotic drugs, but did not perform well for addresses and personal properties. Our future work includes conducting larger-scale evaluation studies and enhancing the system to capture human knowledge interactively.
Article
Full-text available
This paper examines three such challenges: 1) statistical dependence caused by linked instances; 2) bias introduced by sampling density; and 3) multiple comparisons intensified by feature combinatorics. In general, current systems See the web pages of the 1998 AAAI Fall Symposium on AI and Link Analysis for more information on link analysis and the relevance of various AI techniques <http://eksl-www.cs.umass.edu/aila/> For an additional discussion of the use of link analysis to detect money laundering, see Jensen (1997) and Goldberg and Senator (1995)
Article
We report experiments that use lexical statistics, such as word frequency counts, to discover hidden connections in the medical literature. Hidden connections are those that are unlikely to be found by examination of bibliographic citations or the use of standard indexing methods and yet establish a relationship between topics that might profitably be explored by scientific research. Our experiments were conducted with the MEDLINE medical literature database and follow and extend the work of Swanson.
Article
An evaluation of a large, operational full-text document-retrieval system (containing roughly 350,000 pages of text) shows the system to be retrieving less than 20 percent of the documents relevant to a particular search. The findings are discussed in terms of the theory and practice of full-text document retrieval.
Article
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
Introduction to Graphs and Networks Computer Representation and Solution Tree Algorithms Shortest-Path Algorithms Minimum-Cost Flow Algorithms Matching and Assignment Algorithms The Postman and Related Arc Routing Problems The Traveling Salesman and Related Vertex Routing Problems Location Problems Project Networks NETSOLVE User's Manual
Article
Link analysis procedures were developed and evaluated to aid law-enforcement agencies integrate collected information and develop hypotheses leading to the prevention and control of organized crime. The procedures were designed to portray the relationships among suspected criminals, to determine the structure of criminal organizations, and to identify the nature of suspected criminal activities. An experiment was conducted in which 29 teams of law enforcement intelligence analysts completed link analyses from information contained in identical data bases. The results compared favorably with criterion solutions prepared earlier. Subsequent field applications of link analysis by trained law enforcement officers confirmed the utility and potential value of these procedures.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
Due to the nature and costs of data collection, many real- world databases consist of large numbers of independent transactions. Finding evidence of structured groups of entities reflected in this data is a task aptly suited to Link Analysis. However, the databases usually must be restructured to allow effective search and analysis of the linkage structures hidden in the original transactions. The FinCEN AI System (FAIS) (Senator 1995) is an example such an application. We briefly discuss the process of database restructuring and show how it is used to support the discovery and analysis of evidence of money laundering in a database of cash transactions.
Article
An abstract is not available.
Article
GTE's Compass (Central Office Maintenance Printout Analysis and Suggestion System) is an expert system that aids in the maintenance of a telephone switching system. It analyzes maintenance printouts of telephone company central office switching equipment and suggests maintenance actions to be performed. The results from several field trials indicate that the current version of Compass performs well in identifying system faults and in suggesting corresponding maintenance actions. An expanded version of the expert system is under development while the present version is being put into field use.
Article
The focus of this research is to demonstrate how probabilistic models may be used to provide early warnings for bank failures. While prior research in the auditing literature has recognized the applicability of a Bayesian belief revision framework for many audit tasks, empirical evidence has suggested that auditors' cognitive decision processes often violate probability axioms. We believe that some of the well-documented cognitive limitations of a human auditor can be compensated by an automated system. In particular, we demonstrate that a formal belief revision scheme can be incorporated into an automated system to provide reliable probability estimates for early warning of bank failures. The automated system examines financial ratios as predictors of a bank's performance and assesses the posterior probability of a banks financial health (alternatively, financial distress). We examine two different probabilistic models, one that is simpler and makes more assumptions, while the other that is somewhat more complex but makes fewer assumptions. We find that both models are able to make accurate predictions with the help of historical data to estimate the required probabilities. In particular, the more complex model is found to be very well calibrated in its probability estimates. We posit that such a model can serve as a useful decision aid to an auditor's judgment process.
Article
Four new shortest-path algorithms, two sequential and two parallel, for the source-to-sink shortest-path problem are presented and empirically compared with five algorithms previously discussed in the literature. The new algorithm, S22, combines the highly effective data structure of the S2 algorithm of Dial et al., with the idea of simultaneously building shortest-path trees from both source and sink nodes, and was found to be the fastest sequential shortest-path algorithm. The new parallel algorithm, PS22, is based on S22 and is the best of the parallel algorithms. We also present results for three new S22-type shortest-path heuristics. These heuristics find very good (often optimal) paths much faster than the best shortest-path algorithm.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
Article
We consider a graph with n vertices, all pairs of which are connected by an edge; each edge is of given positive length. The following two basic problems are solved. Problem 1: construct the tree of minimal total length between the n vertices. (A tree is a graph with one and only one path between any two vertices.) Problem 2: find the path of minimal total length between two given vertices.
Article
Prediction of firm bankruptcies have been extensively studied in accounting, as all stakeholders in a firm have a vested interest in monitoring its financial performance. This paper presents an exploratory study which compares the predictive capabilities for firm bankruptcy of neural networks and classical multivariate discriminant analysis. The predictive accuracy of the two techniques is presented within a comprehensive, statistically sound framework, indicating the value added to the forecasting problem by each technique. The study indicates that neural networks perform significantly better than discriminant analysis at predicting firm bankruptcies. Implications of our results for the accounting professional, neural networks researcher and decision support system builders are highlighted.
Article
Effective and efficient link analysis techniques are needed to help law enforcement and intelligence agencies fight organized crimes such as narcotics violation, terrorism, and kidnapping. In this paper, we propose a link analysis technique that uses shortest-path algorithms, priority-first-search (PFS) and two-tree PFS, to identify the strongest association paths between entities in a criminal network. To evaluate effectiveness, we compared the PFS algorithms with crime investigators' typical association-search approach, as represented by a modified breadth-first-search (BFS). Our domain expert considered the association paths identified by PFS algorithms to be useful about 70% of the time, whereas the modified BFS algorithm's precision rates were only 30% for a kidnapping network and 16.7% for a narcotics network. Efficiency of the two-tree PFS was better for a small, dense kidnapping network, and the PFS was better for the large, sparse narcotics network.
Article
In crime analysis, law enforcement officials have to process a large amount of criminal data and figure out their relationships. It is important to identify different associations among criminal entities. In this paper, we propose the use of a hyperbolic tree view and a hierarchical list view to visualize criminal relationships. A prototype system called COPLINK Criminal Relationship Visualizer was developed. An experiment was conducted to test the effectiveness and the efficiency of the two views. The results show that the hyperbolic tree view is more effective for an “identify” task and more efficient for an “associate” task. The participants generally thought it was easier to use the hierarchical list, with which they were more familiar. When asked about the usefulness of the two views, about half of the participants thought that the hyperbolic tree was more useful, while the other half thought otherwise. Our results indicate that both views can help in criminal relationship visualization. While the hyperbolic tree view performs better in some tasks, the users' experiences and preferences will impact the decision on choosing the visualization technique.
Article
Associating records in a large database that are related but not exact matches has importance in a variety of applications. In law enforcement, this task enables crime analysts to associate incidents possibly resulting from the same individual or group of individuals. In practice, most crime analysts perform this task manually by searching through incident reports looking for similarities. This paper describes automated approaches to data association. We report tests showing that our data association methods significantly reduced the time required by manual methods with accuracy comparable to experienced crime analysts. In comparison to analysis using the structured query language (SQL), our methods were both faster and more accurate.
Article
This paper explores the opportunities for the application of network analytic techniques to the problems of criminal intelligence analysis, paying particular attention to the identification of vulnerabilities in different types of criminal organization — from terrorist groups to narcotics supply networks.A variety of concepts from the network analysis literature are considered in terms of the promise they hold for helping law enforcement agencies extract useful information from existing collections of link data. For example, six different notions of “centrality” and the three major notions of “equivalence” are examined for their relevance in revealing the mechanics and vulnerabilities of criminal enterprises.
Conference Paper
Associating criminal incidents committed by the same person is important in crime analysis. In this paper, we introduce concepts from OLAP (online-analytical processing) and data-mining to resolve this issue. The criminal incidents are modeled into an OLAP data cube; a measurement function, called the outlier score function is defined on the cube cells. When the score is significant enough, we say that the incidents contained in the cell are associated with each other. The method can be used with a variety of criminal incident features to include the locations of the crimes for spatial analysis. We applied this association method to the robbery dataset of Richmond, Virginia. Results show that this method can effectively solve the problem of criminal incident association.
Article
An unintended consequence of specialization in science is poor communication across specialties. Information developed in one area of research may be of value in another without anyone becoming aware of the fact. We describe and evaluate interactive software and database search strategies that facilitate the discovery of previously unknown cross-specialty information of scientific interest. The user begins by searching MEDLINE for article titles that identify a problem or topic of interest. From downloaded titles the software constructs input for additional database searches and produces a series of heuristic aids that help the user select a second set of articles complementary to the first set and from a different area of research. The two sets are complementary if together they can reveal new useful information that cannot be inferred from either set alone. The software output further helps the user identify the new information and derive from it a novel testable hypothesis. We report several successful tests and applications of the system.
Article
A bilingual concept space approach using the Hopfield network to relieve the vocabulary problem in national security information sharing was presented. It was found that the concept allows the user to interactively refine a search by selecting concepts, which are automatically generated and presented to the user. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based crosslingual information management and retrieval. The results show that the concept space generated through the Hopfield network can effectively recognize the translations of a concept in a parallel corpus.
Article
this article we present a semi-supervised active learning algorithm for pattern discovery in information extraction from textual data. The patterns are reduced regular expressions composed of various characteristics of features useful in information extraction. Our major contribution is a semi-supervised learning algorithm that extracts information from a set of examples labeled as relevant or irrelevant to a given attribute. The approach is semi-supervised because it does not require precise labeling of the exact location of features in the training data. This significantly reduces the effort needed to develop a training set. An active learning algorithm is used to assist the semi-supervised learning algorithm in order to further reduce training set development effort. The active learning algorithm is seeded with a single positive example of a given attribute. The context of the seed is used to automatically identify candidates for additional positive examples of the given attribute. Candidate examples are manually pruned during the active learning phase, and our semi-supervised learning algorithm automatically discovers reduced regular expressions for each attribute. We have successfully applied this learning technique in the extraction of textual features from police incident reports, university crime reports, and patents. The performance of our algorithm compares favorably with competitive extraction systems being used in criminal justice information systems
Article
Highly portable information collection and transmission technologies such as radio frequency identification (RFID) tags and smart cards are becoming ubiquitous in government and business—employed in functions including homeland security, information security, physical premises security, and even the control of goods in commerce. And, directly or indirectly, in many of these applications, it is individuals and their activities that are tracked. Yet, a significant unknown is (a) whether the public understands these technologies and the manner in which personally identifiable information may be collected, maintained, used, and disseminated; and (b) whether the public consents to these information practices. To answer these and related questions, we surveyed a select group of citizens on the uses of this technology for business as well as homeland security purposes. We found a significant lack of understanding, a significant level of distrust even in the context of homeland security applications, and a very significant consensus for governmental regulation. We conclude that a primary objective for any organization deploying these technologies is the promulgation of a comprehensive Technology Privacy Policy, and we provide detailed specifications for such an effort. © 2005 Wiley Periodicals, Inc.
Article
Highly portable information collection and transmission technologies such as radio frequency identification (RFID) tags and smart cards are becoming ubiquitous in government and business—employed in functions including homeland security, information ...
Article
Scientific literature is often fragmented, which implies that certain scientific questions can only be answered by combining information from various articles. In this paper, a new algorithm is proposed for finding associations between related concepts present in literature. To this end, concepts are mapped to a multidimensional space by a Hebbian type of learning algorithm using co-occurrence data as input. The resulting concept space allows exploration of the neighborhood of a concept and finding potentially novel relationships between concepts. The obtained information retrieval system is useful for finding literature supporting hypotheses and for discovering previously unknown relationships between concepts. Tests on artificial data show the potential of the proposed methodology. In addition, preliminary tests on a set of Medline abstracts yield promising results.
Article
A search tactic is a set of search moves that are tem- porally and semantically related. The current study ex- amined the tactics of medical students searching a fac- tual database in microbiology. The students answered problems and searched the database on three occa- sions over a 9-month period. Their search moves were analyzed in terms of the changes in search terms used from one cycle to the next, using two different analysis methods. Common patterns were found in the students' search tactics; the most common approach was the specifica- tion of a concept, followed by the addition of one or more concepts, gradually narrowing the retrieved set before it was displayed. It was also found that the search tactics changed over time as the students' domain knowledge changed. These results have important im- plications for designers in developing systems that will support users' preferred ways of formulating searches. In addition, the research methods used (the coding scheme and the two data analysis methods—zero-order state transition matrices and maximal repeating pat- terns (MRP) analysis) are discussed in terms of their validity in future studies of search tactics.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Article
We report experiments that use lexical statistics, such as word frequency counts, to discover hidden connections in the medical literature. Hidden connections are those that are unlikely to be found by examination of bibliographic citations or the use of standard indexing methods and yet establish a relationship between topics that might profitably be explored by scientific research. Our experiments were conducted with the MEDLINE medical literature database and follow and extend the work of Swanson. Peer Reviewed http://deepblue.lib.umich.edu/bitstream/2027.42/34257/1/3_ftp.pdf
Article
Divide and conquer—the strategy that science uses to cope with the mountains of printed matter it produces—appears on the surface to serve us well. Science organizes itself into manageable units—scientific specialties—and so its literature is created and assimilated in manageable chunks or units. But a few clouds on the horizon ought not to go unexamined. First, most of the units are no doubt logically related to other units. Second, there are far more combinations of units, therefore far more potential relationships among the units, than there are units. Third, the system is not organized to cope with combinations. I suggest that important relationships might be escaping our notice. Individual units of literature are created to some degree independently of one another, and, insofar as that is so, the logical connections among the units, though inevitable, may be unintended by and even unknown to their creators. Until those fragments, like scattered pieces of a puzzle, are brought together, the relationships among them may remain undiscovered—even though the isolated pieces might long have been public knowledge. My purpose in this essay is to show, by means of an example, how this might happen. I shall identify two units of literature that are logically connected but noninteractive; neither seems to acknowledge the other to any substantial degree. Yet the logical connections, once apparent, lead to a potentially useful and possibly new hypothesis.
Article
Two East-bloc computing knowledge bases, both based on a semantic network structure, were created automatically from large, operational textual databases using two statistical algorithms. The knowledge bases were evaluated in detail in a concept-association experiment based on recall and recognition tests. In the experiment, one of the knowledge bases, which exhibited the asymmetric link property, outperformed four experts in recalling relevant concepts in East-bloc computing. The knowledge base, which contained 20000 concepts (nodes) and 280000 weighted relationships (links), was incorporated as a thesaurus-like component in an intelligent retrieval system. The system allowed users to perform semantics-based information management and information retrieval via interactive, conceptual relevance feedback
Article
The paper discusses the Coplink system. The system applies a concept space-a statistics-based, algorithmic technique that identifies relationships between suspects, victims, and other pertinent data-to accelerate criminal investigations and enhance law enforcement efforts. The Coplink concept space application, which began as a research project, has evolved into a real-time system being used in everyday police work. Coplink CS has been successfully deployed at the Tucson Police Department, where crime analysts, officers, detectives, and sergeants from 16 departmental units use the technology voluntarily as part of their daily investigative routine
Article
Databases often inaccurately identify entities of interest. Two operations, consolidation and link formation, which complement the usual machine learning techniques that use similarity-based clustering to discover classifications, are proposed as essential components of KDD systems for certain applications. Consolidation relates identifiers present in a database to a set of real world entities (RWE's) which are not uniquely identified in the database. Consolidation may also be viewed as a transformation of representation from the identifiers present in the original database to the RWE's. Link formation constructs structured relationships between consolidated RWE's through identifiers and events explicitly represented in the database. An operational knowledge discovery system which identifies potential money laundering in a database of large cash transactions implements consolidation and link formation. Consolidation and link formation are easily implemented as index creation in relatio...
Article
We have developed a number of systems to aid analysts in tracking the activities and associations of various criminal elements. These systems each consist of a relational data base containing the details on the relevant entities, activities, and associations, assorted tools for retrieving and analyzing those details, including link analysis, and a tool for automatically extracting those details from documents. Introduction On several projects, Sterling Software has been leading the development of software systems for the intelligence analyst. These systems have been designed to assist the analyst in discovering and keeping track of individuals and organizations involved in various activities such as drug trafficking, terrorism, insurgency, weapons proliferation, etc. They also provide the ability to tie these individuals and organizations to the facilities, vehicles, etc, they use, and to the events and activities they are involved in. Each system is a collection of tools inte...
Article
. The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis. Categories and S...