Conference Paper

Exploring Corporate Law Violations Through Data-Driven Clustering Approaches

Authors:
  • Westcliff University
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We extend network analysis to directed criminal networks in the context of asymmetric links. We computed selected centralities, centralizations and the assortativity of a drug trafficking network with 110 nodes and 295 edges. We also monitored the centralizations of eleven temporal networks corresponding to successive stages of investigation during the period 1994–1996. All indices reach local extrema at the stage of highest activity, extending previous results to directed networks. The sharpest changes (90%) are observed for betweenness and in-degree centralization. A notable difference between entropies is observed: the in-degree entropy reaches a global minimum at month 12, while the out-degree entropy reaches a global maximum. This confirms that at the stage of highest activity, incoming instructions are precise and focused, while outgoing instructions are diversified. These findings are expected to be useful for alerting the authorities to increasing criminal activity. The disruption simulations on the time-averaged network extend previous results on undirected networks to directed networks.
Article
Full-text available
The detection of anomalous structures in natural image data is of utmost importance for numerous tasks in the field of computer vision. The development of methods for unsupervised anomaly detection requires data on which to train and evaluate new approaches and ideas. We introduce the MVTec anomaly detection dataset containing 5354 high-resolution color images of different object and texture categories. It contains normal, i.e., defect-free images intended for training and images with anomalies intended for testing. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, contaminations, and various structural changes. In addition, we provide pixel-precise ground truth annotations for all anomalies. We conduct a thorough evaluation of current state-of-the-art unsupervised anomaly detection methods based on deep architectures such as convolutional autoencoders, generative adversarial networks, and feature descriptors using pretrained convolutional neural networks, as well as classical computer vision methods. We highlight the advantages and disadvantages of multiple performance metrics as well as threshold estimation techniques. This benchmark indicates that methods that leverage descriptors of pretrained networks outperform all other approaches and deep-learning-based generative models show considerable room for improvement.
Chapter
Full-text available
The travel behavior of residents is influenced by environment factor such as urban transportation system and administrative division. In turn, users equipped with navigation devices act as sensors detecting the environmental dynamics. The long-term accumulation of navigation big data contains massive valuable spatio-temporal information. We propose to detect the spatio-temporal distribution and community structure of urban hotspot regions from navigation big data. A framework including data preprocessing, hotspot region detection, urban spatial discretization, and community structure detection is designed in this work. Hotspot regions are detected by kernel density estimation and density-based clustering on origin-destination (OD) points of navigation trajectories. The hotspot regions are discretized into Voronoi polygon grids based on the spatial distribution of OD points. Finally, we analyze the complex network formed by hotspot region grids and employ Louvain algorithm to detect the community structure of hotspot region network. This framework is implemented on the taxi dataset of Chengdu. The experimental results reveal the spatio-temporal distribution and community structure of urban hotspot regions in different periods of weekdays and weekends. The urban hotspot regions and the community structure are influenced by inherent geographical environment and dynamically evolve with time. The spatio-temporal characteristics of urban hotspot regions are proved to be the result of coaction of environment and human activities. Findings of this work could provide decision-making support for transportation system optimization, city layout plan, and smart city construction etc.
Article
Full-text available
Online reviews play a significant role in influencing decisions made by users in day-to-day life. The presence of reviewers who deliberately post fake reviews for financial or other gains, however, negatively impacts both users and businesses. Unfortunately, automatically detecting such reviewers is a challenging problem since fake reviews do not seem out-of-place next to genuine reviews. In this paper, we present a fully unsupervised approach to detect anomalous behavior in online reviewers. We propose a novel hierarchical approach for this task in which we (1) derive distributions for key features that define reviewer behavior, and (2) combine these distributions into a finite mixture model. Our approach is highly generalizable and it allows us to seamlessly combine both univariate and multivariate distributions into a unified anomaly detection system. Most importantly, it requires no explicit labeling (spam/not spam) of the data. Our newly developed approach outperforms prior state-of-the-art unsupervised anomaly detection approaches.
Article
Full-text available
Criminal organizations tend to be clustered to reduce risks of detection and information leaks. Yet, the literature exploring the relevance of subgroups for their internal structure is so far very limited. The paper applies methods of community analysis to explore the structure of a criminal network representing the individuals’ co-participation in meetings. It draws from a case study on a large law enforcement operation (“Operazione Infinito”) tackling the ‘Ndrangheta, a mafia organization from Calabria, a southern Italian region. The results show that the network is indeed clustered and that communities are associated, in a non-trivial way, with the internal organization of the ‘Ndrangheta into different “locali” (similar to mafia families). Furthermore, the results of community analysis can improve the prediction of the “locale” membership of the criminals (up to two thirds of any random sample of nodes) and the leadership roles (above 90% precision in classifying nodes as either bosses or non-bosses). The implications of these findings on the interpretation of the structure and functioning of the criminal network are discussed.
Article
Full-text available
Context During the past 2 decades, a major transition in the clinical characterization of psychotic disorders has occurred. The construct of a clinical high-risk (HR) state for psychosis has evolved to capture the prepsychotic phase, describing people presenting with potentially prodromal symptoms. The importance of this HR state has been increasingly recognized to such an extent that a new syndrome is being considered as a diagnostic category in the DSM-5. Objective To reframe the HR state in a comprehensive state-of-the-art review on the progress that has been made while also recognizing the challenges that remain. Data Sources Available HR research of the past 20 years from PubMed, books, meetings, abstracts, and international conferences. Study Selection and Data Extraction Critical review of HR studies addressing historical development, inclusion criteria, epidemiologic research, transition criteria, outcomes, clinical and functional characteristics, neurocognition, neuroimaging, predictors of psychosis development, treatment trials, socioeconomic aspects, nosography, and future challenges in the field. Data Synthesis Relevant articles retrieved in the literature search were discussed by a large group of leading worldwide experts in the field. The core results are presented after consensus and are summarized in illustrative tables and figures. Conclusions The relatively new field of HR research in psychosis is exciting. It has the potential to shed light on the development of major psychotic disorders and to alter their course. It also provides a rationale for service provision to those in need of help who could not previously access it and the possibility of changing trajectories for those with vulnerability to psychotic illnesses.
Article
Full-text available
We propose an approach to address data uncertainty for discrete optimization and network flow problems that allows controlling the degree of conservatism of the solution, and is computationally tractable both practically and theoretically. In particular, when both the cost coefficients and the data in the constraints of an integer programming problem are subject to uncertainty, we propose a robust integer programming problem of moderately larger size that allows controlling the degree of conservatism of the solution in terms of probabilistic bounds on constraint violation. When only the cost coefficients are subject to uncertainty and the problem is a 0–1 discrete optimization problem on n variables, then we solve the robust counterpart by solving at most n+1 instances of the original problem. Thus, the robust counterpart of a polynomially solvable 0–1 discrete optimization problem remains polynomially solvable. In particular, robust matching, spanning tree, shortest path, matroid intersection, etc. are polynomially solvable. We also show that the robust counterpart of an NP-hard -approximable 0–1 discrete optimization problem, remains -approximable. Finally, we propose an algorithm for robust network flows that solves the robust counterpart by solving a polynomial number of nominal minimum cost flow problems in a modified network.
Article
We propose a novel unsupervised anomaly detection and diagnosis algorithm in power electronic networks. Since most anomaly detection and diagnosis algorithms in the literature are based on supervised methods that can hardly be generalized to broader scenarios, we propose unsupervised algorithms. Our algorithm extracts the Time-Frequency Domain (TFD) features from the three-phase currents and three-phase voltages of the point of coupling (PCC) nodes to detect anomalies and distinguish between different types of anomalies, such as cyber-attacks and physical faults. To detect anomalies through TFD features, we propose a novel Informative Leveraging for Anomaly Detection (ILAD) algorithm. The proposed unsupervised ILAD algorithm automatically extracts noise-reduced anomalous signals, resulting in more accurate anomaly detection results than other score-based methods. To assign anomaly types for anomaly diagnosis, we apply a novel Multivariate Functional Principal Component Analysis (MFPCA) clustering method. Unlike the deep learning methods, the MFPCA clustering method does not require labels for training and provides more accurate results than other deep embedding-based clustering approaches. Furthermore, it is even comparable to supervised algorithms in both offline and online experiments. To the best of our knowledge, the proposed unsupervised framework accomplishing anomaly detection and anomaly diagnosis tasks is the first of its kind in power electronic networks.
Chapter
Unsupervised anomaly detection (UAD) aims to find anomalous images by optimising a detector using a training set that contains only normal images. UAD approaches can be based on reconstruction methods, self-supervised approaches, and Imagenet pre-trained models. Reconstruction methods, which detect anomalies from image reconstruction errors, are advantageous because they do not rely on the design of problem-specific pretext tasks needed by self-supervised approaches, and on the unreliable translation of models pre-trained from non-medical datasets. However, reconstruction methods may fail because they can have low reconstruction errors even for anomalous images. In this paper, we introduce a new reconstruction-based UAD approach that addresses this low-reconstruction error issue for anomalous images. Our UAD approach, the memory-augmented multi-level cross-attentional masked autoencoder (MemMC-MAE), is a transformer-based approach, consisting of a novel memory-augmented self-attention operator for the encoder and a new multi-level cross-attention operator for the decoder. MemMC-MAE masks large parts of the input image during its reconstruction, reducing the risk that it will produce low reconstruction errors because anomalies are likely to be masked and cannot be reconstructed. However, when the anomaly is not masked, then the normal patterns stored in the encoder’s memory combined with the decoder’s multi-level cross-attention will constrain the accurate reconstruction of the anomaly. We show that our method achieves SOTA anomaly detection and localisation on colonoscopy, pneumonia, and covid-19 chest x-ray datasets.
Article
OneClass SVM is a popular method for unsupervised anomaly detection. As many other methods, it suffers from the black box problem: it is difficult to justify, in an intuitive and simple manner, why the decision frontier is identifying data points as anomalous or non anomalous. This problem is being widely addressed for supervised models. However, it is still an uncharted area for unsupervised learning. In this paper, we evaluate several rule extraction techniques over OneClass SVM models, while presenting alternative designs for some of those algorithms. Furthermore, we propose algorithms to compute metrics related to eXplainable Artificial Intelligence (XAI) regarding the “comprehensibility”, “representativeness”, “stability” and “diversity” of the extracted rules. We evaluate our proposals with different data sets, including real-world data coming from industry. Consequently, our proposal contributes to extending XAI techniques to unsupervised machine learning models.
Article
Machine learning (ML) encompasses a broad range of algorithms and modeling tools used for a vast array of data processing tasks, which has entered most scientific disciplines in recent years. This article reviews in a selective way the recent research on the interface between machine learning and the physical sciences. This includes conceptual developments in ML motivated by physical insights, applications of machine learning techniques to several domains in physics, and cross fertilization between the two fields. After giving a basic notion of machine learning methods and principles, examples are described of how statistical physics is used to understand methods in ML. This review then describes applications of ML methods in particle physics and cosmology, quantum many-body physics, quantum computing, and chemical and material physics. Research and development into novel computing architectures aimed at accelerating ML are also highlighted. Each of the sections describe recent successes as well as domain-specific methodology and challenges.
Article
Historical earthquakes have to be parameterized for seismic-hazard analyses, although there may be only a few intensity assignments available for them. We studied epicenter determination for 18 million synthetic samples of 3-11 intensity data points (IDPs). The IDP distributions corresponded to earthquakes that occurred offshore, close to the coast, or onshore. We assumed an ordinal variable, an attenuation relationship, and a point source. The attenuation relationship was utilized to encompass every IDP of a sample using a lower and upper radius that corresponded to the respective intensity. The epicenter must fit all the intensity rings simultaneously. The successes and failures of epicenter determination were monitored for a fixed magnitude and depth. We investigated where the epicenter was found, its uncertainty, and its uniqueness. Small location uncertainties may be obtained for the smallest samples but increasing the sample size led to a larger proportion of small uncertainties provided that intensities were error-free. A large range of intensities in the sample, a short distance to the true epicenter, and, to a lesser extent, a small azimuthal gap were indicators of a good solution. A location uncertainty of 20 km and smaller is realistic in many cases, but uncertainties of 5 km are extremely seldom occurrences. The proportion of good locations was reduced when the intensities were erroneous. Epicenters were determined for the earthquakes of 26 April 1458 in central Italy, 14 July 1765 in Sweden, and 23 December 1875 in the eastern United States.
Article
This article offers some remarks on a few critical issues related to explanation in criminal network research. It first discusses two distinct perspectives on networks, namely a substantive approach that views networks as a distinct form of organisation, and an instrumental one that interprets networks as a collection of nodes and attributes. The latter stands at the basis of Social Network Analysis. This work contends that the instrumental approach is better suited to test hypotheses, as it does not assume any structure a priori\textit{a priori}, but derives it from the data. Moreover, social network techniques can be applied to investigate criminal networks while rejecting the notion of networks as a distinct form of organisation. Next, the article discusses some potential pitfalls associated with the instrumental approach and cautions against an over-reliance on structural measures alone when interpreting real-world networks. It then stresses the need to complement these measures with additional qualitative evidence. Finally, the article discusses the use of Quadratic Assignment Procedure regression models as a viable strategy to test hypotheses based on criminal network data.
Conference Paper
We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction
Technology and at-risk students: Lessons learned
  • J L Renzulli
  • E J Reis
A survey on clustering ensemble methods
  • J Ghosh
  • A Acharya
Machine learning and the physical sciences
  • R Melko
  • G Carleo
  • J Carrasquilla
  • J I Cirac