[Show abstract][Hide abstract] ABSTRACT: A vector autoregressive model in discrete time domain (DVAR) is often used to analyze continuous time, multivariate, linear Markov systems through their observed time series data sampled at discrete timesteps. Based on previous studies, the DVAR model is supposed to be a noncanonical representation of the system, that is, it does not correspond to a unique system bijectively. However, in this article, we characterize the relations of the DVAR model with its corresponding Structural Vector AR (SVAR) and Continuous Time Vector AR (CTVAR) models through a finite difference method across continuous and discrete time domain. We further clarify that the DVARmodel of a continuous time,multivariate, linearMarkov system is canonical under a highly generic condition. Our analysis shows that we can uniquely reproduce its SVAR and CTVAR models from the DVAR model. Based on these results, we propose a novel Continuous and Structural Vector Autoregressive (CSVAR) modeling approach to derive the SVAR and the CTVAR models from their DVAR model empirically derived from the observed time series of continuous time linear Markov systems. We demonstrate its superior performance through some numerical experiments on both artificial and real-world data.
Preview · Article · Nov 2015 · ACM Transactions on Intelligent Systems and Technology
[Show abstract][Hide abstract] ABSTRACT: Data depth is a statistical method which models data distribution in terms of center-outward ranking rather than density or linear ranking. While there are a lot of academic interests, its applications are hampered by the lack of a method which is both robust and efficient. This paper introduces Half-Space Mass which is a significantly improved version of half-space data depth. Half-Space Mass is the only data depth method which is both robust and efficient, as far as we know. We also reveal four theoretical properties of Half-Space Mass: (i) its resultant mass distribution is concave regardless of the underlying density distribution, (ii) its maximum point is unique which can be considered as median, (iii) the median is maximally robust, and (iv) its estimation extends to a higher dimensional space in which the convex hull of the dataset occupies zero volume. We demonstrate the power of Half-Space Mass through its applications in two tasks. In anomaly detection, being a maximally robust location estimator leads directly to a robust anomaly detector that yields a better detection accuracy than half-space depth; and it runs orders of magnitude faster than $$L_2$$L2 depth, an existing maximally robust location estimator. In clustering, the Half-Space Mass version of K-means overcomes three weaknesses of K-means.
No preview · Article · Aug 2015 · Machine Learning
[Show abstract][Hide abstract] ABSTRACT: Cardiovascular diseases, which lead to cardiovascular events including death, progress with many deleterious pathophysiological sequels. If a cause-and-effect relationship follows a one-to-one relation, we can focus on a cause to treat an effect, but such a relation cannot be applied in cardiovascular diseases. To identify novel drugs in the cardiovascular field, we generally adopt two different strategies: induction and deduction. In the cardiovascular field, it is difficult to use deduction because cardiovascular diseases are caused by many factors, leading us to use induction. In this method, we consider all clinical data, such as medical records or genetic data, and identify a few candidates. Recent computational and mathematical advances enable us to use data-mining methods to uncover hidden relationships between many parameters and clinical outcomes. However, because these candidates are not identified as promoting or inhibiting factors, or as causal or consequent factors of cardiovascular diseases, we need to test them in basic research, and bring them back to the clinical field to test their efficacy in clinical trials. With such a "back-and-forth loop" between clinical observation and basic research, data-mining methods may provide novel strategies leading to new tools for clinicians, basic findings for researchers, and better outcomes for patients.
No preview · Article · Jun 2015 · Cardiovascular Drugs and Therapy
[Show abstract][Hide abstract] ABSTRACT: This paper presents a first application of a novel Continuous and Structural
Autoregressive Moving Average (CSARMA) modeling approach to BWR
noise analysis. The CSARMA approach derives a unique representation of
the system dynamics by more robust and reliable canonical models as basis
for signal analysis in general and for reactor diagnostics in particular. In this
paper, a stability event that occurred in a Swiss BWR plant during power
ascension phase is analyzed as well as the time periods that preceded and
followed the event. Focusing only on qualitative trends at this stage, the
obtained results clearly indicate a different dynamical state during the unstable
event compared to the two other stable periods. Also, they could be
interpreted as pointing out a disturbance in the pressure control system as
primary cause for the event. To benchmark these findings, the frequency domain
based Signal Transmission-Path (STP) method is also applied. And
with the STP method, we obtained similar relationships as mentioned above.
This consistency between both methods can be considered as being a confirmation
that the event was caused by a pressure control system disturbance and not induced by the core. Also, it is worth noting that the STP analysis
failed to catch the relations among the processes during the stable periods,
that were clearly indicated by the CSARMA method, since the last uses more
precise models as basis.
No preview · Article · Jan 2015 · Annals of Nuclear Energy
[Show abstract][Hide abstract] ABSTRACT: Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called 'mp-dissimilarity'. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.
[Show abstract][Hide abstract] ABSTRACT: While the recent advent of new technologies in biology such as DNA microarray and next-generation sequencer has given researchers a large volume of data representing genome-wide biological responses, it is not necessarily easy to derive knowledge that is accurate and understandable at the same time. In this study, we applied the Classification Based on Association (CBA) algorithm, one of the Class Association Rule mining techniques, to the TG-GATEs database, where both toxicogenomic and toxicological data of more than 150 compounds in rat and human are stored. We compared the generated classifiers between CBA and linear discriminant analysis (LDA) and showed that CBA is superior to LDA in terms of both predictive performances (accuracy: 83% for CBA vs. 75% for LDA, sensitivity: 82% for CBA vs. 72% for LDA, specificity: 85% for CBA vs. 75% for LDA) and interpretability.
Full-text · Article · Nov 2014 · Toxicology Reports
[Show abstract][Hide abstract] ABSTRACT: A large amount of observational data has been accumulated in various fields
in recent times, and there is a growing need to estimate the generating
processes of these data. A linear non-Gaussian acyclic model (LiNGAM) based on
the non-Gaussianity of external influences has been proposed to estimate the
data-generating processes of variables. However, the results of the estimation
can be biased if there are latent classes. In this paper, we first review
LiNGAM, its extended model, as well as the estimation procedure for LiNGAM in a
Bayesian framework. We then propose a new Bayesian estimation procedure that
solves the problem.
[Show abstract][Hide abstract] ABSTRACT: Despite their wide spread use, nearest neighbour density estimators have two fundamental limitations: O(n2)O(n2) time complexity and O(n) space complexity. Both limitations constrain nearest neighbour density estimators to small data sets only. Recent progress using indexing schemes has improved to near linear time complexity only.
We propose a new approach, called LiNearN for Linear time Nearest Neighbour algorithm, that yields the first nearest neighbour density estimator having O(n) time complexity and constant space complexity, as far as we know. This is achieved without using any indexing scheme because LiNearN uses a subsampling approach for which the subsample values are significantly less than the data size. Like existing density estimators, our asymptotic analysis reveals that the new density estimator has a parameter to trade off between bias and variance. We show that algorithms based on the new nearest neighbour density estimator can easily scale up to data sets with millions of instances in anomaly detection and clustering tasks.
No preview · Article · Aug 2014 · Pattern Recognition
[Show abstract][Hide abstract] ABSTRACT: iForest uses a collection of isolation trees to detect anomalies. While it is effective in detecting global anomalies, it fails to detect local anomalies in data sets having multiple clusters of normal instances because the local anomalies are masked by normal clusters of similar density and they become less susceptible to isolation. In this paper, we propose a very simple but effective solution to overcome this limitation by replacing the global ranking measure based on path length with a local ranking measure based on relative mass that takes local data distribution into consideration. We demonstrate the utility of relative mass by improving the task specific performance of iForest in anomaly detection and information retrieval tasks.
[Show abstract][Hide abstract] ABSTRACT: Discovering causal relations among observed variables in a given data set is
a major objective in studies of statistics and artificial intelligence.
Recently, some techniques to discover a unique causal model have been explored
based on non-Gaussianity of the observed data distribution. However, most of
these are limited to continuous data. In this paper, we present a novel causal
model for binary data and propose an efficient new approach to deriving the
unique causal model governing a given binary data set under skew distributions
of external binary noises. Experimental evaluation shows excellent performance
for both artificial and real world data sets.
[Show abstract][Hide abstract] ABSTRACT: The notion of causality is used in many situations dealing with uncertainty.
We consider the problem whether causality can be identified given data set
generated by discrete random variables rather than continuous ones. In
particular, for non-binary data, thus far it was only known that causality can
be identified except rare cases. In this paper, we present necessary and
sufficient condition for an integer modular acyclic additive noise (IMAN) of
two variables. In addition, we relate bivariate and multivariate causal
identifiability in a more explicit manner, and develop a practical algorithm to
find the order of variables and their parent sets. We demonstrate its
performance in applications to artificial data and real world body motion data
with comparisons to conventional methods.
[Show abstract][Hide abstract] ABSTRACT: The accurate detection of small deviations in given density matrices is
important for quantum information processing. Here we propose a new method
based on the concept of data mining. We demonstrate that the proposed method
can more accurately detect small erroneous deviations in reconstructed density
matrices, which contain intrinsic fluctuations due to the limited number of
samples, than a naive method of checking the trace distance from the average of
the given density matrices. This method has the potential to be a key tool in
broad areas of physics where the detection of small deviations of quantum
states reconstructed using a limited number of samples are essential.
Full-text · Article · Jan 2014 · Physical Review A
[Show abstract][Hide abstract] ABSTRACT: We often use a discrete time vector autoregressive (DVAR) model to analyse continuous time, multivariate, linear Markov systems through their time series data sampled at discrete time steps. However, the DVAR model has been considered not to be structural representation and hence not to have bijective correspondence with system dynamics in general. In this paper, we characterize the relationships of the DVAR model with its corresponding structural vector AR (SVAR) and continuous time vector AR (CVAR) models through finite difference approximation of time differentials. Our analysis shows that the DVAR model of a continuous time, multivariate, linear Markov system bijectively corresponds to the system dynamics. Further we clarify that the SVAR and the CVAR models are uniquely reproduced from their DVAR model under a highly generic condition. Based on these results, we propose a novel Continuous time and Structural Vector AutoRegressive (CSVAR) modeling approach for continuous time, linear Markov systems to derive the SVAR and the CVAR models from their DVAR model empirically derived from the observed time series. We demonstrate its superior performance through some numerical experiments on both artificial and real world data.
[Show abstract][Hide abstract] ABSTRACT: We consider learning a causal ordering of variables in a linear nongaussian acyclic model called LiNGAM. Several methods have been shown to consistently estimate a causal ordering assuming that all the model assumptions are correct. But the estimation results could be distorted if some assumptions are violated. In this letter, we propose a new algorithm for learning causal orders that is robust against one typical violation of the model assumptions: latent confounders. The key idea is to detect latent confounders by testing independence between estimated external influences and find subsets (parcels) that include variables unaffected by latent confounders. We demonstrate the effectiveness of our method using artificial data and simulated brain imaging data.
[Show abstract][Hide abstract] ABSTRACT: Recently, there is a growing need for statistical learning of causal structures in data with many variables. A structural equation model called Linear Non-Gaussian Acyclic Model (LiNGAM) has been extensively studied to uniquely estimate causal structures in data. The key assumptions are that external influences are independent and follow non-Gaussian distributions. However, LiNGAM does not capture temporal structural changes in observed data. In this paper, we consider learning causal structures in longitudinal data that collects samples over a period of time. In previous studies of LiNGAM, there was no model specialized to handle longitudinal data with multiple samples. Therefore, we propose a new model called longitudinal LiNGAM and a new estimation method using the information on temporal structural changes and non-Gaussianity of data. The new approach requires less assumptions than previous methods.
[Show abstract][Hide abstract] ABSTRACT: The analysis of multimedia application traces can reveal important information to enhance program execution comprehension. However typical size of traces can be in gigabytes, which hinders their effective exploitation by application developers. In this paper, we study the problem of finding a set of sequences of events that allows a reduced-size rewriting of the original trace. These sequences of events, that we call blocks, can simplify the exploration of large execution traces by allowing application developers to see an abstraction instead of low-level events. The problem of computing such set of blocks is NP-hard and naive approaches lead to prohibitive running times that prevent analysing real world traces. We propose a novel algorithm that directly mines the set of blocks. Our experiments show that our algorithm can analyse real traces of up to two hours of video. We also show experimentally the quality of the set of blocks proposed, and the interest of the rewriting to understand actual trace data.
[Show abstract][Hide abstract] ABSTRACT: Density estimation is the ubiquitous base modelling mechanism employed for many tasks including clustering, classification, anomaly detection and information retrieval. Commonly used density estimation methods such as kernel density estimator and
-nearest neighbour density estimator have high time and space complexities which render them inapplicable in problems with big data. This weakness sets the fundamental limit in existing algorithms for all these tasks. We propose the first density estimation method, having average case sub-linear time complexity and constant space complexity in the number of instances, that stretches this fundamental limit to an extent that dealing with millions of data can now be done easily and quickly. We provide an asymptotic analysis of the new density estimator and verify the generality of the method by replacing existing density estimators with the new one in three current density-based algorithms, namely DBSCAN, LOF and Bayesian classifiers, representing three different data mining tasks of clustering, anomaly detection and classification. Our empirical evaluation results show that the new density estimation method significantly improves their time and space complexities, while maintaining or improving their task-specific performances in clustering, anomaly detection and classification. The new method empowers these algorithms, currently limited to small data size only, to process big data—setting a new benchmark for what density-based algorithms can achieve.
No preview · Article · Jun 2013 · Knowledge and Information Systems