Statistical Analysis and Data Mining

Published by Wiley
Online ISSN: 1932-1872
Publications
Article
Joint analyses of high-throughput datasets generate the need to assess the association between two long lists of p-values. In such p-value lists, the vast majority of the features are insignificant. Ideally contributions of features that are null in both tests should be minimized. However, by random chance their p-values are uniformly distributed between zero and one, and weak correlations of the p-values may exist due to inherent biases in the high-throughput technology used to generate the multiple datasets. Rank-based agreement test may capture such unwanted effects. Testing contingency tables generated using hard cutoffs may be sensitive to arbitrary threshold choice. We develop a novel method based on feature-level concordance using local false discovery rate. The association score enjoys straight-forward interpretation. The method shows higher statistical power to detect association between p-value lists in simulation. We demonstrate its utility using real data analysis. The R implementation of the method is available at http://userwww.service.emory.edu/~tyu8/AAPL/.
 
Article
Estimation of multiple directed graphs becomes challenging in the presence of inhomogeneous data, where directed acyclic graphs (DAGs) are used to represent causal relations among random variables. To infer causal relations among variables, we estimate multiple DAGs given a known ordering in Gaussian graphical models. In particular, we propose a constrained maximum likelihood method with nonconvex constraints over elements and element-wise differences of adjacency matrices, for identifying the sparseness structure as well as detecting structural changes over adjacency matrices of the graphs. Computationally, we develop an efficient algorithm based on augmented Lagrange multipliers, the difference convex method, and a novel fast algorithm for solving convex relaxation subproblems. Numerical results suggest that the proposed method performs well against its alternatives for simulated and real data.
 
Article
In Prequential analysis, an inference method is viewed as a forecasting system, and the quality of the inference method is based on the quality of its predictions. This is an alternative approach to more traditional statistical methods that focus on the inference of parameters of the data generating distribution. In this paper, we introduce adaptive combined average predictors (ACAPs) for the Prequential analysis of complex data. That is, we use convex combinations of two different model averages to form a predictor at each time step in a sequence. A novel feature of our strategy is that the models in each average are re-chosen adaptively at each time step. To assess the complexity of a given data set, we introduce measures of data complexity for continuous response data. We validate our measures in several simulated contexts prior to using them in real data examples. The performance of ACAPs is compared with the performances of predictors based on stacking or likelihood weighted averaging in several model classes and in both simulated and real data sets. Our results suggest that ACAPs achieve a better trade off between model list bias and model list variability in cases where the data is very complex. This implies that the choices of model class and averaging method should be guided by a concept of complexity matching, i.e. the analysis of a complex data set may require a more complex model class and averaging strategy than the analysis of a simpler data set. We propose that complexity matching is akin to a bias-variance tradeoff in statistical modeling.
 
Article
An important task in personalized medicine is to predict disease risk based on a person's genome, e.g. on a large number of single-nucleotide polymorphisms (SNPs). Genome-wide association studies (GWAS) make SNP and phenotype data available to researchers. A critical question for researchers is how to best predict disease risk. Penalized regression equipped with variable selection, such as LASSO and SCAD, is deemed to be promising in this setting. However, the sparsity assumption taken by the LASSO, SCAD and many other penalized regression techniques may not be applicable here: it is now hypothesized that many common diseases are associated with many SNPs with small to moderate effects. In this article, we use the GWAS data from the Wellcome Trust Case Control Consortium (WTCCC) to investigate the performance of various unpenalized and penalized regression approaches under true sparse or non-sparse models. We find that in general penalized regression outperformed unpenalized regression; SCAD, TLP and LASSO performed best for sparse models, while elastic net regression was the winner, followed by ridge, TLP and LASSO, for non-sparse models.
 
Article
The NCI60 human tumor cell line screen is a public resource for studying selective and non-selective growth inhibition of small molecules against cancer cells. By coupling growth inhibition screening data with biological characterizations of the different cell lines, it becomes possible to infer mechanisms of action underlying some of the observable patterns of selective activity. Using these data, mechanistic relationships have been identified including specific associations between single genes and small families of closely related compounds, and less specific relationships between biological processes involving several cooperating genes and broader families of compounds. Here we aim to characterize the degree to which such specific and general relationships are present in these data. A related question is whether genes tend to act with a uniform mechanism for all associated compounds, or whether multiple mechanisms are commonly involved. We address these two issues in a statistical framework placing special emphasis on the effects of measurement error in the gene expression and chemical screening data. We find that as measurement accuracy increases, the pattern of apparent associations shifts from one dominated by isolated gene/compound pairs, to one in which families consisting of an average of 25 compounds are associated to the same gene. At the same time, the number of genes that appear to play a role in influencing compound activities decreases. For less than half of the genes, the presence of both positive and negative correlations indicates pleiotropic associations with molecules via different mechanisms of action.
 
Article
Technologies for rapid detection of bacterial pathogens are crucial for securing the food supply. A light-scattering sensor recently developed for real-time identification of multiple colonies has shown great promise for distinguishing bacteria cultures. The classification approach currently used with this system relies on supervised learning. For accurate classification of bacterial pathogens, the training library should be exhaustive, i.e., should consist of samples of all possible pathogens. Yet, the sheer number of existing bacterial serovars and more importantly the effect of their high mutation rate would not allow for a practical and manageable training. In this study, we propose a Bayesian approach to learning with a nonexhaustive training dataset for automated detection of unmatched bacterial serovars, i.e., serovars for which no samples exist in the training library. The main contribution of our work is the Wishart conjugate priors defined over class distributions. This allows us to employ the prior information obtained from known classes to make inferences about unknown classes as well. By this means, we identify new classes of informational value and dynamically update the training dataset with these classes to make it increasingly more representative of the sample population. This results in a classifier with improved predictive performance for future samples. We evaluated our approach on a 28-class bacteria dataset and also on the benchmark 26-class letter recognition dataset for further validation. The proposed approach is compared against state-of-the-art involving density-based approaches and support vector domain description, as well as a recently introduced Bayesian approach based on simulated classes.
 
Article
Finding optimal parameters for simulating biological systems is usually a very difficult and expensive task in systems biology. Brute force searching is infeasible in practice because of the huge (often infinite) search space. In this article, we propose predicting the parameters efficiently by learning the relationship between system outputs and parameters using regression. However, the conventional parametric regression models suffer from two issues, thus are not applicable to this problem. First, restricting the regression function as a certain fixed type (e.g. linear, polynomial, etc.) introduces too strong assumptions that reduce the model flexibility. Second, conventional regression models fail to take into account the fact that a fixed parameter value may correspond to multiple different outputs due to the stochastic nature of most biological simulations, and the existence of a potentially large number of other factors that affect the simulation outputs. We propose a novel approach based on a Gaussian process model that addresses the two issues jointly. We apply our approach to a tumor vessel growth model and the feedback Wright-Fisher model. The experimental results show that our method can predict the parameter values of both of the two models with high accuracy.
 
Article
Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.
 
Article
Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.
 
Article
Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.
 
Article
Successful implementation of feature selection in nuclear magnetic resonance (NMR) spectra not only improves classification ability, but also simplifies the entire modeling process and, thus, reduces computational and analytical efforts. Principal component analysis (PCA) and partial least squares (PLS) have been widely used for feature selection in NMR spectra. However, extracting meaningful metabolite features from the reduced dimensions obtained through PCA or PLS is complicated because these reduced dimensions are linear combinations of a large number of the original features. In this paper, we propose a multiple testing procedure controlling false discovery rate (FDR) as an efficient method for feature selection in NMR spectra. The procedure clearly compensates for the limitation of PCA and PLS and identifies individual metabolite features necessary for classification. In addition, we present orthogonal signal correction to improve classification and visualization by removing unnecessary variations in NMR spectra. Our experimental results with real NMR spectra showed that classification models constructed with the features selected by our proposed procedure yielded smaller misclassification rates than those with all features.
 
Article
For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.
 
Article
The novel supervised learning method of vertex discriminant analysis (VDA) has been demonstrated for its good performance in multicategory classification. The current paper explores an elaboration of VDA for nonlinear discrimination. By incorporating reproducing kernels, VDA can be generalized from linear discrimination to nonlinear discrimination. Our numerical experiments show that the new reproducing kernel-based method leads to accurate classification for both linear and nonlinear cases. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012
 
Article
In multivariate linear regression, it is often assumed that the response matrix is intrinsically of lower rank. This could be because of the correlation structure among the prediction variables or the coefficient matrix being lower rank. To accommodate both, we propose a reduced rank ridge regression for multivariate linear regression. Specifically, we combine the ridge penalty with the reduced rank constraint on the coefficient matrix to come up with a computationally straightforward algorithm. Numerical studies indicate that the proposed method consistently outperforms relevant competitors. A novel extension of the proposed method to the reproducing kernel Hilbert space (RKHS) set-up is also developed.
 
Article
Multiple response regression is a useful regression technique to model multiple response variables using the same set of predictor variables. Most existing methods for multiple response regression are designed for modeling homogeneous data. In many applications, however, one may have heterogeneous data where the samples are divided into multiple groups. Our motivating example is a cancer dataset where the samples belong to multiple cancer subtypes. In this paper, we consider modeling the data coming from a mixture of several Gaussian distributions with known group labels. A naive approach is to split the data into several groups according to the labels and model each group separately. Although it is simple, this approach ignores potential common structures across different groups. We propose new penalized methods to model all groups jointly in which the common and unique structures can be identified. The proposed methods estimate the regression coefficient matrix, as well as the conditional inverse covariance matrix of response variables. Asymptotic properties of the proposed methods are explored. Through numerical examples, we demonstrate that both estimation and prediction can be improved by modeling all groups jointly using the proposed methods. An application to a glioblastoma cancer dataset reveals some interesting common and unique gene relationships across different cancer subtypes.
 
Article
High-dimensional data common in genomics, proteomics, and chemometrics often contains complicated correlation structures. Recently, partial least squares (PLS) and Sparse PLS methods have gained attention in these areas as dimension reduction techniques in the context of supervised data analysis. We introduce a framework for Regularized PLS by solving a relaxation of the SIMPLS optimization problem with penalties on the PLS loadings vectors. Our approach enjoys many advantages including flexibility, general penalties, easy interpretation of results, and fast computation in high-dimensional settings. We also outline extensions of our methods leading to novel methods for non-negative PLS and generalized PLS, an adoption of PLS for structured data. We demonstrate the utility of our methods through simulations and a case study on proton Nuclear Magnetic Resonance (NMR) spectroscopy data.
 
Article
Interestingness measures provide information that can be used to prune or select association rules. A given value of an interestingness measure is often interpreted relative to the overall range of the values that the interestingness measure can take. However, properties of individual association rules restrict the values an interestingness measure can achieve. An interesting measure can be standardized to take this into account, but this has only been done for one interestingness measure to date, i.e., the lift. Standardization provides greater insight than the raw value and may even alter researchers' perception of the data. We derive standardized analogues of three interestingness measures and use real and simulated data to compare them to their raw versions, each other, and the standardized lift.
 
Article
This review article considers some of the most common methods used in astronomy for regressing one quantity against another in order to estimate the model parameters or to predict an observationally expensive quantity using trends between object values. These methods have to tackle some of the awkward features prevalent in astronomical data, namely heteroscedastic (point-dependent) errors, intrinsic scatter, non-ignorable data collection and selection effects, data structure and non-uniform population (often called Malmquist bias), non-Gaussian data, outliers and mixtures of regressions. We outline how least square fits, weighted least squares methods, Maximum Likelihood, survival analysis, and Bayesian methods have been applied in the astrophysics literature when one or more of these features is present. In particular we concentrate on errors-in-variables regression and we advocate Bayesian techniques.
 
Article
We review the use of Bayesian Model Averaging in astrophysics. We first introduce the statistical basis of Bayesian Model Selection and Model Averaging. We discuss methods to calculate the model-averaged posteriors, including Markov Chain Monte Carlo (MCMC), nested sampling, Population Monte Carlo, and Reversible Jump MCMC. We then review some applications of Bayesian Model Averaging in astrophysics, including measurements of the dark energy and primordial power spectrum parameters in cosmology, cluster weak lensing and Sunyaev-Zel'dovich effect data, estimating distances to Cepheids, and classifying variable stars.
 
Empirical contour plots from 1000 samples simulated from density (3), all with p = 0.5, V1 = V2 and σ1 = σ2 = 1.
The estimated mean and variance of RW, and the standard error of the mean for male
Article
Expanding a lower-dimensional problem to a higher-dimensional space and then projecting back is often beneficial. This article rigorously investigates this perspective in the context of finite mixture models, namely how to improve inference for mixture models by using auxiliary variables. Despite the large literature in mixture models and several empirical examples, there is no previous work that gives general theoretical justification for including auxiliary variables in mixture models, even for special cases. We provide a theoretical basis for comparing inference for mixture multivariate models with the corresponding inference for marginal univariate mixture models. Analytical results for several special cases are established. We show that the probability of correctly allocating mixture memberships and the information number for the means of the primary outcome in a bivariate model with two Gaussian mixtures are generally larger than those in each univariate model. Simulations under a range of scenarios, including misspecified models, are conducted to examine the improvement. The method is illustrated by two real applications in ecology and causal inferenc
 
Article
Many businesses are using recommender systems for marketing outreach. Recommendation algorithms can be either based on content or driven by collaborative filtering. We study different ways to incorporate content information directly into the matrix factorization approach of collaborative filtering. These content-boosted matrix factorization algorithms not only improve recommendation accuracy, but also provide useful insights about the contents, as well as make recommendations more easily interpretable.
 
Article
We present a technique for spatiotemporal data analysis called nonlinear Laplacian spectral analysis (NLSA), which generalizes singular spectrum analysis (SSA) to take into account the nonlinear manifold structure of complex data sets. The key principle underlying NLSA is that the functions used to represent temporal patterns should exhibit a degree of smoothness on the nonlinear data manifold M; a constraint absent from classical SSA. NLSA enforces such a notion of smoothness by requiring that temporal patterns belong in low-dimensional Hilbert spaces V_l spanned by the leading l Laplace-Beltrami eigenfunctions on M. These eigenfunctions can be evaluated efficiently in high ambient-space dimensions using sparse graph-theoretic algorithms. Moreover, they provide orthonormal bases to expand a family of linear maps, whose singular value decomposition leads to sets of spatiotemporal patterns at progressively finer resolution on the data manifold. The Riemannian measure of M and an adaptive graph kernel width enhances the capability of NLSA to detect important nonlinear processes, including intermittency and rare events. The minimum dimension of V_l required to capture these features while avoiding overfitting is estimated here using spectral entropy criteria.
 
Article
This paper brings explicit considerations of distributed computing architectures and data structures into the rigorous design of Sequential Monte Carlo (SMC) methods. A theoretical result established recently by the authors shows that adapting interaction between particles to suitably control the Effective Sample Size (ESS) is sufficient to guarantee stability of SMC algorithms. Our objective is to leverage this result and devise algorithms which are thus guaranteed to work well in a distributed setting. We make three main contributions to achieve this. Firstly, we study mathematical properties of the ESS as a function of matrices and graphs that parameterize the interaction amongst particles. Secondly, we show how these graphs can be induced by tree data structures which model the logical network topology of an abstract distributed computing environment. Thirdly, we present efficient distributed algorithms that achieve the desired ESS control, perform resampling and operate on forests associated with these trees.
 
Article
Binary classification is a common statistical learning problem in which a model is estimated on a set of covariates for some outcome indicating the membership of one of two classes. In the literature, there exists a distinction between hard and soft classification. In soft classification, the conditional class probability is modeled as a function of the covariates. In contrast, hard classification methods only target the optimal prediction boundary. While hard and soft classification methods have been studied extensively, not much work has been done to compare the actual tasks of hard and soft classification. In this paper we propose a spectrum of statistical learning problems which span the hard and soft classification tasks based on fitting multiple decision rules to the data. By doing so, we reveal a novel collection of learning tasks of increasing complexity. We study the problems using the framework of large-margin classifiers and a class of piecewise linear convex surrogates, for which we derive statistical properties and a corresponding sub-gradient descent algorithm. We conclude by applying our approach to simulation settings and a magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.
 
Article
In the last few years many real-world networks have been found to show a so-called community structure organization. Much effort has been devoted in the literature to develop methods and algorithms that can efficiently highlight this hidden structure of the network, traditionally by partitioning the graph. Since network representation can be very complex and can contain different variants in the traditional graph model, each algorithm in the literature focuses on some of these properties and establishes, explicitly or implicitly, its own definition of community. According to this definition it then extracts the communities that are able to reflect only some of the features of real communities. The aim of this survey is to provide a manual for the community discovery problem. Given a meta definition of what a community in a social network is, our aim is to organize the main categories of community discovery based on their own definition of community. Given a desired definition of community and the features of a problem (size of network, direction of edges, multidimensionality, and so on) this review paper is designed to provide a set of approaches that researchers could focus on.
 
Article
I propose a frequency domain adaptation of the Expectation Maximization (EM) algorithm to group a family of time series in classes of similar dynamic structure. It does this by viewing the magnitude of the discrete Fourier transform (DFT) of each signal (or power spectrum) as a probability density/mass function (pdf/pmf) on the unit circle: signals with similar dynamics have similar pdfs; distinct patterns have distinct pdfs. An advantage of this approach is that it does not rely on any parametric form of the dynamic structure, but can be used for non-parametric, robust and model-free classification. This new method works for non-stationary signals of similar shape as well as stationary signals with similar auto-correlation structure. Applications to neural spike sorting (non-stationary) and pattern-recognition in socio-economic time series (stationary) demonstrate the usefulness and wide applicability of the proposed method.
 
Article
Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on \emph{triangles}, such as those measuring social cohesion. Despite the importance of these triadic measures, algorithms to compute them can be extremely expensive. We discuss the method of \emph{wedge sampling}. This versatile technique allows for the fast and accurate approximation of various types of clustering coefficients and triangle counts. Furthermore, these techniques are extensible to counting directed triangles in digraphs. Our methods come with \emph{provable} and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state of the art, while providing nearly the accuracy of full enumeration.
 
Article
Dynamic model reduction in power systems is necessary for improving computational efficiency. Traditional model reduction using linearized models or offline analysis would not be adequate to capture power system dynamic behaviors, especially the new mix of intermittent generation and intelligent consumption makes the power system more dynamic and non-linear. Real-time dynamic model reduction emerges as an important need. This paper explores the use of clustering techniques to analyze real-time phasor measurements to determine generator groups and representative generators for dynamic model reduction. Two clustering techniques -- graph clustering and evolutionary clustering -- are studied in this paper. Various implementations of these techniques are compared and also compared with a previously developed Singular Value Decomposition (SVD)-based dynamic model reduction approach. Various methods exhibit different levels of accuracy when comparing the reduced model simulation against the original model. But some of them are consistently accurate. From this comparative perspective, this paper provides a good reference point for practical implementations.
 
Article
Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? I review experience with five problems where probabilistic forecasting played an important role. This leads me to identify five types of potential users: Low Stakes Users, who don't need probabilistic forecasts; General Assessors, who need an overall idea of the uncertainty in the forecast; Change Assessors, who need to know if a change is out of line with expectatations; Risk Avoiders, who wish to limit the risk of an adverse outcome; and Decision Theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and to consider their goals. The cognitive research tells us that calibration is important for trust in probability forecasts, and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest seem often to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role, but in a limited range of applications.
 
Article
Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of 'residual' network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximum-likelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure.
 
Article
Many modern data mining applications are concerned with the analysis of datasets in which the observations are described by paired high-dimensional vectorial representations or "views". Some typical examples can be found in web mining and genomics applications. In this article we present an algorithm for data clustering with multiple views, Multi-View Predictive Partitioning (MVPP), which relies on a novel criterion of predictive similarity between data points. We assume that, within each cluster, the dependence between multivariate views can be modelled by using a two-block partial least squares (TB-PLS) regression model, which performs dimensionality reduction and is particularly suitable for high-dimensional settings. The proposed MVPP algorithm partitions the data such that the within-cluster predictive ability between views is maximised. The proposed objective function depends on a measure of predictive influence of points under the TB-PLS model which has been derived as an extension of the PRESS statistic commonly used in ordinary least squares regression. Using simulated data, we compare the performance of MVPP to that of competing multi-view clustering methods which rely upon geometric structures of points, but ignore the predictive relationship between the two views. State-of-art results are obtained on benchmark web mining datasets.
 
Article
Predictions from science and engineering models depend on the values of the model's input parameters. As the number of parameters increases, algorithmic parameter studies like optimization or uncertainty quantification require many more model evaluations. One way to combat this curse of dimensionality is to seek an alternative parameterization with fewer variables that produces comparable predictions. The active subspace is a low-dimensional linear subspace of the space of model inputs that captures the variability in the model's predictions. We describe a method for checking if a model admits an exploitable active subspace, and we apply this method to a single-diode solar cell model. We find that the maximum power of the solar cell has a dominant one-dimensional active subspace in its space of five input parameters.
 
Article
This article addresses the problem of classification method based on both labeled and unlabeled data, where we assume that a density function for labeled data is different from that for unlabeled data. We propose a semi-supervised logistic regression model for classification problem along with the technique of covariate shift adaptation. Unknown parameters involved in proposed models are estimated by regularization with EM algorithm. A crucial issue in the modeling process is the choices of tuning parameters in our semi-supervised logistic models. In order to select the parameters, a model selection criterion is derived from an information-theoretic approach. Some numerical studies show that our modeling procedure performs well in various cases.
 
Article
The wide applicability of kernels makes the problem of max-kernel search ubiquitous and more general than the usual similarity search in metric spaces. We focus on solving this problem efficiently. We begin by characterizing the inherent hardness of the max-kernel search problem with a novel notion of directional concentration. Following that, we present a method to use an $O(n \log n)$ algorithm to index any set of objects (points in $\Real^\dims$ or abstract objects) directly in the Hilbert space without any explicit feature representations of the objects in this space. We present the first provably $O(\log n)$ algorithm for exact max-kernel search using this index. Empirical results for a variety of data sets as well as abstract objects demonstrate up to 4 orders of magnitude speedup in some cases. Extensions for approximate max-kernel search are also presented.
 
Conference Paper
Mining labeled subgraph is a popular research task in data mining because of its potential application in many different scientific domains. All the existing methods for this task explicitly or implicitly solve the subgraph isomorphism task which is computationally expensive, so they suffer from the lack of scalability problem when the graphs in the input database are large. In this work, we propose FS^3, which is a sampling based method. It mines a small collection of subgraphs that are most frequent in the probabilistic sense. FS^3 performs a Markov Chain Monte Carlo (MCMC) sampling over the space of a fixed-size subgraphs such that the potentially frequent subgraphs are sampled more often. Besides, FS^3 is equipped with an innovative queue manager. It stores the sampled subgraph in a finite queue over the course of mining in such a manner that the top-k positions in the queue contain the most frequent subgraphs. Our experiments on database of large graphs show that FS^3 is efficient, and it obtains subgraphs that are the most frequent amongst the subgraphs of a given size.
 
Article
How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the "importance" of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a "vocabulary" of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop \method, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph.
 
Article
This paper presents a semi-supervised learning algorithm called Gaussian process expectation-maximization (GP-EM), for classification of landcover based on hyperspectral data analysis. Model parameters for each land cover class are first estimated by a supervised algorithm using Gaussian process regressions to find spatially adaptive parameters, and the estimated parameters are then used to initialize a spatially adaptive mixture-of-Gaussians model. The mixture model is updated by expectation-maximization iterations using the unlabeled data, and the spatially adaptive parameters for unlabeled instances are obtained by Gaussian process regressions with soft assignments. Spatially and temporally distant hyperspectral images taken from the Botswana area by the NASA EO-1 satellite are used for experiments. Detailed empirical evaluations show that the proposed framework performs significantly better than all previously reported results by a wide variety of alternative approaches and algorithms on the same datasets. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 358–371, 2011
 
Article
An established method to detect concept drift in data streams is to perform statistical hypothesis testing on the multivariate data in the stream. The statistical theory offers rank-based statistics for this task. However, these statistics depend on a fixed set of characteristics of the underlying distribution. Thus, they work well whenever the change in the underlying distribution affects the properties measured by the statistic, but they perform not very well, if the drift influences the characteristics caught by the test statistic only to a small degree. To address this problem, we show how uniform convergence bounds in learning theory can be adjusted for adaptive concept drift detection. In particular, we present three novel drift detection tests, whose test statistics are dynamically adapted to match the actual data at hand. The first one is based on a rank statistic on density estimates for a binary representation of the data, the second compares average margins of a linear classifier induced by the 1-norm support vector machine (SVM), and the last one is based on the average zero-one, sigmoid or stepwise linear error rate of an SVM classifier. We compare these new approaches with the maximum mean discrepancy method, the StreamKrimp system, and the multivariate Wald–Wolfowitz test. The results indicate that the new methods are able to detect concept drift reliably and that they perform favorably in a precision-recall analysis. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 311-327, 2009
 
Article
Data stream analysis frequently relies on identifying correlations and posing conditional queries on the data after it has been seen. Correlated aggregates form an important example of such queries, which ask for an aggregation over one dimension of stream elements which satisfy a predicate on another dimension. Since recent events are typically more important than older ones, time decay should also be applied to downweight less significant values. We present space-efficient algorithms as well as space lower bounds for the time-decayed correlated sum, a problem at the heart of many related aggregations. By considering different fundamental classes of decay functions, we separate cases where efficient approximations with relative error or additive error guarantees are possible, from other cases where linear space is necessary to approximate. In particular, we show that no efficient algorithms with relative error guarantees are possible for the popular sliding window and exponential decay models, resolving an open problem. This negative result for the exponential decay holds even if the stream is allowed to be processed in multiple passes. The results are surprising, since efficient approximations are known for other data stream problems under these decay models. This is a step toward better understanding which sophisticated queries can be answered on massive streams using limited memory and computation. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 294-310, 2009
 
Article
With the advent of contemporary computers, datasets can be massively huge, too large for direct analysis. One of the many approaches to this problem of size is to aggregate the data according to some appropriate scientific question of interest, with the resulting dataset perforce being one with symbolic-valued observations such as lists, intervals, histograms, and the like. Other datasets, small or large, are naturally symbolic in nature. One aim here is to provide a brief nontechnical overview of symbolic data and discuss how they arise. We also provide brief insights into some of the issues that arise in their analyses. These include the need to take into account the internal variations inherent in symbolic data but not present in classical data. Another issue is that, by the nature of the aggregation, resulting datasets can contain “holes” or regions that are not possible; thus, accommodation for these need to be taken into account, when, e.g. seemingly interval data are actually some other form of symbolic data (such as histogram data). Also, we show how other forms of complex data differ from symbolic data; so, e.g. fuzzy data are a different domain than that for symbolic data. Finally, we look at further research needs for the subject. A more technical introduction to symbolic data and available analytic methodology is given by Noirhomme and Brito. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 149–156, 2011
 
Article
In this paper, we propose a new approach to anomaly detection by looking at the latent variable space to make the first step toward latent anomaly detection. Most conventional approaches to anomaly detection are concerned with tracking data which are largely deviated from the ordinary pattern. In this paper, we are instead concerned with the issue of how to track changes occurring in the latent variable space consisting of the meta information existing behind directly observed data. For example, in the case of masquerade detection, the conventional task was to detect anomalous command lines related to masqueraders' malicious behaviors. Meanwhile, we rather attempt to track changes of behavioral patterns such as writing mails, making software, etc. which are information of more abstract level than command lines. The key ideas of the proposed methods are: (i) constructing the model variation vector, which is introduced relative to the latent variable space, and (ii) the latent anomaly detection is reduced to the issue of change-point detection for the time series that the model variation vector forms. We demonstrate through the experimental results using an artificial data set and a UNIX command data set that our method has significantly enhanced the accuracy of existing anomaly detection methods. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 48-69, 2009
 
Article
Cross-domain text categorization targets on adapting the knowledge learnt from a labeled source domain to an unlabeled target domain, where the documents from the source and target domains are drawn from different distributions. However, in spite of the different distributions in raw-word features, the associations between word clusters (conceptual features) and document classes may remain stable across different domains. In this paper, we exploit these unchanged associations as the bridge of knowledge transformation from the source domain to the target domain by the non-negative matrix tri-factorization. Specifically, we formulate a joint optimization framework of the two matrix tri-factorizations for the source- and target-domain data, respectively, in which the associations between word clusters and document classes are shared between them. Then, we give an iterative algorithm for this optimization and theoretically show its convergence. The comprehensive experiments show the effectiveness of this method. In particular, we show that the proposed method can deal with some difficult scenarios where baseline methods usually do not perform well. © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 100–114, 2011
 
Article
The classification performance of an associative classification algorithm is strongly dependent on the statistic measure or metric that is used to quantify the strength of the association between features and classes (i.e., confidence, correlation, etc.). Previous studies have shown that classification algorithms produced using different metrics may predict conflicting outputs for the same input, and that the best metric to use is data-dependent and rarely known while designing the algorithm (Veloso et al. Competence–conscious associative classification. Stati Anal Data Min 2(5–6):361–377,2009; The metric dillema: competence–conscious associative classification. In: Proceeding of the SIAM Data Mining Conference (SDM). SIAM, 2009). This uncertainty concerning the optimal match between metrics and problems is a dilemma, and prevents associative classification algorithms to achieve their maximal performance . A possible solution to this dilemma is to exploit the competence, expertise, or assertiveness of classification algorithms produced using different metrics. The basic idea is that each of these algorithms has a specific sub-domain for which it is most competent (i.e., there is a set of inputs for which this algorithm consistently provides more accurate predictions than algorithms produced using other metrics). Particularly, we investigate stacking -based meta-learning methods, which use the training data to find the domain of competence of associative classification algorithms produced using different metrics. The result is a set of competing algorithms that are produced using different metrics. The ability to detect which of these algorithms is the most competent one for a given input leads to new algorithms , which are denoted as competence–conscious associative classification algorithms .
 
Article
Network security has been a serious concern for many years. For example, firewalls often record thousands of exploit attempts on a daily basis. Network administrators could benefit from information on potential aggressive attack sources, as such information can help to proactively defend their networks. For this purpose, several large-scale information sharing systems have been established, in which information on cyberattacks targeting each participant network is shared such that a network can be forewarned of attacks observed by others.However, the total number of reported attackers is huge in these systems. Thus, a challenging problem is to identify the attackers that are most relevant to each individual network (i.e. most likely to come to that network in the near future). We present a framework to estimate the relevance of each attacker with respect to each network. In particular, we model each attacker's relevance as a function over the networks. Different attackers have different functions. The distribution of the functions is modeled using a Gaussian process (GP). The relevance function of each attacker is then inferred from the GP, that itself is learned from the collection of attack information. We test our framework on the attack reports in the DShield information sharing system. Experiments show that attackers found relevant to a network by our framework are indeed more likely to come to that network in the future. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 56-68, 2010
 
Article
Cluster ensembles provide a framework for combining multiple base clusterings of a dataset to generate a stable and robust consensus clustering. There are important variants of the basic cluster ensemble problem, notably including cluster ensembles with missing values, row- or column-distributed cluster ensembles. Existing cluster ensemble algorithms are applicable only to a small subset of these variants. In this paper, we propose Bayesian cluster ensemble (BCE), which is a mixed-membership model for learning cluster ensembles, and is applicable to all the primary variants of the problem. We propose a variational approximation based algorithm for learning Bayesian cluster ensembles. BCE is further generalized to deal with the case where the features of original data points are available, referred to as generalized BCE (GBCE). We compare BCE extensively with several other cluster ensemble algorithms, and demonstrate that BCE is not only versatile in terms of its applicability but also outperforms other algorithms in terms of stability and accuracy. Moreover, GBCE can have higher accuracy than BCE, especially with only a small number of available base clusterings. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 54–70 2011
 
Article
Prior knowledge over general nonlinear sets is incorporated into proximal nonlinear kernel classification problems as linear equalities. The key tool in this incorporation is the conversion of general nonlinear prior knowledge implications into linear equalities in the classification variables without the need to kernelize these implications. These equalities are then included into a proximal nonlinear kernel classification formulation (G. Fung and O. L. Mangasarian, Proximal support vector machine classifiers, in Proceedings KDD-2001: Knowledge Discovery and Data Mining, F. Provost and R. Srikant (eds), San Francisco, CA, New York, Association for Computing Machinery) that is solvable as a system of linear equations. Effectiveness of the proposed formulation is demonstrated on a number of publicly available classification datasets. Nonlinear kernel classifiers for these datasets exhibit marked improvements upon the introduction of nonlinear prior knowledge compared with nonlinear kernel classifiers that do not utilize such knowledge. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000-000, 2009
 
Article
Real-time prediction problems pose a challenge to machine learning algorithms because learning must be fast, the set of classes may be changing, and the relevance of some features to each class may be changing. To learn robust classifiers in such nonstationary environments, it is essential not to assign too much weight to any single feature. We address this problem by combining regularization mechanisms with online large-margin learning algorithms. We prove bounds on their error and show that removing features with small weights has little influence on prediction accuracy, suggesting that these methods exhibit feature selection ability. We show that such regularized learning algorithms automatically decrease the influence of older training instances and focus on the more recent ones. This makes them especially attractive in dynamic environments. We evaluate our algorithms through experimental results on real data sets and through experiments with an online activity recognition system. The results show that these regularized large-margin methods adapt more rapidly to changing distributions and achieve lower overall error rates than state-of-the-art methods. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 328-345, 2009
 
Article
Finding patterns that are interesting to a user in a certain application context is one of the central goals of data mining research. Regarding all patterns above a certain frequency threshold as interesting is one way of defining interestingness. In this paper, however, we argue that in many applications, a different notion of interestingness is required in order to be able to capture “long”, and thus particularly informative, patterns that are correspondingly of low frequency. To identify such patterns, our proposed measure of interestingness is based on the degree or strength of closedness of the patterns. We show that (i) indeed this definition selects long interesting patterns that are difficult to identify with frequency-based approaches, and (ii) that it selects patterns that are robust against noise and/or dynamic changes. We prove that the family of interesting patterns proposed here forms a closure system and use the corresponding closure operator to design a mining algorithm listing these patterns in amortized quadratic time. In particular, for nonsparse datasets its time complexity is per pattern, where n denotes the number of items and m the size of the database. This is equal to the best known time bound for listing ordinary closed frequent sets, which is a special case of our problem. We also report empirical results with real-world datasets. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 346-360, 2009
 
Article
This paper studies the ensemble selection problem for unsupervised learning. Given a large library of different clustering solutions, our goal is to select a subset of solutions to form a smaller yet better-performing cluster ensemble than using all available solutions. We design our ensemble selection methods based on quality and diversity, the two factors that have been shown to influence cluster ensemble performance. Our investigation revealed that using quality or diversity alone may not consistently achieve improved performance. Based on our observations, we designed three different selection approaches that jointly consider these two factors. We empirically evaluated their performance in comparison with both full ensembles and a random selection strategy. Our results indicate that by explicitly considering both quality and diversity in ensemble selection, we can achieve statistically significant performance improvement over full ensembles. Copyright © 2008 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000-000, 2008
 
Article
This paper introduces a model for clustering, the Relevant-Set Correlation (RSC) model, that requires no direct knowledge of the nature or representation of the data. Instead, the RSC model relies solely on the existence of an oracle that accepts a query in the form of a reference to a data item, and returns a ranked set of references to items that are most relevant to the query. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. The RSC significance measures can be used to evaluate the relative importance of cluster candidates of various sizes, avoiding the problems of bias found with other shared-neighbor methods that use fixed neighborhood sizes. A scalable clustering heuristic based on the RSC model is also presented and demonstrated for large, high-dimensional datasets using a fast approximate similarity search structure as the oracle. © 2008 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000-000, 2008
 
Top-cited authors
Dino Pedreschi
  • Università di Pisa
Michele Coscia
  • Università di Pisa
Fosca Giannotti
  • Italian National Research Council
Eduardo R Hruschka
  • University of São Paulo
Innar Liiv
  • Tallinn University of Technology