Joint analyses of high-throughput datasets generate the need to assess the association between two long lists of p-values. In such p-value lists, the vast majority of the features are insignificant. Ideally contributions of features that are null in both tests should be minimized. However, by random chance their p-values are uniformly distributed between zero and one, and weak correlations of the p-values may exist due to inherent biases in the high-throughput technology used to generate the multiple datasets. Rank-based agreement test may capture such unwanted effects. Testing contingency tables generated using hard cutoffs may be sensitive to arbitrary threshold choice. We develop a novel method based on feature-level concordance using local false discovery rate. The association score enjoys straight-forward interpretation. The method shows higher statistical power to detect association between p-value lists in simulation. We demonstrate its utility using real data analysis. The R implementation of the method is available at http://userwww.service.emory.edu/~tyu8/AAPL/.
Estimation of multiple directed graphs becomes challenging in the presence of inhomogeneous data, where directed acyclic graphs (DAGs) are used to represent causal relations among random variables. To infer causal relations among variables, we estimate multiple DAGs given a known ordering in Gaussian graphical models. In particular, we propose a constrained maximum likelihood method with nonconvex constraints over elements and element-wise differences of adjacency matrices, for identifying the sparseness structure as well as detecting structural changes over adjacency matrices of the graphs. Computationally, we develop an efficient algorithm based on augmented Lagrange multipliers, the difference convex method, and a novel fast algorithm for solving convex relaxation subproblems. Numerical results suggest that the proposed method performs well against its alternatives for simulated and real data.
In Prequential analysis, an inference method is viewed as a forecasting system, and the quality of the inference method is based on the quality of its predictions. This is an alternative approach to more traditional statistical methods that focus on the inference of parameters of the data generating distribution. In this paper, we introduce adaptive combined average predictors (ACAPs) for the Prequential analysis of complex data. That is, we use convex combinations of two different model averages to form a predictor at each time step in a sequence. A novel feature of our strategy is that the models in each average are re-chosen adaptively at each time step. To assess the complexity of a given data set, we introduce measures of data complexity for continuous response data. We validate our measures in several simulated contexts prior to using them in real data examples. The performance of ACAPs is compared with the performances of predictors based on stacking or likelihood weighted averaging in several model classes and in both simulated and real data sets. Our results suggest that ACAPs achieve a better trade off between model list bias and model list variability in cases where the data is very complex. This implies that the choices of model class and averaging method should be guided by a concept of complexity matching, i.e. the analysis of a complex data set may require a more complex model class and averaging strategy than the analysis of a simpler data set. We propose that complexity matching is akin to a bias-variance tradeoff in statistical modeling.
An important task in personalized medicine is to predict disease risk based on a person's genome, e.g. on a large number of single-nucleotide polymorphisms (SNPs). Genome-wide association studies (GWAS) make SNP and phenotype data available to researchers. A critical question for researchers is how to best predict disease risk. Penalized regression equipped with variable selection, such as LASSO and SCAD, is deemed to be promising in this setting. However, the sparsity assumption taken by the LASSO, SCAD and many other penalized regression techniques may not be applicable here: it is now hypothesized that many common diseases are associated with many SNPs with small to moderate effects. In this article, we use the GWAS data from the Wellcome Trust Case Control Consortium (WTCCC) to investigate the performance of various unpenalized and penalized regression approaches under true sparse or non-sparse models. We find that in general penalized regression outperformed unpenalized regression; SCAD, TLP and LASSO performed best for sparse models, while elastic net regression was the winner, followed by ridge, TLP and LASSO, for non-sparse models.
The NCI60 human tumor cell line screen is a public resource for studying selective and non-selective growth inhibition of small molecules against cancer cells. By coupling growth inhibition screening data with biological characterizations of the different cell lines, it becomes possible to infer mechanisms of action underlying some of the observable patterns of selective activity. Using these data, mechanistic relationships have been identified including specific associations between single genes and small families of closely related compounds, and less specific relationships between biological processes involving several cooperating genes and broader families of compounds. Here we aim to characterize the degree to which such specific and general relationships are present in these data. A related question is whether genes tend to act with a uniform mechanism for all associated compounds, or whether multiple mechanisms are commonly involved. We address these two issues in a statistical framework placing special emphasis on the effects of measurement error in the gene expression and chemical screening data. We find that as measurement accuracy increases, the pattern of apparent associations shifts from one dominated by isolated gene/compound pairs, to one in which families consisting of an average of 25 compounds are associated to the same gene. At the same time, the number of genes that appear to play a role in influencing compound activities decreases. For less than half of the genes, the presence of both positive and negative correlations indicates pleiotropic associations with molecules via different mechanisms of action.
Technologies for rapid detection of bacterial pathogens are crucial for securing the food supply. A light-scattering sensor recently developed for real-time identification of multiple colonies has shown great promise for distinguishing bacteria cultures. The classification approach currently used with this system relies on supervised learning. For accurate classification of bacterial pathogens, the training library should be exhaustive, i.e., should consist of samples of all possible pathogens. Yet, the sheer number of existing bacterial serovars and more importantly the effect of their high mutation rate would not allow for a practical and manageable training. In this study, we propose a Bayesian approach to learning with a nonexhaustive training dataset for automated detection of unmatched bacterial serovars, i.e., serovars for which no samples exist in the training library. The main contribution of our work is the Wishart conjugate priors defined over class distributions. This allows us to employ the prior information obtained from known classes to make inferences about unknown classes as well. By this means, we identify new classes of informational value and dynamically update the training dataset with these classes to make it increasingly more representative of the sample population. This results in a classifier with improved predictive performance for future samples. We evaluated our approach on a 28-class bacteria dataset and also on the benchmark 26-class letter recognition dataset for further validation. The proposed approach is compared against state-of-the-art involving density-based approaches and support vector domain description, as well as a recently introduced Bayesian approach based on simulated classes.
Finding optimal parameters for simulating biological systems is usually a very difficult and expensive task in systems biology. Brute force searching is infeasible in practice because of the huge (often infinite) search space. In this article, we propose predicting the parameters efficiently by learning the relationship between system outputs and parameters using regression. However, the conventional parametric regression models suffer from two issues, thus are not applicable to this problem. First, restricting the regression function as a certain fixed type (e.g. linear, polynomial, etc.) introduces too strong assumptions that reduce the model flexibility. Second, conventional regression models fail to take into account the fact that a fixed parameter value may correspond to multiple different outputs due to the stochastic nature of most biological simulations, and the existence of a potentially large number of other factors that affect the simulation outputs. We propose a novel approach based on a Gaussian process model that addresses the two issues jointly. We apply our approach to a tumor vessel growth model and the feedback Wright-Fisher model. The experimental results show that our method can predict the parameter values of both of the two models with high accuracy.
Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.
Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.
Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.
Successful implementation of feature selection in nuclear magnetic resonance (NMR) spectra not only improves classification ability, but also simplifies the entire modeling process and, thus, reduces computational and analytical efforts. Principal component analysis (PCA) and partial least squares (PLS) have been widely used for feature selection in NMR spectra. However, extracting meaningful metabolite features from the reduced dimensions obtained through PCA or PLS is complicated because these reduced dimensions are linear combinations of a large number of the original features. In this paper, we propose a multiple testing procedure controlling false discovery rate (FDR) as an efficient method for feature selection in NMR spectra. The procedure clearly compensates for the limitation of PCA and PLS and identifies individual metabolite features necessary for classification. In addition, we present orthogonal signal correction to improve classification and visualization by removing unnecessary variations in NMR spectra. Our experimental results with real NMR spectra showed that classification models constructed with the features selected by our proposed procedure yielded smaller misclassification rates than those with all features.
For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.
In multivariate linear regression, it is often assumed that the response matrix is intrinsically of lower rank. This could be because of the correlation structure among the prediction variables or the coefficient matrix being lower rank. To accommodate both, we propose a reduced rank ridge regression for multivariate linear regression. Specifically, we combine the ridge penalty with the reduced rank constraint on the coefficient matrix to come up with a computationally straightforward algorithm. Numerical studies indicate that the proposed method consistently outperforms relevant competitors. A novel extension of the proposed method to the reproducing kernel Hilbert space (RKHS) set-up is also developed.
Multiple response regression is a useful regression technique to model multiple response variables using the same set of predictor variables. Most existing methods for multiple response regression are designed for modeling homogeneous data. In many applications, however, one may have heterogeneous data where the samples are divided into multiple groups. Our motivating example is a cancer dataset where the samples belong to multiple cancer subtypes. In this paper, we consider modeling the data coming from a mixture of several Gaussian distributions with known group labels. A naive approach is to split the data into several groups according to the labels and model each group separately. Although it is simple, this approach ignores potential common structures across different groups. We propose new penalized methods to model all groups jointly in which the common and unique structures can be identified. The proposed methods estimate the regression coefficient matrix, as well as the conditional inverse covariance matrix of response variables. Asymptotic properties of the proposed methods are explored. Through numerical examples, we demonstrate that both estimation and prediction can be improved by modeling all groups jointly using the proposed methods. An application to a glioblastoma cancer dataset reveals some interesting common and unique gene relationships across different cancer subtypes.
High-dimensional data common in genomics, proteomics, and chemometrics often contains complicated correlation structures. Recently, partial least squares (PLS) and Sparse PLS methods have gained attention in these areas as dimension reduction techniques in the context of supervised data analysis. We introduce a framework for Regularized PLS by solving a relaxation of the SIMPLS optimization problem with penalties on the PLS loadings vectors. Our approach enjoys many advantages including flexibility, general penalties, easy interpretation of results, and fast computation in high-dimensional settings. We also outline extensions of our methods leading to novel methods for non-negative PLS and generalized PLS, an adoption of PLS for structured data. We demonstrate the utility of our methods through simulations and a case study on proton Nuclear Magnetic Resonance (NMR) spectroscopy data.
Interestingness measures provide information that can be used to prune or
select association rules. A given value of an interestingness measure is often
interpreted relative to the overall range of the values that the
interestingness measure can take. However, properties of individual association
rules restrict the values an interestingness measure can achieve. An
interesting measure can be standardized to take this into account, but this has
only been done for one interestingness measure to date, i.e., the lift.
Standardization provides greater insight than the raw value and may even alter
researchers' perception of the data. We derive standardized analogues of three
interestingness measures and use real and simulated data to compare them to
their raw versions, each other, and the standardized lift.
This review article considers some of the most common methods used in
astronomy for regressing one quantity against another in order to estimate the
model parameters or to predict an observationally expensive quantity using
trends between object values. These methods have to tackle some of the awkward
features prevalent in astronomical data, namely heteroscedastic
(point-dependent) errors, intrinsic scatter, non-ignorable data collection and
selection effects, data structure and non-uniform population (often called
Malmquist bias), non-Gaussian data, outliers and mixtures of regressions. We
outline how least square fits, weighted least squares methods, Maximum
Likelihood, survival analysis, and Bayesian methods have been applied in the
astrophysics literature when one or more of these features is present. In
particular we concentrate on errors-in-variables regression and we advocate
Bayesian techniques.
We review the use of Bayesian Model Averaging in astrophysics. We first
introduce the statistical basis of Bayesian Model Selection and Model
Averaging. We discuss methods to calculate the model-averaged posteriors,
including Markov Chain Monte Carlo (MCMC), nested sampling, Population Monte
Carlo, and Reversible Jump MCMC. We then review some applications of Bayesian
Model Averaging in astrophysics, including measurements of the dark energy and
primordial power spectrum parameters in cosmology, cluster weak lensing and
Sunyaev-Zel'dovich effect data, estimating distances to Cepheids, and
classifying variable stars.
Expanding a lower-dimensional problem to a higher-dimensional space and then
projecting back is often beneficial. This article rigorously investigates this
perspective in the context of finite mixture models, namely how to improve
inference for mixture models by using auxiliary variables. Despite the large
literature in mixture models and several empirical examples, there is no
previous work that gives general theoretical justification for including
auxiliary variables in mixture models, even for special cases. We provide a
theoretical basis for comparing inference for mixture multivariate models with
the corresponding inference for marginal univariate mixture models. Analytical
results for several special cases are established. We show that the probability
of correctly allocating mixture memberships and the information number for the
means of the primary outcome in a bivariate model with two Gaussian mixtures
are generally larger than those in each univariate model. Simulations under a
range of scenarios, including misspecified models, are conducted to examine the
improvement. The method is illustrated by two real applications in ecology and
causal inferenc
Many businesses are using recommender systems for marketing outreach.
Recommendation algorithms can be either based on content or driven by
collaborative filtering. We study different ways to incorporate content
information directly into the matrix factorization approach of collaborative
filtering. These content-boosted matrix factorization algorithms not only
improve recommendation accuracy, but also provide useful insights about the
contents, as well as make recommendations more easily interpretable.
We present a technique for spatiotemporal data analysis called nonlinear
Laplacian spectral analysis (NLSA), which generalizes singular spectrum
analysis (SSA) to take into account the nonlinear manifold structure of complex
data sets. The key principle underlying NLSA is that the functions used to
represent temporal patterns should exhibit a degree of smoothness on the
nonlinear data manifold M; a constraint absent from classical SSA. NLSA
enforces such a notion of smoothness by requiring that temporal patterns belong
in low-dimensional Hilbert spaces V_l spanned by the leading l Laplace-Beltrami
eigenfunctions on M. These eigenfunctions can be evaluated efficiently in high
ambient-space dimensions using sparse graph-theoretic algorithms. Moreover,
they provide orthonormal bases to expand a family of linear maps, whose
singular value decomposition leads to sets of spatiotemporal patterns at
progressively finer resolution on the data manifold. The Riemannian measure of
M and an adaptive graph kernel width enhances the capability of NLSA to detect
important nonlinear processes, including intermittency and rare events. The
minimum dimension of V_l required to capture these features while avoiding
overfitting is estimated here using spectral entropy criteria.
This paper brings explicit considerations of distributed computing
architectures and data structures into the rigorous design of Sequential Monte
Carlo (SMC) methods. A theoretical result established recently by the authors
shows that adapting interaction between particles to suitably control the
Effective Sample Size (ESS) is sufficient to guarantee stability of SMC
algorithms. Our objective is to leverage this result and devise algorithms
which are thus guaranteed to work well in a distributed setting. We make three
main contributions to achieve this. Firstly, we study mathematical properties
of the ESS as a function of matrices and graphs that parameterize the
interaction amongst particles. Secondly, we show how these graphs can be
induced by tree data structures which model the logical network topology of an
abstract distributed computing environment. Thirdly, we present efficient
distributed algorithms that achieve the desired ESS control, perform resampling
and operate on forests associated with these trees.
Binary classification is a common statistical learning problem in which a
model is estimated on a set of covariates for some outcome indicating the
membership of one of two classes. In the literature, there exists a distinction
between hard and soft classification. In soft classification, the conditional
class probability is modeled as a function of the covariates. In contrast, hard
classification methods only target the optimal prediction boundary. While hard
and soft classification methods have been studied extensively, not much work
has been done to compare the actual tasks of hard and soft classification. In
this paper we propose a spectrum of statistical learning problems which span
the hard and soft classification tasks based on fitting multiple decision rules
to the data. By doing so, we reveal a novel collection of learning tasks of
increasing complexity. We study the problems using the framework of
large-margin classifiers and a class of piecewise linear convex surrogates, for
which we derive statistical properties and a corresponding sub-gradient descent
algorithm. We conclude by applying our approach to simulation settings and a
magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease
Neuroimaging Initiative (ADNI) study.
In the last few years many real-world networks have been found to show a
so-called community structure organization. Much effort has been devoted in the
literature to develop methods and algorithms that can efficiently highlight
this hidden structure of the network, traditionally by partitioning the graph.
Since network representation can be very complex and can contain different
variants in the traditional graph model, each algorithm in the literature
focuses on some of these properties and establishes, explicitly or implicitly,
its own definition of community. According to this definition it then extracts
the communities that are able to reflect only some of the features of real
communities. The aim of this survey is to provide a manual for the community
discovery problem. Given a meta definition of what a community in a social
network is, our aim is to organize the main categories of community discovery
based on their own definition of community. Given a desired definition of
community and the features of a problem (size of network, direction of edges,
multidimensionality, and so on) this review paper is designed to provide a set
of approaches that researchers could focus on.
I propose a frequency domain adaptation of the Expectation Maximization (EM)
algorithm to group a family of time series in classes of similar dynamic
structure. It does this by viewing the magnitude of the discrete Fourier
transform (DFT) of each signal (or power spectrum) as a probability
density/mass function (pdf/pmf) on the unit circle: signals with similar
dynamics have similar pdfs; distinct patterns have distinct pdfs. An advantage
of this approach is that it does not rely on any parametric form of the dynamic
structure, but can be used for non-parametric, robust and model-free
classification. This new method works for non-stationary signals of similar
shape as well as stationary signals with similar auto-correlation structure.
Applications to neural spike sorting (non-stationary) and pattern-recognition
in socio-economic time series (stationary) demonstrate the usefulness and wide
applicability of the proposed method.
Graphs are used to model interactions in a variety of contexts, and there is
a growing need to quickly assess the structure of such graphs. Some of the most
useful graph metrics are based on \emph{triangles}, such as those measuring
social cohesion. Despite the importance of these triadic measures, algorithms
to compute them can be extremely expensive. We discuss the method of
\emph{wedge sampling}. This versatile technique allows for the fast and
accurate approximation of various types of clustering coefficients and triangle
counts. Furthermore, these techniques are extensible to counting directed
triangles in digraphs. Our methods come with \emph{provable} and practical
time-approximation tradeoffs for all computations. We provide extensive results
that show our methods are orders of magnitude faster than the state of the art,
while providing nearly the accuracy of full enumeration.
Dynamic model reduction in power systems is necessary for improving
computational efficiency. Traditional model reduction using linearized models
or offline analysis would not be adequate to capture power system dynamic
behaviors, especially the new mix of intermittent generation and intelligent
consumption makes the power system more dynamic and non-linear. Real-time
dynamic model reduction emerges as an important need. This paper explores the
use of clustering techniques to analyze real-time phasor measurements to
determine generator groups and representative generators for dynamic model
reduction. Two clustering techniques -- graph clustering and evolutionary
clustering -- are studied in this paper. Various implementations of these
techniques are compared and also compared with a previously developed Singular
Value Decomposition (SVD)-based dynamic model reduction approach. Various
methods exhibit different levels of accuracy when comparing the reduced model
simulation against the original model. But some of them are consistently
accurate. From this comparative perspective, this paper provides a good
reference point for practical implementations.
Probabilistic forecasts are becoming more and more available. How should they
be used and communicated? What are the obstacles to their use in practice? I
review experience with five problems where probabilistic forecasting played an
important role. This leads me to identify five types of potential users: Low
Stakes Users, who don't need probabilistic forecasts; General Assessors, who
need an overall idea of the uncertainty in the forecast; Change Assessors, who
need to know if a change is out of line with expectatations; Risk Avoiders, who
wish to limit the risk of an adverse outcome; and Decision Theorists, who
quantify their loss function and perform the decision-theoretic calculations.
This suggests that it is important to interact with users and to consider their
goals. The cognitive research tells us that calibration is important for trust
in probability forecasts, and that it is important to match the verbal
expression with the task. The cognitive load should be minimized, reducing the
probabilistic forecast to a single percentile if appropriate. Probabilities of
adverse events and percentiles of the predictive distribution of quantities of
interest seem often to be the best way to summarize probabilistic forecasts.
Formal decision theory has an important role, but in a limited range of
applications.
Latent variable models are frequently used to identify structure in
dichotomous network data, in part because they give rise to a Bernoulli product
likelihood that is both well understood and consistent with the notion of
exchangeable random graphs. In this article we propose conservative confidence
sets that hold with respect to these underlying Bernoulli parameters as a
function of any given partition of network nodes, enabling us to assess
estimates of 'residual' network structure, that is, structure that cannot be
explained by known covariates and thus cannot be easily verified by manual
inspection. We demonstrate the proposed methodology by analyzing student
friendship networks from the National Longitudinal Survey of Adolescent Health
that include race, gender, and school year as covariates. We employ a
stochastic expectation-maximization algorithm to fit a logistic regression
model that includes these explanatory variables as well as a latent stochastic
blockmodel component and additional node-specific effects. Although
maximum-likelihood estimates do not appear consistent in this context, we are
able to evaluate confidence sets as a function of different blockmodel
partitions, which enables us to qualitatively assess the significance of
estimated residual network structure relative to a baseline, which models
covariates but lacks block structure.
Many modern data mining applications are concerned with the analysis of
datasets in which the observations are described by paired high-dimensional
vectorial representations or "views". Some typical examples can be found in web
mining and genomics applications. In this article we present an algorithm for
data clustering with multiple views, Multi-View Predictive Partitioning (MVPP),
which relies on a novel criterion of predictive similarity between data points.
We assume that, within each cluster, the dependence between multivariate views
can be modelled by using a two-block partial least squares (TB-PLS) regression
model, which performs dimensionality reduction and is particularly suitable for
high-dimensional settings. The proposed MVPP algorithm partitions the data such
that the within-cluster predictive ability between views is maximised. The
proposed objective function depends on a measure of predictive influence of
points under the TB-PLS model which has been derived as an extension of the
PRESS statistic commonly used in ordinary least squares regression. Using
simulated data, we compare the performance of MVPP to that of competing
multi-view clustering methods which rely upon geometric structures of points,
but ignore the predictive relationship between the two views. State-of-art
results are obtained on benchmark web mining datasets.
Predictions from science and engineering models depend on the values of the
model's input parameters. As the number of parameters increases, algorithmic
parameter studies like optimization or uncertainty quantification require many
more model evaluations. One way to combat this curse of dimensionality is to
seek an alternative parameterization with fewer variables that produces
comparable predictions. The active subspace is a low-dimensional linear
subspace of the space of model inputs that captures the variability in the
model's predictions. We describe a method for checking if a model admits an
exploitable active subspace, and we apply this method to a single-diode solar
cell model. We find that the maximum power of the solar cell has a dominant
one-dimensional active subspace in its space of five input parameters.
This article addresses the problem of classification method based on both
labeled and unlabeled data, where we assume that a density function for labeled
data is different from that for unlabeled data. We propose a semi-supervised
logistic regression model for classification problem along with the technique
of covariate shift adaptation. Unknown parameters involved in proposed models
are estimated by regularization with EM algorithm. A crucial issue in the
modeling process is the choices of tuning parameters in our semi-supervised
logistic models. In order to select the parameters, a model selection criterion
is derived from an information-theoretic approach. Some numerical studies show
that our modeling procedure performs well in various cases.
The wide applicability of kernels makes the problem of max-kernel search
ubiquitous and more general than the usual similarity search in metric spaces.
We focus on solving this problem efficiently. We begin by characterizing the
inherent hardness of the max-kernel search problem with a novel notion of
directional concentration. Following that, we present a method to use an $O(n
\log n)$ algorithm to index any set of objects (points in $\Real^\dims$ or
abstract objects) directly in the Hilbert space without any explicit feature
representations of the objects in this space. We present the first provably
$O(\log n)$ algorithm for exact max-kernel search using this index. Empirical
results for a variety of data sets as well as abstract objects demonstrate up
to 4 orders of magnitude speedup in some cases. Extensions for approximate
max-kernel search are also presented.
Mining labeled subgraph is a popular research task in data mining because of
its potential application in many different scientific domains. All the
existing methods for this task explicitly or implicitly solve the subgraph
isomorphism task which is computationally expensive, so they suffer from the
lack of scalability problem when the graphs in the input database are large. In
this work, we propose FS^3, which is a sampling based method. It mines a small
collection of subgraphs that are most frequent in the probabilistic sense. FS^3
performs a Markov Chain Monte Carlo (MCMC) sampling over the space of a
fixed-size subgraphs such that the potentially frequent subgraphs are sampled
more often. Besides, FS^3 is equipped with an innovative queue manager. It
stores the sampled subgraph in a finite queue over the course of mining in such
a manner that the top-k positions in the queue contain the most frequent
subgraphs. Our experiments on database of large graphs show that FS^3 is
efficient, and it obtains subgraphs that are the most frequent amongst the
subgraphs of a given size.
How can we succinctly describe a million-node graph with a few simple
sentences? How can we measure the "importance" of a set of discovered subgraphs
in a large graph? These are exactly the problems we focus on. Our main ideas
are to construct a "vocabulary" of subgraph-types that often occur in real
graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the
most succinct description of a graph in terms of this vocabulary. We measure
success in a well-founded way by means of the Minimum Description Length (MDL)
principle: a subgraph is included in the summary if it decreases the total
description length of the graph.
Our contributions are three-fold: (a) formulation: we provide a principled
encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop
\method, an efficient method to minimize the description cost, and (c)
applicability: we report experimental results on multi-million-edge real
graphs, including Flickr and the Notre Dame web graph.
The classification
performance
of an associative classification
algorithm is strongly dependent on the statistic measure or metric that is used to quantify the strength of the association
between features and classes (i.e., confidence, correlation, etc.). Previous studies have shown that classification
algorithms
produced using different metrics may predict conflicting outputs
for the same input, and that the best metric to use is data-dependent and rarely known while designing the algorithm (Veloso et al. Competence–conscious associative classification. Stati Anal Data Min 2(5–6):361–377,2009; The metric dillema: competence–conscious associative classification. In: Proceeding of the SIAM Data Mining Conference (SDM). SIAM, 2009). This uncertainty concerning the optimal match between metrics and problems is a dilemma, and prevents associative classification
algorithms
to achieve their maximal
performance
. A possible solution to this dilemma is to exploit the competence, expertise, or assertiveness of classification
algorithms
produced using different metrics. The basic idea is that each of these algorithms
has a specific sub-domain for which it is most competent (i.e., there is a set of inputs
for which this algorithm consistently provides more accurate predictions than algorithms
produced using other metrics). Particularly, we investigate stacking
-based meta-learning
methods, which use the training data to find the domain of competence of associative classification
algorithms
produced using different metrics. The result is a set of competing algorithms
that are produced using different metrics. The ability to detect which of these algorithms
is the most competent one for a given input leads to new algorithms
, which are denoted as competence–conscious
associative classification
algorithms
.