[show abstract][hide abstract] ABSTRACT: A multivariate decision tree attempts to improve upon the single variable split in a traditional tree. With the increase in data sets with many features and a small number of labeled instances in a variety of domains (e.g., bioinformatics, text mining, etc.), a traditional tree-based approach with a greedy variable selection at a node may omit important information. Therefore, the recursive partitioning idea of a simple decision tree combined with the intrinsic feature selection of $L_1$ regularized logistic regression at each node is a natural choice for a multivariate tree model that is simple, but broadly applicable. This natural solution leads to the sparse multivariate tree (SMT) considered here.
SMT can naturally handle non-time-series data and is extended to handle time-series classification problems with the power of extracting interpretable temporal patterns (e.g., means, slopes, deviations). Binary $L_1$ regularized logistic regression models are used here for binary classification problems. However, SMT may be extended to solve multi-class problems with multinomial logistic regression models. The accuracy and computational efficiency of SMT is compared to a large number of competitors on time series and non-time-series data.
[show abstract][hide abstract] ABSTRACT: The regularized random forest (RRF) was recently proposed for feature
selection by building only one ensemble. In RRF the features are evaluated on a
part of the training data at each tree node. We derive an upper bound for the
number of distinct Gini information gain values in a node, and show that many
features can share the same information gain at a node with a small number of
instances and a large number of features. Therefore, in a node with a small
number of instances, RRF is likely to select a feature not strongly relevant.
Here an enhanced RRF, referred to as the guided RRF (GRRF), is proposed. In
GRRF, the importance scores from an ordinary random forest (RF) are used to
guide the feature selection process in RRF. Experiments on 10 gene data sets
show that the accuracy performance of GRRF is, in general, more robust than RRF
when their parameters change. GRRF is computationally efficient, can select
compact feature subsets, and has competitive accuracy performance, compared to
RRF, varSelRF and LASSO logistic regression (with evaluations from an RF
classifier). Also, RF applied to the features selected by RRF with the minimal
regularization outperforms RF applied to all the features for most of the data
sets considered here. Therefore, if accuracy is considered more important than
the size of the feature subset, RRF with the minimal regularization may be
considered. We use the accuracy performance of RF, a strong classifier, to
evaluate feature selection methods, and illustrate that weak classifiers are
less capable of capturing the information contained in a feature subset. Both
RRF and GRRF were implemented in the "RRF" R package available at CRAN, the
official R package archive.
[show abstract][hide abstract] ABSTRACT: Time series classification is an important task with many challenging applications. A nearest neighbor (NN) classifier with dynamic time warping (DTW) distance is a strong solution in this context. On the other hand, feature-based approaches have been proposed as both classifiers and to provide insight into the series, but these approaches have problems handling translations and dilations in local patterns. Considering these shortcomings, we present a framework to classify time series based on a bag-of-features representation (TSBF). Multiple subsequences selected from random locations and of random lengths are partitioned into shorter intervals to capture the local information. Consequently, features computed from these subsequences measure properties at different locations and dilations when viewed from the original series. This provides a feature-based approach that can handle warping (although differently from DTW). Moreover, a supervised learner (that handles mixed data types, different units, etc.) integrates location information into a compact codebook through class probability estimates. Additionally, relevant global features can easily supplement the codebook. TSBF is compared to NN classifiers and other alternatives (bag-of-words strategies, sparse spatial sample kernels, shapelets). Our experimental results show that TSBF provides better results than competitive methods on benchmark datasets from the UCR time series database.
[show abstract][hide abstract] ABSTRACT: Time series classification is an important task with many challenging applications. A nearest-neighbor classifier with dynamic time warping (DTW) distance is a strong solution in this context. On the other hand, feature-based approaches have been proposed as both classifiers and to provide insight into the series, but these approaches have problems handling translations and dilations in local patterns. Considering these shortcomings, we present a framework to classify time series based on a bag-of-features representation (TSBF). Multiple subsequences selected from random locations and of random lengths are partitioned into shorter intervals to capture the local information. Consequently, features computed from these subsequences measure properties at different locations and dilations when viewed from the original series. This provides a feature-based approach that can handle warping (although differently from DTW). Moreover, a supervised learner (that handles mixed data types, different units, etc.) integrates location information into a compact codebook through class probability estimates. Additionally, relevant global features can easily supplement the codebook. TSBF is compared to nearest-neighbor classifiers and other alternatives (bag-of-words strategies, sparse spatial sample kernels, shapelets). Our experimental results show that TSBF provides better results than competitive methods on benchmark datasets from the UCR time series database.
IEEE Transactions on Pattern Analysis and Machine Intelligence 04/2013; 35(11):2796 - 2802. · 4.80 Impact Factor
[show abstract][hide abstract] ABSTRACT: We propose a tree ensemble method, referred to as time series forest (TSF),
for time series classification. TSF employs a combination of the entropy gain
and a distance measure, referred to as the Entrance (entropy and distance)
gain, for evaluating the splits. Experimental studies show that the Entrance
gain criterion improves the accuracy of TSF. TSF randomly samples features at
each tree node and has a computational complexity linear in the length of a
time series and can be built using parallel computing techniques such as
multi-core computing used here. The temporal importance curve is also proposed
to capture the important temporal characteristics useful for classification.
Experimental studies show that TSF using simple features such as mean,
deviation and slope outperforms strong competitors such as one-nearest-neighbor
classifiers with dynamic time warping, is computationally efficient, and can
provide insights into the temporal characteristics.
Information Sciences 01/2013; 239:142-153. · 3.64 Impact Factor
[show abstract][hide abstract] ABSTRACT: Phenotypic characterization of individual cells provides crucial insights into intercellular heterogeneity and enables access to information that is unavailable from ensemble averaged, bulk cell analyses. Single-cell studies have attracted significant interest in recent years and spurred the development of a variety of commercially available and research-grade technologies. To quantify cell-to-cell variability of cell populations, we have developed an experimental platform for real-time measurements of oxygen consumption (OC) kinetics at the single-cell level. Unique challenges inherent to these single-cell measurements arise, and no existing data analysis methodology is available to address them. Here we present a data processing and analysis method that addresses challenges encountered with this unique type of data in order to extract biologically relevant information. We applied the method to analyze OC profiles obtained with single cells of two different cell lines derived from metaplastic and dysplastic human Barrett's esophageal epithelium. In terms of method development, three main challenges were considered for this heterogeneous dynamic system: (i) high levels of noise, (ii) the lack of a priori knowledge of single-cell dynamics, and (iii) the role of intercellular variability within and across cell types. Several strategies and solutions to address each of these three challenges are presented. The features such as slopes, intercepts, breakpoint or change-point were extracted for every OC profile and compared across individual cells and cell types. The results demonstrated that the extracted features facilitated exposition of subtle differences between individual cells and their responses to cell-cell interactions. With minor modifications, this method can be used to process and analyze data from other acquisition and experimental modalities at the single-cell level, providing a valuable statistical framework for single-cell analysis.
[show abstract][hide abstract] ABSTRACT: We propose a tree regularization framework, which enables many tree models to
perform feature selection efficiently. The key idea of the regularization
framework is to penalize selecting a new feature for splitting when its gain
(e.g. information gain) is similar to the features used in previous splits. The
regularization framework is applied on random forest and boosted trees here,
and can be easily applied to other tree models. Experimental studies show that
the regularized trees can select high-quality feature subsets with regard to
both strong and weak classifiers. Because tree models can naturally deal with
categorical and numerical variables, missing values, different scales between
variables, interactions and nonlinearities etc., the tree regularization
framework provides an effective and efficient feature selection solution for
many practical problems.
International Joint Conference on Neural Networks (IJCNN). 01/2012;
[show abstract][hide abstract] ABSTRACT: Learning Markov Blankets is important for classification and regression, causal discovery, and Bayesian network learning.
We present an argument that ensemble masking measures can provide an approximate Markov Blanket. Consequently, an ensemble
feature selection method can be used to learnMarkov Blankets for either discrete or continuous networks (without linear, Gaussian
assumptions). We use masking measures for redundancy and statistical inference for feature selection criteria. We compare
our performance in the causal structure learning problem to a collection of common feature selection methods.We also compare
to Bayesian local structure learning. These results can also be easily extended to other casual structure models such as undirected
[show abstract][hide abstract] ABSTRACT: Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.
[show abstract][hide abstract] ABSTRACT: Many systems (manufacturing, environmental, health, etc.) generate counts (or rates) of events that are monitored to detect
changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others)
that comprise high-dimensional attributes. This leads to an important challenge to detect a change that only occurs within
a region, initially unspecified, defined by these attributes and current methods to handle the attribute information are challenged
by high-dimensional data. Our approach transforms the problem to supervised learning, so that properties of an appropriate
learner can be described. Rather than error rates, we generate a signal (of a system change) from an appropriate feature selection
algorithm. A measure of statistical significance is included to control false alarms. Results on simulated examples are provided.
Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part II; 01/2011
[show abstract][hide abstract] ABSTRACT: Attribute importance measures for supervised learning are important for improving both learning accuracy and interpretability. However, it is well-known there could be bias when the predictor attributes have different numbers of values. We propose two methods to solve the bias problem. One uses an out-of-bag sampling method called OOBForest and one, based on the new concept of a partial permutation test, is called pForest. The existing research has considered the bias problem only among irrelevant attributes and equally informative attributes, while we compare to existing methods in a situation where unequally informative attributes (with or without interactions) and irrelevant attributes co-exist. We observe that the existing methods are not always reliable for multi-valued predictors, while the proposed methods compare favorably in our experiments.
Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part II; 01/2011
[show abstract][hide abstract] ABSTRACT: A bounded adjustment strategy is an important link between statistical process control and engineering process control (or closed-loop feedback adjustment). The optimal bounded adjustment strategy for the case of a single variable has been reported in the literature and recently a number of publications have enhanced this relationship (but still for a single variable). The optimal bounded adjustment strategy for a multivariate processes (of arbitrary dimension) is derived in this article. This uses optimization and exploits a symmetry relationship to obtain a closed-form solution for the optimal strategy. Furthermore, a numerical method is developed to analyze the adjustment strategy for an arbitrary number of dimensions with only a one-dimensional integral. This provides the link between statistical and engineering process control in the important multivariate case. Both infinite- and finite-horizon solutions are presented along with a numerical illustration.
[show abstract][hide abstract] ABSTRACT: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems.
In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins.
Supplementary data are available at Bioinformatics online.
[show abstract][hide abstract] ABSTRACT: Clustering algorithms partition data sets into groups of objects such that the pairwise similarity between objects within the same cluster is higher than those assigned to different clusters. Defining a similarity measure becomes challenging in the presence of categorical data and affects the quality and meaningfulness of the clusters formed. Furthermore, the curse of dimensionality diminishes the robustness of such measures. This paper introduces SCAR (supervised clustering with association rules) a nontraditional algorithm for clustering massive high dimensional categorical data. SCAR is robust to the curse of dimensionality, it relies on association rules as an intuitive way to evaluate the similarity between objects and group them.
Innovations in Information Technology, 2008. IIT 2008. International Conference on; 01/2009
[show abstract][hide abstract] ABSTRACT: Discretizing continuous attributes is necessary before association rules mining or using several inductive learning algorithms with a heterogeneous data space. This data preprocessing step should be carried out with a minimum information loss; that is the mutual information between attributes on the one hand and between attributes and the class labels on the other should not be destroyed. This paper introduces a novel supervised, global and dynamic discretization algorithm, called RFDisc (Random Forests Discretizer). It derives its ability in conserving the data properties from the Random Forests learning algorithm. RFDisc is simple, relatively fast and learns automatically the number of bins into which each continuous attribute is to be discretized. Empirical results indicate that the accuracies of classification algorithms such as CART when used with several data sets are comparable before and after discretization using RFDisc. Furthermore, C5.0 achieves the highest classification accuracy with data discretized with RFDisc when compared with other well known discretization algorithms.
The 7th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009, Rabat, Morocco, May 10-13, 2009; 01/2009