About
20
Publications
17,866
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,309
Citations
Citations since 2017
Introduction
Data Scientist.
Publications
Publications (20)
We consider a Dynamic Workforce Acquisition (DWA) problem for crowdsourced last-mile delivery platforms that need to match supply (the number of available workers) with demand (the number of requested deliveries). This need arises due to the fact that the initial number of scheduled workers does not always match the number of requested deliveries....
Demand variance can result in a mismatch between planned supply and actual demand. Demand shaping strategies such as pricing can be used to reduce the imbalance between supply and demand. In this work, we propose to consider the demand shaping factor in forecasting. We present a method to reallocate the historical elastic demand to reduce variance,...
Traditional control charts assume a baseline parametric model, against which new observations are compared in order to identify significant departures from the baseline model. To monitor a process without a baseline model, real-time contrasts (RTC) control charts were recently proposed to monitor classification errors when seperarting new observati...
Phaeosphaeria leaf spot (PLS) is considered one of the major diseases that threaten the stability of maize production in tropical and subtropical African regions. The objective of the present study was to investigate the use of hyperspectral data in detecting the early stage of PLS in tropical maize. Field data were collected from healthy and the e...
Quality control of multivariate processes has been extensively studied in the past decades; however, fundamental challenges still remain due to the complexity and the decision-making challenges that require not only sensitive fault detection but also identification of the truly out-of-control variables. In existing approaches, fault detection and d...
The segmentation of infant brain tissue images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) plays an important role in studying early brain development in health and disease. In the isointense stage (approximately 6-8 months of age), WM and GM exhibit similar levels of intensity in both T1 and T2 MR images, making the tis...
Tree ensembles such as random forests and boosted trees are accurate but
difficult to understand, debug and deploy. In this work, we provide the inTrees
(interpretable trees) framework that extracts, measures, prunes and selects
rules from a tree ensemble, and calculates frequent variable interactions. An
rule-based learner, referred to as the simp...
Associative classifiers have been proposed to achieve an accurate model with each individual rule being interpretable. However, existing associative classifiers often consist of a large number of rules and, thus, can be difficult to interpret. We show that associative classifiers consisting of an ordered rule set can be represented as a tree model....
A multivariate decision tree attempts to improve upon the single variable split in a traditional tree. With the increase in data sets with many features and a small number of labeled instances in a variety of domains (e.g., bioinformatics, text mining, etc.), a traditional tree-based approach with a greedy variable selection at a node may omit impo...
The regularized random forest (RRF) was recently proposed for feature
selection by building only one ensemble. In RRF the features are evaluated on a
part of the training data at each tree node. We derive an upper bound for the
number of distinct Gini information gain values in a node, and show that many
features can share the same information gain...
Brain function is the result of interneuron signal transmission controlled by the fundamental biochemistry of each neuron. The biochemical content of a neuron is in turn determined by spatiotemporal gene expression and regulation encoded into the genomic regulatory networks. It is thus of particular interests to elucidate the relationship between g...
We propose a tree ensemble method, referred to as time series forest (TSF),
for time series classification. TSF employs a combination of the entropy gain
and a distance measure, referred to as the Entrance (entropy and distance)
gain, for evaluating the splits. Experimental studies show that the Entrance
gain criterion improves the accuracy of TSF....
Random Forest (RF) is a powerful supervised learner and has been popularly
used in many applications such as bioinformatics. In this work we propose the
guided random forest (GRF) for feature selection. Similar to a feature
selection method called guided regularized random forest (GRRF), GRF is built
using the importance scores from an ordinary RF....
We propose a tree regularization framework, which enables many tree models to
perform feature selection efficiently. The key idea of the regularization
framework is to penalize selecting a new feature for splitting when its gain
(e.g. information gain) is similar to the features used in previous splits. The
regularization framework is applied on ra...
Monitoring real-time data steams is an important learning task in numerous disciplines. Traditional process monitoring techniques are challenged by increasingly complex, high-dimensional data, mixed categorical and numerical variables, non-Gaussian
distributions, non-linear relationships, etc. A new monitoring method based on realtime contrasts (RT...
Learning Markov Blankets is important for classification and regression, causal discovery, and Bayesian network learning.
We present an argument that ensemble masking measures can provide an approximate Markov Blanket. Consequently, an ensemble
feature selection method can be used to learnMarkov Blankets for either discrete or continuous networks (...
Attribute importance measures for supervised learning are important for improving both learning accuracy and interpretability. However, it is well-known there could be bias when the predictor attributes have different numbers of values. We propose two methods to solve the bias problem. One uses an out-of-bag sampling method called OOBForest and one...