ArticlePDF Available

A data value metric for quantifying information content and utility

Springer Nature
Journal of Big Data
Authors:

Abstract and Figures

Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.
This content is subject to copyright. Terms and conditions apply.
A data value metric forquantifying
information content andutility
Morteza Noshad1,7, Jerome Choi2,3, Yuming Sun2,4, Alfred Hero III1,3,5 and Ivo D. Dinov2,6*
Abstract
Data-driven innovation is propelled by recent scientific advances, rapid technological
progress, substantial reductions of manufacturing costs, and significant demands for
effective decision support systems. This has led to efforts to collect massive amounts of
heterogeneous and multisource data, however, not all data is of equal quality or equally
informative. Previous methods to capture and quantify the utility of data include value of
information (VoI), quality of information (QoI), and mutual information (MI). This manu-
script introduces a new measure to quantify whether larger volumes of increasingly
more complex data enhance, degrade, or alter their information content and utility with
respect to specific tasks. We present a new information-theoretic measure, called Data
Value Metric (DVM), that quantifies the useful information content (energy) of large and
heterogeneous datasets. The DVM formulation is based on a regularized model balanc-
ing data analytical value (utility) and model complexity. DVM can be used to determine if
appending, expanding, or augmenting a dataset may be beneficial in specific application
domains. Subject to the choices of data analytic, inferential, or forecasting techniques
employed to interrogate the data, DVM quantifies the information boost, or degradation,
associated with increasing the data size or expanding the richness of its features. DVM
is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the
usefulness of the sample data specifically in the context of the inferential task. The regu-
larization term represents the computational complexity of the corresponding inferential
method. Inspired by the concept of information bottleneck in deep learning, the fidelity
term depends on the performance of the corresponding supervised or unsupervised
model. We tested the DVM method for several alternative supervised and unsupervised
regression, classification, clustering, and dimensionality reduction tasks. Both real and
simulated datasets with weak and strong signal information are used in the experimen-
tal validation. Our findings suggest that DVM captures effectively the balance between
analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs
between algorithmic complexity and data analytical value in terms of the sample-size
and the feature-richness of a dataset. DVM values may be used to determine the size and
characteristics of the data to optimize the relative utility of various supervised or unsuper-
vised algorithms.
Keywords: Data energy, Artificial intelligence, Machine learning, Data utility,
Information content
Open Access
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco
mmons. org/ licen ses/ by/4. 0/.
RESEARCH
Noshadetal. J Big Data (2021) 8:82
https://doi.org/10.1186/s40537-021-00446-6
*Correspondence:
statistics@umich.edu
2 Statistics Online
Computational Resource,
University of Michigan, Ann
Arbor, MI 48109, USA
Full list of author information
is available at the end of the
article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 23
Noshadetal. J Big Data (2021) 8:82
Introduction
Background
Big data sets are becoming ubiquitous, emphasizing the importance of solving the
challenge of balancing information utility, data value, resource costs, computational
efficiency, and inferential reliability [1]. is manuscript tackles this problem by
developing a new measure, called the Data Value Metric (DVM), that quantifies the
energy, or information content, of large and complex datasets, which can be used as
a yardstick to determine if appending, expanding, or otherwise augmenting the data
size or complexity may be beneficial in specific application domains. In practice, DVM
provides a mechanism to balance, or tradeoff, a pair of competing priorities (1) costs
or tradeoffs associated with increasing or decreasing the size of heterogeneous data-
sets (sample size) and controlling the sampling error rate, and (2) expected gains (e.g.,
decision-making improvement) or losses (e.g., decrease of precision or variability
increase) associated with the corresponding scientific inference. e computational
complexity of the DVM method is directly proportional to that of calculating mutual
information, which is linear in terms of the data size. us, the DVM complexity is
determined directly by the inferential method or technique used to obtain the classi-
fication, regression, or clustering results, which may itself be non-linear. Hence, DVM
calculations do not add significant overhead to the standard analytical protocol.
Although several performance measures exist for supervised and unsupervised inference
tasks, it is difficult to use established methods to infer the sufficiency of the data for each
specific inferential task. For example, one could use accuracy measures for a classification
task. Assume that the accuracy of
70%
is achieved for a non-random, non-stationary, or
non-homogeneous dataset. en, the question is whether we can expect an increase of
the accuracy by adding more samples or more features, or maybe use alternative models
to increase the value of the resulting inference. In general, such questions are difficult to
answer solely by considering a particular measure of performance on a given dataset. Sev-
eral of the previous approaches measuring the quality of data are summarized below.
Related work
Several previous studies have proposed metrics for assessing the information gain of
a given dataset. For example, value of information (VoI) analysis, originally proposed
in [2] with overviews in [35], is a decision-theoretic statistical framework represent-
ing the expected increased inference accuracy or reduction in loss based on additional
prospective information [6]. e basic three types of VoI methods include (1) inferen-
tial and modeling cases for linear objective functions under simplified parameter distri-
bution restrictions, which limits their broad practical applicability [3, 7]; (2) methods
for estimating the expected value of partial perfect information (EVPPI) involving parti-
tioning of the parameter space into smaller subsets and assuming constant and optimal
inference over the local neighborhoods, within subsets [8, 9]; and (3) Gaussian process
regression methods approximating the expected inference [1012]. More specifically,
for a particular parameter
φ
, the EVPPI is the expected inferential gain, or reduction in
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 23
Noshadetal. J Big Data (2021) 8:82
loss, when
φ
is perfectly estimated. As the perfect
φ
is unknown in advance, this reduc-
tion of loss expectation is taken over the entire parameter space
φ
:
where d is decision, inference or action,
d
φ
is the optimal inference obtained when
φ
is known,
θ
is the the model parameter vector, E is the expectation, and
L(d,θ)
is the
likelihood function [6]. Note that VoI techniques are mainly suitable for specific types
of problems, such as evidence synthesis in the context of decision theory. Further, their
computational complexity tends to be high and require nested Monte Carlo procedures.
Another relevant study [13] utilizes a unique decomposition of the differences (errors)
between theoretical (population) parameters and their sample-driven estimates (statis-
tics) into three independent components. If
and
ˆ
θ
represent a theoretical characteristic
of interest (e.g., population mean) and its sample-based parameter estimate (e.g., sample
arithmetic average), respectively, then, the error can be canonically decomposed as:
Suppose J is a (uniform) random subset indexing a sample from the entire (finite, N)
population. For a sample
{Xi:jIn}
,
Rj
is a random-sample indicator function (with
values 0 or 1) capturing whether
jIn
. Of course,
N
j=1
Rj=
n
, X is a multidimensional
design matrix capturing the attributes of the data (features),
g:X−→ R
is a linking
map that allows us to compute on samples (e.g., polynomial functions for moment calcu-
lations or indicator functions for distribution functions),
gj=g(Xj)
is a mapping of the
j-th feature,
A=A(g,R)
is a measure of association between
RJ
and
GJ
, the sampling
rate f
=
E
J
(R
J
)
=n
N
(ratio of sample-to-population size), B=
1f
f
, and C is a meas-
ure encoding the difficulty of estimating the sample-based parameters (
ˆ
θ
).
Bayes error rate is another metric that quantifies the intrinsic classification limits. In
classification problems, the Bayes error rate represents the minimal classification error
achieved by any classifier [14, 15]. e Bayes error rate only depends on the distributions
of the classes and characterizes the minimum achievable error of any classifier. Several
previous studies proposed effective estimation methods for the Bayes error rate [1417].
In particular, [18] obtains a rate-optimal non-parametric estimator of the Bayes error
rate. e Bayes error rate may not be attainable with a practical classifier.
e proposed data value metric addresses the problem of measuring and tracking data
information content relative to the intrinsic limits within the context of a specific ana-
lytical inferential model.
Data value metric
For a given dataset, the information-theoretic definition of DVM employs mutual infor-
mation (MI) [19, 20] to quantify the inferential gain corresponding to increasing the data
size or the richness of its features. In general, mutual information evaluates the degree of
relatedness between a pair of data sets. In particular, MI may be used to assess the infor-
mation gain between an initial data set and its augmented counterpart representing an
EVPPI
(φ)
=E
θ(
L
(
d,
θ))
E
φ(
E
θ|φ(
L
(
d
φ,
θ)))
,
θ
ˆ
θ
 
error
=A

(Data
Quality
)
+B

(Data
Quantity
)
+C

(Inference
Problem Complexity
)
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 23
Noshadetal. J Big Data (2021) 8:82
enhanced version of the former. When both random variables X and Y are either discrete
or continuous, the mutual information can be defined by:
where p(x) and p(y) are the marginal probability distribution functions and p(x,y) is
the joint probability function of X and Y. e non-negative and symmetric MI measure
expresses the intrinsic dependence in the joint distribution of X and Y, relative to the
assumption of X and Y independence. us, MI captures the X and Y dependence in the
sense that
I(X;Y)=0
if and only if X and Y are independent random variables, and for
dependent X and Y,
I(X;Y)>0
. Further, the conditional mutual information is defined
as follows:
DVM relies on a low-dimensional representation of the data and tracks the quality of the
extracted features. Either the extracted features or the predicted values from a model
can be used in a low-dimensional representation in the DVM formulation. For each
dataset, the DVM quantifies the performance of a specified supervised or unsupervised
inference method. e DVM formulation is inspired by the concept of information bot-
tleneck in deep neural networks (DNNs) [21, 22]. Information bottleneck represents the
trade-off between two mutual information measures: I(X;T) and I(T;Y), where X and Y
are respectively the input and output of the deep learning model and T is an intermedi-
ate feature layer.
Instead of simply computing sample-driven parameter estimates, the DVM approach
examines the information-theoretic properties of datasets relative to their sample-sizes,
feature-richness, and the algorithmic complexity of the corresponding scientific infer-
ence. ere are both similarities and differences between DMV and other VoI metrics.
e main difference is that for model-based inference, some VoI metrics may have
known, exact, or asymptotic expectations based on exact, or Markov chain Monte Carlo
(MCMC), posterior estimates [2325]. Whereas, under model-free inference, estimating
the DVM theoretical or ergodic properties is difficult, in general. is challenge prevents
the derivation of an exact linear decomposition of the error between population charac-
teristics and their sample-driven counterparts.
is manuscript is organized as follows. In "Methods" section , we define the data
value metric (DVM) as an information-theoretic function of the (training and testing)
data and the specific inferential technique. is section also includes the computational
details about an effective mutual information (MI) estimator and ensemble dependency
graph estimator (EDGE) [22] as well as the implementation details of a DVM Python
package we built, validated, and openly shared. A feature selection application of DVM
is also discussed in this section. e estimation of the mutual information using the
ensemble dependency graph estimator (EDGE) is discussed in "Mutual information esti-
mation" section. "Results" section includes experimental results illustrating the behav-
ior of the proposed DVM metric on a wide range of real and simulated data, low- and
(1)
discrete
distributions
I(X;Y)=
{yY,xX}
p(x,y)log
p(x,y)
p(x)p(y)
continuous
distributions
I(X;Y)={
y
Y,x
X
}
p(x,y)log p(x,y)
p(x)p(y)dxdy
,
(2)
I(X;Y|Z)=I(X;Y,Z)I(X;Z).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 23
Noshadetal. J Big Data (2021) 8:82
high-energy signals, feature-poor and feature-rich datasets. "Conclusion and discussion"
section summarizes the conclusions and provides a discussion about applications, pos-
sible improvements, limitations, and future work. In the Appendix, we provide DVM
implementation details, source code references, additional results, and references to
interactive 3D plots of DVM performance on real and simulated data.
Methods
ere are a wide range of artificial intelligence, machine learning, and statistical infer-
ence methods for classification, regression and clustering [1, 2628]. e DVM metric
is applicable to unsupervised and supervised, model-based and model-free approaches.
We employed the following supervised classification methods to identify, predict, or
label predefined classes, linear models [29, 30], random forest [31], adaptive [32] and
gradient [33] boosting, and k-nearest neighbors [34]. In addition, we tested several unsu-
pervised clustering approaches for categorizing and grouping objects into subsets with-
out explicit a priori labels, K-means [35], Affinity Propagation [36], and Agglomerative
clustering [37].
e data value metric (DVM) technique utilizes MI to quantify the energy of datasets
relative to the corresponding inferential technique applied to interrogate the data. Our
approach is based on transforming the triple (T,S,g), representing the training (model
estimation) dataset, the testing (validation) dataset, and the specific inferential method,
respectively, into random variables
X=g(XT,XS)
and
Y=YS
whose MI captures the
data-method information content in the triple.
Depending upon the type of the intended inference on the data, we will define the
DVM separately for supervised modeling and for unsupervised clustering. While the two
definitions are congruent, this dichotomy is necessary to provide explicitly construc-
tive definitions that can be used for a wide range of domain applications. Expanding the
general regularization problem formulation, given a dataset, D, the DVM is defined as a
mixture blending a fidelity term, F(D), and a regularization term, R(D):
e DVM fidelity term captures the usefulness of the sample data for the specified infer-
ential task (supervised or unsupervised). e second, regularization, term penalizes the
DVM based on the computational complexity of the corresponding inferential method.
us, broadly speaking, the DVM depends on the data (including both training and test-
ing sets) as well as the data-analytic technique used to obtain the desired inference.
Let’s first explain the rationale behind mixing fidelity and regularization in the DVM
definition. Consider a case-study where a high-energy (low-noise) dataset provides suf-
ficient information to derive either good prediction accuracy, for supervised modeling,
or obtain stable clustering results, for unsupervised inference. Expanding heterogeneous
data by either appending the number of samples or expanding the set of features may not
always increase the DVM and may add substantial costs associated of collecting, man-
aging, quality control, and processing the larger datasets. e penalty term in the DVM
accounts for some of these potential detrimental effects due to inflating the data. e
effect of the regularization term is mediated by the size of the penalty coefficient
, which
(3)
DVM(D)=F(D)
 
fidelity

penalty
R(D)
 
regularizer
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 23
Noshadetal. J Big Data (2021) 8:82
controls the DVM balance between quality of the inference and the algorithmic com-
plexity. ere are many possible alternative forms of the regularizer term, R(D), such as
runtime, computational complexity, or computing costs. In our experiments, we use the
Big-O computational complexity of training the predictor to quantify the regularization
penalty term
R(D)=f(n)
. Table1 shows the computational complexities of several com-
monly used classification (C) and regression (R) classifiers. e table uses the following
notation: n represents the size of the training sample, p is the number of features,
ktrees
is the number of trees (for tree-based classifiers),
msv
is the number of support vectors
(for SVM), and
oli
is the number of neurons at layer i in a deep neural network classifier.
Next, we will focus solely on the more complex DVM fidelity term, which will be defined
separately for the two alternative approaches-supervised prediction and unsupervised
clustering.
Representation ofthedelity term inlow‑dimensions
First we will define the DVM fidelity term based on low-dimensional representations of the
data. e motivation behind this definition of the fidelity is driven by the neural networks
(NNs) process of optimizing an objective function and identifying feature contributions.
Let X, T and Y respectively denote the NN input layer, an intermediate feature layer, and the
output layer.
In [21, 22], the mutual information measures I(X;T) and I(T;Y) are used to demonstrate
the evolution of training in deep neural networks. I(T;Y) represents how the trained feature
layer T is informative about the label. In the training process of a deep neurals network
(DNN), I(T;Y) keeps increasing [21, 22]. On the other hand, I(X;T) shows the complexity of
the representation T. In DNN, I(X;T) increases in the first training phase and it decreases
in the compression phase [21, 22]. us, T is a good representation of X if its information
about Y is maximized for a constrained complexity. is is equivalent to maximizing the
following information bottleneck (IB) loss function [38]:
where
β
is a Lagrange multiplier with the condition
β>0
.
e DVM formulation is inspired by the NN definition of information bottleneck loss
function in equation (5). Intuitively, a feature vector T has high quality if it is informative
(5)
IB =I(T;Y)βI(X;T),
Table 1 Computational complexity of several commonly used regression and classification
techniques
Classier Type Training Prediction
Linear Regression R
O
(
p2n
+
p3)
O(p)
(4)
Decision Trees C&R
O(n2p)
O(p)
Random Forest C
O(n2pktrees)
O(pktrees)
Gradient Boosting C&R
O(npktrees)
O(pktrees)
SVM C&R
O(n2p
+
n3)
O(msv p)
k-Nearest Neighbors C&R varies
O(np)
Neural Networks C&R varies
O(iolioli+1)
Naive Bayes C
O(np)
O(p)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 23
Noshadetal. J Big Data (2021) 8:82
about the label and its representation complexity is small. us, IB might be used as a
measure of feature quality.
However, there are also problems with considering IB as a feature quality measure.
First, in general, IB has no fixed range and it’s not a priori clear what values of IB repre-
sent high and low salient features. Second, the penalty term in the IB function, I(X;T),
represents the information of the feature T about X, which captures both necessary and
unnecessary information in order to predict Y. It may be better to only consider the
information that is independent of Y as a penalty term. In terms of information theoretic
measures, one could formulate this as conditional information I(X;T|Y). Note that this
penalty term is minimized when the representation T yields the information of Y with-
out extra information about X. An example of this case is when Y is an invertible func-
tion of T.
us, the proposed fidelity term for the Data Value Metric (DVM) is defined in terms
of the mutual information and conditional mutual information measures introduced in
(1) and (2) as follows:
e following remarks include some of the properties of the proposed fidelity measure.
Remark 1.a e following inequality holds
and the fidelity term of the DVM always has the following upper bound:
Remark 1.b
F(T)=1
if and only if the following equations are true:
e proof for the Remarks 1.a and 1.b is given in Appendix 1.
Remark 2 e fidelity term of the DVM can be simplified to the form of the standard
information bottleneck [38]:
(6)
F(T)
 
DVM Fidelity
=
I(T;Y)βI(X;T|Y)
I(X;Y)
.
(7)
I(T;Y)βI(X;T|Y)I(X;Y),
(8)
F
(T)=
I(T;Y)βI(X;T|Y)
I(X;Y)
1.
(9)
I(X;Y|T)=0,
(10)
I(X;T|Y)=0
(11)
F
(T)=
I(Y;T)βI(T;X|Y)
I(X;Y)=
I(Y;T)β(I(T;X)I(T;Y))
I(X;Y)
=(1+β)I(Y;T)βI(T;X)
I(X;Y)
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 23
Noshadetal. J Big Data (2021) 8:82
As a simple demonstration of the behavior of
DVM =FR
, we fit a 5-layer DNN to
predict the 10 class labels of the MNIST dataset [39], and used the DVM to track the fea-
ture quality across epochs and layers. e results of the DVM performance on the digit
recognition is given Fig.1. Since the network is trained as a whole with all layers, the
regularizer term R is considered fixed for all layers. At a fixed training epoch, the DVM
values in different network layers represent the trade-off between the information about
the labels and the information about the input. ese complementary information com-
ponents are the first and second terms in the numerator of the DVM fidelity (11). During
the iterative network training process, the information about the labels and the fidelity
term increase, which suggests improvement of the quality of the feature layers.
Supervised modeling
e DVM fidelity term definition, equation (6), relies on low-dimensional representa-
tions. Using the supervised-model predicted values, we can obtain low-dimensional
representations that can be used to measure data quality in both supervised and unsu-
pervised problems. Unsupervised inference problems will be considered later.
In supervised inference, we assume that we have a set of independent and identically
distributed (i.i.d.) samples
Xi,1 in
with a joint distribution f(x) and associated
known labels
Yi
. We define encoder-decoder pair
(E,D)
, where
E
maps the high-dimen-
sional input X into a lower dimensional representation T, and
D
maps the representation
T to the predicted labels. In practice, we can think of
E
as a dimensionality-reduction
method, or the intermediate representations of a deep neural network. In addition,
D
per-
forms the classification task based on the lower dimensional representations. Note that if
T is simply the predicted labels, the fidelity would depend on the specific classifier. How-
ever, if T is some low-dimensional representation of the data, such as extracted features
or any intermediate layer of a deep neural network, the fidelity would be independent
from the classifier and would only depend on the encoder (feature extraction) method.
Fig. 1 Training a neural network of size
784 200 100 60 30 10
on the MNIST dataset with ReLU
activation and using the DVM to track feature quality measures for different network layers. As shown, the
deeper layers have higher DVM values, suggesting that they represent salient features for predicting the class
labels
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 23
Noshadetal. J Big Data (2021) 8:82
e definition of the fidelity measure is based on a cross-validation type average of the
definition (6) using the estimated mutual information measures. Given any random vari-
ables X,Y and Z, with corresponding sets of i.i.d. samples
X,Y
and
Z
,
I(X;Y)
denotes
the estimated mutual information using the sample sets
X,Y
.
We randomly split the feature set
X
into two subsets
(
X
,
X
)
. e first subset
(X)
is
used for training, whereas the second one
(
X
)
is used for independent testing and vali-
dation. Also let
T
denote the set of intermediate representation (or predicted labels), and
Y
represent the true labels associated with the test dataset
X
. en, we can define the
DVM fidelity term by:
Using a weight coefficient,
β
, this fidelity term formulation, equation (12), mixes two
components,
I(
T
i;
Y
i)
and
I(
X
i;
T
i|
Y
i)
, via normalization by
I(X;Y)
.
e first term,
I(
T
i;
Y
i)
, accounts for the fidelity of the low dimensional representation
of the output labels,
Yi
, whereas the second (penalty) term,
I(
X
i;
T
i|
Y
i)
, accounts for the
compression of the lower-dimensional representation.
e pseudo code below (Algorithm 1) outlines the computational implementation
strategy we employ in the DVM package for evaluating the DVM. e metric captures
the relative analytical value of the dataset relative to the computational complexity of
the supervised prediction, classification, or regression problem. In practice, the regu-
larization term, R(g), is estimated according to the known algorithmic complexity, see
Table1.
Input : Data sets
X
,
Y
, model g, parameters
β
,
for a random split
(
X
i
,
X
i)
of
X
do
train g based on
(X
i,Y
i)
Ti
g
(
X
i)
F
iI(
Ti;
Yi)βI(
Xi;
Ti|
Yi
)
I(
X
;
Y
)
D
1
MM
i=1
F
i
R(g
)
Output :
D
Algorithm1: DVM calculation for supervised problems.
Feature selection
Since DVM can be used to measure the quality of a feature set T, it can also serve as
a feature selection method. In this section, we demonstrate a heuristic algorithm for
sequential feature selection based on DVM values.
For a classification problem, the feature selection is defined as follows. Based on an ini-
tial feature set, choose a smaller set of features that yields a minimum prediction error.
Let
X={X1, ..., Xd}
denote the d initial features. e objective is to select a smaller set of
r features with maximum DVM score. One specific approach is based on a forward selec-
tion involving r iterative steps. At each step, we select a feature from the initial feature set,
(12)
F
:=
1
M
M
i=1
I(
Ti;
Yi)βI(
Xi;
Ti|
Yi)
I(X;Y)
.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 23
Noshadetal. J Big Data (2021) 8:82
{X1, ..., Xd}
, which increases DVM score the most. For a given (initial or intermediate) fea-
ture set
F
,
DVM{F}
represents the DVM score corresponding to that specific feature set
F
. e pseudocode implementing this strategy for DVM-based feature selection is given in
Algorithm2.
Unsupervised inference
We can extend the definition of DVM for supervised problems to unsupervised cluster-
ing models. In the unsupervised problems, we don’t have explicit outcomes to evaluate the
model performance.
Input: Input dataset,
X={X1, ..., XN}
Labels,
Y={Y1, ..., YN}
Desired number of output features, r
F:= φ
,
R:= {1, ..., r}
for each
iR
do
f
j
R
F(DVM{F}−DVM{FX
j
}
)
Add f into
F
Output:
F
Algorithm2: DVM-based feature selection.
Intuitively, the definition of fidelity for an unsupervised clustering method reflects the
stability of the derived clusters, regardless of the clustering labels.
Our strategy for estimating the DVM fidelity for unsupervised clustering methods is
based on randomly splitting the dataset
X
into three subsets
(
X
,
X
′′,
X
)
.
e first two of these sets,
(X,X′′)
, are used for cross-validation training, whereas the
remaining one,
X
, is used for independent testing and validation. By training the classifier
on the first subset
(
X
)
, we obtain derived computed labels. ese predicted labels,
Y
, may be
used as baseline for computing the fidelity based on the information bottleneck in equation
(12). Let
T
be the representation layer (or predicted indices associated with the test dataset
X
). e DVM fidelity term for unsupervised learners may then be defined as follows:
where the index i in the above definition denote the the variables associated with the ith
randomized splitting of
X
. Just as we did for the supervised problems, we can explicate
the DVM algorithmic implementation via the pseudo code used in the DVM package.
e algorithm below (Algorithm3) shows the DVM calculation for unsupervised clus-
tering and classification problems. Again, the regularization term is derived using the
(13)
F
:= 1
M
M
i=1
I(
Ti;ˆ
Yi)βI(
Xi;
Ti|ˆ
Yi)
I(X;ˆ
Y)
,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 23
Noshadetal. J Big Data (2021) 8:82
approximate estimate of the computational complexity associated with the classifier
(R(g)), see Table1.
Input: Data sets
X
,
Y
, model g, parameters
β
and
for a random split
(
X
i,
X
′′,
X
i)
of
X
do
Apply unsupervised model g based on
Xi
ˆ
Yi
g(X
i)
Ti
g
(X
′′
i)
F
iI(
Ti;ˆ
Yi)βI(
Xi;
Ti|ˆ
Yi
)
I(
X
;
ˆ
Y
)
D
1
TT
i=1
F
i
R(g
)
Output:
D
Algorithm3: DVM calculation for unsupervised problems.
Mutual information estimation
In many areas, including data science and machine learning, the density of the data
is unknown. In these cases, one needs to estimate the mutual information from
the data points. Examples of MI estimation strategies include KSG [40], KDE [41],
Parzen window density estimation [42], and adaptive partitioning [43].
The computational complexity and convergence rate are two important perfor-
mance metrics of various MI estimators. The process of MI estimation is computa-
tionally intensive for large data sets, e.g., the computational complexity of the KDE
method is
O(n2)
, while the KSG method takes
O(knlog(n))
time to compute MI (k is
a parameter of the KSG estimator). More computationally efficient estimators such
as [44] provide improvements with estimated MI estimation time of
O
(
nlog
(
n))
.
Thus, estimation of mutual information for large and complex data sets requires
some approximation. For instance, we can use one of the standard estimators that
exist for the non-parametric distributions. Non-parametric estimators are a family
of estimators, for which we consider minimal assumptions on the density functions.
There are several previous approaches, e.g., [4548], that guarantee optimal conver-
gence rates. Among these estimators, the hash-based estimator proposed in [48] has
linear computational complexity. As we deal with large and complex data sets, here
we employ a hash-based mutual information estimator, called the ensemble depend-
ency graph estimator (EDGE) [22]. EDGE has an optimal mean square error (MSE)
convergence rate and low computational complexity that make it suitable for our
task of detecting the information gain associated with augmenting a data set.
Results
We conducted a number of experiments to illustrate the use of the proposed DVM
on a wide range of real and simulated datasets. Each dataset was labeled as low,
medium, or high energy, indicating the strength of the signal information content in
the data. The results of different machine learning and statistical modeling methods,
their quality, accuracy, and reproducibility heavily depend on the intrinsic signal
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 23
Noshadetal. J Big Data (2021) 8:82
energy. We contrast the proposed DVM against classifier-accuracy and Bayes opti-
mal classifier accuracy, which is a measure of classification task difficulty. In this
paper, we define the Bayes classifier accuracy as the additive complement of the clas-
sical Bayes error rate (risk), i.e., Bayesian Accuracy = 1- Bayesian Error.
Datasets
MNIST Handwritten Digits Data: e Modified National Institute of Standards and
Technology (MNIST) dataset consists of a large number of fixed-size, grayscale images
of handwritten digits. It includes a set of 60,000 training images, and a set of 10,000
test images. Each image has a dimension
28 ×28
, and each pixel intensity takes a value
between 0 and 255. e training data are also paired with a label (0, 1, 2, ...,9) indicating
the correct number represented in the corresponding image [39].
ALS dataset: Amyotrophic lateral sclerosis (ALS) is a complex progressive neurode-
generative disorder with an estimated prevalence of about 5 per 100,000 people in the
United States. e disease severity is enormous with many the patients surviving only
a few years after ALS diagnosis, and few living with ALS for decades [49]. We used the
ProACT open-access database [50], which collects and aggregates clinical data of 16 ALS
clinical trials and one observational study completed in the recent twenty years [51].
is dataset contains the information of 2,424 patients with 249 clinical features,
tracked over 12 months. e ALS disease progression, which is measured by the change
of Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS) score over time, is
used as the target variable. ALSFSR is a real-valued number in the range [0,5].
Simulated dataset: Synthetic data were generated using make_blobs function in scikit-
learn (https:// scikit- learn. org). ere were five centers for the dataset. Each dataset had
2,000 samples and 800 features. e standard deviation for strong signal data was 20,
while it was 40 for weak signal data.
e continuous data was generated using the following formula:
where X was generated by sampling 800 random observations from a multivariate
Gaussian distribution. e mean vector of this multivariate Gaussian distribution was
generated from a Gaussian distribution with mean zero and variance 25. e eigenval-
ues of the diagonal variance-covariance matrix of the multivariate Gaussian distribution
were generated from a Uniform(2;12) distribution. e noise term follows a standard
Gaussian distribution and its magnitude term, K, was chosen to be 10 for the strong sig-
nal or 50 for the weak signal simulated datasets.
Validation experimental design
Our experimental design included supervised and unsupervised machine learning
methods using real and simulated datasets with different signal profiles – weak and
strong signals. Figure 2 shows the specific supervised and unsupervised methods,
and the type of data used in the DVM validation protocol. e labels strong and weak
associated with different datasets qualify the relative size of the information content
in the data, i.e., the relative signal to noise ratio. For the observed datasets, this infor-
mation content reflects the power of the covariate features to predict an outcome (for
(14)
Y=X1
3
+K Noise,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 23
Noshadetal. J Big Data (2021) 8:82
supervised problems) or the consistency of the derived labels (for unsupervised prob-
lems). For the simulated data, the information energy is directly related to signal-to-
noise ratio (
SNR <0.2
vs.
SNR >2.0
). For each of the cells in the validation design,
we computed the DVM as a parametric surface defined over the 2D grid parameter-
ized over data sample-size and number-of-features. e reported results include 2D
plots of cross-sections of the DVM surface for a fixed sample-size or a fixed number-
of-features. We also plotted the complete 3D DVM surfaces rendered as triangulated
2-manifolds. ese interactive 3D plots are available in supplementary materials and
are accessible on our webserver.
Strong signal datasets: Fig.3 compares the DVM value to the classification accuracy
and Bayes accuracy rates on the MNIST dataset using the Random Forest classifier.
As the sample size and the number of features increase, the classification accuracy,
Bayes accuracy, and the DVM increase. e
95%
confidence interval is represented by
the shaded area around the DVM curve.
Using the MNIST data, the results in Fig.3a imply that both the classification accu-
racy and DVM drastically increase with increase of the sample size between 500 and
4,500. e accuracy converges to around 0.85 when the sample size approaches 4,500.
In the same range, the DVM also converges to around 0.8. Similar results relative to
the increase of the number of features are show in Fig.3b. As the number of features
approaches 800, the accuracy converges to around 0.86 and the DVM approaches 0.8.
Using strong-signal simulated data, the results in Fig.4a show that classification accu-
racy, Bayes accuracy, and DVM increase as the sample size grows from 200 to 2000. e
Fig. 2 Summary of experimental design
Fig. 3 Graph panels (a, b) compare the DVM value to the classification accuracy and Bayes accuracy rates
for using the Random Forest method on the MNIST dataset, across sample size and the number of features,
respectively. As the sample size and the number of features increase, the classification accuracy, Bayes
accuracy, and the DVM increase. The shaded area around the DVM curve represents the
95%
confidence
interval
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 23
Noshadetal. J Big Data (2021) 8:82
accuracy converges to around 0.95 and DVM approaches 0.92 for large sample-sizes.
e result in Fig.4b also shows the growth of the classification accuracy and DVM as the
number of features increases from 100 to 800, but plateaus around 300 features.
Figure5 displays a 3D surface plot for the classification accuracy and DVM param-
eterized by the sample size and the number of features. is graph provides more
information compared to the cross-sectional linear plots shown in Figs.3, 4. Interac-
tive 3D surface plots for all experiments are available online (see Appendix 1, 2).
ese results illustrate that for some strong signals, there may be little gain of
increasing the sample-size or the number of features.
Weak-signal datasets: Fig. 6 shows the results of the accuracy and DVM for the
real (ALS) weak-signal dataset. As expected, the DVM pattern is less stable, but still
Fig. 4 The plots on panels (a, b) respectively compare the DVM value to the classification accuracy and
Bayes accuracy rates using the Random Forest method on the strong-signal simulated dataset, across sample
size and the number of features. As the sample size and the number of features increase, the classification
accuracy, Bayes accuracy, and the DVM increase. The shaded area around the DVM curve represents the
95%
confidence interval
Fig. 5 This 3D graph compares the DVM value to the classification accuracy and Bayes accuracy rates for
using the K-Nearest Neighbor classifier on the MNIST dataset, in terms of both the number of samples and
the number of features
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 15 of 23
Noshadetal. J Big Data (2021) 8:82
suggests that adding additional cases or enhancing the features of the data adds little
value to improve the unsupervised clustering of the data (K-means clustering).
Figure7 depicts the DVM trends using the weak simulated data. Again, the overall
low DVM values suggest that increasing the size of augmenting the complexity of weak-
signal data may not significantly improve the subsequent unsupervised clustering.
Interactive 2D and 3D DVM surface plots illustrating the results of each experiment are
available online at https:// socr. umich. edu/ docs/ uploa ds/ 2020/ DVM/. ese graphs show
the behavior of the DVM spanning the domain of possible number of cases and number of
features for the real and simulated datasets.
In the appendix, we show examples of cases (pairs of datasets and classifiers) where the DVM
may actually decrease with an increase of the number of samples or the number of features.
Feature selection
We demonstrate the feature selection algorithm introduced in algorithm2 on a simulated
dataset. e simulated dataset consists of 1000 samples randomly drawn from a 4-cluster
2D-Gaussian distribution. e clusters are on a square with edge size 1, where the label for
each sample determines the distribution cluster. e dimension of the samples is 20 and
the problem is to select up to 15 features. Figure8 represents the steps of the feature selec-
tion algorithm. At each step, the best of all features is selected using DVM and added to the
chosen-features set. Note that due to the dimensionality and runtime complexity terms in
the DVM definition, we do not expect a monotonic graph, however, the local maximums
suggest an appropriate stopping criterion for the feature selection process. Figure8 shows
the performance of the DVM-based feature selection yielding a 6-element feature set,
{F18, F4, F13, F9, F5, F12}
, corresponding to a high DVM value,
DVM =0.84
.
Fig. 6 The DVM value in terms of (a) sample size (b) the number of features, for using the K-Means method
on the ALS dataset. The shaded area around the DVM curve represents the
95%
confidence interval
Fig. 7 The DVM value in terms of (a) sample size (b) the number of features, for using the K-Means method
on the simulated dataset. The shaded area around the DVM curve represents the 95% confidence interval.
The intervals may be too tight and not visible in some plots
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 16 of 23
Noshadetal. J Big Data (2021) 8:82
Conclusion anddiscussion
is manuscript presents the mathematical formulation, algorithmic implementa-
tion, and computational validation of a data value metric (DVM) for quantifying the
analytical-value and information-content (energy) of a dataset. DVM depends on the
intended data processing, modeling, forecasting or classification strategies used to
interrogate the data. e significance of the work is the introduction of a new meas-
ure of intrinsic data value, the DVM, that complements other traditional measures of
analytic performance, e.g., accuracy, sensitivity, log-odds ratio, Bayesian risk, positive
predictive power, and area under the receiver operating characteristic curve. rough
the experiments presented herein, authors discovered that the DVM captures the
important trends of traditional measures applied to different types of datasets. e
DVM tuning parameter (alpha) provides flexibility for balancing between algorithmic
performance and computational complexity, which facilitates a data-specific quanti-
zation of the relative information content in a dataset.
As the DVM is applicable for a wide range of datasets and a broad gamut of super-
vised and unsupervised analytical methods, it can be used as a single unified measure
to guide the process of data augmentation, data reduction, and feature selection. It
would be interesting to compare the DVM-driven feature selection to other variable
selection methods [1], e.g., filtering methods such as information gain and Markov
blanket filtering, wrapper techniques such as recursive feature elimination and simu-
lated annealing, and embedded strategies such as random forests and weighted-SVM.
e DVM evaluates the analytical value of a dataset relative to a predefined analytical
technique for the data interrogation. e two primary benefits of using an information-
theoretic measure, such as the regularized DVM, as a data-utility metric include (1) the
estimate of the DVM is easy to compute for each triple of a dataset, analytical strategy,
and performance measures, and (2) the DVM magnitude (high or low value) serves as a
proxy translating specific data-mining challenges and observable data into a continuous
pseudo-distance metric of information-content relative to computational-complexity.
e normalization of the DVM fidelity term ensures that the information-value of
the data is standardized in a uniform range, [0,1]. Relative to an a priori analytical strat-
egy, extreme fidelity values close to 0 or 1 correspond respectively to low-quality and
high-information-content datasets. e real data and simulation-based results show
Fig. 8 DVM-based feature selection on a simulated dataset. At each step, the best feature that increases
the DVM is selected and added to the chosen-features set. Note that due to the dimensionality and runtime
complexity terms in the DVM definition, the feature by DVM value graph is not expected to be monotonic.
However, the local maxima suggest appropriate stopping criteria for the feature selection algorithm
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 17 of 23
Noshadetal. J Big Data (2021) 8:82
that there is a connection between error rate and the DVM values. However, theoretical
bounds on the discrepancy between the prediction error rate and the information-based
DVM are not yet understood. Future studies are needed to explore this theoretical rela-
tion for various types of analytical methods and data characteristics.
As firm supporters of open-science, we have shared all code, data, and results on the
DVM GitHub page (https:// github. com/ SOCR/ DVM/) and the SOCR DVM documen-
tation site (https:// socr. umich. edu/ docs/ uploa ds/ 2020/ DVM/).
Appendix1
Proof of Remark 1
First note that the following equation holds [22]:
Using the chain rule for the mutual information we have
On the other hand we also have the following equation:
From (16) and (18) we obtain the following inequality:
and the equality holds if and only if
I(X;Y|T)=0
. erefore,
F(1)=1
if and only if
the second term in (6) is equal to zero and we have
I(X;Y|T)=0
. An example of a case
with conditions in (9) is when Y is an invertible function of T.
Appendix2
Implementation
Below, we briefly describe the DVM Python package organization and invocation. We
have implemented a DVM python package for our data value metric framework and
made it available on GitHub (https:// github. com/ SOCR/ DVM).
e DVM package can be used on any dataset and any user-defined supervised or
supervised tasks in order to evaluate the quality of the data. e package consists of
three main python files, DVM.py, methods.py, and DVM_plot.py. Please note that DVM.
py uses the mutual information estimator file, EDGE.py, as its dependency.
DVM.py gets the input datasets X and in the case of a supervised task, a set of corre-
sponding labels denoted by Y. Further, the user needs to specify the input parameters,
β
,
problem_type and method.
β
is the coefficient of the regularizer term of DVM. problem_
type specifies whether the task is supervised or unsupervised, and method is the learning
(15)
I(T;Y|X)=0.
(16)
I(T,X;Y)=I(X;Y)+I(T;Y|X)=I(X;Y).
(17)
I(T,X;Y)=I(T;Y)+I(X;Y|T).
(18)
I(T,X;Y)=I(X;Y)=I(T;Y)+I(X;Y|T)I(T;Y),
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 18 of 23
Noshadetal. J Big Data (2021) 8:82
method that is used by user. For a given method, we can also input the corresponding
required arguments. For example, if we are using KNN_classifier from methods.py as our
method, it requires the parameters n_neighbors (number of neighbors) and weights (type
of weights) as input:
ere are two DVM output values: DVM_value and confidence_band. DVM_value
gives the average value computed according to the DVM formula in equation (6) and
confidence_band gives the
95%
confidence limits of the DVM values computed by differ-
ent subsets in equation (13).
e methods.py file consists of various supervised and supervised methods. Each
supervised method takes the following arguments: X_train, Y_train, X_test, **kwargs.
X_train, Y_train, X_test respectively are the train data set and labels, and the test data set
for which we would like to predict labels. **kwargs specifies all of the arguments that the
given method requires. An example of the format is as follows:
e output of the method is a numpy array of predicted labels. Note that in addition
to the methods listed in methods.py file, any other user defined method that satisfies the
above format can be used for DVM.
e DVM_plot.py gets the input datasets X, and in the case of a supervised task, a set
of corresponding labels denoted by Y. Further the user need to specify the input param-
eters continuous,
β
, problem_type, method, plot_type. Continuous indicates whether the
response variable is continuous variable or discrete variable. plot_type specifies the plots
that the user wants to generate, where plot_type = ’3D’ generates 3D plots of DVM, plot_
type = ’3D’ generates 2D plots, and plot_type = ’Both’ generates both 2D and 3D plots.
β
,
problem_type, method have the same meaning as in the DVM.py. For a given method, we
can also input the corresponding required arguments like DVM.py. e same example as
DVM.py is used here to illustrate the syntax:
In addition to the 2D and 3D DVM plots, DVM_plot also outputs a dictionary contain-
ing Accuracy, MI(mutual information), Complexity, DVM,Sample Number (a sequence of
different number of samples) and Feature Number (a sequence of different of features.)
As calculating the DVM measure actually involves another parameter (
) that rep-
resents the weight-averaging of the DVM fidelity and regularization terms, the actual
DVM manifold is intrinsically a surface embedded in 4D. We have designed a kime-
surface visualization that allows us to explore the properties of the DVM manifold by
including a
slider that reduces the DVM into an animated 3D space.
Supplementary experiments
e appendix below includes four additional tables of results that illustrate some of the
DVM performance in different situations. Readers are encouraged to view the corre-
sponding DVM interactive 2D plots and 3D surface graphs on the web-site,
https:// socr. umich. edu/ docs/ uploa ds/ 2020/ DVM/. See Tables2, 3, 4, 5
DVMvalue
=DVM(X,Y,problem_type =supervised
,
method =KNN_classifier,n_neighbors =10, weights =uniforms
)
Y_predict =KNN_classifier(X_train,Y_train,X_test,∗∗kwargs)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 19 of 23
Noshadetal. J Big Data (2021) 8:82
Table 2 Test results by the signal profile (Supervised - Strong)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 20 of 23
Noshadetal. J Big Data (2021) 8:82
Table 3 Test results by the signal profile (Supervised - weak)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 21 of 23
Noshadetal. J Big Data (2021) 8:82
Acknowledgements
This work was supported in part by NSF grants 1916425, 1734853, 1636840, 1416953, 0716055 and 1023115, NIH grants P20
NR015331, U54 EB020406, P50 NS091856, P30 DK089503, UL1TR002240, R01CA233487, R01MH121079, and K23 ES027221,
and ARO grant W911NF-15-1-0479. The funders played no role in the study design, data collection and analysis, decision
to publish, or preparation of the manuscript. Colleagues at the University of Michigan Statistics Online Computational
Resource (SOCR) and the Michigan Institute for Data Science (MIDAS) contributed ideas, infrastructure, and support for the
project.
Authors’ contributions
ID and MN contributed to the study conception and design. Material preparation, data collection, coding, and analysis were
performed by MN, JC, YS, and ID. All authors read and approved the final manuscript.
Availability of data and materials
All data generated, simulated, or analysed during this study are included in this published article [and its supplementary
information files].
Declarations
Ethics approval and consent to participate
Not applicable.
Table 4 Test results by the signal profile (Unsupervised - Strong)
Table 5 Test results by the signal profile (Unsupervised - Weak)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 22 of 23
Noshadetal. J Big Data (2021) 8:82
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Author details
1 Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA.
2 Statistics Online Computational Resource, University of Michigan, Ann Arbor, MI 48109, USA. 3 Department of Statistics,
University of Michigan, Ann Arbor, MI 48109, USA. 4 Depar tment of Biostatistics, University of Michigan, Ann Arbor, MI
48109, USA. 5 Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA. 6 Michigan
Institute for Data Science, University of Michigan, Ann Arbor, MI 48109, USA. 7 Stanford Center for Biomedical Informatics
Research, Stanford University, Stanford, CA 94305, USA.
Received: 22 January 2021 Accepted: 27 March 2021
References
1. Dinov ID. Data science and predictive analytics biomedical and health applications using R. Berlin: Springer; 2018.
2. Raiffa H, Schlaifer R. Applied statistical decision theory 1961.
3. Baio G. Statistical modeling for health economic evaluations. Ann Revi Statist Appl. 2018;5(1):289–309. https:// doi. org/
10. 1146/ annur ev- stati stics- 031017- 100404.
4. Baio G, Heath A. When simple becomes complicated: why Excel should lose its place at the top table. London: SAGE
Publications Sage UK; 2017.
5. Parmigiani G, Inoue L. Decision Theory: Principles and Approaches, vol. 812. Hoboken: Wiley; 2009.
6. Jackson C, Presanis A, Conti S, Angelis DD. Value of information: sensitivity analysis and research design in bayesian
evidence synthesis. J Am Statist Associat. 2019;114(528):1436–49. https:// doi. org/ 10. 1080/ 01621 459. 2018. 15629 32.
7. Madan J, Ades AE, Price M, Maitland K, Jemutai J, Revill P, Welton NJ. Strategies for efficient computation of the
expected value of partial perfect information. Med Decis Making. 2014;34(3):327–42.
8. Strong M, Oakley JE. An efficient method for computing single-parameter partial expected value of perfect informa-
tion. Med Decis Making. 2013;33(6):755–66.
9. Sadatsafavi M, Bansback N, Zafari Z, Najafzadeh M, Marra C. Need for speed: an efficient algorithm for calculation of
single-parameter expected value of partial perfect information. Value Health. 2013;16(2):438–48.
10. Strong M, Oakley JE, Brennan A. Estimating multiparameter partial expected value of per fect information from a proba-
bilistic sensitivity analysis sample: a nonparametric regression approach. Med Decis Making. 2014;34(3):311–26.
11. Strong M, Oakley JE, Brennan A, Breeze P. Estimating the expected value of sample information using the probabilistic
sensitivity analysis sample: a fast, nonparametric regression-based method. Med Decis Making. 2015;35(5):570–83.
12. Heath A, Manolopoulou I, Baio G. Estimating the expected value of partial per fect information in health economic
evaluations using integrated nested laplace approximation. Statist Med. 2016;35(23):4264–80.
13. Meng X-L. Statistical paradises and paradoxes in big data (i): law of large populations, big data paradox, and the 2016
us presidential election. Ann Appl Stat. 2018;12(2):685–726. https:// doi. org/ 10. 1214/ 18- AOAS1 161SF.
14. Wang Q, Kulkarni SR, Verdú S. Divergence estimation of continuous distributions based on data-dependent partitions.
IEEE Transact Informat Theory. 2005;51(9):3064–74.
15. Póczos B, Xiong L, Schneider J. Nonparametric divergence estimation with applications to machine learning on distri-
butions. In: UAI (also arXiv Preprint arXiv: 1202. 3758 2012) 2011.
16. Berisha V, Wisler A, Hero AO, Spanias A. Empirically estimable classification bounds based on a nonparametric diver-
gence measure. IEEE Transact Signal Process. 2016;64(3):580–91.
17. Noshad M, Hero A. Scalable hash-based estimation of divergence measures. In: International Conference on Artificial
Intelligence and Statistics, 2018;pp. 1877–1885.
18. Noshad M, Xu L, Hero A. Learning to benchmark: Determining best achievable misclassification error from training
data. arXiv preprint arXiv: 1909. 07192 2019.
19. Ho S-W, Verdú S. Convexity/concavity of renyi entropy and
α
-mutual information. In: Information Theory (ISIT), 2015
IEEE International Symposium On, 2015;pp. 745–749. IEEE
20. Cover TM, Thomas JA. Elements of information theory. Hoboken: Wiley; 2012.
21. Shwartz-Ziv R, Tishby N. Opening the black box of deep neural networks via information. arXiv preprint arXiv: 1703.
00810 2017.
22. Noshad M, Zeng Y, Hero AO. Scalable mutual information estimation using dependence graphs. In: ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019;pp. 2962–2966. IEEE
23. Ades A, Sutton A. Multiparameter evidence synthesis in epidemiology and medical decision-making: current
approaches. J Royal Stat Soci Series A. 2006;169(1):5–35.
24. Oakley JE, O’Hagan A. Probabilistic sensitivity analysis of complex models: a bayesian approach. J Royal Statist Soc
Series B. 2004;66(3):751–69.
25. Saltelli A, Tarantola S, Campolongo F, Ratto M. Sensitivity analysis in practice: a guide to assessing scientific models.
Chichester. 2004.
26. Pan SJ, Yang Q. A survey on transfer learning. IEEE Transact Knowl Data Eng. 2010;22(10):1345–59. https:// doi. org/ 10.
1109/ TKDE. 2009. 191.
27. Denison DD, Hansen MH, Holmes CC, Mallick B, Yu B. Nonlinear Estimation and Classification. Lecture Notes in Statistics.
Springer. 2013. https:// books. google. com/ books? id= 0IDuB wAAQB AJ
28. Ghahramani Z. Probabilistic machine learning and artificial intelligence. Nature. 2015;521(7553):452–9.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 23 of 23
Noshadetal. J Big Data (2021) 8:82
29. Faraway JJ. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Mod-
els. Chapman and Hall/CRC, ??? 2016.
30. Tibshirani R. The lasso method for variable selection in the cox model. Statist Med. 1997;16(4):385–95.
31. Liaw A, Wiener M, et al. Classification and regression by randomforest. R News. 2002;2(3):18–22.
32. Margineantu DD, Dietterich TG. Pruning adaptive boosting. In: ICML, 1997;vol. 97, pp. 211–218. Citeseer
33. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22Nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. KDD ’16, pp. 785–794. ACM, New York 2016. https:// doi. org/ 10.
1145/ 29396 72. 29397 85.
34. Dudani SA. The distance-weighted k-nearest-neighbor rule. IEEE Transact Syst Man Cybernet. 1976;4:325–7.
35. Hartigan JA, Wong MA. Algorithm as 136: a k-means clustering algorithm. J Royal Statist Soc Series C. 1979;28(1):100–8.
36. Bodenhofer U, Kothmeier A, Hochreiter S. Apcluster: an r package for affinity propagation clustering. Bioinformatics.
2011;27(17):2463–4.
37. Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s
criterion? J Classificat. 2014;31(3):274–95.
38. Alemi AA, Fischer I, Dillon JV, Murphy K. Deep variational information bottleneck. arXiv preprint arXiv: 1612. 00410 2016.
39. Deng L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal
Process Magaz. 2012;29(6):141–2. https:// doi. org/ 10. 1109/ MSP. 2012. 22114 77.
40. Kraskov A, Stögbauer H, Grassberger P. Estimating mutual information. Phys Rev E. 2004;69(6):066138.
41. Moon Y, Rajagopalan B, Lall U. Estimation of mutual information using kernel density estimators. Phys Rev E.
1995;52(3):2318.
42. Kwak N, Choi C-H. Input feature selection by mutual information based on parzen window. IEEE Transact Pattern Analy
Mach Intell. 2002;24(12):1667–71.
43. Stowell D, Plumbley MD. Fast multidimensional entropy estimation by
k
-d partitioning. IEEE Signal Process Lett.
2009;16(6):537–40. https:// doi. org/ 10. 1109/ LSP. 2009. 20173 46.
44. Evans D. A computationally efficient estimator for mutual information. In: Proceedings of the Royal Society of London
A: Mathematical, Physical and Engineering Sciences, 2008;vol. 464, pp. 1203–1215. The Royal Society
45. Walters-Williams J, Li Y. Estimation of mutual information: A survey. In: International Conference on Rough Sets and
Knowledge Technology, 2009;pp. 389–396. Springer
46. Singh S, Póczos B. Generalized exponential concentration inequality for rényi divergence estimation. In: International
Conference on Machine Learning, 2014;pp. 333–341.
47. Noshad M, Moon KR, Sekeh SY, Hero AO. Direct estimation of information divergence using nearest neighbor ratios. In:
2017 IEEE International Symposium on Information Theory (ISIT), 2017;pp. 903–907. IEEE
48. Noshad M, Hero AO. Scalable hash-based estimation of divergence measures. In: 2018 Information Theory and Applica-
tions Workshop (ITA), 2018; pp. 1–10. IEEE
49. Tang M, Gao C, Goutman SA, Kalinin A, Mukherjee B, Guan Y, Dinov ID. Model-based and model-free techniques for
amyotrophic lateral sclerosis diagnostic prediction and patient clustering. Neuroinformatics. 2019;17(3):407–21. https://
doi. org/ 10. 1007/ s12021- 018- 9406-9.
50. Rahme R, Yeatts SD, Abruzzo TA, Jimenez L, Fan L, Tomsick TA, Ringer AJ, Furlan AJ, Broderick JP, K hatri P. Early reperfu-
sion and clinical outcomes in patients with m2 occlusion: pooled analysis of the proact ii, ims, and ims ii studies. J
Neurosurgery JNS. 2014;121(6):1354–8.
51. Glass JD, Hertzberg VS, Boulis NM, Riley J, Federici T, Polak M, Bordeau J, Fournier C, Johe K, Hazel T, Cudkowicz M, Atassi
N, Borges LF, Rutkove SB, Duell J, Patil PG, Goutman SA, Feldman EL. Transplantation of spinal cord–derived neural stem
cells for als. Neurology. 2016;87(4):392–400. https:// doi. org/ 10. 1212/ WNL. 00000 00000 002889. https://n. neuro logy. org/
conte nt/ 87/4/ 392. full. pdf
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... For example, 26 data quality dimensions and over 80 associated metrics are described by Zaveri et al. [26] and 21 data quality dimensions are identified in the recent review of data quality dimensions by Wang et al. [102]. The most commonly referenced data quality dimensions in our data value survey sample papers were accuracy [35] [17], timeliness [35] [51], completeness [95] [35], latency [71], volume [99] [32] [80] and provenance [71] [95]. Note that volume in many models has an inverse or convex value curve in relation to value [17]. ...
... O Lexical semantic uniqueness A measure of the fraction of most unique words in a file compared to the contents of the whole collection [80]. O Mutual information A qualification of the inferential gain by increasing the size or features of data when comparing two datasets [99]. ...
... S, O Usage over time A quantification of value based on the usage of the data over time [22]. O Usefulness of sample data for a given inferential task [99]. ...
Article
Full-text available
Data is central to modern decision making and value creation. Society creates, consumes and collects data at an increasing pace. Despite advances in processing power, data is expensive to maintain and curate. So, it is imperative to have methods and tools to distinguish between data based on its value. Yet, there is no consensus on what characterises the value of data or how this data value should be assessed. This results in heterogeneous data value models and inconsistent measurement techniques that are siloed in specific application domains. This limits the formalisation and exploitation of these concepts. We present in this paper a methodical literature analysis that discusses data value models, assessment metrics and current applications. We also highlight challenges hindering the development and exploitation of data value as concept. This leads to the identification of a set of research questions to help researchers contribute to this emerging field. The aim of this article is to stimulate further research and deployment of quantitative data value models and value-driven applications.
... Among the filter-univariate methods, we adopted the Chi-Square test and Information Gain (40,41). The Chi-Square test evaluates the relationship between a feature and the target variable while Information Gain measures the reduction in entropy (or uncertainty) about the target variable achieved by knowing the value of a feature (39,40). ...
... The Chi-Square test evaluates the relationship between a feature and the target variable while Information Gain measures the reduction in entropy (or uncertainty) about the target variable achieved by knowing the value of a feature (39,40). These approaches can effectively capture both linear and non-linear relationships between features and the target variable and are less sensitive to outliers (41). For the embedded feature selection techniques, we employed the Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) (38). ...
Article
Full-text available
Background Early identification of high-risk individuals for weight problems in children and adolescents is crucial for implementing timely preventive measures. While machine learning (ML) techniques have shown promise in addressing this complex challenge with high-dimensional data, feature selection is vital for identifying the key predictors that can facilitate effective and targeted interventions. This study aims to utilize feature selection process to identify a robust and minimal set of predictors that can aid in the early prediction of short- and long-term weight problems in children and adolescents. Methods We utilized demographic, physical, and psychological wellbeing predictors to model weight status (normal, underweight, overweight, and obese) for 1-, 3-, and 5-year periods. To select the most influential features, we employed four feature selection methods: (1) Chi-Square test; (2) Information Gain; (3) Random Forest; (4) eXtreme Gradient Boosting (XGBoost) with six ML approaches. The stability of the feature selection methods was assessed by Jaccard's index, Spearman's rank correlation and Pearson's correlation. Model evaluation was performed by various accuracy metrics. Results With 3,862,820 million student-visits were included in this population-based study, the mean age of 11.6 (SD = 3.64) for the training set and 10.8 years (SD = 3.50) for the temporal test set. From the initial set of 38 predictors, we identified 6, 9, and 13 features for 1-, 3-, and 5-year predictions, respectively, by the best performed feature selection method of Chi-Square test in XGBoost models. These feature sets demonstrated excellent stability and achieved prediction accuracies of 0.82, 0.73, and 0.70; macro-AUCs of 0.94, 0.86, and 0.83; micro-AUCs of 0.96, 0.93, and 0.92 for different prediction windows, respectively. Weight, height, sex, total score of self-esteem, and age were consistently the most influential predictors across all prediction windows. Additionally, several psychological and social wellbeing predictors showed relatively high importance in long-term weight status prediction. Conclusions We demonstrate the potential of ML in identifying key predictors of weight status in children and adolescents. While traditional anthropometric measures remain important, psychological and social wellbeing factors also emerge as crucial predictors, potentially informing targeted interventions to address childhood and adolescence weight problems.
... One core need of academia and real-world enterprises is formalising the abovementioned data valuation (Hafner and Mira da Silva 2023;Kaufmann 2019;Noshad et al. 2021). A reason for this is the struggle of enterprises to determine the value of their data, while at the same time, 'data's worth is dramatically increasing' (Mavrogiorgou et al. 2023) and data is increasingly recognised as a strategic asset (Brennan et al. 2019;Faroukhi et al. 2020;Fleckenstein, Obaidi, and Tryfona 2023;Hafner et al. 2022;Thieullent et al. 2020). ...
... However, privacy concerns may not always be relevant in evaluating the utility of joined datasets, especially when joining datasets about non-human objects. Moreover, while Noshad et al. proposed the Data Value Metric (DVM) to assess the information content of large datasets for augmentation in specific domains, this approach is limited to evaluating the utility of a single dataset rather than a joined open dataset [16]. ...
Conference Paper
The widespread adoption of open datasets across various domains has emphasized the significance of joining and computing their utility. However, the interplay between computation and human interaction is vital for informed decision-making. To address this issue, we first propose a utility metric to calibrate the usefulness of open datasets when joined with other such datasets. Further, we distill this utility metric through a visual analytic framework called VALUE, which empowers the researchers to identify joinable datasets, prioritize them based on their utility, and inspect the joined dataset. This transparent evaluation of the utility of the joined datasets is implemented through a human-in-the-loop approach where the researchers can adapt and refine the selection criteria according to their mental model of utility. Finally, we demonstrate the effectiveness of our approach through a usage scenario using real-world open datasets.
... Consequently, having greater "data length" is beneficial for the development of all types of models, although this also depends on degree of informativeness of the data (e.g. whether the data contain repeated examples of similar information, such as medium-flow events, or different types of information, such as low-, medium-and high-flow events) (Gupta et al., 2014;Singh and Bárdossy, 2012;Noshad et al., 2021). However, the impact of "data length" is likely to be greater for data-driven models, such as ANNs and traditional statistical regression approaches, as for these types of models, data are generally used to assist with the selection of appropriate model inputs, the determination of an appropriate model structure, as well as values of the unknown model parameters. ...
Chapter
System condition-based predictive maintenance is an important factor for the availability and performance of complex machines and offers cost-saving potential due to shorter and better planned outages. For this purpose, complex models are commonly used to detect possible faults and degradations in an early stage. These models need information about the faults and all operating points as a training set. Often these data sets are not available or sufficient, so that alternatively training data can be generated artificially. For well-researched rotating machines, scientific literature, which provides information about the modeling of different kinds of damage conditions, can be used to achieve this. We develop a demonstrator grid and a system for detecting degrading and sudden faults on rotating systems. This requires data to be recorded from the sensors (acceleration, noises, etc.) under variation of the speed on the demonstrator. Afterwards the training set for the diagnosis models has to be generated using a simulation of a set of possible faults. For this, the identified system and modeling of the possible faults is needed. The models have to be implemented, trained on this data set, and evaluated on real but manually caused damage data at the demonstrator. The aim is to investigate the extent to which models trained with simulated data can be used for diagnosis in reality. Also, different kind of models will be compared regarding their performance.
Conference Paper
Full-text available
This paper investigates the impact of data valuation metrics (variability and coefficient of variation) on the feature importance in classification models. Data valuation is an emerging topic in the fields of data science, accounting, data quality, and information economics concerned with methods to calculate the value of data. Feature importance or ranking is important in explaining how black-box machine learning models make predictions as well as selecting the most predictive features while training these models. Existing feature importance algorithms are either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance in Tree-based models). No previous investigation of the impact of data valuation metrics on feature importance has been conducted. Five popular machine learning models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as well as six widely implemented feature ranking 1 2 M. Ebiele et al. and SHAP values) to investigate the relationship between feature importance and data valuation metrics for a clinical use case. XGB outperforms the other models with a weighted F1-score of 79.72%. The findings suggest that features with variability greater than 0.4 or a coefficient of variation greater than 23.4 have little to no value; therefore, these features can be filtered out during feature selection. This result, if generalisable, could simplify feature selection and data preparation.
Chapter
The unavailability of equipment in thermoelectric plants for any reason becomes a risk for the entrepreneur, who as a consequence bears even greater losses with the high cost of machines stopped, in addition to the penalties sanctioned and provided by law. Based on this assumption, the maintenance programs are methodologies that aim to contribute with techniques and tools to mitigate this problem. However, only the use of maintenance programs is not enough; thus, this research aims to develop a computational model capable of predicting the Key Performance Indicator Reliability, with the purpose of indicating the probability of the equipment to operate in a pre-defined space of time, having as object of study a group of internal combustion machines of Thermoelectric Plants. Thus, the research meets the objectives of cataloging the significant variables for the prediction model; analyze two ANN training algorithms (Levenberg-Marquardt and Bayesian Regularization) considering the supervised learning approach, where the amount of neurons, hidden layers and activation functions are requirements for the network performance; Developing the prediction model for the reliability of the motor group, where the training algorithms are validated using the best model stopping criterion, finding the best network performance based on Mean Squared Error (MSE), Root Mean Square Error (RMSE), Linear Regression, and Best Model stopping criterion; and finally, simulating the cataloged failure data in order to analyze the technical state of the motor group with the best model. The innovation of the research is characterized by the computational methods of data processing with optimization methods such as Levenberg-Marquardt for the search of the local minimum and faster convergence of the ANN and Bayesian Regularization which is a vector machine focused on machine learning and pattern recognition, characterizing the use of artificial intelligence techniques to predict the reliability in days and months. In addition, it is used for predictive maintenance indicators such as: Mean Time Between Failures (MTBF), Mean Time To Repair (MTTR), Availability, and Reliability. To analyze the results of this research, a set of 20 load generation units was used as parameters for investigating the frequency of failures, the two optimization algorithms were applied, with a combination between the activation functions: in Sigmoid, Linear, and Hyperbolic Tangent, the research results show that the two techniques present 100% correlation between the output and simulated variables, characterizing the efficiency in predicting in days.KeywordsReliabilityArtificial Neural Network Techniques (ANN)Thermal Power Plants (TPP)Key Performance indicatorsMachine learningMean Time To Repair (MTTR)Mean Time Between Failures (MTBF)Statistical failure rate analysisPredictive maintenanceInternal combustion engines
Article
Full-text available
Amyotrophic lateral sclerosis (ALS) is a complex progressive neurodegenerative disorder with an estimated prevalence of about 5 per 100,000 people in the United States. In this study, the ALS disease progression is measured by the change of Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS) score over time. The study aims to provide clinical decision support for timely forecasting of the ALS trajectory as well as accurate and reproducible computable phenotypic clustering of participants. Patient data are extracted from DREAM-Phil Bowen ALS Prediction Prize4Life Challenge data, most of which are from the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT) archive. We employed model-based and model-free machine-learning methods to predict the change of the ALSFRS score over time. Using training and testing data we quantified and compared the performance of different techniques. We also used unsupervised machine learning methods to cluster the patients into separate computable phenotypes and interpret the derived subcohorts. Direct prediction of univariate clinical outcomes based on model-based (linear models) or model-free (machine learning based techniques – random forest and Bayesian adaptive regression trees) was only moderately successful. The correlation coefficients between clinically observed changes in ALSFRS scores relative to the model-based/model-free predicted counterparts were 0.427 (random forest) and 0.545(BART). The reliability of these results were assessed using internal statistical cross validation and well as external data validation. Unsupervised clustering generated very reliable and consistent partitions of the patient cohort into four computable phenotypic subgroups. These clusters were explicated by identifying specific salient clinical features included in the PRO-ACT archive that discriminate between the derived subcohorts. There are differences between alternative analytical methods in forecasting specific clinical phenotypes. Although predicting univariate clinical outcomes may be challenging, our results suggest that modern data science strategies are useful in clustering patients and generating evidence-based ALS hypotheses about complex interactions of multivariate factors. Predicting univariate clinical outcomes using the PRO-ACT data yields only marginal accuracy (about 70%). However, unsupervised clustering of participants into sub-groups generates stable, reliable and consistent (exceeding 95%) computable phenotypes whose explication requires interpretation of multivariate sets of features. Highlights • Used a large ALS data archive of 8,000 patients consisting of 3 million records, including 200 clinical features tracked over 12 months. • Employed model-based and model-free methods to predict ALSFRS changes over time, cluster patients into cohorts, and derive computable phenotypes. • Research findings include stable, reliable, and consistent (95%) patient stratification into computable phenotypes. However, clinical explication of the results requires interpretation of multivariate information. Open image in new window Graphical Abstract ᅟ
Article
Full-text available
We propose a unified method for empirical non-parametric estimation of general Mutual Information (MI) function between the random vectors in Rd\mathbb{R}^d based on N i.i.d. samples. The proposed low complexity estimator is based on a bipartite graph, referred to as dependence graph. The data points are mapped to the vertices of this graph using randomized Locality Sensitive Hashing (LSH). The vertex and edge weights are defined in terms of marginal and joint hash collisions. For a given set of hash parameters ϵ(1),,ϵ(k)\epsilon(1), \ldots, \epsilon(k), a base estimator is defined as a weighted average of the transformed edge weights. The proposed estimator, called the ensemble dependency graph estimator (EDGE), is obtained as a weighted average of the base estimators, where the weights are computed offline as the solution of a linear programming problem. EDGE achieves optimal computational complexity O(N), and can achieve the optimal parametric MSE rate of O(1/N) if the density is d times differentiable. To the best of our knowledge EDGE is the first non-parametric MI estimator that can achieve parametric MSE rates with linear time complexity.
Article
Full-text available
We propose a scalable divergence estimation method based on hashing. Consider two continuous random variables X and Y whose densities have bounded support. We consider a particular locality sensitive random hashing, and consider the ratio of samples in each hash bin having non-zero numbers of Y samples. We prove that the weighted average of these ratios over all of the hash bins converges to f-divergences between the two samples sets. We show that the proposed estimator is optimal in terms of both MSE rate and computational complexity. We derive the MSE rates for two families of smooth functions; the H\"{o}lder smoothness class and differentiable functions. In particular, it is proved that if the density functions have bounded derivatives up to the order d/2, where d is the dimension of samples, the optimal parametric MSE rate of O(1/N) can be achieved. The computational complexity is shown to be O(N), which is optimal. To the best of our knowledge, this is the first empirical divergence estimator that has optimal computational complexity and achieves the optimal parametric MSE estimation rate.
Book
Over the past decade, Big Data have become ubiquitous in all economic sectors, scientific disciplines, and human activities. They have led to striking technological advances, affecting all human experiences. Our ability to manage, understand, interrogate, and interpret such extremely large, multisource, heterogeneous, incomplete, multiscale, and incongruent data has not kept pace with the rapid increase of the volume, complexity and proliferation of the deluge of digital information. There are three reasons for this shortfall. First, the volume of data is increasing much faster than the corresponding rise of our computational processing power (Kryder’s law > Moore’s law). Second, traditional discipline-bounds inhibit expeditious progress. Third, our education and training activities have fallen behind the accelerated trend of scientific, information, and communication advances. There are very few rigorous instructional resources, interactive learning materials, and dynamic training environments that support active data science learning. The textbook balances the mathematical foundations with dexterous demonstrations and examples of data, tools, modules and workflows that serve as pillars for the urgently needed bridge to close that supply and demand predictive analytic skills gap. Exposing the enormous opportunities presented by the tsunami of Big data, this textbook aims to identify specific knowledge gaps, educational barriers, and workforce readiness deficiencies. Specifically, it focuses on the development of a transdisciplinary curriculum integrating modern computational methods, advanced data science techniques, innovative biomedical applications, and impactful health analytics. The content of this graduate-level textbook fills a substantial gap in integrating modern engineering concepts, computational algorithms, mathematical optimization, statistical computing and biomedical inference. Big data analytic techniques and predictive scientific methods demand broad transdisciplinary knowledge, appeal to an extremely wide spectrum of readers/learners, and provide incredible opportunities for engagement throughout the academy, industry, regulatory and funding agencies.
Article
Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size n, probabilistic or not, the difference between the sample average Xn and the population average XN is the product of three terms: (1) a data quality measure, ρR,X, the correlation between√Xj and the response/recording indicator Rj; (2) a data quantity measure, √(N − n)/n, where N is the population size; and (3) a problem difficulty measure, σX, the standard deviation of X. This decomposition provides multiple insights: (I) Probabilistic sampling ensures high data quality by controlling ρR,X at the level of N−1/2; (II) When we lose this control, the impact of N is no longer canceled by ρR,X, leading to a Law of Large Populations√ (LLP), that is, our estimation error, relative to the benchmarking rate 1/√n, increases with √N; and (III) the “bigness” of such Big Data (for population inferences) should be measured by the relative size f = n/N, not the absolute size n; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes. Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρR,X ≈−0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual 95% confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.
Article
Health economic evaluation has become increasingly important in medical research and recently has been built on solid statistical and decision-theoretic foundations, particularly under the Bayesian approach. In this article we review the basic concepts and issues associated with the statistical and decision-theoretic components of health economic evaluations. We present examples of typical models used in different contexts (depending on the availability of data). We also describe the process of uncertainty analysis, a crucial component of economic evaluations for health care interventions, aimed at assessing the impact of uncertainty in the model parameters on the final decision-making process. Finally, we discuss some of the most recent methodological developments, related with the application of advanced statistical models (e.g., Gaussian process regression) to facilitate the application of computationally expensive tools such as value of information analysis.