ArticlePDF Available

Detecting Novel Associations in Large Data Sets

Authors:

Abstract and Figures

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
Application of MINE to global indicators from the WHO. (A) MIC versus r for all pairwise relationships in the WHO data set. (B) Mutual information (Kraskov et al. estimator) versus r for the same relationships. High mutual information scores tend to be assigned only to relationships with high r, whereas MIC gives high scores also to relationships that are nonlinear. (C to H) Example relationships from (A). (C) Both r and MIC yield low scores for unassociated variables. (D) Ordinary linear relationships score high under both tests. (E to G) Relationships detected by MIC but not by r, because the relationships are nonlinear (E and G) or because more than one relationship is present (F). In (F), the linear trendline comprises a set of Pacific island nations in which obesity is culturally valued (33); most other countries follow a parabolic trend (table S10). (H) A superposition of two relationships that scores high under all three tests, presumably because the majority of points obey one relationship. The less steep minority trend consists of countries whose economies rely largely on oil (37) (table S11). The lines of best fit in (D) to (H) were generated using regression on each trend. (I) Of these four relationships, the left two appear less noisy than the right two. MIC accordingly assigns higher scores to the two relationships on the left. In contrast, mutual information assigns similar scores to the top two relationships and similar scores to the bottom two relationships.
… 
Content may be subject to copyright.
DOI: 10.1126/science.1205438
, 1518 (2011);334 Science , et al.David N. Reshef
Detecting Novel Associations in Large Data Sets
This copy is for your personal, non-commercial use only.
clicking here.colleagues, clients, or customers by , you can order high-quality copies for yourIf you wish to distribute this article to others
here.following the guidelines can be obtained byPermission to republish or repurpose articles or portions of articles
): December 15, 2011 www.sciencemag.org (this infomation is current as of
The following resources related to this article are available online at
http://www.sciencemag.org/content/334/6062/1518.full.html
version of this article at: including high-resolution figures, can be found in the onlineUpdated information and services,
http://www.sciencemag.org/content/suppl/2011/12/15/334.6062.1518.DC2.html
http://www.sciencemag.org/content/suppl/2011/12/14/334.6062.1518.DC1.html
can be found at: Supporting Online Material
http://www.sciencemag.org/content/334/6062/1518.full.html#ref-list-1
, 6 of which can be accessed free:cites 35 articlesThis article
http://www.sciencemag.org/content/334/6062/1518.full.html#related-urls
1 articles hosted by HighWire Press; see:cited by This article has been
registered trademark of AAAS. is aScience2011 by the American Association for the Advancement of Science; all rights reserved. The title CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005.
(print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience
on December 15, 2011www.sciencemag.orgDownloaded from
Detecting Novel Associations
in Large Data Sets
David N. Reshef,
1,2,3
*Yakir A. Reshef,
2,4
*Hilary K. Finucane,
5
Sharon R. Grossman,
2,6
Gilean McVean,
3,7
Peter J. Turnbaugh,
6
Eric S. Lander,
2,8,9
Michael Mitzenmacher,
10
Pardis C. Sabeti
2,6
Identifying interesting relationships between pairs of variables in large data sets is increasingly
important. Here, we present a measure of dependence for two-variable relationships: the maximal
information coefficient (MIC). MIC captures a wide range of associations both functional and
not, and for functional relationships provides a score that roughly equals the coefficient of
determination (R
2
) of the data relative to the regression function. MIC belongs to a larger
class of maximal information-based nonparametric exploration (MINE) statistics for identifying
and classifying relationships. We apply MIC and MINE to data sets in global health, gene
expression, major-league baseball, and the human gut microbiota and identify known and
novel relationships.
Imagine a data set with hundreds of variables,
which may contain important, undiscovered
relationships. There are tens of thousands of
variable pairsfar too many to examine manu-
ally. If you do not already know what kinds of
relationships to search for, how do you efficiently
identify the important ones? Data sets of this size
are increasingly common in fields as varied as
genomics, physics, political science, and econom-
ics, making this question an important and grow-
ing challenge (1,2).
One way to begin exploring a large data set
is to search for pairs of variables that are closely
associated. To do this, we could calculate some
measure of dependence for each pair, rank the
pairs by their scores, and examine the top-scoring
pairs. For this strategy to work, the statistic we
use to measure dependence should have two heu-
ristic properties: generality and equitability.
By generality, we mean that with sufficient
sample size the statistic should capture a wide
range of interesting associations, not limited to
specific function types (such as linear, exponential,
or periodic), or even to all functional relation-
ships (3). The latter condition is desirable because
not only do relationships take many functional
forms, but many important relationshipsfor ex-
ample, a superposition of functionsare not well
modeled by a function (47).
By equitability, we mean that the statistic
should give similar scores to equally noisy rela-
tionships of different types. For example, we do
not want noisy linear relationships to drive strong
sinusoidal relationships from the top of the list.
Equitability is difficult to formalize for associa-
tions in general but has a clear interpretation in
the basic case of functional relationships: An equi-
table statistic should give similar scores to func-
tional relationships with similar R
2
values (given
sufficient sample size).
Here, we describe an exploratory data anal-
ysis tool, the maximal information coefficient
(MIC), that satisfies these two heuristic proper-
ties. We establish MICs generality through proofs,
show its equitability on functional relationships
through simulations, and observe that this trans-
lates into intuitively equitable behavior on more
general associations. Furthermore, we illustrate
that MIC gives rise to a larger family of sta-
tistics, which we refer to as MINE, or maximal
information-based nonparametric exploration.
MINE statistics can be used not only to identify
interesting associations, but also to characterize
them according to properties such as nonline-
arity and monotonicity. We demonstrate the
application of MIC and MINE to data sets in
health, baseball, genomics, and the human
microbiota.
The maximal information coefficient. Intu-
itively, MIC is based on the idea that if a re-
lationship exists between two variables, then a
grid can be drawn on the scatterplot of the two
variables that partitions the data to encapsulate
that relationship. Thus, to calculate the MIC of a
set of two-variable data, we explore all grids up
to a maximal grid resolution, dependent on the
sample size (Fig. 1A), computing for every pair
of integers (x,y) the largest possible mutual in-
formation achievable by any x-by-ygrid applied
to the data. We then normalize these mutual
information values to ensure a fair comparison
between grids of different dimensions and to ob-
tain modified values between 0 and 1. We de-
fine the characteristic matrix M=(m
x,y
), where
m
x,y
is the highest normalized mutual infor-
mation achieved by any x-by-ygrid, and the
statistic MIC to be the maximum value in M
(Fig. 1, B and C).
More formally, for a grid G, let I
G
denote
the mutual information of the probability dis-
1
Department of Computer Science, Massachusetts Institute of
Technology (MIT), Cambridge, MA 02139, USA.
2
Broad Institute
of MIT and Harvard, Cambridge, MA 02142, USA.
3
Department
of Statistics, University of Oxford, Oxford OX1 3TG, UK.
4
De-
partment of Mathematics, Harvard College, Cambridge, MA
02138, USA.
5
Department of Computer Science and Applied
Mathematics, Weizmann Institute of Science, Rehovot, Israel.
6
Center for Systems Biology, Department of Organismic and
Evolutionary Biology, Harvard University, Cambridge, MA 02138,
USA.
7
Wellcome Trust Centre for Human Genetics, University of
Oxford, Oxford OX3 7BN, UK.
8
Department of Biology, MIT,
Cambridge, MA 02139, USA.
9
Department of Systems Biology,
Harvard Medical School, Boston, MA 02115, USA.
10
School of
Engineering and Applied Sciences, Harvard University, Cam-
bridge, MA 02138, USA.
*These authors contributed equally to this work.
To whom correspondence should be addressed. E-mail:
dnreshef@mit.edu (D.N.R.); yreshef@post.harvard.edu (Y.A.R.)
These authors contributed equally to this work.
Columns
Rows
32...
0
10
20
0 10 20 30
Normalized Score
Vertical Axis Bins
Horizontal Axis Bins
0.5
0.0
1.0
23...
2 x 2 2 x 3 x x y
A
B
C
Fig. 1. Computing MIC (A)Foreachpair(x,y), the
MIC algorithm finds the x-by-ygrid with the highest
induced mutual information. (B)Thealgorithm
normalizes the mutual information scores and
compiles a matrix that stores, for each resolution,
thebestgridatthatresolutionanditsnormalized
score. (C) The normalized scores form the char-
acteristic matrix, which can be visualized as a sur-
face; MIC corresponds to the highest point on this
surface. In this example, there are many grids that
achieve the highest score. The star in (B) marks a
sample grid achieving this score, and the star in (C)
marks that grids corresponding location on the
surface.
RESEARCH ARTICLES
16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1518
on December 15, 2011www.sciencemag.orgDownloaded from
tribution induced on the boxes of G, where the
probability of a box is proportional to the num-
ber of data points falling inside the box. The
(x,y)-th entry m
x,y
of the characteristic matrix
equals max{I
G
}/log min{x,y}, where the maxi-
mum is taken over all x-by-ygrids G. MIC is the
maximum of m
x,y
over ordered pairs (x,y)such
that xy <B,whereBis a function of sample size;
we usually set B=n
0.6
(see SOM Section 2.2.1).
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Squared Spearman Rank
Correlation Coefficient
D
C
1
1.5
2
2.5
0
0.5
Mutual Information (Kraskov et al.)
5.0
00.2 0.4 0.6 1
0.8
0
0.2
0.4
0.6
0.8
1
00.20.40.60.81
Maximal Information Coefficient
(MIC)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Maximal Correlation (ACE)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
CorGC (Principal Curve-Based)
Noise (1 - R2)
*
*
**
**
*
*
*
*
**
**
**
**
0.80 0.65 0.50 0.35
Relationship Type
Two Lines
Line and Parabola
X
Ellipse
Sinusoid
(Mixture of two signals)
Non-coexistence
Maximal Information Coefficient (MIC)
Added Noise
AB
F
E
G
Relationship Type MIC Pearson Spearman
CorGC
(Principal
Curve-Based)
Maximal
Correlation
0.18 -0.02 -0.02 0.01 0.03 0.19 0.01
1.00 1.00 1.00 5.03 3.89 1.00 1.00
1.00 0.61 0.69 3.09 3.12 0.98 1.00
1.00 0.70 1.00 2.09 3.62 0.94 1.00
1.00 -0.09 -0.09 0.01 -0.11 0.36 0.64
1.00 0.53 0.49 2.22 1.65 1.00 1.00
1.00 0.33 0.31 0.69 0.45 0.49 0.91
1.00 -0.01 -0.01 3.33 3.15 1.00 1.00
1.00 0.00 0.00 0.01 0.20 0.40 0.80
1.00 -0.11 -0.11 0.02 0.06 0.38 0.76
Mutual Information
(Kraskov)(KDE)
Random
Linear
Cubic
Exponential
Sinusoidal
(Fourier frequency)
Categorical
Periodic/Linear
Parabolic
Sinusoidal
(non-Fourier frequency)
Sinusoidal
(varying frequency)
Fig. 2. Comparison of MIC to existing methods (A) Scores given to various
noiseless functional relationships by several different statistics (8,12,14,19).
Maximalscoresineachcolumnareaccentuated.(Bto F) The MIC, Spearman
correlation coefficient, mutual information (Kraskov et al. estimator), maximal
correlation (estimated using ACE), and the principal curve-based CorGC de-
pendence measure, respectively, of 27 different functional relationships with
independent uniform vertical noise added, as the R
2
value of the data relative to
the noiseless function varies. Each shape and color corresponds to a different
combination of function type and sample size. In each plot, pairs of thumbnails
show relationships that received identical scores; for data exploration, we would
like these pairs to have similar noise levels. For a list of the functions and sample
sizes in these graphs as well as versions with other statistics, sample sizes, and
noise models, see figs. S3 and S4. (G) Performance of MIC on associations not
well modeled by a function, as noise level varies. For the performance of other
statistics,seefigs.S5andS6.
www.sciencemag.org SCIENCE VOL 334 16 DECEMBER 2011 1519
RESEARCH ARTICLES
on December 15, 2011www.sciencemag.orgDownloaded from
Every entry of Mfalls between 0 and 1, and
soMICdoesaswell.MICisalsosymmetric[i.e.,
MIC(X,Y)=MIC(Y,X)] due to the symmetry of
mutual information, and because I
G
depends
only on the rank order of the data, MIC is invar-
iant under order-preserving transformations of
the axes. Notably, although mutual information
is used to quantify the performance of each grid,
MIC is not an estimate of mutual information
(SOM Section 2).
To calculate M, we would ideally optimize
over all possible grids. For computational effi-
ciency, we instead use a dynamic programming
algorithm that optimizes over a subset of the pos-
sible grids and appears to approximate well the
true value of MIC in practice (SOM Section 3).
Main properties of MIC. We have proven
mathematically that MIC is general in the sense
described above. Our proofs show that, with prob-
ability approaching 1 as sample size grows, (i)
MIC assigns scores that tend to 1 to all never-
constant noiseless functional relationships; (ii)
MIC assigns scores that tend to 1 for a larger
class of noiseless relationships (including super-
positions of noiseless functional relationships);
and (iii) MIC assigns scores that tend to 0 to
statistically independent variables.
Specifically, we have proven that for a pair of
random variables Xand Y,(i)ifYis a function of
Xthat is not constant on any open interval, then
data drawn from (X,Y) will receive an MIC tend-
ing to 1 with probability one as sample size grows;
(ii) if the support of (X,Y) is described by a
finite union of differentiable curves of the form
c(t)=[x(t),y(t)] for tin [0,1], then data drawn from
(X,Y) will receive an MIC tending to 1 with
probability one as sample size grows, provided
that dx/dt and dy/dt are each zero on finitely
many points; (iii) the MIC of data drawn from
(X,Y) converges to zero in probability as sample
size grows if and only if Xand Yare statis-
tically independent. We have also proven that
the MIC of a noisy functional relationship is
bounded from below by a function of its R
2
.
(For proofs, see SOM.)
We tested MICs equitability through simu-
lations. These simulations confirm the mathemat-
ical result that noiseless functional relationships
(i.e., R
2
= 1.0) receive MIC scores approaching
1.0 (Fig. 2A). They also show that, for a large
collection of test functions with varied sample
sizes, noise levels, and noise models, MIC rough-
ly equals the coefficient of determination R
2
rel-
ative to each respective noiseless function. This
makes it easy to interpret and compare scores
across various function types (Fig. 2B and fig.
S4). For instance, at reasonable sample sizes, a
sinusoidal relationship with a noise level of R
2
=
0.80 and a linear relationship with the same R
2
value receive nearly the same MIC score. For a
wide range of associations that are not well
modeled by a function, we also show that MIC
scores degrade in an intuitive manner as noise
is added (Fig. 2G and figs. S5 and S6).
Comparisons to other methods. We compared
MIC to a wide range of methodsincluding meth-
ods formulated around the axiomatic framework
for measures of dependence developed by Rényi
(8), other state-of-the-art measures of dependence,
and several nonparametric curve estimation tech-
niques that can be used to score pairs of vari-
ables based on how well they fit the estimated
curve.
Methods such as splines (1) and regression
estimators (1,9,10) tend to be equitable across
functional relationships (11)butarenotgener-
al: They fail to find many simple and important
types of relationships that are not functional.
(Figures S5 and S6 depict examples of relation-
ships of this type from existing literature, and
compare these methods to MIC on such relation-
0
10
20
30
0
10
20
30
Normalized Score
Vertical Axis Bins
Horizontal Axis Bins
0.5
0.0
1.0
0
10
20
30
0
10
20
30
Normalized Score
Vertical Axis Bins
Horizontal Axis Bins
0.5
0.0
1.0
0
10
20
30
0
10
20
30
Normalized Score
Vertical Axis Bins
Horizontal Axis Bins
0.5
0.0
1.0
A CB
ABCDE
D FE
0
10
20
30
0
10
20
30
Normalized Score
Vertical Axis Bins
Horizontal Axis Bins
0.5
0.0
1.0
0
10
20
30
0
10
20
30
Normalized Score
Vertical Axis Bins
Horizontal Axis Bins
0.5
0.0
1.0
0
10
20
30
0
10
20
30
Normalized Score
Vertical Axis Bins
Horizontal Axis Bins
0.5
0.0
1.0
Fig. 3. Visualizations of the characteristic matrices of common relation-
ships. (Ato F) Surfaces representing the characteristic matrices of several
common relationship types. For each surface, the xaxis represents num-
ber of vertical axis bins (rows), the yaxis represents number of horizontal
axis bins (columns), and the zaxis represents the normalized score of the
best-performing grid with those dimensions. The inset plots show the rela-
tionships used to generate each surface. For surfaces of additional relation-
ships, see fig. S7.
16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org
1520
RESEARCH ARTICLES
on December 15, 2011www.sciencemag.orgDownloaded from
ships.) Although these methods are not intended
to provide generality, the failure to assign high
scores in such cases makes them unsuitable for
identifying all potentially interesting relationships
in a data set.
Other methods such as mutual information
estimators (1214), maximal correlation (8,15),
principal curvebased methods (1619,20), dis-
tance correlation (21), and the Spearman rank
correlation coefficient all detect broader classes
of relationships. However, they are not equitable
even in the basic case of functional relation-
ships: They show a strong preference for some
types of functions, even at identical noise levels
(Fig. 2, A and C to F). For example, at a sample
size of 250, the Kraskov et al. mutual informa-
tion estimator (14) assigns a score of 3.65 to a
noiseless line but only 0.59 to a noiseless sinus-
oid, and it gives equivalent scores to a very
noisy line (R
2
= 0.35) and to a much cleaner
sinusoid (R
2
= 0.80) (Fig. 2D). Again, these
c
h
f
ge
d
40
60
80
100 500 900
0
125
250
0 2000 4000 6000
0
100
200
300
0 2000 4000 6000
Dentist Density (per 10,000)
C
Life Lost to Injuries (% yrs)
20
30
0
10
40
0481216
E
Number of Physicians
Deaths due to HIV/AIDS
0
800
1600
0 1x1052x106 2x105
Health Exp. / Person (US$)
G
Measles Imm. Disparity (%)
0
30
60
0 150 300
A
Pearson Correlation Coefficient (ρ)
MIC Score
C
H
F
GE
D
0 0.25 0.5 0.75 1
-1
-0.5
0
0.5
1
0
3
12
6
9
15
Health Exp. / Person (Int$)
I
Under Five Mortality Rate
Health Exp. / Person (US$)
Under Five Mortality Rate
Cardiovascular Disease Mortalit
y
(per 1E5)
Life Expectancy (Years)
B
Pearson Correlation Coefficient (ρ)
Mutual Information (Kraskov et al.)
01.0 2.0 3.0
-1
-0.5
0
0.5
1
0
3
12
6
9
15
MIC = 0.85 MIC = 0.65
Mutual Information = 0.83Mutual Information = 0.65
~
~
~
~
F
Income / Person (Int$)
Adult (Female) Obesity (%)
0
25
50
75
0 20,000 40,000
D
Children Per Woman
Life Expectancy (Years)
246
60
30
90
H
Gross Nat’l Inc / Person (Int$)
Health Exp. / Person (Int $)
0
2000
4000
6000
0 20,000 40,000
20
60
100
10 40 70 100
Rural Access to Improved Water Sources (%)
Improved Water Facilities (%)
Fig. 4. Application of MINE to global indicators from the WHO. (A)MIC
versus rfor all pairwise relationships in the WHO data set. (B) Mutual
information (Kraskov et al. estimator) versus rfor the same relationships.
High mutual information scores tend to be assigned only to relationships
with high r, whereas MIC gives high scores also to relationships that are
nonlinear. (Cto H) Example relationships from (A). (C) Both rand MIC yield
low scores for unassociated variables. (D) Ordinary linear relationships score
high under both tests. (E to G) Relationships detected by MIC but not by r,
because the relationships are nonlinear (E and G) or because more than one
relationship is present (F). In (F), the linear trendline comprises a set of
Pacific island nations in which obesity is culturally valued (33); most other
countries follow a parabolic trend (table S10). (H) A superposition of two
relationships that scores high under all three tests, presumably because the
majority of points obey one relationship. The less steep minority trend
consists of countries whose economies rely largely on oil (37) (table S11).
The lines of best fit in (D) to (H) were generated using regression on each
trend. (I) Of these four relationships, the left two appear less noisy than the
right two. MIC accordingly assigns higher scores to the two relationships on
the left. In contrast, mutual information assigns similar scores to the top two
relationships and similar scores to the bottom two relationships.
www.sciencemag.org SCIENCE VOL 334 16 DECEMBER 2011 1521
RESEARCH ARTICLES
on December 15, 2011www.sciencemag.orgDownloaded from
results are not surprisingthey correctly re-
flect the properties of mutual information. But
this behavior makes these methods less prac-
tical for data exploration.
An expanded toolkit for exploration. The
basic approach of MIC can be extended to de-
fine a broader class of MINE statistics based
on both MIC and the characteristic matrix M.
These statistics can be used to rapidly charac-
terize relationships that may then be studied with
more specialized or computationally intensive
techniques.
Some statistics are derived, like MIC, from
the spectrum of grid resolutions contained in M.
Different relationship types give rise to different
types of characteristic matrices (Fig. 3). For ex-
ample, just as a characteristic matrix with a high
maximum indicates a strong relationship, a sym-
metric characteristic matrix indicates a mono-
tonic relationship. We can thus detect deviation
from monotonicity with the maximum asym-
metry score (MAS), defined as the maximum
over Mof |m
x,y
m
y,x
|. MAS is useful, for ex-
ample, for detecting periodic relationships with
unknown frequencies that vary over time, a com-
monoccurrenceinrealdata(22). MIC and MAS
together detect such relationships more effec-
tively than either Fisherstest(23) or a recent
specialized test developed by Ahdesmäki et al.
(figs. S8 and S9) (24).
Because MIC is general and roughly equal to
R
2
on functional relationships, we can also define
a natural measure of nonlinearity by MIC r
2
,
where rdenotes the Pearson product-moment cor-
relation coefficient, a measure of linear depen-
dence. The statistic MIC r
2
is near 0 for linear
relationships and large for nonlinear relationships
with high values of MIC. As seen in the real-world
examples below, it is useful for uncovering novel
nonlinear relationships.
Similar MINE statistics can be defined to
detect properties that we refer to as complex-
ityand closeness to being a function.We
provide formal definitions and a performance
summary of these two statistics (SOM section
2.3 and table S1). Finally, MINE statistics can
also be used in cluster analysis to observe the
higher-order structure of data sets (SOM sec-
tion 4.9).
Application of MINE to real data sets. We
used MINE to explore four high-dimensional
data sets from diverse fields. Three data sets
have previously been analyzed and contain
many well-understood relationships. These data
sets are (i) social, economic, health, and political
indicators from the World Health Organization
(WHO) and its partners (7,25); (ii) yeast gene
expression profiles from a classic paper re-
porting genes whose transcript levels vary pe-
riodically with the cell cycle (26); and (iii)
performance statistics from the 2008 Major
League Baseball (MLB) season (27,28). For
our fourth analysis, we applied MINE to a data
set that has not yet been exhaustively analyzed:
a set of bacterial abundance levels in the human
gut microbiota (29). All relationships discussed
in this section are significant at a false discov-
ery rate of 5%; p-values and q-values are listed
in the SOM.
We explored the WHO data set (357 varia-
bles, 63,546 variable pairs) with MIC, the com-
monly used Pearson correlation coefficient (r),
and Kraskovs mutual information estimator
(Fig. 4 and table S9). All three statistics detected
many linear relationships. However, mutual in-
formation gave low ranks to many nonlinear re-
lationships that were highly ranked by MIC
(Fig. 4, A and B). Two-thirds of the top 150 rela-
tionships found by mutual information were
strongly linear (|r|0.97), whereas most of the
top 150 relationships found by MIC had |r|be-
low this threshold. Further, although equitability
is difficult to assess for general associations, the
results on some specific relationships suggest
that MIC comes closer than mutual information
to this goal (Fig. 4I). Using the nonlinearity mea-
sure MIC r
2
, we found several interesting rela-
tionships (Fig. 4, E to G), many of which are
confirmed by existing literature (3032). For ex-
ample, we identified a superposition of two func-
tional associations between female obesity and
income per personone from the Pacific Islands,
where female obesity is a sign of status (33), and
one from the rest of the world, where weight
and status do not appear to be linked in this way
(Fig. 4F).
We next explored a yeast gene expression
data set (6223 genes) that was previously ana-
lyzed with a special-purpose statistic developed
by Spellman et al.toidentifygeneswhose
transcript levels oscillate during the cell cycle
(26). Of the genes identified by Spellman et al.
and MIC, 70 and 69%, respectively, were also
identified in a later study with more time points
conducted by Tu et al.(22). However, MIC
identified genes at a wider range of frequencies
than did Spellman et al., and MAS sorted those
E
C
D
F
Gene Expression
-1.5
0
1.5
GIT1
01234
234
-1
0
1
CPR6
01
-1
0
1
HSP12
01234
-1
0
1
MCM3
01234
234
-2
0
2
HTB1
01
G
Time (hours)
A
Spellman
MIC
0
6
12
0 0.2 0.4 0.6 0.8 1
G
F
ED
C
B
Spellman
MAS
0
4
8
12
0 0.2 0.4 0.6 0.8
q < 0.05
CDE
F
G
Fig. 5. Application of MINE to Saccharomyces cerevisiae gene expression data. (A) MIC versus
scores obtained by Spellman et al. for all genes considered (26). Genes with high Spellman scores
tend to receive high MIC scores, but some genes undetected by Spellman's analysis also received
high MICs. (B) MAS versus Spellmans statistic for genes with significant MICs. Genes with a high
Spellman score also tend to have a high MAS score. (Cto G) Examples of genes with high MIC and
varying MAS (trend lines are moving averages). MAS sorts the MIC-identified genes by frequency. A
higher MAS signifies a shorter wavelength for periodic data, indicating that the genes found by
Spellman et al. are those with shorter wavelengths. None of the examples except for (F) and (G)
were detected by Spellmans analysis. However, subsequent studies have shown that (C) to (E) are
periodic genes with longer wavelengths (22,24). More plots of genes detected with MIC and MAS
are given in fig. S11.
16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org
1522
RESEARCH ARTICLES
on December 15, 2011www.sciencemag.orgDownloaded from
genes by frequency (Fig. 5). Of the genes iden-
tified by MINE as having high frequency (MAS >
75th percentile), 80% were identified by Spellman
et al., while of the low-frequency genes (MAS <
25th percentile), Spellman et al. identified only
20% (Fig. 5B). For example, although both
methods found the well-known cell-cycle regu-
lator HTB1 (Fig. 5G) required for chromatin as-
sembly, only MIC detected the heat-shock protein
HSP12 (Fig. 5E), which Tu et al. confirmed to
be in the top 4% of periodic genes in yeast.
HSP12, along with 43% of the genes identified
by MINE but not Spellman et al., was also in
the top third of statistically significant periodic
genes in yeast according to the more sophisticated
specialty statistic of Ahdesmäki et al., which was
specifically designed for finding periodic rela-
tionships without a prespecified frequency in
biological systems (24). Because of MICsgen-
erality and the small size of this data set (n=
24), relatively few of the genes analyzed (5%)
had significant MIC scores after multiple testing
correction at a false discovery rate of 5%. How-
ever, using a less conservative false discovery
rate of 15% yielded a larger list of significant
genes (16% of all genes analyzed), and this
larger list still attained a 68% confirmation rate
by Tu et al.
In the MLB data set (131 variables), MIC and r
both identified many linear relationships, but
interesting differences emerged. On the basis
of r, the strongest three correlates with player
salary are walks, intentional walks, and runs
batted in. By contrast, the strongest three asso-
ciations according to MIC are hits, total bases,
and a popular aggregate offensive statistic called
Replacement Level Marginal Lineup Value
(27,34) (fig. S12 and table S12). We leave it
to baseball enthusiasts to decide which of these
statistics are (or should be!) more strongly tied
to salary.
Our analysis of gut microbiota focused on
the relationships between prevalence levels of
the trillions of bacterial species that colonize the
gut of humans and other mammals (35,36). The
data set consisted of large-scale sequencing of
16Sribosomal RNA from the distal gut micro-
biota of mice colonized with a human fecal sam-
ple (29). After successful colonization, a subset
of the mice was shifted from a low-fat, plant-
polysacchariderich (LF/PP) diet to a high-fat,
high-sugar Wester ndiet. Our initial analysis
identified 9472 significant relationships (out of
22,414,860) between species-level groups called
operational taxonomic units (OTUs); significant-
ly more of these relationships occurred between
OTUs in the same bacterial family than expected
by chance (30% versus 24 T0.6%).
Examining the 1001 top-scoring nonlinear
relationships (MIC-r
2
> 0.2), we observed that a
commonassociationtypewasnoncoexistence:
When one species is abundant the other is less
abundant than expected by chance, and vice versa
(Fig.6,A,B,andD).Additionally,wefoundthat
312 of the top 500 nonlinear relationships were
affected by one or more factors for which data
were available (host diet, host sex, identity of hu-
mandonor,collectionmethod,andlocationinthe
gastrointestinal tract; SOM section 4.8). Many
are noncoexistence relationships that are ex-
plained by diet (Fig. 6A and table S13). These
diet-explained noncoexistence relationships oc-
cur at a range of taxonomic depthsinterphylum,
interfamily, and intrafamilyand form a highly
interconnected network of nonlinear relation-
ships (Fig. 6E).
The remaining 188 of the 500 highly ranked
nonlinear relationships were not affected by
any of the factors in the data set and included
many noncoexistence relationships (table S14
and Fig. 6D). These unexplained noncoex-
istence relationships may suggest interspe-
cies competition and/or additional selective
factors that shape gut microbial ecology and
0.02
0.04
0.06
Fresh
Second
Generation
Frozen
0
Abundance (%) of OTU1462 (Lachnospiraceae)
Abundance (%) of OTU708 (Lachnospiraceae)
0.02
0.04
0.06
Male
Female
0
Abundance (%) of OTU4273 (Eubacteriaceae)
Abundance (%) of OTU3857 (Porphyromonadaceae)
LF/PP
0.1
0.15
0.2
Western
0
0.05
Abundance (%) of OTU5948 (Bacteroidaceae)
Abundance (%) of OTU710 (Erysipelotrichaceae)
AB C
D
LF/PP
0.02
0.03
Western
0
0.01
Abundance (%) of OTU3349 (Porphyromonadaceae)
Abundance (%) of OTU2728 (Lachnospiraceae)
E
0 0.05 0.1 0.15 0.2 0.25 0 0.1 0.2 0.3 0 0.02 0.04 0.06 0.08 0.1
0 0.01 0.02 0.03 0.04
Fig. 6. Associations between bacterial
species in the gut microbiota of hu-
manizedmice. (A)Anoncoexistence
relationship explained by diet: Under
the LF/PP diet a Bacteroidaceae species-
level OTU dominates, whereas under
aWesterndietanErysipelotrichaceae
species dominates. (B) A noncoexistence
relationship occurring only in males. (C)
A nonlinear relationship partially ex-
plained by donor. (D) A noncoexistence
relationship not explained by diet. (E)A
spring graph (see SOM Section 4.9)
in which nodes correspond to OTUs
andedgescorrespondtothetop300
nonlinear relationships. Node size is
proportional to the number of these
relationships involving the OTU, black edges represent relationships explained by diet, and node glow
color is proportional to the fraction of adjacent edges that are black (100% is red, 0% is blue).
www.sciencemag.org SCIENCE VOL 334 16 DECEMBER 2011 1523
RESEARCH ARTICLES
on December 15, 2011www.sciencemag.orgDownloaded from
therefore represent promising directions for
future study.
Conclusion. Given the ever-growing,
technology-driven data stream in todayssci-
entific world, there is an increasing need for
tools to make sense of complex data sets in di-
verse fields. The ability to examine all potentially
interesting relationships in a data setindependent
of their formallows tremendous versatility in
the search for meaningful insights. On the basis
of our tests, MINE is useful for identifying and
characterizing structure in data.
References and Notes
1. T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and
Prediction (Springer Verlag, New York, 2009).
2. Science Staff, Science 331, 692 (2011).
3. By functional relationshipwe mean a distribution
(X,Y) in which Yis a function of X, potentially with
independent noise added.
4. A. Caspi et al., Science 301, 386 (2003).
5. R. N. Clayton, T. K. Mayeda, Geochim. Cosmochim. Acta
60, 1999 (1996).
6. T. J. Algeo, T. W. Lyons, Paleoceanography 21, PA1016
(2006).
7. World Health Organization Statistical Information
Systems, World Health Organization Statistical
Information Systems (WHOSIS) (2009);
www.who.int/whosis/en/.
8. A. Rényi, Acta Math. Hung. 10, 441 (1959).
9. C. J. Stone, Ann. Stat. 5, 595 (1977).
10. W. S. Cleveland, S. J. Devlin, J. Am. Stat. Assoc. 83, 596
(1988).
11. For both splines and regression estimators, we used R
2
with respect to the estimated spline/regression function
to score relationships.
12. Y. I. Moon, B. Rajagopalan, U. Lall, Phys. Rev. E Stat.
Phys. Plasmas Fluids Relat. Interdiscip. Topics 52, 2318
(1995).
13. G. Darbellay, I. Vajda, IEEE Trans. Inf. Theory 45, 1315
(1999).
14. A. Kraskov, H. Stögbauer, P. Grassberger, Phys. Rev. E Stat.
Nonlin. Soft Matter Phys. 69, 066138 (2004).
15. L. Breiman, J. H. Friedman, J. Am. Stat. Assoc. 80, 580
(1985).
16. T. Hastie, W. Stuetzle, J. Am. Stat. Assoc. 84, 502
(1989).
17. R. Tibshirani, Stat. Comput. 2, 183 (1992).
18. B. Kégl, A. Krzyzak, T. Linder, K. Zeger, Adv. Neural Inf.
Process. Syst. 11, 501 (1999).
19. P. Delicado, M. Smrekar, Stat. Comput. 19, 255
(2009).
20. Principal curve-based methodsrefers to mean-squared
error relative to the principal curve, and CorGC, the
principal curve-based measure of dependence of
Delicado et al.
21. G. Székely, M. Rizzo, Ann. Appl. Stat. 3, 1236
(2009).
22. B. P. Tu, A. Kudlicki, M. Rowicka, S. L. McKnight, Science
310, 1152 (2005).
23. R. Fisher, Tests of significance in harmonic analysis.
Proc. R. Soc. Lond. A 125, 54 (1929).
24. M. Ahdesmäki, H. Lähdesmäki, R. Pearson, H. Huttunen,
O. Yli-Harja, BMC Bioinformatics 6, 117 (2005).
25. H. Rosling, Gapminder, Indicators in Gapminder World
(2008); www.gapminder.org/data/
26. P. T. Spellman et al., Mol. Biol. Cell 9, 3273
(1998).
27. Baseball Prospectus Statistics Reports (2009);
www.baseballprospectus.com/sortable/
28. S. Lahman, The Baseball Archive, The Baseball Archive
(2009); baseball1.com/statistics/
29. P. J. Turnbaugh et al., The effect of diet on the
human gut microbiome: A metagenomic analysis in
humanized gnotobiotic mice. Sci. Transl. Med. 1, 6ra14
(2009).
30. L. Chen et al., Lancet 364, 1984 (2004).
31. S. Desai, S. Alva, Demography 35, 71 (1998).
32. S. Gupta, M. Verhoeven, J. Policy Model. 23, 433
(2001).
33. T. Gill et al., Obesity in the Pacific: Too big to ignore.
Noumea, New Caledonia: World Health Organization
Regional Office for the Western Pacific, Secretariat of the
Pacific Community (2002).
34. RPMLV estimates how many more runs per game a
player contributes over a replacement-level player in an
average lineup.
35. P. J. Turnbaugh et al., Nature 449, 804 (2007).
36. R. E. Ley et al., Science 320, 1647 (2008).
37. The World Factbook 2009 (Central Intelligence Agency,
Washington, DC, 2009).
Acknowledgments: We thank C. Blättler, B. Eidelson,
M. D. Finucane, M. M. Finucane, M. Fujihara, T. Gingrich,
E. Goldstein, R. Gupta, R. Hahne, T. Jaakkola, N. Laird,
M.Lipsitch,S.Manber,G.Nicholls,A.Papageorge,N.Patterson,
E.Phelan,J.Rinn,B.Ripley,I.Shylakhter,andR.Tibshiranifor
invaluable support and critical discussions throughout; and
O. Derby, M. Fitzgerald, S. Hart, M. Huang, E. Karlsson, S. Schaffner,
C. Edwards, and D. Yamins for assistance. P.C.S. and this
work are supported by the Packard Foundation. For data set
analysis, P.C.S. was also supported by NIH MIDAS award
U54GM088558, D.N.R. by a Marshall Scholarship, M.M. by
NSF grant 0915922, H.K.F. by ERC grant 239985, S.R.G. by
the Medical Scientist Training Program, and P.J.T. by NIH
P50 GM068763. Data and software are available online at
http://exploredata.net.
Supporting Online Material
www.sciencemag.org/cgi/content/full/334/6062/1518/DC1
Materials and Methods
SOM Text
Figs. S1 to S13
Tables S1 to S14
References (3854)
10 March 2011; accepted 5 October 2011
10.1126/science.1205438
The Structure of the Eukaryotic
Ribosome at 3.0 Å Resolution
Adam Ben-Shem,*Nicolas Garreau de Loubresse,*Sergey Melnikov,*Lasse Jenner,
Gulnara Yusupova, Marat Yusupov
Ribosomes translate genetic information encoded by messenger RNA into proteins.
Many aspects of translation and its regulation are specific to eukaryotes, whose ribosomes are
much larger and intricate than their bacterial counterparts. We report the crystal structure
of the 80Sribosome from the yeast Saccharomyces cerevisiaeincluding nearly all ribosomal
RNA bases and protein side chains as well as an additional protein, Stm1at a resolution
of 3.0 angstroms. This atomic model reveals the architecture of eukaryote-specific elements
and their interaction with the universally conserved core, and describes all eukaryote-specific
bridges between the two ribosomal subunits. It forms the structural framework for the design
and analysis of experiments that explore the eukaryotic translation apparatus and the
evolutionary forces that shaped it.
Ribosomes are responsible for the synthe-
sis of proteins across all kingdoms of
life. The core, which is universally con-
served and was described in detail by structures
of prokaryotic ribosomes, catalyzes peptide bond
formation and decodes mRNA (1). However,
eukaryotes and prokaryotes differ markedly in
other translation processes such as initiation, ter-
mination, and regulation (2,3), and eukaryotic
ribosomes play a central role in many eukaryote-
specific cellular processes. Accordingly, eukary-
otic ribosomes are at least 40% larger than their
bacterial counterparts as a result of additional ri-
bosomal RNA (rRNA) elements called expansion
segments (ESs) and extra protein moieties (4).
All ribosomes are composed of two subunits.
The large 60Ssubunit of the eukaryotic ribosome
(50Sin bacteria) consists of three rRNA mole-
cules (25S,5.8S, and 5S) and 46 proteins,
whereas the small 40Ssubunit (30Sin bacteria)
includes one rRNA chain (18S) and 33 proteins.
Of the 79 proteins, 32 have no homologs in crys-
tal structures of bacterial or archaeal ribosomes,
and those that do have homologs can still harbor
large eukaryote-specific extensions (5). Apart from
variability in certain rRNA expansion segments,
all eukaryotic ribosomes, from yeast to human,
are very similar.
Three-dimensional cryoelectron microscopy
(cryo-EM) reconstructions of eukaryotic ribo-
somes at 15 to 5.5 Å resolution provided in-
sight into the interactions of the ribosome with
several factors (4,68). A crystal structure of the
S. cerevisiae ribosome at 4.15 Å resolution de-
scribed the fold of all ordered rRNA expansion
segments, but the relatively low resolution pre-
cluded localization of most eukaryote-specific
proteins (9). Crystallographic data at a better res-
olution (3.9 Å) from the Tetrahymena thermo-
phila 40Sled to a definition of the locations and
folds of all eukaryote-specific proteins in the
Institut de Génétique etde Biologie Moculaire et Cellulaire,
1 rue Laurent Fries, BP10142, Illkirch F-67400, France; INSERM,
U964, Illkirch F-67400, France; CNRS, UMR7104, Illkirch
F-67400, France; and Université de Strasbourg, Stras-
bourg F-67000, France.
*These authors contributed equally to this work.
To whom correspondence should be addressed. E-mail:
adam@igbmc.fr (A.B.-S.); marat@igbmc.fr (M.Y.)
16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1524
RESEARCH ARTICLES
on December 15, 2011www.sciencemag.orgDownloaded from
... All properties were related to the measured D32 values using the Pearson coefficient of correlation (R) and the maximal information coefficient [33]. The former measures linear correlation, whereas the latter indicates the level of association between the evaluated variables, not constrained to linear relationships [33]. ...
... All properties were related to the measured D32 values using the Pearson coefficient of correlation (R) and the maximal information coefficient [33]. The former measures linear correlation, whereas the latter indicates the level of association between the evaluated variables, not constrained to linear relationships [33]. The maximal information coefficient (MIC) was used to detect variables that were non-linearly related to the Sauter diameter and did not lead to a high coefficient of correlation. ...
... The maximal information coefficient (MIC) was used to detect variables that were non-linearly related to the Sauter diameter and did not lead to a high coefficient of correlation. All properties were related to the measured D 32 values using the Pearson coefficient of correlation (R) and the maximal information coefficient [33]. The former measures linear correlation, whereas the latter indicates the level of association between the evaluated variables, not constrained to linear relationships [33]. ...
Article
Full-text available
This paper studies the correlation between different macroscopic features of image regions and object properties with the Sauter diameter (D32) of bubble size in flotation. Bubbles were sampled from the collection zone of a two-dimensional flotation cell using a McGill Bubble Size Analyzer, and photographed bubbles were processed using image analysis. The Sauter mean diameters were obtained under different experimental conditions using a semiautomated methodology, in which non-identifiable bubbles were manually characterized to estimate the bubble size distribution. For the same processed images, different image properties from their binary representation were studied in terms of their correlation with D32. The median and variability of the shadow percentage, aspect ratio, power spectral density, perimeter, equivalent diameters, solidity, and circularity, among other image or object properties, were studied. These properties were then related to the measured D32 values, from which four predictors were chosen to obtain a multivariable model that adequately described the Sauter diameter. After removing abnormal gas dispersion conditions, the multivariable linear model was able to represent D32 values (99 datasets) for superficial gas rates in the range of 0.4–2.5 cm/s, for four types of frothers and surfactant concentrations ranging from 0 to 32 ppm. The model was tested with 72 independent datasets, showing the generalizability of the results. Thus, the approach proved to be applicable at the laboratory scale for D32 = 1.3–6.7 mm.
... The multi-objective optimization model finds a Pareto front in the solution set and each solution on the front represents a combination of individual features, which exhibits a good balance the relevance and the redundancy. For each feature group on the Pareto front, we selected no more than ten features by utilizing a feature selection method named MIC-mRMR, which used maximal information coefficient (MIC) [24] as the correlation metric. Among all feature groups on the Pareto front, we further identified the optimal feature group with the highest training AUC on the training dataset, where a linear regression (LR) model was used for classifying glioma grades and the AUC was calculated by comparing with the true labels of patients in the training dataset. ...
Article
Radiomics, providing quantitative data extracted from medical images, has emerged as a critical role in diagnosis and classification of diseases such as glioma. One main challenge is how to uncover key disease-relevant features from the large amount of extracted quantitative features. Many existing methods suffer from low accuracy or overfitting. We propose a new method, Multiple-Filter and Multi-Objective-based method (MFMO), to identify predictive and robust biomarkers for disease diagnosis and classification. This method combines a multi-filter feature extraction with a multi-objective optimization-based feature selection model, which identifies a small set of predictive radiomic biomarkers with less redundancy. Taking magnetic resonance imaging (MRI) images-based glioma grading as a case study, we identify 10 key radiomic biomarkers that can accurately distinguish low-grade glioma (LGG) from high-grade glioma (HGG) on both training and test datasets. Using these 10 signature features, the classification model reaches training Area Under the receiving operating characteristic Curve (AUC) of 0.96 and test AUC of 0.95, which shows superior performance over existing methods and previously identified biomarkers.
... where I(X m ; Y ) is the value of mutual information between the two sequences. The detailed algorithm is also available in Reshef et al. [35]. ...
Article
Full-text available
The volatility of tourism demand is often caused by some irregular events in recent years. Typically, inbound tourists are quite sensitive to various factors, including the exchange rate fluctuation, consumer price index, personal or household income or consumption expenditure. We combine these multivariate time series data onto an ingenious multi-factor fusion strategy to contribute to precise tourism demand forecasting. A novel hybrid deep learning forecasting approach is developed by integrating several modules such as improved complete ensemble empirical mode decomposition with adaptive noise, intrinsic mode functions classification, multi-factors fusion and predictors matching. The monthly tourist flow data of Shanghai inbounding from USA, Korea and Japan are conducted to verify the performance of the proposed approach, which outperforms all benchmark models for different prediction horizons. The experimental results show that introducing external influencing factors can improve the prediction accuracy significantly, and therefore confirm the rationality and validity of the proposed approach.
... Sejdinovic et al. [65] showed that distance covariance and the kernel-based independence criterion are, in fact, equivalent if the kernel is chosen based on the relevant distance function. Other popular methods include mutual information-based tests [2,6,43,74], graph-based methods [19,22,34], the maximal information coefficient [61], ranking of interpoint distances [35,55], ball covariance [58], and binning approaches based on partitions of the sample space [35,50,77] (see [41,49] for reviews of the various methods). However, none of the aforementioned multivariate tests simultaneously inherit the distribution-free and universal consistency properties of the rank-based univariate tests. ...
Preprint
Full-text available
In this paper we study the problem of measuring and testing joint independence for a collection of multivariate random variables. Using the emerging theory of optimal transport (OT) based multivariate ranks, we propose a distribution-free test for multivariate joint independence. Towards this we introduce the notion of rank joint distance correlation (RJdCov), the higher-order rank analogue of the celebrated distance covariance measure, that captures the dependencies among all the subsets of the variables. The RJdCov can be easily estimated from the data without any moment assumptions and the associated test for joint independence is universally consistent. We can calibrate the test without any knowledge of the (unknown) marginal distributions (due to the distribution-free property), both asymptotically and in finite samples. In addition to being distribution-free and universally consistent, the proposed test is also statistically efficient, that is, it has non-trivial asymptotic (Pitman) efficiency. We demonstrate this by computing the limiting local power of the test for both mixture alternatives and joint Konijn alternatives. We also use the RJdCov measure to develop a method for independent component analysis (ICA) that is easy to implement and robust to outliers and contamination. Extensive simulations are performed to illustrate the efficacy of the proposed test in comparison to other existing methods. Finally, we apply the proposed method to learn the higher-order dependence structure among different US industries based on stock prices.
... • including multiple indices of agreement and model efficiency such as: i) index of agreement d (Willmott, 1981), and its modified d1 (Willmott et al., 1985) and refined d1r (Willmott et al., 2012) variants, ii) Nash-Sutcliffe model efficiency (NSE) (Nash & Sutcliffe, 1970) and its improved variants E1 (Legates & McCabe Jr., 1999), Erel (Krause et al., 2005), and Kling-Gupta model efficiency (KGE) (Kling et al., 2012), iii) Robinson's index of agreement (RAC) (Robinson, 1957(Robinson, , 1959, iv) Ji & Gallo agreement coefficient (AC) (Ji & Gallo, 2006), v) Duvellier's lambda (Duveiller et al., 2016), vi) distance correlation (dcorr) (Székely et al., 2007), or vii) maximal information coefficient (MIC) (Reshef et al., 2011)), among others. ...
... Previous studies [34] have shown that the high dimension of the input parameters can cause overfitting in models and reduce computational efficiency. Therefore, this section will study the correlation between the input and output parameters using Maximal Information Coefficient (MIC) [35] to remove redundant input parameters. Figure 10 shows that the correlation coefficient (C) between cutter temperature, driving speed, shield attitude, penetration depth, grouting amount, grouting pressure, chamber earth pressure, buried depth ratio, friction angle, cohesion, pile diameter, tensile strength of reinforcing bar, concrete strength and thrust is 0.59, 0.86, 0.89, 0.93, 0.46, 0.31, 0.76, 0.21, 0.32, 0.29, 0.61, 0.68, 0.66, respectively. ...
Article
Full-text available
Shield thrust is a critical operational parameter during shield driving, which is of vital significance for adjusting operational parameters and ensuring efficient and safe propulsion of shield tunneling machine. In this paper, a novel hybrid prediction model (CLM) combining attention mechanism, convolutional neural networks (CNN) and Bi-directional long short-term memory (BiLSTM) network is proposed for shield thrust prediction. Correlation analysis based on Maximal Information Coefficient (MIC) between the thrust and other parameters is first conducted to select optimal parameters and reduce input dimension. An attention mechanism is introduced into CNN to distinguish the importance of different features, with the convolution layer and pooling layer further extracting dimension features of the data. Then, a BiLSTM neural network integrating first attention layer is employed to extract time-varying characteristics of the data, with a second attention layer added to capture important time information. Field data during shield cutting bridge piles are investigated to support and validate the effectiveness and superiority of the proposed method. Results show that the proposed CLM model are general enough to avoid overfitting problems and have good performance at prediction. The predicted value match reasonably well the monitoring data, with coefficient of determination (R <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ) equaling to 0.85, root mean square error (RMSE) equaling to 0.05, mean absolute error (MAE) equaling to 0.02. The CLM model in this paper can accurately predict the thrust even under complicated construction conditions, which provides reference for similar industrial application.
... Mutual information (MI) measures how much the information of one random indicator is communicated with another [22] . Here, high MI means a large reduction in uncertainty, while low MI indicates a small reduction. ...
Article
Indicator selection has been a compelling problem in data envelopment analysis. With the advent of the big data era, scholars are faced with more complex indicator selection situations. The boom in machine learning presents an opportunity to address this problem. However, poor quality indicators may be selected if inappropriate methods are used in overfitting or underfitting scenarios. To date, some scholars have pioneered the use of the least absolute shrinkage and selection operator to select indicators in overfitting scenarios, but researchers have not proposed classifying the big data scenarios encountered by DEA into overfitting and underfitting scenarios, nor have they attempted to develop a complete indicator selection system for both scenarios. To fill these research gaps, this study employs machine learning methods and proposes a mean score approach based on them. Our Monte Carlo simulations show that the least absolute shrinkage and selection operator dominates in overfitting scenarios but fails to select good indicators in underfitting scenarios, while the ensemble methods are superior in underfitting scenarios, and the proposed mean approach performs well in both scenarios. Based on the strengths and limitations of the different methods, a smart indicator selection mechanism is proposed to facilitate the selection of DEA indicators.
Article
Worldwide cities are becoming more sustainable and are being monitored using data collection techniques at various geographical levels. Given the growing volume of data, there is a need to identify challenges associated with the processing, visualization, and analysis of the generated data from an urban scale. This study proposes a framework to investigate the capabilities of dimensionality reduction techniques (t-SNE, and UMAP) applied to city-scale data to identify key features of high consumption and generation areas based on building characteristics. The analysis is performed on measured data from 2735 postcodes consisting of 72000 households/buildings from a city in the Netherlands. The evaluation results showed that the UMAP's algorithm mean sigma quickly approaches a threshold of 0.6 at n_neighbor values of 50 and the low dimensional shape does not change with increasing values. Whereas the t-SNE's mean sigma value increases continuously with the increasing perplexity value, implying that t-SNE is significantly more sensitive to the perplexity parameter. The UMAP algorithm was used to extract information about the high photovoltaic generation and consumption regions. The proposed framework will assist grid operators and energy planners in extracting information from energy consumption data at the neighbourhood level by utilizing high dimensional reduction techniques.
Article
Accurate prediction of the future degradation trend (FDT) and remaining useful life (RUL) of proton exchange membrane fuel cell (PEMFC) is crucial in the prognosis and health management (PHM) of PEMFC. Therefore, this study develops a stacked generalization model (SGM) to predict the FDT and RUL of the PEMFC. Firstly, maximum information coefficient (MIC) and Akaike information criterion (AIC) are used to select input variables to obtain key information and reduce the difficulty of model training. Then use improved grasshopper optimization algorithm (GOA) to optimize the hyperparameters of the SGM to further improve the prediction performance of the model. Finally, according to the dataset after feature selection, improved GOA (IGOA) optimized SGM (M/AIC-IGOA-SGM) is established to predict FDT and RUL. Among them, IGOA is obtained by introducing Chebyshev chaotic mapping initialization, chaotic decreasing factor, and a spiral position update mechanism to GOA. The base models of SGM are support vector regression (SVR) and gated recurrent unit (GRU), and the meta-model is regularized extreme learning machine (RELM). In the comparative experiments under two different current conditions, the superiority of the proposed model is verified, and the effectiveness of M/AIC, IGOA, and SGM is discussed.
Article
The real purpose of collecting big data is to identify causalityin the hope that this will facilitate credible predictivity . But the searchfor causality can trap one into infinite regress, and thus one takes refuge inseeking associations between variables in data sets. Regrettably, the mereknowledge of associations does not enable predictivity. Associations need tobe embedded within the framework of probability calculus to make coherentpredictions. This is so because associations are a feature of probabilitymodels, and hence they do not exist outside the framework of a model.Measures of association, like correlation, regression, and mutual informationmerely refute a preconceived model. Estimated measures of associations donot lead to a probability model; a model is the product of pure thought. Thispaper discusses these and other fundamentals that are germane to seekingassociations in particular, and machine learning in general.
Article
Full-text available
A strategy to understand the microbial components of the human genetic and metabolic landscape and how they contribute to normal physiology and predisposition to disease.
Article
Ramsey theory is a fast-growing area of combinatorics with deep connections to other fields of mathematics such as topological dynamics, ergodic theory, mathematical logic, and algebra. The area of Ramsey theory dealing with Ramsey-type phenomena in higher dimensions is particularly useful. Introduction to Ramsey Spaces presents in a systematic way a method for building higher-dimensional Ramsey spaces from basic one-dimensional principles. It is the first book-length treatment of this area of Ramsey theory, and emphasizes applications for related and surrounding fields of mathematics, such as set theory, combinatorics, real and functional analysis, and topology. In order to facilitate accessibility, the book gives the method in its axiomatic form with examples that cover many important parts of Ramsey theory both finite and infinite. An exciting new direction for combinatorics, this book will interest graduate students and researchers working in mathematical subdisciplines requiring the mastery and practice of high-dimensional Ramsey theory.
Article
This classic introduction to probability theory for beginning graduate students covers laws of large numbers, central limit theorems, random walks, martingales, Markov chains, ergodic theorems, and Brownian motion. It is a comprehensive treatment concentrating on the results that are the most useful for applications. Its philosophy is that the best way to learn probability is to see it in action, so there are 200 examples and 450 problems. The fourth edition begins with a short chapter on measure theory to orient readers new to the subject.
Article
Oxygen isotope abundances provide a powerful tool for recognizing genetic relationships among meteorites. Among the differentiated achondrites, three isotopic groups are recognized: (l ) SNC (Mars), (2) Earth and Moon, and (3) HED (howardites, eucrites, diogenites). The HED group also contains the mesosiderites, main-group pallasites, and silicates from IIIAB irons. The angrites may be marginally resolvable from the HED group. Within each of these groups, internal geologic processes give rise to isotopic variations along a slope- fractionation line, as is well known for terrestrial materials. Variations of Δ17O from one planet to another are inherited from the inhomogeneities in the solar nebula, as illustrated by the isotopic compositions of chondrites and their constituents. Among the undifferentiated achondrites, five isotopic groups are found: (1) aubrites, (2) winonaites and IAB-IIICD irons, (3) brachinites, (4) acapulcoites and lodranites, and (5) ureilites. The isotopic compositions of aubrites coincide with the Earth and Moon, and also with the enstatite chondrites. These bodies apparently were derived from a. common reservoir, the isotopic composition of which was established at the chondrule scale by nebular processes. Isotopic similarities between chondrites and achondrites are seen only for the following instances: (1) enstatite chondrites and aubrites, (2) H chondrites and HE irons, and (3) L or LL chondrites and IVA irons. The isotopic data also support the following genetic associations: (1) winonaites and IAB-IIICD irons, (2) acapulcoites and lodranites, and (3) ureilites and dark inclusions of C3 chondrites. An attempt to reconcile the whole-planet isotopic compositions of Earth, Mars, and the eucrite parent body with mixing models of their chemical compositions failed. It is not possible to satisfy both the chemical and isotopic compositions of the terrestrial planets using known primitive Solar System components.
Article
Principal curves are smooth one-dimensional curves that pass through the middle of a p-dimensional data set, providing a nonlinear summary of the data. They are nonparametric, and their shape is suggested by the data. The algorithm for constructing principal curves starts with some prior summary, such as the usual principal-component line. The curve in each successive iteration is a smooth or local average of the p-dimensional points, where the definition of local is based on the distance in arc length of the projections of the points onto the curve found in the previous iteration. In this article principal curves are defined, an algorithm for their construction is given, some theoretical results are presented, and the procedure is compared to other generalizations of principal components. Two applications illustrate the use of principal curves. The first describes how the principal-curve procedure was used to align the magnets of the Stanford linear collider. The collider uses about 950 magnets in a roughly circular arrangement to bend electron and positron beams and bring them to collision. After construction, it was found that some of the magnets had ended up significantly out of place. As a result, the beams had to be bent too sharply and could not be focused. The engineers realized that the magnets did not have to be moved to their originally planned locations, but rather to a sufficiently smooth arc through the middle of the existing positions. This arc was found using the principal-curve procedure. In the second application, two different assays for gold content in several samples of computer-chip waste appear to show some systematic differences that are blurred by measurement error. The classical approach using linear errors in variables regression can detect systematic linear differences but is not able to account for nonlinearities. When the first linear principal component is replaced with a principal curve, a local “bump” is revealed, and bootstrapping is used to verify its presence.
Article
Sedimentary molybdenum, [Mo]s, has been widely used as a proxy for benthic redox potential owing to its generally strong enrichment in organic-rich marine facies deposited under oxygen-depleted conditions. A detailed analysis of [Mo]s–total organic carbon (TOC) covariation in modern anoxic marine environments and its relationship to ambient water chemistry suggests that (1) [Mo]s, while useful in distinguishing oxic from anoxic facies, is not related in a simple manner to dissolved sulfide concentrations within euxinic environments and (2) patterns of [Mo]s-TOC covariation can provide information about paleohydrographic conditions, especially the degree of restriction of the subchemoclinal water mass and temporal changes thereof related to deepwater renewal. These inferences are based on data from four anoxic silled basins (the Black Sea, Framvaren Fjord, Cariaco Basin, and Saanich Inlet) and one upwelling zone (the Namibian Shelf), representing a spectrum of aqueous chemical conditions related to water mass restriction. In the silled-basin environments, increasing restriction is correlated with a systematic decrease in [Mo]s/TOC ratios, from ∼45 ± 5 for Saanich Inlet to ∼4.5 ± 1 for the Black Sea. This variation reflects control of [Mo]s by [Mo]aq, which becomes depleted in stagnant basins through removal to the sediment without adequate resupply by deepwater renewal (the “basin reservoir effect”). The temporal dynamics of this process are revealed by high-resolution chemostratigraphic data from Framvaren Fjord and Cariaco Basin sediment cores, which exhibit long-term trends toward lower [Mo]s/TOC ratios following development of water column stratification and deepwater anoxia. Mo burial fluxes peak in weakly sulfidic environments such as Saanich Inlet (owing to a combination of greater [Mo]aq availability and enhanced Mo transport to the sediment-water interface via Fe-Mn redox cycling) and are lower in strongly sulfidic environments such as the Black Sea and Framvaren Fjord. These observations demonstrate that, at timescales associated with deepwater renewal in anoxic silled basins, decreased sedimentary Mo concentrations and burial fluxes are associated with lower benthic redox potentials (i.e., more sulfidic conditions). These conclusions apply only to anoxic marine environments exhibiting some degree of water mass restriction (e.g., silled basins) and are not valid for low-oxygen facies in open marine settings such as continent-margin upwelling systems.