Content uploaded by Peter J Turnbaugh

Author content

All content in this area was uploaded by Peter J Turnbaugh

Content may be subject to copyright.

DOI: 10.1126/science.1205438

, 1518 (2011);334 Science , et al.David N. Reshef

Detecting Novel Associations in Large Data Sets

This copy is for your personal, non-commercial use only.

clicking here.colleagues, clients, or customers by , you can order high-quality copies for yourIf you wish to distribute this article to others

here.following the guidelines can be obtained byPermission to republish or repurpose articles or portions of articles

): December 15, 2011 www.sciencemag.org (this infomation is current as of

The following resources related to this article are available online at

http://www.sciencemag.org/content/334/6062/1518.full.html

version of this article at: including high-resolution figures, can be found in the onlineUpdated information and services,

http://www.sciencemag.org/content/suppl/2011/12/15/334.6062.1518.DC2.html

http://www.sciencemag.org/content/suppl/2011/12/14/334.6062.1518.DC1.html

can be found at: Supporting Online Material

http://www.sciencemag.org/content/334/6062/1518.full.html#ref-list-1

, 6 of which can be accessed free:cites 35 articlesThis article

http://www.sciencemag.org/content/334/6062/1518.full.html#related-urls

1 articles hosted by HighWire Press; see:cited by This article has been

registered trademark of AAAS. is aScience2011 by the American Association for the Advancement of Science; all rights reserved. The title CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005.

(print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on December 15, 2011www.sciencemag.orgDownloaded from

Detecting Novel Associations

in Large Data Sets

David N. Reshef,

1,2,3

*†Yakir A. Reshef,

2,4

*†Hilary K. Finucane,

5

Sharon R. Grossman,

2,6

Gilean McVean,

3,7

Peter J. Turnbaugh,

6

Eric S. Lander,

2,8,9

Michael Mitzenmacher,

10

‡Pardis C. Sabeti

2,6

‡

Identifying interesting relationships between pairs of variables in large data sets is increasingly

important. Here, we present a measure of dependence for two-variable relationships: the maximal

information coefficient (MIC). MIC captures a wide range of associations both functional and

not, and for functional relationships provides a score that roughly equals the coefficient of

determination (R

2

) of the data relative to the regression function. MIC belongs to a larger

class of maximal information-based nonparametric exploration (MINE) statistics for identifying

and classifying relationships. We apply MIC and MINE to data sets in global health, gene

expression, major-league baseball, and the human gut microbiota and identify known and

novel relationships.

Imagine a data set with hundreds of variables,

which may contain important, undiscovered

relationships. There are tens of thousands of

variable pairs—far too many to examine manu-

ally. If you do not already know what kinds of

relationships to search for, how do you efficiently

identify the important ones? Data sets of this size

are increasingly common in fields as varied as

genomics, physics, political science, and econom-

ics, making this question an important and grow-

ing challenge (1,2).

One way to begin exploring a large data set

is to search for pairs of variables that are closely

associated. To do this, we could calculate some

measure of dependence for each pair, rank the

pairs by their scores, and examine the top-scoring

pairs. For this strategy to work, the statistic we

use to measure dependence should have two heu-

ristic properties: generality and equitability.

By generality, we mean that with sufficient

sample size the statistic should capture a wide

range of interesting associations, not limited to

specific function types (such as linear, exponential,

or periodic), or even to all functional relation-

ships (3). The latter condition is desirable because

not only do relationships take many functional

forms, but many important relationships—for ex-

ample, a superposition of functions—are not well

modeled by a function (4–7).

By equitability, we mean that the statistic

should give similar scores to equally noisy rela-

tionships of different types. For example, we do

not want noisy linear relationships to drive strong

sinusoidal relationships from the top of the list.

Equitability is difficult to formalize for associa-

tions in general but has a clear interpretation in

the basic case of functional relationships: An equi-

table statistic should give similar scores to func-

tional relationships with similar R

2

values (given

sufficient sample size).

Here, we describe an exploratory data anal-

ysis tool, the maximal information coefficient

(MIC), that satisfies these two heuristic proper-

ties. We establish MIC’s generality through proofs,

show its equitability on functional relationships

through simulations, and observe that this trans-

lates into intuitively equitable behavior on more

general associations. Furthermore, we illustrate

that MIC gives rise to a larger family of sta-

tistics, which we refer to as MINE, or maximal

information-based nonparametric exploration.

MINE statistics can be used not only to identify

interesting associations, but also to characterize

them according to properties such as nonline-

arity and monotonicity. We demonstrate the

application of MIC and MINE to data sets in

health, baseball, genomics, and the human

microbiota.

The maximal information coefficient. Intu-

itively, MIC is based on the idea that if a re-

lationship exists between two variables, then a

grid can be drawn on the scatterplot of the two

variables that partitions the data to encapsulate

that relationship. Thus, to calculate the MIC of a

set of two-variable data, we explore all grids up

to a maximal grid resolution, dependent on the

sample size (Fig. 1A), computing for every pair

of integers (x,y) the largest possible mutual in-

formation achievable by any x-by-ygrid applied

to the data. We then normalize these mutual

information values to ensure a fair comparison

between grids of different dimensions and to ob-

tain modified values between 0 and 1. We de-

fine the characteristic matrix M=(m

x,y

), where

m

x,y

is the highest normalized mutual infor-

mation achieved by any x-by-ygrid, and the

statistic MIC to be the maximum value in M

(Fig. 1, B and C).

More formally, for a grid G, let I

G

denote

the mutual information of the probability dis-

1

Department of Computer Science, Massachusetts Institute of

Technology (MIT), Cambridge, MA 02139, USA.

2

Broad Institute

of MIT and Harvard, Cambridge, MA 02142, USA.

3

Department

of Statistics, University of Oxford, Oxford OX1 3TG, UK.

4

De-

partment of Mathematics, Harvard College, Cambridge, MA

02138, USA.

5

Department of Computer Science and Applied

Mathematics, Weizmann Institute of Science, Rehovot, Israel.

6

Center for Systems Biology, Department of Organismic and

Evolutionary Biology, Harvard University, Cambridge, MA 02138,

USA.

7

Wellcome Trust Centre for Human Genetics, University of

Oxford, Oxford OX3 7BN, UK.

8

Department of Biology, MIT,

Cambridge, MA 02139, USA.

9

Department of Systems Biology,

Harvard Medical School, Boston, MA 02115, USA.

10

School of

Engineering and Applied Sciences, Harvard University, Cam-

bridge, MA 02138, USA.

*These authors contributed equally to this work.

†To whom correspondence should be addressed. E-mail:

dnreshef@mit.edu (D.N.R.); yreshef@post.harvard.edu (Y.A.R.)

‡These authors contributed equally to this work.

Columns

Rows

32...

0

10

20

0 10 20 30

Normalized Score

Vertical Axis Bins

Horizontal Axis Bins

0.5

0.0

1.0

23...

2 x 2 2 x 3 x x y

A

B

C

Fig. 1. Computing MIC (A)Foreachpair(x,y), the

MIC algorithm finds the x-by-ygrid with the highest

induced mutual information. (B)Thealgorithm

normalizes the mutual information scores and

compiles a matrix that stores, for each resolution,

thebestgridatthatresolutionanditsnormalized

score. (C) The normalized scores form the char-

acteristic matrix, which can be visualized as a sur-

face; MIC corresponds to the highest point on this

surface. In this example, there are many grids that

achieve the highest score. The star in (B) marks a

sample grid achieving this score, and the star in (C)

marks that grid’s corresponding location on the

surface.

RESEARCH ARTICLES

16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1518

on December 15, 2011www.sciencemag.orgDownloaded from

tribution induced on the boxes of G, where the

probability of a box is proportional to the num-

ber of data points falling inside the box. The

(x,y)-th entry m

x,y

of the characteristic matrix

equals max{I

G

}/log min{x,y}, where the maxi-

mum is taken over all x-by-ygrids G. MIC is the

maximum of m

x,y

over ordered pairs (x,y)such

that xy <B,whereBis a function of sample size;

we usually set B=n

0.6

(see SOM Section 2.2.1).

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Squared Spearman Rank

Correlation Coefficient

D

C

1

1.5

2

2.5

0

0.5

Mutual Information (Kraskov et al.)

5.0

00.2 0.4 0.6 1

0.8

0

0.2

0.4

0.6

0.8

1

00.20.40.60.81

Maximal Information Coefficient

(MIC)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Maximal Correlation (ACE)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

CorGC (Principal Curve-Based)

Noise (1 - R2)

*

*

**

**

*

*

*

*

**

**

**

**

0.80 0.65 0.50 0.35

Relationship Type

Two Lines

Line and Parabola

X

Ellipse

Sinusoid

(Mixture of two signals)

Non-coexistence

Maximal Information Coefficient (MIC)

Added Noise

AB

F

E

G

Relationship Type MIC Pearson Spearman

CorGC

(Principal

Curve-Based)

Maximal

Correlation

0.18 -0.02 -0.02 0.01 0.03 0.19 0.01

1.00 1.00 1.00 5.03 3.89 1.00 1.00

1.00 0.61 0.69 3.09 3.12 0.98 1.00

1.00 0.70 1.00 2.09 3.62 0.94 1.00

1.00 -0.09 -0.09 0.01 -0.11 0.36 0.64

1.00 0.53 0.49 2.22 1.65 1.00 1.00

1.00 0.33 0.31 0.69 0.45 0.49 0.91

1.00 -0.01 -0.01 3.33 3.15 1.00 1.00

1.00 0.00 0.00 0.01 0.20 0.40 0.80

1.00 -0.11 -0.11 0.02 0.06 0.38 0.76

Mutual Information

(Kraskov)(KDE)

Random

Linear

Cubic

Exponential

Sinusoidal

(Fourier frequency)

Categorical

Periodic/Linear

Parabolic

Sinusoidal

(non-Fourier frequency)

Sinusoidal

(varying frequency)

Fig. 2. Comparison of MIC to existing methods (A) Scores given to various

noiseless functional relationships by several different statistics (8,12,14,19).

Maximalscoresineachcolumnareaccentuated.(Bto F) The MIC, Spearman

correlation coefficient, mutual information (Kraskov et al. estimator), maximal

correlation (estimated using ACE), and the principal curve-based CorGC de-

pendence measure, respectively, of 27 different functional relationships with

independent uniform vertical noise added, as the R

2

value of the data relative to

the noiseless function varies. Each shape and color corresponds to a different

combination of function type and sample size. In each plot, pairs of thumbnails

show relationships that received identical scores; for data exploration, we would

like these pairs to have similar noise levels. For a list of the functions and sample

sizes in these graphs as well as versions with other statistics, sample sizes, and

noise models, see figs. S3 and S4. (G) Performance of MIC on associations not

well modeled by a function, as noise level varies. For the performance of other

statistics,seefigs.S5andS6.

www.sciencemag.org SCIENCE VOL 334 16 DECEMBER 2011 1519

RESEARCH ARTICLES

on December 15, 2011www.sciencemag.orgDownloaded from

Every entry of Mfalls between 0 and 1, and

soMICdoesaswell.MICisalsosymmetric[i.e.,

MIC(X,Y)=MIC(Y,X)] due to the symmetry of

mutual information, and because I

G

depends

only on the rank order of the data, MIC is invar-

iant under order-preserving transformations of

the axes. Notably, although mutual information

is used to quantify the performance of each grid,

MIC is not an estimate of mutual information

(SOM Section 2).

To calculate M, we would ideally optimize

over all possible grids. For computational effi-

ciency, we instead use a dynamic programming

algorithm that optimizes over a subset of the pos-

sible grids and appears to approximate well the

true value of MIC in practice (SOM Section 3).

Main properties of MIC. We have proven

mathematically that MIC is general in the sense

described above. Our proofs show that, with prob-

ability approaching 1 as sample size grows, (i)

MIC assigns scores that tend to 1 to all never-

constant noiseless functional relationships; (ii)

MIC assigns scores that tend to 1 for a larger

class of noiseless relationships (including super-

positions of noiseless functional relationships);

and (iii) MIC assigns scores that tend to 0 to

statistically independent variables.

Specifically, we have proven that for a pair of

random variables Xand Y,(i)ifYis a function of

Xthat is not constant on any open interval, then

data drawn from (X,Y) will receive an MIC tend-

ing to 1 with probability one as sample size grows;

(ii) if the support of (X,Y) is described by a

finite union of differentiable curves of the form

c(t)=[x(t),y(t)] for tin [0,1], then data drawn from

(X,Y) will receive an MIC tending to 1 with

probability one as sample size grows, provided

that dx/dt and dy/dt are each zero on finitely

many points; (iii) the MIC of data drawn from

(X,Y) converges to zero in probability as sample

size grows if and only if Xand Yare statis-

tically independent. We have also proven that

the MIC of a noisy functional relationship is

bounded from below by a function of its R

2

.

(For proofs, see SOM.)

We tested MIC’s equitability through simu-

lations. These simulations confirm the mathemat-

ical result that noiseless functional relationships

(i.e., R

2

= 1.0) receive MIC scores approaching

1.0 (Fig. 2A). They also show that, for a large

collection of test functions with varied sample

sizes, noise levels, and noise models, MIC rough-

ly equals the coefficient of determination R

2

rel-

ative to each respective noiseless function. This

makes it easy to interpret and compare scores

across various function types (Fig. 2B and fig.

S4). For instance, at reasonable sample sizes, a

sinusoidal relationship with a noise level of R

2

=

0.80 and a linear relationship with the same R

2

value receive nearly the same MIC score. For a

wide range of associations that are not well

modeled by a function, we also show that MIC

scores degrade in an intuitive manner as noise

is added (Fig. 2G and figs. S5 and S6).

Comparisons to other methods. We compared

MIC to a wide range of methods—including meth-

ods formulated around the axiomatic framework

for measures of dependence developed by Rényi

(8), other state-of-the-art measures of dependence,

and several nonparametric curve estimation tech-

niques that can be used to score pairs of vari-

ables based on how well they fit the estimated

curve.

Methods such as splines (1) and regression

estimators (1,9,10) tend to be equitable across

functional relationships (11)butarenotgener-

al: They fail to find many simple and important

types of relationships that are not functional.

(Figures S5 and S6 depict examples of relation-

ships of this type from existing literature, and

compare these methods to MIC on such relation-

0

10

20

30

0

10

20

30

Normalized Score

Vertical Axis Bins

Horizontal Axis Bins

0.5

0.0

1.0

0

10

20

30

0

10

20

30

Normalized Score

Vertical Axis Bins

Horizontal Axis Bins

0.5

0.0

1.0

0

10

20

30

0

10

20

30

Normalized Score

Vertical Axis Bins

Horizontal Axis Bins

0.5

0.0

1.0

A CB

ABCDE

D FE

0

10

20

30

0

10

20

30

Normalized Score

Vertical Axis Bins

Horizontal Axis Bins

0.5

0.0

1.0

0

10

20

30

0

10

20

30

Normalized Score

Vertical Axis Bins

Horizontal Axis Bins

0.5

0.0

1.0

0

10

20

30

0

10

20

30

Normalized Score

Vertical Axis Bins

Horizontal Axis Bins

0.5

0.0

1.0

Fig. 3. Visualizations of the characteristic matrices of common relation-

ships. (Ato F) Surfaces representing the characteristic matrices of several

common relationship types. For each surface, the xaxis represents num-

ber of vertical axis bins (rows), the yaxis represents number of horizontal

axis bins (columns), and the zaxis represents the normalized score of the

best-performing grid with those dimensions. The inset plots show the rela-

tionships used to generate each surface. For surfaces of additional relation-

ships, see fig. S7.

16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org

1520

RESEARCH ARTICLES

on December 15, 2011www.sciencemag.orgDownloaded from

ships.) Although these methods are not intended

to provide generality, the failure to assign high

scores in such cases makes them unsuitable for

identifying all potentially interesting relationships

in a data set.

Other methods such as mutual information

estimators (12–14), maximal correlation (8,15),

principal curve–based methods (16–19,20), dis-

tance correlation (21), and the Spearman rank

correlation coefficient all detect broader classes

of relationships. However, they are not equitable

even in the basic case of functional relation-

ships: They show a strong preference for some

types of functions, even at identical noise levels

(Fig. 2, A and C to F). For example, at a sample

size of 250, the Kraskov et al. mutual informa-

tion estimator (14) assigns a score of 3.65 to a

noiseless line but only 0.59 to a noiseless sinus-

oid, and it gives equivalent scores to a very

noisy line (R

2

= 0.35) and to a much cleaner

sinusoid (R

2

= 0.80) (Fig. 2D). Again, these

c

h

f

ge

d

40

60

80

100 500 900

0

125

250

0 2000 4000 6000

0

100

200

300

0 2000 4000 6000

Dentist Density (per 10,000)

C

Life Lost to Injuries (% yrs)

20

30

0

10

40

0481216

E

Number of Physicians

Deaths due to HIV/AIDS

0

800

1600

0 1x1052x106 2x105

Health Exp. / Person (US$)

G

Measles Imm. Disparity (%)

0

30

60

0 150 300

A

Pearson Correlation Coefficient (ρ)

MIC Score

C

H

F

GE

D

0 0.25 0.5 0.75 1

-1

-0.5

0

0.5

1

0

3

12

6

9

15

Health Exp. / Person (Int$)

I

Under Five Mortality Rate

Health Exp. / Person (US$)

Under Five Mortality Rate

Cardiovascular Disease Mortalit

y

(per 1E5)

Life Expectancy (Years)

B

Pearson Correlation Coefficient (ρ)

Mutual Information (Kraskov et al.)

01.0 2.0 3.0

-1

-0.5

0

0.5

1

0

3

12

6

9

15

MIC = 0.85 MIC = 0.65

Mutual Information = 0.83Mutual Information = 0.65

~

~

~

~

F

Income / Person (Int$)

Adult (Female) Obesity (%)

0

25

50

75

0 20,000 40,000

D

Children Per Woman

Life Expectancy (Years)

246

60

30

90

H

Gross Nat’l Inc / Person (Int$)

Health Exp. / Person (Int $)

0

2000

4000

6000

0 20,000 40,000

20

60

100

10 40 70 100

Rural Access to Improved Water Sources (%)

Improved Water Facilities (%)

Fig. 4. Application of MINE to global indicators from the WHO. (A)MIC

versus rfor all pairwise relationships in the WHO data set. (B) Mutual

information (Kraskov et al. estimator) versus rfor the same relationships.

High mutual information scores tend to be assigned only to relationships

with high r, whereas MIC gives high scores also to relationships that are

nonlinear. (Cto H) Example relationships from (A). (C) Both rand MIC yield

low scores for unassociated variables. (D) Ordinary linear relationships score

high under both tests. (E to G) Relationships detected by MIC but not by r,

because the relationships are nonlinear (E and G) or because more than one

relationship is present (F). In (F), the linear trendline comprises a set of

Pacific island nations in which obesity is culturally valued (33); most other

countries follow a parabolic trend (table S10). (H) A superposition of two

relationships that scores high under all three tests, presumably because the

majority of points obey one relationship. The less steep minority trend

consists of countries whose economies rely largely on oil (37) (table S11).

The lines of best fit in (D) to (H) were generated using regression on each

trend. (I) Of these four relationships, the left two appear less noisy than the

right two. MIC accordingly assigns higher scores to the two relationships on

the left. In contrast, mutual information assigns similar scores to the top two

relationships and similar scores to the bottom two relationships.

www.sciencemag.org SCIENCE VOL 334 16 DECEMBER 2011 1521

RESEARCH ARTICLES

on December 15, 2011www.sciencemag.orgDownloaded from

results are not surprising—they correctly re-

flect the properties of mutual information. But

this behavior makes these methods less prac-

tical for data exploration.

An expanded toolkit for exploration. The

basic approach of MIC can be extended to de-

fine a broader class of MINE statistics based

on both MIC and the characteristic matrix M.

These statistics can be used to rapidly charac-

terize relationships that may then be studied with

more specialized or computationally intensive

techniques.

Some statistics are derived, like MIC, from

the spectrum of grid resolutions contained in M.

Different relationship types give rise to different

types of characteristic matrices (Fig. 3). For ex-

ample, just as a characteristic matrix with a high

maximum indicates a strong relationship, a sym-

metric characteristic matrix indicates a mono-

tonic relationship. We can thus detect deviation

from monotonicity with the maximum asym-

metry score (MAS), defined as the maximum

over Mof |m

x,y

–m

y,x

|. MAS is useful, for ex-

ample, for detecting periodic relationships with

unknown frequencies that vary over time, a com-

monoccurrenceinrealdata(22). MIC and MAS

together detect such relationships more effec-

tively than either Fisher’stest(23) or a recent

specialized test developed by Ahdesmäki et al.

(figs. S8 and S9) (24).

Because MIC is general and roughly equal to

R

2

on functional relationships, we can also define

a natural measure of nonlinearity by MIC –r

2

,

where rdenotes the Pearson product-moment cor-

relation coefficient, a measure of linear depen-

dence. The statistic MIC –r

2

is near 0 for linear

relationships and large for nonlinear relationships

with high values of MIC. As seen in the real-world

examples below, it is useful for uncovering novel

nonlinear relationships.

Similar MINE statistics can be defined to

detect properties that we refer to as “complex-

ity”and “closeness to being a function.”We

provide formal definitions and a performance

summary of these two statistics (SOM section

2.3 and table S1). Finally, MINE statistics can

also be used in cluster analysis to observe the

higher-order structure of data sets (SOM sec-

tion 4.9).

Application of MINE to real data sets. We

used MINE to explore four high-dimensional

data sets from diverse fields. Three data sets

have previously been analyzed and contain

many well-understood relationships. These data

sets are (i) social, economic, health, and political

indicators from the World Health Organization

(WHO) and its partners (7,25); (ii) yeast gene

expression profiles from a classic paper re-

porting genes whose transcript levels vary pe-

riodically with the cell cycle (26); and (iii)

performance statistics from the 2008 Major

League Baseball (MLB) season (27,28). For

our fourth analysis, we applied MINE to a data

set that has not yet been exhaustively analyzed:

a set of bacterial abundance levels in the human

gut microbiota (29). All relationships discussed

in this section are significant at a false discov-

ery rate of 5%; p-values and q-values are listed

in the SOM.

We explored the WHO data set (357 varia-

bles, 63,546 variable pairs) with MIC, the com-

monly used Pearson correlation coefficient (r),

and Kraskov’s mutual information estimator

(Fig. 4 and table S9). All three statistics detected

many linear relationships. However, mutual in-

formation gave low ranks to many nonlinear re-

lationships that were highly ranked by MIC

(Fig. 4, A and B). Two-thirds of the top 150 rela-

tionships found by mutual information were

strongly linear (|r|≥0.97), whereas most of the

top 150 relationships found by MIC had |r|be-

low this threshold. Further, although equitability

is difficult to assess for general associations, the

results on some specific relationships suggest

that MIC comes closer than mutual information

to this goal (Fig. 4I). Using the nonlinearity mea-

sure MIC –r

2

, we found several interesting rela-

tionships (Fig. 4, E to G), many of which are

confirmed by existing literature (30–32). For ex-

ample, we identified a superposition of two func-

tional associations between female obesity and

income per person—one from the Pacific Islands,

where female obesity is a sign of status (33), and

one from the rest of the world, where weight

and status do not appear to be linked in this way

(Fig. 4F).

We next explored a yeast gene expression

data set (6223 genes) that was previously ana-

lyzed with a special-purpose statistic developed

by Spellman et al.toidentifygeneswhose

transcript levels oscillate during the cell cycle

(26). Of the genes identified by Spellman et al.

and MIC, 70 and 69%, respectively, were also

identified in a later study with more time points

conducted by Tu et al.(22). However, MIC

identified genes at a wider range of frequencies

than did Spellman et al., and MAS sorted those

E

C

D

F

Gene Expression

-1.5

0

1.5

GIT1

01234

234

-1

0

1

CPR6

01

-1

0

1

HSP12

01234

-1

0

1

MCM3

01234

234

-2

0

2

HTB1

01

G

Time (hours)

A

Spellman

MIC

0

6

12

0 0.2 0.4 0.6 0.8 1

G

F

ED

C

B

Spellman

MAS

0

4

8

12

0 0.2 0.4 0.6 0.8

q < 0.05

CDE

F

G

Fig. 5. Application of MINE to Saccharomyces cerevisiae gene expression data. (A) MIC versus

scores obtained by Spellman et al. for all genes considered (26). Genes with high Spellman scores

tend to receive high MIC scores, but some genes undetected by Spellman's analysis also received

high MICs. (B) MAS versus Spellman’s statistic for genes with significant MICs. Genes with a high

Spellman score also tend to have a high MAS score. (Cto G) Examples of genes with high MIC and

varying MAS (trend lines are moving averages). MAS sorts the MIC-identified genes by frequency. A

higher MAS signifies a shorter wavelength for periodic data, indicating that the genes found by

Spellman et al. are those with shorter wavelengths. None of the examples except for (F) and (G)

were detected by Spellman’s analysis. However, subsequent studies have shown that (C) to (E) are

periodic genes with longer wavelengths (22,24). More plots of genes detected with MIC and MAS

are given in fig. S11.

16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org

1522

RESEARCH ARTICLES

on December 15, 2011www.sciencemag.orgDownloaded from

genes by frequency (Fig. 5). Of the genes iden-

tified by MINE as having high frequency (MAS >

75th percentile), 80% were identified by Spellman

et al., while of the low-frequency genes (MAS <

25th percentile), Spellman et al. identified only

20% (Fig. 5B). For example, although both

methods found the well-known cell-cycle regu-

lator HTB1 (Fig. 5G) required for chromatin as-

sembly, only MIC detected the heat-shock protein

HSP12 (Fig. 5E), which Tu et al. confirmed to

be in the top 4% of periodic genes in yeast.

HSP12, along with 43% of the genes identified

by MINE but not Spellman et al., was also in

the top third of statistically significant periodic

genes in yeast according to the more sophisticated

specialty statistic of Ahdesmäki et al., which was

specifically designed for finding periodic rela-

tionships without a prespecified frequency in

biological systems (24). Because of MIC’sgen-

erality and the small size of this data set (n=

24), relatively few of the genes analyzed (5%)

had significant MIC scores after multiple testing

correction at a false discovery rate of 5%. How-

ever, using a less conservative false discovery

rate of 15% yielded a larger list of significant

genes (16% of all genes analyzed), and this

larger list still attained a 68% confirmation rate

by Tu et al.

In the MLB data set (131 variables), MIC and r

both identified many linear relationships, but

interesting differences emerged. On the basis

of r, the strongest three correlates with player

salary are walks, intentional walks, and runs

batted in. By contrast, the strongest three asso-

ciations according to MIC are hits, total bases,

and a popular aggregate offensive statistic called

Replacement Level Marginal Lineup Value

(27,34) (fig. S12 and table S12). We leave it

to baseball enthusiasts to decide which of these

statistics are (or should be!) more strongly tied

to salary.

Our analysis of gut microbiota focused on

the relationships between prevalence levels of

the trillions of bacterial species that colonize the

gut of humans and other mammals (35,36). The

data set consisted of large-scale sequencing of

16Sribosomal RNA from the distal gut micro-

biota of mice colonized with a human fecal sam-

ple (29). After successful colonization, a subset

of the mice was shifted from a low-fat, plant-

polysaccharide–rich (LF/PP) diet to a high-fat,

high-sugar “Wester n”diet. Our initial analysis

identified 9472 significant relationships (out of

22,414,860) between “species”-level groups called

operational taxonomic units (OTUs); significant-

ly more of these relationships occurred between

OTUs in the same bacterial family than expected

by chance (30% versus 24 T0.6%).

Examining the 1001 top-scoring nonlinear

relationships (MIC-r

2

> 0.2), we observed that a

commonassociationtypewas“noncoexistence”:

When one species is abundant the other is less

abundant than expected by chance, and vice versa

(Fig.6,A,B,andD).Additionally,wefoundthat

312 of the top 500 nonlinear relationships were

affected by one or more factors for which data

were available (host diet, host sex, identity of hu-

mandonor,collectionmethod,andlocationinthe

gastrointestinal tract; SOM section 4.8). Many

are noncoexistence relationships that are ex-

plained by diet (Fig. 6A and table S13). These

diet-explained noncoexistence relationships oc-

cur at a range of taxonomic depths—interphylum,

interfamily, and intrafamily—and form a highly

interconnected network of nonlinear relation-

ships (Fig. 6E).

The remaining 188 of the 500 highly ranked

nonlinear relationships were not affected by

any of the factors in the data set and included

many noncoexistence relationships (table S14

and Fig. 6D). These unexplained noncoex-

istence relationships may suggest interspe-

cies competition and/or additional selective

factors that shape gut microbial ecology and

0.02

0.04

0.06

Fresh

Second

Generation

Frozen

0

Abundance (%) of OTU1462 (Lachnospiraceae)

Abundance (%) of OTU708 (Lachnospiraceae)

0.02

0.04

0.06

Male

Female

0

Abundance (%) of OTU4273 (Eubacteriaceae)

Abundance (%) of OTU3857 (Porphyromonadaceae)

LF/PP

0.1

0.15

0.2

Western

0

0.05

Abundance (%) of OTU5948 (Bacteroidaceae)

Abundance (%) of OTU710 (Erysipelotrichaceae)

AB C

D

LF/PP

0.02

0.03

Western

0

0.01

Abundance (%) of OTU3349 (Porphyromonadaceae)

Abundance (%) of OTU2728 (Lachnospiraceae)

E

0 0.05 0.1 0.15 0.2 0.25 0 0.1 0.2 0.3 0 0.02 0.04 0.06 0.08 0.1

0 0.01 0.02 0.03 0.04

Fig. 6. Associations between bacterial

species in the gut microbiota of “hu-

manized”mice. (A)Anoncoexistence

relationship explained by diet: Under

the LF/PP diet a Bacteroidaceae species-

level OTU dominates, whereas under

aWesterndietanErysipelotrichaceae

species dominates. (B) A noncoexistence

relationship occurring only in males. (C)

A nonlinear relationship partially ex-

plained by donor. (D) A noncoexistence

relationship not explained by diet. (E)A

spring graph (see SOM Section 4.9)

in which nodes correspond to OTUs

andedgescorrespondtothetop300

nonlinear relationships. Node size is

proportional to the number of these

relationships involving the OTU, black edges represent relationships explained by diet, and node glow

color is proportional to the fraction of adjacent edges that are black (100% is red, 0% is blue).

www.sciencemag.org SCIENCE VOL 334 16 DECEMBER 2011 1523

RESEARCH ARTICLES

on December 15, 2011www.sciencemag.orgDownloaded from

therefore represent promising directions for

future study.

Conclusion. Given the ever-growing,

technology-driven data stream in today’ssci-

entific world, there is an increasing need for

tools to make sense of complex data sets in di-

verse fields. The ability to examine all potentially

interesting relationships in a data set—independent

of their form—allows tremendous versatility in

the search for meaningful insights. On the basis

of our tests, MINE is useful for identifying and

characterizing structure in data.

References and Notes

1. T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of

Statistical Learning: Data Mining, Inference, and

Prediction (Springer Verlag, New York, 2009).

2. Science Staff, Science 331, 692 (2011).

3. By “functional relationship”we mean a distribution

(X,Y) in which Yis a function of X, potentially with

independent noise added.

4. A. Caspi et al., Science 301, 386 (2003).

5. R. N. Clayton, T. K. Mayeda, Geochim. Cosmochim. Acta

60, 1999 (1996).

6. T. J. Algeo, T. W. Lyons, Paleoceanography 21, PA1016

(2006).

7. World Health Organization Statistical Information

Systems, World Health Organization Statistical

Information Systems (WHOSIS) (2009);

www.who.int/whosis/en/.

8. A. Rényi, Acta Math. Hung. 10, 441 (1959).

9. C. J. Stone, Ann. Stat. 5, 595 (1977).

10. W. S. Cleveland, S. J. Devlin, J. Am. Stat. Assoc. 83, 596

(1988).

11. For both splines and regression estimators, we used R

2

with respect to the estimated spline/regression function

to score relationships.

12. Y. I. Moon, B. Rajagopalan, U. Lall, Phys. Rev. E Stat.

Phys. Plasmas Fluids Relat. Interdiscip. Topics 52, 2318

(1995).

13. G. Darbellay, I. Vajda, IEEE Trans. Inf. Theory 45, 1315

(1999).

14. A. Kraskov, H. Stögbauer, P. Grassberger, Phys. Rev. E Stat.

Nonlin. Soft Matter Phys. 69, 066138 (2004).

15. L. Breiman, J. H. Friedman, J. Am. Stat. Assoc. 80, 580

(1985).

16. T. Hastie, W. Stuetzle, J. Am. Stat. Assoc. 84, 502

(1989).

17. R. Tibshirani, Stat. Comput. 2, 183 (1992).

18. B. Kégl, A. Krzyzak, T. Linder, K. Zeger, Adv. Neural Inf.

Process. Syst. 11, 501 (1999).

19. P. Delicado, M. Smrekar, Stat. Comput. 19, 255

(2009).

20. “Principal curve-based methods”refers to mean-squared

error relative to the principal curve, and CorGC, the

principal curve-based measure of dependence of

Delicado et al.

21. G. Székely, M. Rizzo, Ann. Appl. Stat. 3, 1236

(2009).

22. B. P. Tu, A. Kudlicki, M. Rowicka, S. L. McKnight, Science

310, 1152 (2005).

23. R. Fisher, Tests of significance in harmonic analysis.

Proc. R. Soc. Lond. A 125, 54 (1929).

24. M. Ahdesmäki, H. Lähdesmäki, R. Pearson, H. Huttunen,

O. Yli-Harja, BMC Bioinformatics 6, 117 (2005).

25. H. Rosling, Gapminder, Indicators in Gapminder World

(2008); www.gapminder.org/data/

26. P. T. Spellman et al., Mol. Biol. Cell 9, 3273

(1998).

27. Baseball Prospectus Statistics Reports (2009);

www.baseballprospectus.com/sortable/

28. S. Lahman, The Baseball Archive, The Baseball Archive

(2009); baseball1.com/statistics/

29. P. J. Turnbaugh et al., The effect of diet on the

human gut microbiome: A metagenomic analysis in

humanized gnotobiotic mice. Sci. Transl. Med. 1, 6ra14

(2009).

30. L. Chen et al., Lancet 364, 1984 (2004).

31. S. Desai, S. Alva, Demography 35, 71 (1998).

32. S. Gupta, M. Verhoeven, J. Policy Model. 23, 433

(2001).

33. T. Gill et al., Obesity in the Pacific: Too big to ignore.

Noumea, New Caledonia: World Health Organization

Regional Office for the Western Pacific, Secretariat of the

Pacific Community (2002).

34. RPMLV estimates how many more runs per game a

player contributes over a replacement-level player in an

average lineup.

35. P. J. Turnbaugh et al., Nature 449, 804 (2007).

36. R. E. Ley et al., Science 320, 1647 (2008).

37. The World Factbook 2009 (Central Intelligence Agency,

Washington, DC, 2009).

Acknowledgments: We thank C. Blättler, B. Eidelson,

M. D. Finucane, M. M. Finucane, M. Fujihara, T. Gingrich,

E. Goldstein, R. Gupta, R. Hahne, T. Jaakkola, N. Laird,

M.Lipsitch,S.Manber,G.Nicholls,A.Papageorge,N.Patterson,

E.Phelan,J.Rinn,B.Ripley,I.Shylakhter,andR.Tibshiranifor

invaluable support and critical discussions throughout; and

O. Derby, M. Fitzgerald, S. Hart, M. Huang, E. Karlsson, S. Schaffner,

C. Edwards, and D. Yamins for assistance. P.C.S. and this

work are supported by the Packard Foundation. For data set

analysis, P.C.S. was also supported by NIH MIDAS award

U54GM088558, D.N.R. by a Marshall Scholarship, M.M. by

NSF grant 0915922, H.K.F. by ERC grant 239985, S.R.G. by

the Medical Scientist Training Program, and P.J.T. by NIH

P50 GM068763. Data and software are available online at

http://exploredata.net.

Supporting Online Material

www.sciencemag.org/cgi/content/full/334/6062/1518/DC1

Materials and Methods

SOM Text

Figs. S1 to S13

Tables S1 to S14

References (38–54)

10 March 2011; accepted 5 October 2011

10.1126/science.1205438

The Structure of the Eukaryotic

Ribosome at 3.0 Å Resolution

Adam Ben-Shem,*†Nicolas Garreau de Loubresse,*Sergey Melnikov,*Lasse Jenner,

Gulnara Yusupova, Marat Yusupov†

Ribosomes translate genetic information encoded by messenger RNA into proteins.

Many aspects of translation and its regulation are specific to eukaryotes, whose ribosomes are

much larger and intricate than their bacterial counterparts. We report the crystal structure

of the 80Sribosome from the yeast Saccharomyces cerevisiae—including nearly all ribosomal

RNA bases and protein side chains as well as an additional protein, Stm1—at a resolution

of 3.0 angstroms. This atomic model reveals the architecture of eukaryote-specific elements

and their interaction with the universally conserved core, and describes all eukaryote-specific

bridges between the two ribosomal subunits. It forms the structural framework for the design

and analysis of experiments that explore the eukaryotic translation apparatus and the

evolutionary forces that shaped it.

Ribosomes are responsible for the synthe-

sis of proteins across all kingdoms of

life. The core, which is universally con-

served and was described in detail by structures

of prokaryotic ribosomes, catalyzes peptide bond

formation and decodes mRNA (1). However,

eukaryotes and prokaryotes differ markedly in

other translation processes such as initiation, ter-

mination, and regulation (2,3), and eukaryotic

ribosomes play a central role in many eukaryote-

specific cellular processes. Accordingly, eukary-

otic ribosomes are at least 40% larger than their

bacterial counterparts as a result of additional ri-

bosomal RNA (rRNA) elements called expansion

segments (ESs) and extra protein moieties (4).

All ribosomes are composed of two subunits.

The large 60Ssubunit of the eukaryotic ribosome

(50Sin bacteria) consists of three rRNA mole-

cules (25S,5.8S, and 5S) and 46 proteins,

whereas the small 40Ssubunit (30Sin bacteria)

includes one rRNA chain (18S) and 33 proteins.

Of the 79 proteins, 32 have no homologs in crys-

tal structures of bacterial or archaeal ribosomes,

and those that do have homologs can still harbor

large eukaryote-specific extensions (5). Apart from

variability in certain rRNA expansion segments,

all eukaryotic ribosomes, from yeast to human,

are very similar.

Three-dimensional cryoelectron microscopy

(cryo-EM) reconstructions of eukaryotic ribo-

somes at 15 to 5.5 Å resolution provided in-

sight into the interactions of the ribosome with

several factors (4,6–8). A crystal structure of the

S. cerevisiae ribosome at 4.15 Å resolution de-

scribed the fold of all ordered rRNA expansion

segments, but the relatively low resolution pre-

cluded localization of most eukaryote-specific

proteins (9). Crystallographic data at a better res-

olution (3.9 Å) from the Tetrahymena thermo-

phila 40Sled to a definition of the locations and

folds of all eukaryote-specific proteins in the

Institut de Génétique etde Biologie Moléculaire et Cellulaire,

1 rue Laurent Fries, BP10142, Illkirch F-67400, France; INSERM,

U964, Illkirch F-67400, France; CNRS, UMR7104, Illkirch

F-67400, France; and Université de Strasbourg, Stras-

bourg F-67000, France.

*These authors contributed equally to this work.

†To whom correspondence should be addressed. E-mail:

adam@igbmc.fr (A.B.-S.); marat@igbmc.fr (M.Y.)

16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1524

RESEARCH ARTICLES

on December 15, 2011www.sciencemag.orgDownloaded from