ArticlePDF Available

Abstract and Figures

Aberrant cell signaling can cause cancer and other diseases and is a focal point of drug research. A common approach is to infer signaling activity of pathways from gene expression. However, mapping gene expression to pathway components disregards the effect of post-translational modifications, and downstream signatures represent very specific experimental conditions. Here we present PROGENy, a method that overcomes both limitations by leveraging a large compendium of publicly available perturbation experiments to yield a common core of Pathway RespOnsive GENes. Unlike pathway mapping methods, PROGENy can (i) recover the effect of known driver mutations, (ii) provide or improve strong markers for drug indications, and (iii) distinguish between oncogenic and tumor suppressor pathways for patient survival. Collectively, these results show that PROGENy accurately infers pathway activity from gene expression in a wide range of conditions.
Deriving pathway-response signatures for 11 pathways. a Reasoning about pathway activation. Most pathway approaches make use of either the set (top panel) or infer or incorporate structure (middle panel) of signaling molecules to make statements about a possible activation, while signature-based approaches such as PROGENy consider the genes affected by perturbing the pathway. b Workflow of the data curation and model building. (1) Finding and curation of 208 publicly available experiment series in the ArrayExpress database, (2) Extracting 556 perturbation experiments from series’ raw data, (3) Performing QC metrics and discarding failures, (4) Computing z-scores per experiment, (5) Using a linear regression model to fit genes responsive to all pathways simultaneously obtaining the z-coefficients matrix, (6) Assigning pathway scores using the coefficients matrix and basal expression data. See methods section for details. c Size of the data set compared to an individual gene expression signature experiment. The amount of experiments that comprise each pathway is shown to scale and indicated. Figure 1b (2) created by Guillaime Paumier is published under a CC-BY-SA license, sourced from https://commons.wikimedia.org/wiki/File:DNA_microarray.svg. Figure 1b (4) is an adaptation (by Chen-Pan Liao) of the original work of User:Jhguch at en.wikipedia, published under a CC-BY-SA license, sourced from https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg. Figure 1b (6) is an adaptation (by User:Ogrebot) of the original work of User:Bilou at en.wikipedia, published under a CC-BY-SA license, sourced from https://commons.wikimedia.org/wiki/File:Matrix_multiplication_diagram_2.svg
… 
MAPK and p53 scores drive drug response across all cancer types. a Comparison of the associations obtained by different pathway methods. Number of associations on the vertical and FDR on the horizontal axis. PROGENy yield more and stronger associations than all other pathway methods. Mutation associations are only stronger for TP53/Nutlin-3a and drugs that were specifically designed to bind to a mutated protein. PARADIGM not shown because no associations <10% FDR. markers (green) and greater than zero resistance markers (red). P values FDR-corrected. b Pathway context of the strongest associations (Supplementary Fig. 10) between EGFR/MAPK pathways and their inhibitors obtained by PROGENy. c Comparison of stratification by mutations and pathway scores. MAPK pathway (BRAF, NRAS, or KRAS) mutations and Trametinib on top left panel, AZ628 bottom left, BRAF mutations and Dabrafenib top right, and p53 pathway/TP53 mutations/Nutlin-3a bottom right. For each of the four cases, the leftmost violin plot shows the distribution of IC50s across all cell lines, followed by a stratification in wild-type (green) and mutant cell lines (blue box). The three rightmost violin plots show stratification of all the cell lines by the top, the two middle, and the bottom quartile of inferred pathway score (indicated by shade of color). The two remaining violin plots in the middle show mutated (BRAF, KRAS, or NRAS; blue color) or wild-type (TP53; green color) cell lines stratified by the top- and bottom quartiles of MAPK or p53 pathways scores (Mann–Whitney U-test statistics as indicated)
… 
This content is subject to copyright. Terms and conditions apply.
ARTICLE
Perturbation-response genes reveal signaling
footprints in cancer gene expression
Michael Schubert1, Bertram Klinger2,3, Martina Klünemann2,3, Anja Sieber2,3, Florian Uhlitz 2,3, Sascha Sauer4,
Mathew J. Garnett5, Nils Blüthgen 2,3 & Julio Saez-Rodriguez 1,6
Aberrant cell signaling can cause cancer and other diseases and is a focal point of drug
research. A common approach is to infer signaling activity of pathways from gene expression.
However, mapping gene expression to pathway components disregards the effect of post-
translational modications, and downstream signatures represent very specic experimental
conditions. Here we present PROGENy, a method that overcomes both limitations by
leveraging a large compendium of publicly available perturbation experiments to yield a
common core of Pathway RespOnsive GENes. Unlike pathway mapping methods, PROGENy
can (i) recover the effect of known driver mutations, (ii) provide or improve strong markers
for drug indications, and (iii) distinguish between oncogenic and tumor suppressor pathways
for patient survival. Collectively, these results show that PROGENy accurately infers pathway
activity from gene expression in a wide range of conditions.
DOI: 10.1038/s41467-017-02391-6 OPEN
1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, CB10 1SD, UK. 2Institute of Pathology,
Charité Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany. 3IRI Life Sciences and Institute for Theoretical Biology, Humboldt University Berlin,
Philippstr. 13/Haus 18, 10115 Berlin, Germany. 4Max Delbrück Center for Molecular Medicine (MDC), Berlin Institute for Medical Systems Biology/Berlin
Institute of Health, Robert-Rössle-Str. 10, 13092 Berlin, Germany. 5Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge CB10 1SA, UK.
6RWTH Aachen University, Faculty of Medicine, Joint Research Centre for Computational Biomedicine, Aachen 52057, Germany. Correspondence and
requests for materials should be addressed to J.S-R. (email: saezrodriguez@gmail.com)
NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecommunications 1
1234567890
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Awealth of molecular data have become available that
reects a cells state in different diseases. The challenge
that remains is how to derive predictive and reliable
biomarkers for disease status, treatment opportunities, or patient
outcome in a way that is both relevant and interpretable. Of
particular interest are methods that infer and quantify deregula-
tion of signaling pathways, as those are key for many processes
underpinning different diseases.
A particular example of this is cancer, which is largely caused
by cell signaling aberrations created by driver mutations and copy
number alterations1. Here, efforts like the TCGA2and ICGC3
have pioneered molecular characterization of primary tumors on
a large scale. In addition, the GDSC4,5and CCLE6have focussed
on preclinical biomarkers of drug sensitivity in cancer cell lines.
These initiatives have provided profound insight in the molecular
markup of the disease. However, putting the genomic alterations
investigated in the functional context of the pathways they alter
may provide additional information on mechanisms of patho-
genesis and treatment opportunities7.
With direct measurements of signaling activity not widely
available, the latter has often been inferred using gene expression.
This includes quantifying the expression level of a pathway gene
set (e.g., Gene Ontology8or Reactome9) using Gene Set Enrich-
ment Analysis10, or other methods that are able to take pathway
structure into account1113. While these methods can be applied
to almost any pathway, they are based on mapping transcript
expression to the corresponding signaling proteins and hence do
not take into account the effect of post-translational modica-
tions (Fig. 1a). It is therefore unclear if and under what cir-
cumstances the pathway scores obtained by these methods reect
signaling activity.
A complementary approach is to contrast two conditions with
known differential activity by means of a gene expression sig-
nature14. Of particular interest are short-term perturbation
experiments that capture the primary response to a stimulus. A
well-known example of this is the Connectivity Map15 that has
been used to match drug-induced gene expression changes for
disease indications or drug repurposing16. In a similar manner,
many signatures have been proposed to infer pathway activity17
23, including seminal work by Bild et al.17 that was later also used
to predict drug response in breast cancer cell lines18,24.
However, the same signaling pathways may trigger different
downstream gene expression programs depending on the cell type
or the perturbing agent used. Hence, if gene expression signatures
are to be used as a generally applicable pathway method, there is a
need to address this context specicity. In the past, methods have
been developed that addressed this by building consensus models
over multiple signatures and using these to infer pathway activ-
ity5,25,26. These methods, however, have been limited by a low
number of perturbation experiments as well as inherent appli-
cation constraints.
Here, we overcome the limitations of both approaches by
leveraging a large compendium of publicly available perturbation
experiments that yield a common core of Pathway RespOnsive
GENes to a specied set of stimuli. PROGENy is able to better
infer pathway activity from perturbation experiments than
EPSA25, is applicable to panels of samples unlike SPEED26, and
performs better than a previous extension we proposed to the
latter5.
We performed a systematic comparison of PROGENy and
other commonly used pathway methods for 11 cancer-relevant
pathways. We investigated how well each method can recover
pathway perturbations and constitutive activity mediated by
driver mutations in The Cancer Genome Atlas (TCGA)2.We
further examined how well they can explain drug sensitivity to
265 drugs in 805 cancer cell lines in the Genomics of Drug
Sensitivity in Cancer (GDSC)4,5and patient survival in 7254
primary tumors spanning 34 tumor types using TCGA data. We
found that PROGENy signicantly outperforms existing methods
for these tasks.
Results
Consensus gene signatures for pathway activity. We curated
(workow in Fig. 1b; experiments in Supplementary Note 1)a
total of 208 different submissions to ArrayExpress/GEO, span-
ning perturbations of the 11 pathways EGFR, MAPK, PI3K,
VEGF, JAK-STAT, TGFb, TNFa, NFkB, Hypoxia, p53-mediated
DNA damage response, and Trail (apoptosis). Our data set
consists of 568 experiments and 2652 microarrays, making it the
largest study of pathway signatures to date (Fig. 1c and Supple-
mentary Fig. 1).
We calculated z-scores of gene expression changes for each
experiment, for which we trained a regression model using the
perturbed pathway as input and gene expression as a response
variable. For each pathway, we identied 100 responsive genes
that are most consistently deregulated across experiments
(Supplementary Fig. 2). These responsive genes are specicto
the perturbed pathway and have little overlap with genes
encoding for its signaling proteins (Supplementary Fig. 3). This
underscores the fact that pathway expression and activation are
distinct processes and suggests that they should be treated
separately. We use the z-scores of those 100 pathway-responsive
genes in a simple, yet effective, linear model to infer pathway
activity from gene expression called PROGENy (for Pathway
RespOnsive GENes, but also to indicate the descent of the
method from previously published experiments; Supplementary
Data 1). We nd that our responsive genes are often enriched in
biological processes related to a signaling pathway, but not the
pathway itself (Supplementary Fig. 4).
Using a leave-one-out strategy of model building and
perturbation scoring, our inferred pathway activation is strongly
(p<1010, except p<105for Trail) associated with the pathway
that was experimentally perturbed. The associations of a pathway
signature with other pathways are weaker (p>105), except for
EGFR with MAPK/PI3K and TNFa with NFkB/MAPK (Fig. 2a
and Supplementary Fig. 5, left), where there is biologically known
cross-activation27. Relative activation patterns are consistent
across input experiments (Supplementary Fig. 5, right).
PROGENy separates basal and perturbed arrays better
(Supplementary Table 1; binomial test; p<0.04) than EPSA25
on our curated set of experiments, and in addition to SPEED26
also infers the sign of pathway activity (Supplementary Fig. 6).
We nd that building the consensus of many experiments is
essential, as the z-scores from a single experiment perform no
better than random, and using too few experiments to derive the
model degrades performance. The exact number of experiments
required differs between pathways, but we see a plateau effect
between 20 and 50 signatures for most of them.
In order to also test PROGENy on a completely separate set, we
set aside 10 perturbation experiments that also measured pathway
activity in an orthogonal manner. We compared the activity
measurement from basal and perturbed condition with the
pathway scores that PROGENy inferred, and found that our
method could always predict the direction of perturbation
correctly, with separation statistics that are comparable to direct
measurements (Supplementary Fig. 7). Furthermore, we per-
formed independent validation experiments using the HEK293-
ER cell line, where we performed 5 distinct pathway perturba-
tions. We induced RAF/MAPK signaling using 4-hydroxy
tamoxifen (4OHT) that stimulates an RAF-ER transgene, and
used the PI3K inhibitor Ly294002 to block the PI3K/AKT
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6
2NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
pathway, TNF-alpha to activate the TNF-alpha pathway as well as
the NFkB pathway downstream of it, TGF-beta 1 to activate the
TGFb pathway, and IFN-gamma to activate the JAKSTAT
pathway. We subsequently measured phospho-proteomics
(Fig. 2c) and gene expression upon perturbation. Results of these
experiments conrmed that the PROGENy scores (Fig. 2d)
capture pathway activity, as they accurately reected the activated
pathway and agreed with the measured changes in the activity
status of key proteins in the corresponding pathways measured by
phosphorylation.
Now that we have conrmed how pathway-responsive genes
behave when a stimulus is present, we can take the idea one step
further and hypothesize that the existence of a different basal
expression level of the responsive genes may in turn correspond
to cell-intrinsic signaling activity. When we apply PROGENy to a
cell line panel, we nd that the obtained pathway scores are
robust to changes in the experiments that the model was derived
from (Fig. 2d), and also observe a similar correlation as the
previously observed cross-activation upon perturbation (Supple-
mentary Fig. 8).
Recovering mechanisms of known driver mutations. If our
reasoning is correct and PROGENy signatures in basal gene
expression correspond to intrinsic signaling activity, we should be
able to see a higher pathway score in cancer patients with an
activating driver mutation in that pathway and a lower score for
pathway suppression compared to patients where no such
alteration is present.
We selected all cancer types in the TCGA for which there were
tissue-matched normals available, in order to make full use of the
3
6
Signatures, PROGENy
SPIA, Pathifier, PARADIGM
GO, pathway enrichment
W
15.73%15.73%
−4−3−2−101324
gi
control
gi
perturbed
68.27%
Linear regression model
z-scores
QC
Expression
contrasts
Experiment
curation
1
2
4
5
Expression
matrix
z coefficients
matrix
e1,1e1,2
e3,1e3,2
e2,1e2,2
e4,1e4,2
z1,2
z2,2
z1,3
z2,3
z1,1
z2,1
Genes
Samples
Pathways
GenesSamples
Scores
matrix
Pathways
1
Pathways
experiments
1
1
0
0
0
...
...
z1,2
z1,3
Exp.
...
gene
A
zi
EGFR (106)
Hypoxia (66)
JAK−STAT (66)
MAPK (88) NFkB (46)
PI3K (27)
TGFb (31)
TNFa (69)
Trail (10)
VEGF (36)
p53 (23)
Signature (to scale)
W
W
XY Z
XY Z
XY Z
PROGENy
b
c
a
2
Fig. 1 Deriving pathway-response signatures for 11 pathways. aReasoning about pathway activation. Most pathway approaches make use of either the set
(top panel) or infer or incorporate structure (middle panel) of signaling molecules to make statements about a possible activation, while signature-based
approaches such as PROGENy consider the genes affected by perturbing the pathway. bWorkow of the data curation and model building. (1) Finding and
curation of 208 publicly available experiment series in the ArrayExpress database, (2) Extracting 556 perturbation experiments from seriesraw data, (3)
Performing QC metrics and discarding failures, (4) Computing z-scores per experiment, (5) Using a linear regression model to t genes responsive to all
pathways simultaneously obtaining the z-coefcients matrix, (6) Assigning pathway scores using the coefcients matrix and basal expression data. See
methods section for details. cSize of the data set compared to an individual gene expression signature experiment. The amount of experiments that
comprise each pathway is shown to scale and indicated. Figure 1b (2) created by Guillaime Paumier is published under a CC-BY-SA license, sourced from
https://commons.wikimedia.org/wiki/File:DNA_microarray.svg. Figure 1b (4) is an adaptation (by Chen-Pan Liao) of the original work of User:Jhguch at
en.wikipedia, published under a CC-BY-SA license, sourced from https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg. Figure 1b (6) is an
adaptation (by User:Ogrebot) of the original work of User:Bilou at en.wikipedia, published under a CC-BY-SA license, sourced from https://commons.
wikimedia.org/wiki/File:Matrix_multiplication_diagram_2.svg
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6 ARTICLE
NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecommunications 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
pathway methods that require them. We calculated pathway
scores for those using PROGENy, Reactome9and Gene
Ontology8enrichment, SPIA11, Pathier13, PARADIGM12,a
modied version of SPEED5, and the Gatza et al.18 signatures
(Supplementary Table 2). We used an ANOVA to calculate
signicant associations between the presence and absence of
mutations and copy number alterations and the inferred pathway
scores for our method (Fig. 3a) and others (Supplementary
Fig. 9).
In terms of proliferative signaling, we nd that PROGENy
identies EGFR amplications to activate both the EGFR and
MAPK pathways (FDR <109). KRAS mutations and amplica-
tions show an increase in inferred MAPK/EGFR activity. Other
methods do not detect a strong activation of the MAPK/EGFR
pathways given those alterations (Fig. 3b.; top right and bottom
left). We nd the same effect for BRAF mutations (FDR <1010)
that additionally activate TNFa/NFkB (FDR <1015).
For TP53 mutations, PROGENy nds a signicant reduction in
p53/DNA damage response activity (FDR <1064) and activation
of the PI3K and Hypoxia pathways (FDR <1015). This is in
contrast to loss of TP53, where we only nd a reduction in p53/
DDR (FDR <103), but no strong evidence of modication of any
other pathway (FDR >0.04). The dual nature of TP53 mutations
and loss are in line with the recent discovery that TP53 mutations
can act in an oncogenic manner in addition to disrupting its
tumor suppressor activity, which has been shown for individual
cancer types2831. In addition, this analysis suggests a link
between TP53 mutations and genes that are induced by activation
of canonical oncogenic signaling such as PI3K or the hypoxic
response. Other methods (Fig. 3b.; top left) do not recover the
expected negative association between these alterations and p53/
DDR activity. Gene Ontology showed a much weaker effect in the
same direction, while Reactome, Pathier, and SPIA showed an
incorrect positive effect. These methods do, however, capture the
activation of other oncogenic pathways, suggesting that this effect
is driven by expression changes that then lead to changes in
activity.
PROGENy nds that VHL mutations (which have a high
overlap with Kidney Renal Carcinoma, KIRC) are associated with
an expected stronger induction of hypoxic genes32 compared to
other cancer types (FDR <10200). It is the only method to
recover hypoxia as the strongest link with VHL mutations, while
the other methods primarily report expression changes in
unrelated pathways (Fig. 3b.; bottom right). More surprisingly,
we nd that presence of PIK3CA amplications and PTEN
deletions is also more connected to increasing the hypoxic
response (FDR <106) compared to an effect on the PI3K-
responsive genes (Supplementary Table 3). A role of PI3K
signaling in hypoxia has been shown before3335.
These highlights reect the more general pattern that PRO-
GENy is able to correctly infer the impact of driver mutations that
the other pathway expression-based methods could not. The
latter are only able to identify some cases where activity is
mediated by changes in the expression level of the pathway
members itself.
Associations with drug response. The next question we tried to
answer is how well PROGENy is able to explain drug sensitivity
in cancer cell lines. We took as a measure of efcacy the IC
50
, i.e.,
the drug concentration that reduces viability of cancer cells by
50%, for 265 drugs and 805 cell lines from the GDSC project5.We
Perturbation
**
*
*
*
**
4OHT
IFN
Ly
TGF
TNF
EGFR
Hypoxia
MAPK
JAK−STAT
NFkB
PI3K
TGFb
TNFa
Trail
VEGF
p53
Pathway
−2
0
2
sd over
control
*p<0.05, sd/c>1.5
VEGF
Trail
TNFa
TGFb
PI3K
p53
NFkB
MAPK
Hypoxia
JAK−STAT
EGFR
1 2 5 10 100
Cell line variance over
input variance
Pathway
Perturbation
**
*
*
*
*
*
*
*
4OHT
IFN
LY
TGF
TNF
AKT
ERK
IkBa
JNK
MEK
Smad2
Stat3
cJun
mTOR
−0.5
0.0
0.5
Relative
activation
Phosphoprotein p<0.05, 30% max pert.
*
*
*
*
**
*
*
·
*
·
·
*
*
**
*
*·
VEGF
Trail
TNFa
TGFb
PI3K
p53
NFkB
MAPK
JAK−STAT
Hypoxia
EGFR
EGFR
Hypoxia
JAK−STAT
MAPK
NFkB
p53
PI3K
TGFb
TNFa
Trail
VEGF
Pathway perturbed
Assigned score
−30
−20
−10
0
10
20
30
Wald
statistic
ab
dc
Fig. 2 Evaluation of pathway-response signatures. aAssociations for PROGENy pathway scores with experimental perturbation for experiments that the
model was not built with (leave-one-out cross-validation). Each pathway is strongly associated with its own perturbation, and we observe few cases of
cross-talk in agreement with biological knowledge. bPathway perturbations in HEK293 cell line activate the corresponding signaling proteins. MEK and ERK
for MAPK pathway, Stat3 for Interferon-induced JAK-STAT, AKT for PI3K, Smad2 for TGFb, and IKb for TNF-alpha-induced NFkB. As expected, all
increased upon stimulation except AKT that decreased upon inhibition. Activation shown relative to maximum readout per antibody, pvalues reported for
one-sample one-sided ttest. Results are signicant if p<0.05 and perturbation is at least 30% of maximum. cPROGENy correctly infers pathway activity
from gene expression in the HEK293 experiments. Associations are signicant if pvalue of two-sample one-sided ttest <0.05 and experiments are at least
1.5 standard deviations above or below the control. dStability of basal pathway scores when bootstrapping input experiments. Bars show how much more
variance in pathway scores (GDSC panel) is introduced by cell line identity over using resampled perturbation experiments in model building. Variance by
cell line is over ve times as high for most pathways, and roughly twice as high for Trail and VEGF
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6
4NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
performed an ANOVA between those IC
50
values and inferred
pathway scores of PROGENy and the other methods we
investigated.
We found 178 signicant associations for PROGENy (10%
FDR in Fig. 4a and Supplementary Fig. 10), dominated by
sensitivity associations between MAPK/EGFR activity and drugs
targeting MAPK pathway (Fig. 4b) that are consistent with
oncogene addiction. In particular, this includes associations of the
MAPK/EGFR pathways with different MEK inhibitors (Trame-
tinib, RDEA119, CI-1040, etc.), a RAF inhibitor (AZ628) and a
TAK1 inhibitor (7-Oxozeaenol). However, the strongest hit we
obtained was the association between Nutlin-3a and p53-
responsive genes. Nutlin-3a is an MDM2-inhibitor that in turn
stabilizes p53. Since it has also previously been shown that a
mutation in TP53 is strongly associated with increased resistance
to Nutlin-3a4, this is a well-understood mechanism of sensitivity
(presence) or resistance (absence of p53 activity) to this drug that
our method captures, but none of the pathway expression-based
methods do.
Considering the overall number of associations, the other
pathway methods provided a lower number across the range of
signicance (Fig. 4a). PROGENy outperforms associations
obtained with driver mutations at 10% FDR, as those only yield
136 associations. The latter only provides stronger associations
for TP53, where the signature is a compound of p53 signaling and
DNA damage response, and PLX4720/Dabrafenib, drugs that
were specically designed to target mutated BRAF. For 147 out of
265 drugs covered by signicant associations with either
PROGENy or driver mutations, PROGENy provided stronger
associations for 78, with a signicant enrichment in cytotoxic
drugs compared to targeted drugs for mutations (Fishers exact
test, p<0.01).
However, stratication using PROGENy and mutated driver
genes is not mutually exclusive. Our pathway scores are able to
further stratify the mutated and wild-type sub-populations into
more and less sensitive cell lines (Fig. 4c and Supplementary
Tables 45). This includes, but is not limited to, BRAF,NRAS,or
KRAS mutations using MAPK pathway activity and the MEK
inhibitor Trametinib (Fig. 4c; top left) or RAF inhibitor AZ628
(Fig. 4c; bottom left), BRAF mutations with Dabrafenib (Fig. 4c;
top right), and TP53 mutations with p53/DDR and Nutlin-3a
(Fig. 4c; bottom left). For MAPK- and BRAF-mutated cell lines,
we nd that cell lines with an active MAPK pathway according to
PROGENy are 65 (AZ628), 9130 (Trametinib), or 104fold
(Dabrafenib) more sensitive than those where it is inactive. For
Trametinib, cell lines with active MAPK, but no mutation in
BRAF,KRAS,orNRAS are 15 times more sensitive than cell lines
that harbor a mutation in any of them, but our analysis
Δ normalized pathway score
Adjusted P−value
Mutation
Copy number
aberration
EGFR amp p<10–3
p<10–9
*
·VHLmut p<10–10
p<10–50
*
·
10−17
10−11
10−5
CDH1mut PI3K
MAP4K1amp VEGF
CDKN2A adel trail
NRAS mut TNFa
−0.5 0.0 0.5
BRAF mut TNFa, NFkB
TP53mut hypoxia
MCEBPAamp VEGF
CDKN2Adel MAPK
TP53mut p53/DDR
BRAF mut EGFR
KRAS mut MAPK
EGFR amp EGFR
GATA3 amp p53/DDR
KRAS amp MAPK
MYC amp MAPK
VHLdel hypoxia
MYCamp p53/DDR
TP53mut PI3K
EGFRamp MAPK
BRAF mut MAPK
*
·
*
·
·
·
·
·
*
·
·
·
·
VEGF
Trail
TNFa
TGFb
PI3K
p53/DDR
NFkB
MAPK
JAK−STAT JAK−STAT
Hypoxia
EGFR
−10
−5
0
5
10
Wald
·
*
·
·
·
·
·
*
·
·
*
·
·
·
·
·
·
·
·
·
·
*
*
VEGF
Trail
TNFa
TGFb
PI3K
p53/DDR
NFkB
MAPK
Hypoxia
EGFR
−20
0
20
Wald
TP53 mut p<10–5
p<10–15
*
·KRAS mut p<0.05
p<10–5
*
·
*
·
*
*·
·
·
·
·
*
·
·
·
·
*
·
*
*
*
*
VEGF
Trail
TNFa
TGFb
PI3K
p53/DDR
NFkB
MAPK
Hypoxia
JAK−STAT JAK−STAT
EGFR
−10
0
10
Wald
*
*
·
·
·
·
·
·
·
·
·
·
·
·
·
VEGF
Trail
TNFa
TGFb
PI3K
p53/DDR
NFkB
MAPK
Hypoxia
EGFR
PROGENy
Gene ontology
Reactome
SPIA
Pathifier
PARADIGM
Gatza (2009)
−4
0
4
Wald
Iorio (2016)
PROGENy
Gene ontology
Reactome
SPIA
Pathifier
PARADIGM
Gatza (2009)
Iorio (2016)
PROGENy
Gene ontology
Reactome
SPIA
Pathifier
PARADIGM
Gatza (2009)
Iorio (2016)
PROGENy
Gene ontology
Reactome
SPIA
Pathifier
PARADIGM
Gatza (2009)
Iorio (2016)
ab
Fig. 3 Ability of pathway methods to recover well-known mutations. aVolcano plot of pan-cancer associations between driver mutations and copy number
aberrations with differences in pathway score. Pathway scores calculated from basal gene expression in the TCGA for primary tumors. Size of points
corresponds to occurrence of aberration. Type of aberration is indicated by superscript mutif mutated and amp/delif amplied or deleted, with
colors as indicated. Effect sizes on the horizontal axis larger than zero indicate pathway activation and smaller than zero indicate inferred inhibition. P
values on the vertical axis FDR-adjusted with a signicance threshold of 5%. Associations shown without correcting for different cancer types. Associations
with a black outer ring are also signicant if corrected. bComparison of pathway scores (vertical axes) across different methods (horizontal axes) for TP53
and KRAS mutations, EGFR amplications and VHL mutations. Wald statistic shown as shades of green for downregulated and red for upregulated
pathways. Pvalue labels shown as indicated. White squares where a pathway was not available for a method
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6 ARTICLE
NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecommunications 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
determined that MAPK is inactive (Supplementary Table 5; fold
changes reported for median of subset).
Taken together, these results show that PROGENy can be used
to complement mutation-derived biomarkers by either rening
them or providing an alternative where no such marker exists.
Associations obtained with other methods do not show strong
interactions between pathways and drugs that target their
members (Supplementary Fig. 10). Furthermore, our associations
hold true in an independent sensitivity screen for overlapping
drugs (CCLE; Supplementary Table 6).
Implications for patient survival. The implications of inferred
pathway activity compared to pathway expression is expected to
be less clear for patient survival than for cell line drug response
due to the many more factors that affect the phenotype observed.
Nonetheless, we were interested in how our inferred pathway
activity compared to pathway expression methods in terms of
overall patient survival.
Across all cancer types, PROGENy found a strong association
between the activation of EGFR, MAPK, PI3K, and Hypoxia
pathways and decreased survival, similar to other signature
methods (Fig. 5a). Gene Ontology found much weaker
associations for expression of those pathways, and the other
pathway mapping methods missed them almost entirely.
PROGENy is the only method to nd an increase in survival
associated with the activation of the Trail/apoptosis pathway,
while other methods show either a decrease or no effect, or we did
not nd an appropriate pathway in the signatures we compared.
For JAKSTAT, NFkB, p53, and VEGF pathways there are no
signicant associations that are picked up by more than one
method (FDR <0.05). Compared to pathway mapping, signature
methods provide associations of similar strength with overlapping
pathways.
For individual cancer types, PROGENy nds a similar
separation between oncogenic and tumor-suppressor pathways
(Fig. 5b), showing that it can capture both general and specic
patterns in gene expression changes. Importantly, pathway
mapping methods do not provide this separation and our
associations are signicant for more cancer types and more
specic to individual pathways (Supplementary Fig. 11). In
addition, we nd cancer-specic associations of pathways with no
effect in the pan-cancer setting: For instance, with PROGENy,
Adrenocortical Carcinoma (ACC) shows a signicant increase of
survival with p53 activity (FDR <103). This positive effect of p53
on survival is supported by the fact that ACC samples do not
EGFR
TGFb
Ras
BRAF/Raf
MEK
ERK
PI3K
TAK1
JNK
JUN
FOS
AZ628
Dabrafenib PD-0325901
RDEA119
Trametinib
CI-1040
VX-11e
7-Oxo-
zeaenol
MAPK
p53
Nutlin-3a
−5
0
5
Trametinib
All wt BRAF|NRAS |KRAS MAPK (+,0,)MAPK (+,0,)
All wt BRAF|NRAS |KRAS MAPK (+,0,)
AZ628
0
5
IC50 [log μM]
mut+wt mut wt MAPK top quartile p53 top quartile Pathway bottom quartile
FC 9.1·103, p 7.1·10–7
−5
0
5
Dabrafenib
All wt BRAF mut
FC 65, p 0.014
0
2
4
6
Nutlin-3a
All mut TP53 wt p53 (+,0,)
– log FDR
Number of associations
1
10
100
1 10 20 30
PROGENy
Gene ontology
Reactome
SPIA
Pathifier
Mutations
Gatza (2009)
TP53 mut
Nutlin-3a
BRAF mut
PLX4720, Dabrafenib
Iorio (2016)
p53: Nutlin-3a
MAPK, EGFR: MEK, ERK,
BRAF inhibitors
TNFa
XAV 939 [Wnt]
ab
c
FC 5.4·104, p 0.14
FC 5, p 0.017
Fig. 4 MAPK and p53 scores drive drug response across all cancer types. aComparison of the associations obtained by different pathway methods.
Number of associations on the vertical and FDR on the horizontal axis. PROGENy yield more and stronger associations than all other pathway methods.
Mutation associations are only stronger for TP53/Nutlin-3a and drugs that were specically designed to bind to a mutated protein. PARADIGM not shown
because no associations <10% FDR. markers (green) and greater than zero resistance markers (red). Pvalues FDR-corrected. bPathway context of the
strongest associations (Supplementary Fig. 10) between EGFR/MAPK pathways and their inhibitors obtained by PROGENy. cComparison of stratication
by mutations and pathway scores. MAPK pathway (BRAF,NRAS,orKRAS) mutations and Trametinib on top left panel, AZ628 bottom left, BRAF mutations
and Dabrafenib top right, and p53 pathway/TP53 mutations/Nutlin-3a bottom right. For each of the four cases, the leftmost violin plot shows the
distribution of IC
50
s across all cell lines, followed by a stratication in wild-type (green) and mutant cell lines (blue box). The three rightmost violin plots
show stratication of all the cell lines by the top, the two middle, and the bottom quartile of inferred pathway score (indicated by shade of color). The two
remaining violin plots in the middle show mutated (BRAF,KRAS,orNRAS; blue color) or wild-type (TP53; green color) cell lines stratied by the top- and
bottom quartiles of MAPK or p53 pathways scores (MannWhitney U-test statistics as indicated)
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6
6NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecom munications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
harbor any previously reported gain-of-function TP53 variants31.
Kidney Renal Clear Cell Carcinoma (KIRC) and Low-Grade
Glioma (LGG) show decreased survival with TNFa and JAK-
STAT pathways, respectively, where specic activating mutations
are much less known than for EGFR/MAPK. For these three
associations, the top and bottom quartiles of PROGENy pathway
activity were able to stratify patients in groups with over 25%
difference in one year survival (Fig. 5c). These associations are
stable when resampling patients (Supplementary Table 7).
In summary, we can observe that signature-based methods
generally outperform pathway mapping for survival associations,
but the difference between PROGENy and the signatures of
SPEED5,26 and Gatza et al.18 is less pronounced than for driver
mutations or drug response.
Discussion
The explanation of phenotypes in cancer, such as cell line drug
response or patient survival, has largely been focussed on geno-
mic alterations (mutations, copy number alterations, and struc-
tural variations). While this approach has generated many
important insights into cancer biology, it does not directly make
statements about the impact of those aberrations on cellular
processes and signal transduction in particular. Pathway methods,
mostly used on gene expression, have produced mixed results
when it comes to delivering actionable evidence. This can in part
be due to lack of robustness, as suggested by the heterogeneity in
responses of individual signatures (Supplementary Fig. 6), but
arguably also by the fact that extracting features that reect
pathway activity from gene expression is not trivial. With pro-
teomics lagging behind sequencing data for the foreseeable future,
we have a need to address the accurate inference of pathway
activity from gene expression in heterogeneous samples using a
general downstream gene expression pattern.
We developed PROGENy in order to achieve this. PROGENy
leverages a large compendium of pathway-responsive gene sig-
natures derived from a wide range of different conditions in order
to identify genes that are consistently deregulated. While this
approach has been taken before, previous studies either focussed
less on integrating responses from many different cell lines25 or
derived their scores from a much smaller collection of pertur-
bation experiments5,26.
We found that despite the heterogeneity of individual gene
expression experiments, PROGENy closely corresponds to path-
way perturbations. PROGENy can recover the impact of known
driver mutations from basal-gene expression, but also identify
cases where a pathway is active without their presence. In con-
trast, pathway mapping only recovers known associations, where
this effect is mediated by expression changes in pathway
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0 1020304050
0 1020304050
0 1020304050
Fraction
Weeks
Pathway
Bottom quartile
Middle
Top quartile
LGG: JAK-STAT
p < 10–3
ACC: p53
p < 0.002
KIRC: TNFa
p < 10–4
n = 79
n = 495
n = 519
.*...
....
.
.
..
*
** *
.*
.*.. *
.
Gatza (2009)
PARADIGM
Pathifier
SPIA
Reactome
Gene ontology
PROGENy
EGFR
Hypoxia
JAK−STAT
MAPK
NFkB
p53
PI3K
TGFb
TNFa
Trail
VEGF
−5
0
5
Wald
statistic
10−5
10−3
10−1
−2 0
Survival decrease / Δ pathway score
2
Adjusted P−value
PRAD: Trail
ACC: p53
CESC: Trail
SARC: Trail
LUAD: EGFR, PI3K
CESC: Hypoxia
PAAD: MAPK
ACC: MAPK
KIRC: TNFa
LGG: NFkB/TNFa
KICH: Hypoxia
Iorio (2016)
p<0.01
p<10–8
*
·
FDR<0.2
b
ac
Fig. 5 Response signatures outperform pathway methods for patient survival. aPan-cancer associations between pathway scores and patient survival.
Pathways on the horizontal axis and different methods on the vertical axis. Associations of survival increase (green) and decrease. Signicance labels as
indicated. Shades correspond to effect size, pvalues as indicated. bVolcano plot of cancers-specic associations between patient survival and inferred
pathway score using PROGENy. Effect size on the horizontal axis. Below zero indicates increased survival (green), above decreased survival (red). FDR-
adjusted pvalues on the vertical axis. Size of the dots corresponds to number of patients in each cohort. cKaplanMeier curves of individual associations
for kidney (KIRC), low-grade glioma (LGG), and adrenocortical carcinoma (ACC). Pathway scores are split in top and bottom quartiles and center half.
Lines show the fraction of patients (vertical axis) that are alive at a given time (horizontal axis) within one year. Pvalues for discretized scores
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6 ARTICLE
NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturec ommunications 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
members, such as TP53 oncogene activation or copy number
aberrations. We applied PROGENy to a drug sensitivity data set,
where the signicant associations we obtained corresponded
better to known drugpathway interactions than those of com-
peting methods. PROGENy was also able to consistently distin-
guish between oncogenic driver pathways (mainly EGFR and
MAPK) and cell death (Trail) pathways for patient survival.
Overall, our results suggest that PROGENy provides a better
measure of pathway activity than other pathway methods, irre-
spective of whether the latter was derived from gene sets or
directed paths. The latter can be used for many more pathways, as
information on the pathway components is more often available
than perturbation experiments. However, our results indicate that
one should be cautious when interpreting the expression level of a
pathway as its activity.
We have shown that PROGENy is able to rene our under-
standing of the impact of mutations, as well as their utility for cell
line drug response and patient survival. It provides a strong
evidence that in order to infer pathway activity, e.g., for patient
stratication, a downstream readout should be used instead of
mapping transcript expression levels to signaling molecules.
While PROGENy provides a good estimate of pathway activity
in large and heterogeneous data sets, signatures derived from, for
instance, a specic tissue may still more closely reect activation
status given the same context. We see a hint of this when applying
the Gatza et al. signatures for the TCGA breast cancer cohort, but
more studies will be required to fully elucidate the differences
between a common response and additional transcriptional
modules that may not always be activated. We believe that our
curated set of experiments and computational pipeline will be
useful to further investigate this aspect of specialized vs. con-
sensus signatures and when either of them should be used.
Methods
Data from The Cancer Genome Atlas (TCGA). To obtain the TCGA data, we
used the Firehose tool from the BROAD institute (http://gdac.broadinstitute.org/),
release 28 January 2016.
For gene expression, we used all data labeled Level 3 RNA-seq v2.We
extracted the raw counts from the text les for each gene, discarded those that did
not have a valid HGNC symbol, and averaged expression levels where more than
one row corresponded to a given gene. We then performed a variance stabilizing
transformation (DESeq2 package36, BioConductor) for each TCGA study
separately, to be able to use linear modeling techniques with the count-based RNA-
seq data. The data used corresponds to 34 cancer types and a total of 9737 tumor
and 641 matched normal samples.
From the clinical data, we extracted the vital status and used known survival
time or known time of last follow-up as the survival time for the downstream
analyses. We converted the time in days to months by dividing by 30.4. We
obtained both mRNA expression levels as well as survival times for 10,544 patients,
distributed across 33 cancer types. For comparing different pathway methods, we
only used cancer types with tissue-matched controls, leaving 5927 samples in 13
cancer types.
Data for cell line gene expression and drug sensitivity. We used version 17a of
the Genomics of Drug Sensitivity in Cancer (GDSC) data5, comprised of molecular
data for 1001 cell lines and 265 anticancer drugs, specically microarray gene
expression data (ArrayExpress accession E-MTAB-3610) and the IC
50
values for
each drugcell line combination. For computing pan-cancer associations, we used
the subset with TCGA-like cancer type label, leaving 805 cell lines.
We downloaded the Cancer Cell Line Encyclopedia (CCLE) microarray gene
expression and drug sensitivity data from the CCLE web page (https://portals.
broadinstitute.org/ccle). For microarray data (20130318), we performed RMA
normalization, and mapped the probes to HGNC gene symbols. We used drug
proling data version 20120220 and drug metadata version 20150224.
Perturbation experiments of HEK293 cell line. HEK293ΔRAF1:ER cells were
acquired and cultured as previously described37. Before treatments, cells were
starved in serum-free medium overnight. Cells were treated with 4-hydroxy
tamoxifen (4OHT, Sigma-Aldrich; 0.5 µM), Ly294002 (Life Technologies; 10 µM)
or the following ligands from Peprotech: TNF-alpha (20 ng/ml), TGF-beta 1 (10
ng/ml), IFN-gamma (50 ng/ml). Cell lines have been tested for Mycoplasma
infection using Tenor GeM Classic (Minerva Biolabs).
RNA sequencing for HEK293 perturbations. After 4 h of treatment, total RNA
was extracted with Qiagen RNeasyMini Kit. Sequencing libraries were prepared
using Illumina TruSeq mRNA Library Prep Kit v2 and sequenced on Illumina
HiSeq 2000. Read quality was assessed using FastQC and sequencing adapters were
trimmed using cutadapt38. Reads were mapped with STAR aligner v2.5.0c39 on
hg19 using GENCODE v19 for annotation and quantied with subread feature-
Counts40. The preprocessing pipeline was written in Snakemake41. Raw read
counts were then normalized with DESeq2 and variance stabilization
transformed36.
Phosphoprotein measurements for HEK293 perturbations. Protein extracts of
cells were prepared by incubation with cell lysis buffer (Bio-Plex Pro Cell signaling
Reagent Kit, BioRad). The BioPlex Protein Array system (BioRad, Hercules, CA)
was used, as described earlier42. A total of 10 µg protein was analyzed. The fol-
lowing analytes were used: AKTS473, c-JunS63, ERK1/2T202,Y204/T185,Y187,
IkBaS32,S36, JNKT183,Y185, MEK1S217,S221 and mTORS2448. The beads and
detection antibodies were diluted 1:3. For data acquisition, the xPONENT software
was used.
The following antibodies were used for western blot measurements: rabbit anti
human p-SMAD2 (Ser465/467) (138D4) #3108, rabbit anti human p-Stat3 (Tyr
705) #9131 and rabbit anti human ß-Tubulin #2146. All primary antibodies were
diluted 1:1000 and obtained from Cell Signaling Technology. Electrophoresis was
performed and lysates were transferred onto nitrocellulose membranes. Unbound
protein sites were blocked with 1:2 Odyssey Blocking Buffer (from LiCOR) and
PBS. Thereafter, specic proteins were detected by incubation with primary
antibodies diluted in the same blocking buffer containing 0.1% Tween20 overnight
at 4 °C followed by nearinfrared dye labeled secondary antibodies. For detection of
phosphorylated SMAD2 and Stat3, a total of 30 and 60 µg protein was used,
respectively. Membrane Images were taken using LiCOR Odyssey Fc. The bands
were quantied by determining the background corrected total intensities using
ImageStudio software (Li-COR). All Signals were normalized to ß-Tubulin.
Two biological replicates were measured both after 30 min and 1 h and
outcomes were analyzed together by calculating log2 ratios to their respective
solvent control (BSA).
Curation of perturbation-response experiments. Our method is dependent on a
sufciently large number of available perturbation experiments that activate or
inhibit one of the pathways we were looking at. The following conditions needed to
be met in order for us to consider an experiment: (1) the compound or factor used
for perturbation was one of our curated list of pathway-perturbing agents (Sup-
plementary Note 1); (2) the perturbation lasted for less than 24 h to capture genes
that belong to the primary response; (3) there was raw data available for at least two
control arrays and one perturbed array; (4) it was a single-channel array; (5) we
could process the arrays using available BioConductor packages; (6) the array was
not custom-made so we could use standard annotations.
We curated a list of known pathway activators and inhibitors for 11 pathways,
where the interaction between each compound and pathway is well established in
literature (Supplementary Note 1). We then used those as query terms for public
perturbation experiments in the ArrayExpress database43 and included a total of
219 submissions and 581 experiments in our data set, where each experiment is a
distinct comparison between basal and perturbed arrays. If there were multiple
time points, different cells, different concentrations, or different perturbing agents
within a single database submission, they were considered as different experiments.
Microarray processing. Started from the curated list of perturbation-induced gene
expression experiments, we included all single-channel microarrays with at least
two replicates in the basal condition with raw data available that could be processed
by either the limma44, oligo45, or affy46 BioConductor packages and for which
there was a respective annotation package available.
We rst calculated a probe-level expression levels for 581 full series of arrays,
where we performed quality control of the raw data using RLE and NUSE cutoffs
under 0.1 and kept all arrays below that threshold. If after ltering less than two
basal condition arrays remained, the whole experiment was discarded. For the
remaining 575 experiments we normalized expression data using the RMA
algorithm and mapped the probe identiers to HGNC symbols.
Building a linear model of pathway-response genes. We set aside 10 experi-
ments for model validation. For the remainder and each HGNC symbol, we cal-
culated a model based on mean and standard deviation of the gene expression level,
and computed the z-score as average number of standard deviations that the
expression level in the perturbed array was shifted from the basal arrays. We then
performed LOESS smoothing for all z-scores in a given experiment using our null
model as described previously26
From the z-scores of all experiments and all pathways, we performed a linear
regression with the pathway as input and the z-scores as response variable for each
gene separately:
ZgMp¼8g;p
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6
8NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecom munications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Where Z
g
is the z-score for a given gene g across all input experiments. M
p
is a
perturbation indicator vector across all input experiments for each pathway p that
has the coefcient 1 if the experiment had a pathway activated, 1 if inhibited, and
0 otherwise. For instance, the Hypoxia pathway had experiments with low oxygen
conditions set as 1, HIF1A knockdown as 1, and all other experiments as 0. The
same is true for EGFR and EGF treatment vs. EGFR inhibitors, respectively, with
the additional coefcients of MAPK pathways set to 1 because of known cross-talk.
TNFa perturbations also changed NFkB coefcients for the same reason.
From the result of the linear model, we selected the top 100 genes per pathway
according to their signicance (pvalue) and took their estimate (the tted z-scores)
as coefcient. We set all other gene coefcients to 0. This way, we obtained a matrix
with HGNC symbols in rows and pathways in columns, where each pathway had
100 non-zero gene coefcients (Supplementary Data 1).
PROGENy scores. Each column in the matrix of perturbation-response genes
corresponds to a plane in gene expression space, in which each cell line or tumor
sample is located. If you follow its normal vector from the origin, the distance it
spans corresponds to the pathway score P, each sample is assigned (matrix of
samples in rows, pathways in columns). In practice, this is achieved by a simple
matrix multiplication between the gene expression matrix (samples in rows, genes
in columns, values are expression levels) and the model matrix (genes in rows,
pathways in columns, values are our top 100 coefcients):
P¼E´G
We then scaled each pathway or gene set score to have a mean of zero and standard
deviation of one, in order to factor out the difference in strength of gene expression
signatures and thus be able to compare the relative scores across pathways and
samples at the same time.
EPSA model. The EPSA model was built as previously published25 with the fol-
lowing modications: (1) we used the mean of the treated and untreated arrays for
each experiment in order to avoid bias by experiment size; (2) we calculated
signicance of differential expression with limma44, not SAM; and (3) we selected
the top 100 signicant genes due to very different gene numbers at 5 or 10% FDR.
Comparison to other signature and signature consensus methods. We calcu-
late pathway scores for all perturbation experiments in the direction of activation
(activatedcontrol and controlinhibition). For methods that work on differ-
ential expression (SPEED: both using the original web server at https://speed.sys-
bio.net and running the method on our perturbation experiments, GSEA using
KolmogorovSmirnov statistic), we use the negative logarithm of the pvalue as
pathway score. For methods that score individual samples (PROGENy, EPSA), we
use the difference of the mean between basal and perturbed arrays for each
experiment. For both, we normalize the pathway scores per experiment because of
the different strength of perturbations. We then quantify how well each pathway
signature ranks experiments, where a pathway was perturbed before experiments
where a pathway was not perturbed by the Receiver Operator AUC. We quantify if
a given method has a consistently higher AUC than another across pathways using
a binomial test (Supplementary Fig. 6a).
In addition, we quantify the inuence of the number of signatures on the ROC
AUC. For this, we build the PROGENy model by sampling nsignatures per
pathway 10 times with replacement and calculate the AUC as described above
(Supplementary Fig. 6b).
Validation of PROGENy scores on public experiments. We previously set aside
10 public perturbations experiments that measure both pathway activation (mainly
western Blots) and gene expression upon perturbation, which were not included in
any of the model building. For each of those experiments, we quantied the Blot
bands in the original publication (DOI and experimental details in Supplementary
Fig. 7) using ImageJ for the control vs. perturbed condition if no numerical values
were reported. We calculated PROGENy pathway scores for both the control and
perturbed condition, and plotted the spread of the control scores vs. the spread of
the perturbed scores. We set the median of the control to 0, and the total standard
deviation of the control-perturbed pair to 1 for easier presentation (without
changing test statistics). We performed a one-tailed ttest between each control and
perturbed pair and report the pvalues (Supplementary Fig. 7).
Validation of PROGENy scores on HEK293 perturbations. We conrmed
pathway activation using MEK for the MAPK pathway, Stat3 for JAK-STAT, AKT
for PI3K, Smad2 for TGFbeta, IKb for NFkB by performing a one-sample one-
tailed ttest of the fold change over BSA, including samples from both 0.5 and 1 h
after perturbation. We report all fold changes with a pvalue <0.05 and at least 30%
of the maximum antibody readout as signicant (Fig. 2c). We then computed the
pathway scores for all conditions, and scaled each pathway score to have a mean of
0 and standard deviation of 1. We then computed the difference between the
control condition (BSA treatment) and each perturbation. For this comparison, we
plot the difference in means (Fig. 2c) and perform a one-tailed ttest. Here, we
reported all pathway changes as signicant if they have a pvalue <0.05 and an
activation status that is 1.5 standard deviations above or below the control.
Pathway scores using gene sets. We matched our dened set of pathways to the
publicly available pathway database Reactome9and Gene Ontology (GO)8cate-
gories, as well as Gatza et al. signatures (Supplementary Table 2a, b, f), to obtain a
uniform set across pathway resources that makes them comparable. The SPEED
platform already uses the same pathways, so no mapping was required. We cal-
culated pathway scores as Gene Set Variation Analysis (GSVA) scores that are able
to assign a score to each individual sample (unlike GSEA that compares groups).
SPIA scores. Signaling Pathway Impact Analysis (SPIA)11 is a method that utilizes
the directionality and signs in a KEGG47 pathway graph to determine if in a given
pathway structure the available species are more or less available to transduce a
signal. As the species considered for a pathway are usually mRNAs of genes, this
method infers signaling activity by the proxy of gene expression. In order to do
this, SPIA scores require the comparison with a normal condition in order to
compute both their scores and their signicance.
We used the SPIA Bioconductor package11 in our analyses, focussing on the
subset of pathways used by the other methods (Supplementary Table 2c). We
calculated our scores either for each cell line compared to the rest of a given tissue
where no normals are available (i.e. for the GDSC and drug response data) or
compared to the tissue-matched normals (for the TCGA data used in driver and
survival associations).
Pathier scores. As Pathier13 requires the comparison with a baseline condition
in order to compute scores, we computed the GDSC/TCGA scores as with SPIA. As
gene sets, we selected Reactome pathways that corresponded to our set of pathways
(Supplementary Table 2b), where Pathier calculated the signal owfrom the
baseline and compared it to each sample.
PARADIGM scores. We used the PARADIGM software from the public software
repository (https://github.com/sbenz/Paradigm) and a model of the cell signaling
network48 from the corresponding TCGA publication (https://tcga-data.nci.nih.
gov/docs/publications/coadread_2012/). We normalized our gene expression data
from both GDSC and TCGA using ranks to assign equally spaced values between 0
and 1 for each sample within a given tissue. We then ran PARADIGM inference
using the same options as in the above publication for each sample separately. We
used nodes in the network representing pathway activity to our set of pathways
(Supplementary Table 2d) to obtain pathway scores that are comparable to the
other methods, averaging scores where there were more than one for a given
sample and node.
Recall of perturbation experiments. We calculate pathway scores for each of our
curated experiments using all pathway methods. For gene set methods (GO,
Reactome, Gatza) we use the difference in GSVA without kernel density estimator
due to low sample numbers. For PROGENy, we exclude the experiment we
quantify from model building (leave-one-out cross-validation).
We calculate linear associations between perturbations and assigned scores and
plot the assigned pathway scores (using whether a pathway was perturbed as 0/1
coefcients and the pathway scores as response variable; pathway inhibitions
encoded as negative pathway activations) and show the relative (column scale
function in pheatmap) activation patterns as heatmaps (Supplementary Fig. 7).
Associations with known driver mutations and CNAs. For comparing the
impact of mutations across different pathway methods, we used TCGA cohorts,
where tissue-matched controls were available, leaving 6549 samples across 13
cancer types. For mutated genes, we considered all genes that had a change of
coding sequence (SNP, small indels in MAF les) as mutated and all others as not
mutated. For copy number alterations (CNAs), we used the thresholded GISTIC49
scores, where we considered homozygous deletions (2) and strong amplications
(2) as altered, no change (0) as basal and discarded intermediate values (1, 1)
from our analysis. We focussed our analysis of the mutations and copy number
alterations on the subset of 464 driver genes that were also used in the GDSC. We
used the sets of mutations and CNAs to compute the linear associations between
samples for all different methods we looked at.
Drug associations using GDSC cell lines. We performed drug association using
an ANOVA between 265 drug IC
50
s and 11 inferred pathway scores conditioned
on MSI status, doing a total of 2915 comparisons for which we correct the pvalues
using the False Discovery Rate. For pan-cancer associations, we used the cancer
type as a covariate in order to discard the effect that different tissues have on the
observed drug response. While this will also remove genuine differences in pathway
activation between different cancer types, we would not be able to distinguish
between those and other confounders that impact the sensitivity of a certain cell
line from a given tissue to a drug. Our pan-cancer association are thus the same of
intra-tissue differences in drug response explained by inferred (our method, GO, or
Reactome) pathway scores.
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6 ARTICLE
NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturec ommunications 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved
We selected four of our strongest associations to investigate whether they
provide additional information of what is known by mutation data. For two MEK
inhibitors, we show the difference between wild-type and mutant MAPK pathway
(dened as a mutation in either NRAS,KRAS,orBRAF) with a discretized pathway
score (upper and lower quartile vs. the rest), as well as the combination between the
upper quartile of tissue-specic pathway scores and presence of a MAPK mutation.
For a BRAF inhibitor, we show additional stratication on top of BRAF mutations,
and for Nutlin-3a on top of TP53 mutations.
Survival associations using TCGA data. Starting from the pathway scores
derived with GO/Reactome GSEA, SPIA, Pathier, PARADIGM, and our method
on the TCGA data as described above, we used Cox Proportional Hazard model (R
package survival) to calculate survival associations for pan-cancer and each tissue-
specic cohort. For the pan-cancer cohort, we tted the model for each pathway
and method separately, regressing out the study of origin and age of the patient.
For the tissue-specic cohorts, we regressed out the age of the patients. We adjusted
the pvalues using the FDR method for each method and study separately. We
selected a signicance threshold of 5 and 10% for the pan-cancer and cancer-
specic associations for which we show a matrix plot and a volcano plot of asso-
ciations, respectively.
In order to get distinct classes needed for interpretable KaplanMeier survival
curves (Fig. 4c), we split three of our obtained pathway scores in upper, the two
middle, and lower quartile.
Code availability.progeny is available as an R package on Bioconductor. The code
used for the analysis in this paper is available at https://github.com/saezlab/
footprints.
Data availability. RNA-Seq data are accessible from gene expression omnibus
(GEO) under accession number GSE97979. Phosphoprotein measurements are
available as Supplementary Data 2.
Received: 7 September 2016 Accepted: 24 November 2017
References
1. Hanahan, D. & Weinberg, R. A. The Hallmarks of Cancer. Cell 100,5770
(2000).
2. The Cancer Genome Atlas Research Network. et al. The Cancer Genome Atlas
Pan-Cancer analysis project. Nat. Genet. 45, 11131120 (2013).
3. International Cancer Genome Consortium. et al. International network of
cancer genome projects. Nature 464, 993998 (2010).
4. Garnett, M. J. et al. Systematic identication of genomic markers of drug
sensitivity in cancer cells. Nature 483, 570575 (2012).
5. Iorio, F. et al. A Landscape of Pharmacogenomic Interactions in Cancer. Cell
166, 740754 (2016).
6. Barretina, J. et al. The cancer cell line encyclopedia enables predictive modelling
of anticancer drug sensitivity. Nature 483, 603607 (2012).
7. Mutation Consequences and Pathway Analysis working group of the
International Cancer Genome Consortium. Pathway and network analysis of
cancer genomes. Nat. Methods 12, 615621 (2015).
8. Gene Ontology Consortium. The Gene Ontology (GO) database and
informatics resource. Nucleic Acids Res. 32, D258D261 (2004).
9. Croft, D. et al. Reactome: a database of reactions, pathways and biological
processes. Nucleic Acids Res. 39, D691D697 (2011).
10. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression proles. Proc. Natl Acad.
Sci. USA 102, 1554515550 (2005).
11. Tarca, A. L. et al. A novel signaling pathway impact analysis. Bioinformatics 25,
7582 (2008).
12. Vaske, C. J. et al. Inference of patient-specic pathway activities from multi-
dimensional cancer genomics data using PARADIGM. Bioinformatics 26,
i237i245 (2010).
13. Drier, Y., Sheffer, M. & Domany, E. Pathway-based personalized analysis of
cancer. Proc. Natl Acad. Sci. USA 110, 63886393 (2013).
14. vant Veer, L. J. & Bernards, R. Enabling personalized cancer medicine through
analysis of gene-expression patterns. Nature 452, 564570 (2008).
15. Lamb, J. et al. The Connectivity Map: using gene-expression signatures to
connect small molecules, genes, and disease. Science 313, 19291935 (2006).
16. Iorio, F., Rittman, T., Ge, H., Menden, M. & Saez-Rodriguez, J. Transcriptional
data: a new gateway to drug repositioning? Drug Discov. Today 18, 350357
(2013).
17. Bild, A. H. et al. Oncogenic pathway signatures in human cancers as a guide to
targeted therapies. Nature 439, 353357 (2005).
18. Gatza, M. L. et al. A pathway-based classication of human breast cancer. Proc.
Natl Acad. Sci. USA 107, 69946999 (2010).
19. Maiso, P. et al. Metabolic signature identies novel targets for drug resistance in
multiple myeloma. Cancer Res. 75, 20712082 (2015).
20. Fardin, P. et al. The l1-l2 regularization framework unmasks the hypoxia
signature hidden in the transcriptome of a set of heterogeneous neuroblastoma
cell lines. BMC Genomics 10, 474 (2009).
21. Fardin, P. et al. A biology-driven approach identies the hypoxia gene signature
as a predictor of the outcome of neuroblastoma patients. Mol. Cancer 9, 185
(2010).
22. Fertig, E. J. et al. Gene expression signatures modulated by epidermal growth
factor receptor activation and their relationship to cetuximab resistance in head
and neck squamous cell carcinoma. BMC Genomics 13, 160 (2012).
23. Jürchott, K. et al. Identication of Y-box binding protein 1 as a core regulator of
MEK/ERK pathway-dependent gene signatures in colorectal cancer cells. PLoS
Genet. 6, e1001231 (2010).
24. Gatza, M. L., Silva, G. O., Parker, J. S., Fan, C. & Perou, C. M. An integrated
genomics approach identies drivers of proliferation in luminal-subtype human
breast cancer. Nat. Genet. 46, 10511059 (2014).
25. Tenenbaum, J. D., Walker, M. G., Utz, P. J. & Butte, A. J. Expression-based
pathway signature analysis (EPSA): mining publicly available microarray data
for insight into human disease. BMC Med. Genomics 1, 51 (2008).
26. Parikh, J. R., Klinger, B., Xia, Y., Marto, J. A. & Blüthgen, N. Discovering causal
signaling pathways through gene-expression patterns. Nucleic Acids Res. 38,
W109W117 (2010).
27. Kant, S. et al. TNF-stimulated MAP kinase activation mediated by a Rho family
GTPase signaling pathway. Genes Dev. 25, 20692078 (2011).
28. Olive, K. P. et al. Mutant p53 gain of function in two mouse models of Li-
Fraumeni syndrome. Cell 119, 847860 (2004).
29. Zhang, C. et al. Tumour-associated mutant p53 drives the Warburg effect. Nat.
Commun. 4, 2935 (2013).
30. Weissmueller, S. et al. Mutant p53 drives pancreatic cancer metastasis through
cell-autonomous PDGF receptor βsignaling. Cell 157, 382394 (2014).
31. Zhu, J. et al. Gain-of-function p53 mutants co-opt chromatin pathways to drive
cancer growth. Nature 525, 206211 (2015).
32. Maxwell, P. H. et al. The tumour suppressor protein VHL targets hypoxia-
inducible factors for oxygen-dependent proteolysis. Nature 399, 271275
(1999).
33. Zhou, J., Schmid, T., Frank, R. & Brüne, B. PI3K/Akt is required for heat shock
proteins to protect hypoxia-inducible factor 1αfrom pVHL-independent
degradation. J. Biol. Chem. 279, 1350613513 (2004).
34. Yang, X.-M. et al. Role of PI3K/Akt and MEK/ERK in mediating hypoxia-
induced expression of HIF-1alpha and VEGF in laser-induced rat choroidal
neovascularization. Invest. Ophthalmol. Vis. Sci. 50, 18731879 (2009).
35. Kilic-Eren, M., Boylu, T. & Tabor, V. Targeting PI3K/Akt represses Hypoxia
inducible factor-1a activation and sensitizes Rhabdomyosarcoma and Ewings
sarcoma cells for apoptosis. Cancer Cell. Int. 13,18 (2013).
36. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
37. Uhlitz, F. et al. An immediate-late gene expression module decodes ERK signal
duration. Mol. Syst. Biol. 13,115 (2017).
38. Martin, M. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet. J. 17,1012 (2011).
39. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29,
1521 (2013).
40. Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efcient general purpose
program for assigning sequence reads to genomic features. Bioinformatics 30,
923930 (2014).
41. Köster, J. & Rahmann, S. Snakemakea scalable bioinformatics workow
engine. Bioinformatics 28, 25202522 (2012).
42. Klinger, B. et al. Network quantication of EGFR signaling unveils potential for
targeted combination therapy. Mol. Syst. Biol. 9, 673 (2013).
43. Parkinson, H. et al. ArrayExpressa public database of microarray
experiments and gene expression proles. Nucleic Acids Res. 35, D747D750
(2007).
44. Smyth, G. K. in Bioinformatics and Computational Biology Solutions Using R
and Bioconductor 397420 (Springer, New York, 2005).
45. Carvalho, B. S. & Irizarry, R. A. A framework for oligonucleotide microarray
preprocessing. Bioinformatics 26, 23632367 (2010).
46. Gautier, L., Cope, L., Bolstad, B. M. & Irizarry, R. A. affyanalysis of affymetrix
genechip data at the probe level. Bioinformatics 20, 307315 (2004).
47. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res. 28,2730 (2000).
48. Cancer Genome Atlas Network. Comprehensive molecular characterization of
human colon and rectal cancer. Nature 487, 330337 (2012).
49. Beroukhim, R. et al. Assessing the signicance of chromosomal aberrations in
cancer: methodology and application to glioma. Proc. Natl Acad. Sci. USA 104,
2000720012 (2007).
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6
10 NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturecom munications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Acknowledgements
M.S. is funded by a MRC Case fellowship (1246915) awarded to J.S.-R. and Joanna Betts
(GSK). N.B. acknowledges funding by BMBF (OncoPath). M.J.G. is supported with
funding from the Wellcome Trust (102696), Stand Up To Cancer (SU2C-AACR-
DT1213), The Dutch Cancer Society (H1/2014-6919) and Cancer Research UK (C44943/
A22536). We thank Francesco Iorio, Florian Markowetz, Bence Szalai and Alvis Brazma
for useful discussions. We thank S. Cagnol and P. Lenormand for providing the
HEK293ΔRAF1:ER cell line.
Author contributions
M.S. designed research, performed all analyses, and wrote the manuscript. A.S., F.U., B.K.
and S.S. performed and preprocessed validation experiments, supervised by N.B. B.K., M.
K., N.B. and M.J.G. supported result interpretation and manuscript writing. J.S.-R.
supervised the project and contributed to writing the manuscript.
Additional information
Supplementary Information accompanies this paper at https://doi.org/10.1038/s41467-
017-02391-6.
Competing interests: The authors declare no competing nancial interests.
Reprints and permission information is available online at http://npg.nature.com/
reprintsandpermissions/
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional afliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made. The images or other third party
material in this article are included in the articles Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
articles Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this license, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2017
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-02391-6 ARTICLE
NATURE COMMUNICATIONS | (2018) 9:20 |DOI: 10.1038/s41467-017-02391-6 |www.nature.com/naturec ommunications 11
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Pathway activity inference with PROGENy. PROGENy is a tool that allows predicting pathway activities from gene expression data in human (and mouse) 34 . Instead of interrogating the expression of pathway members, PROGENy takes the expression of the most responsive genes of a pathway into account. ...
Article
Full-text available
Hepatocellular carcinoma (HCC) is a leading cause of cancer-related deaths worldwide, and therapeutic options for advanced HCC are limited. Here, we observe that intestinal dysbiosis affects antitumor immune surveillance and drives liver disease progression towards cancer. Dysbiotic microbiota, as seen in Nlrp6−/− mice, induces a Toll-like receptor 4 dependent expansion of hepatic monocytic myeloid-derived suppressor cells (mMDSC) and suppression of T-cell abundance. This phenotype is transmissible via fecal microbiota transfer and reversible upon antibiotic treatment, pointing to the high plasticity of the tumor microenvironment. While loss of Akkermansia muciniphila correlates with mMDSC abundance, its reintroduction restores intestinal barrier function and strongly reduces liver inflammation and fibrosis. Cirrhosis patients display increased bacterial abundance in hepatic tissue, which induces pronounced transcriptional changes, including activation of fibro-inflammatory pathways as well as circuits mediating cancer immunosuppression. This study demonstrates that gut microbiota closely shapes the hepatic inflammatory microenvironment opening approaches for cancer prevention and therapy.
... ; https://doi.org/10.1101/2022.06.18.496114 doi: bioRxiv preprint publicly available VISIUM tumor datasets from seven tissues (Fig. 1A) were downloaded from a recent publication [13] (three liver leading-edge tumor samples; HCC-1L, cHC-1L, ICC-1L) and Transcription factor and pathway activity assessment MYC transcription factor activity was assessed using DoRothEA and Viper R packages after filtering for high confidence regulons [23,24]. TGFB pathway activity was assessed using PROGENy [25]. ...
Preprint
Full-text available
Cancer Hallmarks have been indispensable to rationalize the complex phenotypic properties acquired by cancer cells and their surrounding tumor microenvironment (TME) as they progress from normalcy to malignancy. While great insights have been gained from their use to understand bulk tissue transcriptomics, our knowledge remained incomplete, as it was not possible to determine the origin and relative localization of the activity of each Cancer Hallmark within tumor tissues. Thanks to recent advances in spatial transcriptomics, we can now explore the spatial contribution of each compartment within a tumor to each Cancer Hallmark at near single-cell resolution. Here, we used spatial transcriptomics from fifteen different primary tumor samples from seven tissues to quantify the contribution of the main tumor compartments-cancer cells and the TME-to each of thirteen Cancer Hallmarks. Our results revealed a Pan-Cancer architecture: with the activity of some hallmarks consistently coming from the cancer compartment across all cancer types (Cancer-associated hallmarks) and the activity of others consistently coming from the TME (TME-associated hallmarks). Furthermore, we observed that even within each compartment (cancer or TME) there was a consistent internal architecture, as many of these hallmarks exhibited higher activity at the border between compartments. Our study provides the first spatial map of Cancer Hallmark activities in tumors at almost cellular resolution, paving the way for a better understanding of the underpinnings of hallmark acquisition by tumor cell types.
Article
Full-text available
Myocardial infarction is a leading cause of death worldwide¹. Although advances have been made in acute treatment, an incomplete understanding of remodelling processes has limited the effectiveness of therapies to reduce late-stage mortality². Here we generate an integrative high-resolution map of human cardiac remodelling after myocardial infarction using single-cell gene expression, chromatin accessibility and spatial transcriptomic profiling of multiple physiological zones at distinct time points in myocardium from patients with myocardial infarction and controls. Multi-modal data integration enabled us to evaluate cardiac cell-type compositions at increased resolution, yielding insights into changes of the cardiac transcriptome and epigenome through the identification of distinct tissue structures of injury, repair and remodelling. We identified and validated disease-specific cardiac cell states of major cell types and analysed them in their spatial context, evaluating their dependency on other cell types. Our data elucidate the molecular principles of human myocardial tissue organization, recapitulating a gradual cardiomyocyte and myeloid continuum following ischaemic injury. In sum, our study provides an integrative molecular map of human myocardial infarction, represents an essential reference for the field and paves the way for advanced mechanistic and therapeutic studies of cardiac disease.
Article
Several kinds of stress promote the formation of three-stranded RNA:DNA hybrids called R-loops. Insufficient clearance of these structures promotes genomic instability and DNA damage, which ultimately contribute to the establishment of cancer phenotypes. Paraspeckle assemblies participate in R-loop resolution and preserve genome stability, however, the main determinants of this mechanism are still unknown. This study finds that in Multiple Myeloma (MM), AATF/Che-1 (Che-1), an RNA-binding protein fundamental to transcription regulation, interacts with paraspeckles via the lncRNA NEAT1_2 (NEAT1) and directly localizes on R-loops. We systematically show that depletion of Che-1 produces a marked accumulation of RNA:DNA hybrids. We provide evidence that such failure to resolve R-loops causes sustained activation of a systemic inflammatory response characterized by an interferon (IFN) gene expression signature. Furthermore, elevated levels of R-loops and of mRNA for paraspeckle genes in patient cells are linearly correlated with Multiple Myeloma progression. Moreover, increased interferon gene expression signature in patients is associated with markedly poor prognosis. Taken together, our study indicates that Che-1/NEAT1 cooperation prevents excessive inflammatory signaling in Multiple Myeloma by facilitating the clearance of R-loops. Further studies on different cancer types are needed to test if this mechanism is ubiquitously conserved and fundamental for cell homeostasis.
Article
Full-text available
Organoids enable in vitro modeling of complex developmental processes and disease pathologies. Like most 3D cultures, organoids lack sufficient oxygen supply and therefore experience cellular stress. These negative effects are particularly prominent in complex models, such as brain organoids, and can affect lineage commitment. Here, we analyze brain organoid and fetal single-cell RNA sequencing (scRNAseq) data from published and new datasets, totaling about 190,000 cells. We identify a unique stress signature in the data from all organoid samples, but not in fetal samples. We demonstrate that cell stress is limited to a defined subpopulation of cells that is unique to organoids and does not affect neuronal specification or maturation. We have developed a computational algorithm, Gruffi, which uses granular functional filtering to identify and remove stressed cells from any organoid scRNAseq dataset in an unbiased manner. We validated our method using six additional datasets from different organoid protocols and early brains, and show its usefulness to other organoid systems including retinal organoids. Our data show that the adverse effects of cell stress can be corrected by bioinformatic analysis for improved delineation of developmental trajectories and resemblance to in vivo data.
Article
Full-text available
Anti-cancer therapies often exhibit only short-term effects. Tumors typically develop drug resistance causing relapses that might be tackled with drug combinations. Identification of the right combination is challenging and would benefit from high-content, high-throughput combinatorial screens directly on patient biopsies. However, such screens require a large amount of material, normally not available from patients. To address these challenges, we present a scalable microfluidic workflow, called Combi-Seq, to screen hundreds of drug combinations in picoliter-size droplets using transcriptome changes as a readout for drug effects. We devise a deterministic combinatorial DNA barcoding approach to encode treatment conditions, enabling the gene expression-based readout of drug effects in a highly multiplexed fashion. We apply Combi-Seq to screen the effect of 420 drug combinations on the transcriptome of K562 cells using only ~250 single cell droplets per condition, to successfully predict synergistic and antagonistic drug pairs, as well as their pathway activities.
Article
Human papillomavirus (HPV)-driven head and neck squamous cell carcinomas (HNSCC) generally have a more favourable prognosis. We hypothesized that HPV-associated HNSCC may be identified by an miRNA-signature according to their specific molecular pathogenesis, and be characterized by a unique transcriptome compared to HPV-negative HNSCC. We performed miRNA expression profiling of two p16/HPV DNA characterized HNSCC cohorts of patients treated by adjuvant radio(chemo)therapy (multicentre DKTK-ROG n = 128, single-centre LMU-KKG n = 101). A linear model predicting HPV status built in DKTK-ROG using lasso-regression was tested in LMU-KKG. LMU-KKG tumours (n = 30) were transcriptome profiled for differential gene expression and miRNA-integration. A 24-miRNA signature predicted HPV-status with 94.53% accuracy (AUC: 0.99) in DKTK-ROG, and 86.14% (AUC: 0.86) in LMU-KKG. The prognostic values of 24-miRNA- and p16/HPV DNA status were comparable. Combining p16/HPV DNA and 24-miRNA status allowed patient sub-stratification and identification of an HPV-associated patient subgroup with impaired overall survival. HPV-positive tumours showed downregulated MAPK, Estrogen, EGFR, TGFbeta, WNT signaling activity. miRNA-mRNA integration revealed HPV-specific signaling pathway regulation, including PD−L1 expression/PD−1 checkpoint pathway in cancer in HPV-associated HNSCC. Integration of clinically established p16/HPV DNA with 24-miRNA signature status improved clinically relevant risk stratification, which might be considered for future clinical decision-making with respect to treatment de-escalation in HPV-associated HNSCC.
Article
Signal transduction governs cellular behavior, and its dysregulation often leads to human disease. To understand this process, we can use network models based on prior knowledge, where nodes represent biomolecules, usually proteins, and edges indicate interactions between them. Several computational methods combine untargeted omics data with prior knowledge to estimate the state of signaling networks in specific biological scenarios. Here, we review, compare, and classify recent network approaches according to their characteristics in terms of input omics data, prior knowledge and underlying methodologies. We highlight existing challenges in the field, such as the general lack of ground truth and the limitations of prior knowledge. We also point out new omics developments that may have a profound impact, such as single-cell proteomics or large-scale profiling of protein conformational changes. We provide both an introduction for interested users seeking strategies to study cell signaling on a large scale and an update for seasoned modelers.
Article
Full-text available
Genome-wide association studies (GWAS) have identified dozens of loci associated with chronic obstructive pulmonary disease (COPD) susceptibility; however, the function of associated genes in the cell type(s) affected in disease remains poorly understood, partly due to a lack of cell models that recapitulate human alveolar biology. Here, we apply CRISPR interference to interrogate the function of nine genes implicated in COPD by GWAS in induced pluripotent stem cell–derived type 2 alveolar epithelial cells (iAT2s). We find that multiple genes implicated by GWAS affect iAT2 function, including differentiation potential, maturation, and/or proliferation. Detailed characterization of the GWAS gene DSP demonstrates that it regulates iAT2 cell-cell junctions, proliferation, mitochondrial function, and response to cigarette smoke–induced injury. Our approach thus elucidates the biological function, as well as disease-relevant consequences of dysfunction, of genes implicated in COPD by GWAS in type 2 alveolar epithelial cells.
Article
Full-text available
Cell interactions determine phenotypes, and intercellular communication is shaped by cellular contexts such as disease state, organismal life stage, and tissue microenvironment. Single-cell technologies measure the molecules mediating cell–cell communication, and emerging computational tools can exploit these data to decipher intercellular communication. However, current methods either disregard cellular context or rely on simple pairwise comparisons between samples, thus limiting the ability to decipher complex cell–cell communication across multiple time points, levels of disease severity, or spatial contexts. Here we present Tensor-cell2cell, an unsupervised method using tensor decomposition, which deciphers context-driven intercellular communication by simultaneously accounting for multiple stages, states, or locations of the cells. To do so, Tensor-cell2cell uncovers context-driven patterns of communication associated with different phenotypic states and determined by unique combinations of cell types and ligand-receptor pairs. As such, Tensor-cell2cell robustly improves upon and extends the analytical capabilities of existing tools. We show Tensor-cell2cell can identify multiple modules associated with distinct communication processes (e.g., participating cell–cell and ligand-receptor pairs) linked to severities of Coronavirus Disease 2019 and to Autism Spectrum Disorder. Thus, we introduce an effective and easy-to-use strategy for understanding complex communication patterns across diverse conditions. Cellular contexts such as disease state, organismal life stage and tissue microenvironment, shape intercellular communication, and ultimately affect an organism’s phenotypes. Here, the authors present Tensor-cell2cell, an unsupervised method for deciphering context-driven intercellular communication.
Article
Full-text available
The RAF-MEK-ERK signalling pathway controls fundamental, often opposing cellular processes such as proliferation and apoptosis. Signal duration has been identified to play a decisive role in these cell fate decisions. However, it remains unclear how the different early and late responding gene expression modules can discriminate short and long signals. We obtained both protein phosphorylation and gene expression time course data from HEK293 cells carrying an inducible construct of the proto-oncogene RAF By mathematical modelling, we identified a new gene expression module of immediate-late genes (ILGs) distinct in gene expression dynamics and function. We find that mRNA longevity enables these ILGs to respond late and thus translate ERK signal duration into response amplitude. Despite their late response, their GC-rich promoter structure suggested and metabolic labelling with 4SU confirmed that transcription of ILGs is induced immediately. A comparative analysis shows that the principle of duration decoding is conserved in PC12 cells and MCF7 cells, two paradigm cell systems for ERK signal duration. Altogether, our findings suggest that ILGs function as a gene expression module to decode ERK signal duration.
Article
Full-text available
The RAF‐MEK‐ERK signalling pathway controls fundamental, often opposing cellular processes such as proliferation and apoptosis. Signal duration has been identified to play a decisive role in these cell fate decisions. However, it remains unclear how the different early and late responding gene expression modules can discriminate short and long signals. We obtained both protein phosphorylation and gene expression time course data from HEK293 cells carrying an inducible construct of the proto‐oncogene RAF. By mathematical modelling, we identified a new gene expression module of immediate–late genes (ILGs) distinct in gene expression dynamics and function. We find that mRNA longevity enables these ILGs to respond late and thus translate ERK signal duration into response amplitude. Despite their late response, their GC‐rich promoter structure suggested and metabolic labelling with 4SU confirmed that transcription of ILGs is induced immediately. A comparative analysis shows that the principle of duration decoding is conserved in PC12 cells and MCF7 cells, two paradigm cell systems for ERK signal duration. Altogether, our findings suggest that ILGs function as a gene expression module to decode ERK signal duration.
Article
Full-text available
The International Cancer Genome Consortium (ICGC) was launched to coordinate large-scale cancer genome studies in tumours from 50 different cancer types and/or subtypes that are of clinical and societal importance across the globe. Systematic studies of more than 25,000 cancer genomes at the genomic, epigenomic and transcriptomic levels will reveal the repertoire of oncogenic mutations, uncover traces of the mutagenic influences, define clinically relevant subtypes for prognosis and therapeutic management, and enable the development of new cancer therapies.
Article
Full-text available
Systematic studies of cancer genomes have provided unprecedented insights into the molecular nature of cancer. Using this information to guide the development and application of therapies in the clinic is challenging. Here, we report how cancer-driven alterations identified in 11,289 tumors from 29 tissues (integrating somatic mutations, copy number alterations, DNA methylation, and gene expression) can be mapped onto 1,001 molecularly annotated human cancer cell lines and correlated with sensitivity to 265 drugs. We find that cell lines faithfully recapitulate oncogenic alterations identified in tumors, find that many of these associate with drug sensitivity/resistance, and highlight the importance of tissue lineage in mediating drug response. Logic-based modeling uncovers combinations of alterations that sensitize to drugs, while machine learning demonstrates the relative importance of different data types in predicting drug response. Our analysis and datasets are rich resources to link genotypes with cellular phenotypes and to identify therapeutic options for selected cancer sub-populations.
Article
Full-text available
In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0550-8) contains supplementary material, which is available to authorized users.
Article
Full-text available
Genomic information on tumors from 50 cancer types cataloged by the International Cancer Genome Consortium (ICGC) shows that only a few well-studied driver genes are frequently mutated, in contrast to many infrequently mutated genes that may also contribute to tumor biology. Hence there has been large interest in developing pathway and network analysis methods that group genes and illuminate the processes involved. We provide an overview of these analysis techniques and show where they guide mechanistic and translational investigations.
Article
TP53 (which encodes p53 protein) is the most frequently mutated gene among all human cancers. Prevalent p53 missense mutations abrogate its tumour suppressive function and lead to a 'gain-of-function' (GOF) that promotes cancer. Here we show that p53 GOF mutants bind to and upregulate chromatin regulatory genes, including the methyltransferases MLL1 (also known as KMT2A), MLL2 (also known as KMT2D), and acetyltransferase MOZ (also known as KAT6A or MYST3), resulting in genome-wide increases of histone methylation and acetylation. Analysis of The Cancer Genome Atlas shows specific upregulation of MLL1, MLL2, and MOZ in p53 GOF patient-derived tumours, but not in wild-type p53 or p53 null tumours. Cancer cell proliferation is markedly lowered by genetic knockdown of MLL1 or by pharmacological inhibition of the MLL1 methyltransferase complex. Our study reveals a novel chromatin mechanism underlying the progression of tumours with GOF p53, and suggests new possibilities for designing combinatorial chromatin-based therapies for treating individual cancers driven by prevalent GOF p53 mutations.