Thesis

Network Inference from Perturbation Data: Robustness, Identifiability and Experimental Design

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Hochdurchsatzverfahren quantifizieren eine Vielzahl zellulärer Komponenten, können aber selten deren Interaktionen beschreiben. Daher wurden in den letzten 20 Jahren verschiedenste Netzwerk-Rekonstruktionsmethoden entwickelt. Insbesondere Perturbationsdaten erlauben dabei Rückschlüsse über funktionelle Mechanismen in der Genregulierung, Signal Transduktion, intra-zellulärer Kommunikation und anderen Prozessen zu ziehen. Dennoch bleibt Netzwerkinferenz ein ungelöstes Problem, weil die meisten Methoden auf ungeeigneten Annahmen basieren und die Identifizierbarkeit von Netzwerkkanten nicht aufklären. Diesbezüglich beschreibt diese Dissertation eine neue Rekonstruktionsmethode, die auf einfachen Annahmen von Perturbationsausbreitung basiert. Damit ist sie in verschiedensten Zusammenhängen anwendbar und übertrifft andere Methoden in Standard-Benchmarks. Für MAPK und PI3K Signalwege in einer Adenokarzinom-Zellline generiert sie plausible Netzwerkhypothesen, die unterschiedliche Sensitivitäten von PI3K-Mutanten gegenüber verschiedener Inhibitoren überzeugend erklären. Weiterhin wird gezeigt, dass sich Netzwerk-Identifizierbarkeit durch ein intuitives Max-Flow Problem beschreiben lässt. Dieses analytische Resultat erlaubt effektive, identifizierbare Netzwerke zu ermitteln und das experimentelle Design aufwändiger Perturbationsexperimente zu optimieren. Umfangreiche Tests zeigen, dass der Ansatz im Vergleich zu zufällig generierten Perturbationssequenzen die Anzahl der für volle Identifizierbarkeit notwendigen Perturbationen auf unter ein Drittel senkt. Schließlich beschreibt die Dissertation eine mathematische Weiterentwicklung der Modular Response Analysis. Es wird gezeigt, dass sich das Problem als analytisch lösbare orthogonale Regression approximieren lässt. Dies erlaubt eine drastische Reduzierung des nummerischen Aufwands, womit sich deutlich größere Netzwerke rekonstruieren und neueste Hochdurchsatz-Perturbationsdaten auswerten lassen.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Motivation: A common strategy to infer and quantify interactions between components of a biological system is to deduce them from the network's response to targeted perturbations. Such perturbation experiments are often challenging and costly. Therefore, optimizing the experimental design is essential to achieve a meaningful characterization of biological networks. However, it remains difficult to predict which combination of perturbations allows to infer specific interaction strengths in a given network topology. Yet, such a description of identifiability is necessary to select perturbations that maximize the number of inferable parameters. Results: We show analytically that the identifiability of network parameters can be determined by an intuitive maximum-flow problem. Furthermore, we used the theory of matroids to describe identifiability relationships between sets of parameters in order to build identifiable effective network models. Collectively, these results allowed to device strategies for an optimal design of the perturbation experiments. We benchmarked these strategies on a database of human pathways. Remarkably, full network identifiability was achieved, on average, with less than a third of the perturbations that are needed in a random experimental design. Moreover, we determined perturbation combinations that additionally decreased experimental effort compared to single-target perturbations. In summary, we provide a framework that allows to infer a maximal number of interaction strengths with a minimal number of perturbation experiments. Availability and implementation: IdentiFlow is available at github.com/GrossTor/IdentiFlow. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
The transcriptome contains rich information on molecular, cellular and organismal phenotypes. However, experimental and statistical limitations constrain sensitivity and throughput of genetic screening with single-cell transcriptomics readout. To overcome these limitations, we introduce targeted Perturb-seq (TAP-seq), a sensitive, inexpensive and platform-independent method focusing single-cell RNA-seq coverage on genes of interest, thereby increasing the sensitivity and scale of genetic screens by orders of magnitude. TAP-seq permits routine analysis of thousands of CRISPR-mediated perturbations within a single experiment, detects weak effects and lowly expressed genes, and decreases sequencing requirements by up to 50-fold. We apply TAP-seq to generate perturbation-based enhancer–target gene maps for 1,778 enhancers within 2.5% of the human genome. We thereby show that enhancer–target association is jointly determined by three-dimensional contact frequency and epigenetic states, allowing accurate prediction of enhancer targets throughout the genome. In addition, we demonstrate that TAP-seq can identify cell subtypes with only 100 sequencing reads per cell.
Article
Full-text available
SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments. This Perspective describes the development and capabilities of SciPy 1.0, an open source scientific computing library for the Python programming language.
Preprint
Full-text available
Modular response analysis (MRA) is a widely used modeling technique to uncover coupling strengths in molecular networks under a steady-state condition by means of perturbation experiments. We propose an extension of this methodology to search genomic data for new associations with a network modeled by MRA and to improve the predictive accuracy of MRA models. These extensions are illustrated by exploring the cross talk between estrogen and retinoic acid receptors, two nuclear receptors implicated in several hormone-driven cancers such as breast. We also present a novel, rigorous and elegant mathematical derivation of MRA equations, which is the foundation of this work and of an R package that is freely available at https://github.com/bioinfo-ircm/aiMeRA/ . This mathematical analysis should facilitate MRA understanding by newcomers. Author summary Estrogen and retinoic acid receptors play an important role in several hormone-driven cancers and share co-regulators and co-repressors that modulate their transcription factor activity. The literature shows evidence for crosstalk between these two receptors and suggests that spatial competition on the promoters could be a mechanism. We used MRA to explore the possibility that key co-repressors, i.e., NRIP1 (RIP140) and LCoR could also mediate crosstalk by exploiting new quantitative (qPCR) and RNA sequencing data. The transcription factor role of the receptors and the availability of genome-wide data enabled us to explore extensions of the MRA methodology to explore genome-wide data sets a posteriori , searching for genes associated with a molecular network that was sampled by perturbation experiments. Despite nearly two decades of use, we felt that MRA lacked a systematic mathematical derivation. We present here an elegant and rather simple analysis that should greatly facilitate newcomers’ understanding of MRA details. Moreover, an easy-to-use R package is released that should make MRA accessible to biology labs without mathematical expertise. Quantitative data are embedded in the R package and RNA sequencing data are available from GEO.
Article
Full-text available
While gene expression profiling is commonly used to gain an overview of cellular processes, the identification of upstream processes that drive expression changes remains a challenge. To address this issue, we introduce CARNIVAL, a causal network contextualization tool which derives network architectures from gene expression footprints. CARNIVAL (CAusal Reasoning pipeline for Network identification using Integer VALue programming) integrates different sources of prior knowledge including signed and directed protein–protein interactions, transcription factor targets, and pathway signatures. The use of prior knowledge in CARNIVAL enables capturing a broad set of upstream cellular processes and regulators, leading to a higher accuracy when benchmarked against related tools. Implementation as an integer linear programming (ILP) problem guarantees efficient computation. As a case study, we applied CARNIVAL to contextualize signaling networks from gene expression data in IgA nephropathy (IgAN), a condition that can lead to chronic kidney disease. CARNIVAL identified specific signaling pathways and associated mediators dysregulated in IgAN including Wnt and TGF-β, which we subsequently validated experimentally. These results demonstrated how CARNIVAL generates hypotheses on potential upstream alterations that propagate through signaling networks, providing insights into diseases.
Article
Full-text available
Technological advances enable assaying multiplexed spatially resolved RNA and protein expression profiling of individual cells, thereby capturing molecular variations in physiological contexts. While these methods are increasingly accessible, computational approaches for studying the interplay of the spatial structure of tissues and cell-cell heterogeneity are only beginning to emerge. Here, we present spatial variance component analysis (SVCA), a computational framework for the analysis of spatial molecular data. SVCA enables quantifying different dimensions of spatial variation and in particular quantifies the effect of cell-cell interactions on gene expression. In a breast cancer Imaging Mass Cytometry dataset, our model yields interpretable spatial variance signatures, which reveal cell-cell interactions as a major driver of protein expression heterogeneity. Applied to high-dimensional imaging-derived RNA data, SVCA identifies plausible gene families that are linked to cell-cell interactions. SVCA is available as a free software tool that can be widely applied to spatial data from different technologies.
Article
Full-text available
Motivation: A major challenge in molecular and cellular biology is to map out the regulatory networks of cells. As regulatory interactions can typically not be directly observed experimentally, various computational methods have been proposed to disentangling direct and indirect effects. Most of these rely on assumptions that are rarely met or cannot be adapted to a given context. Results: We present a network inference method that is based on a simple response logic with minimal presumptions. It requires that we can experimentally observe whether or not some of the system's components respond to perturbations of some other components, and then identifies the directed networks that most accurately account for the observed propagation of the signal. To cope with the intractable number of possible networks, we developed a logic programming approach that can infer networks of hundreds of nodes, while being robust to noisy, heterogeneous or missing data. This allows to directly integrate prior network knowledge and additional constraints such as sparsity. We systematically benchmark our method on KEGG pathways, and show that it outperforms existing approaches in DREAM3 and DREAM4 challenges. Applied to a novel perturbation dataset on PI3K and MAPK pathways in isogenic models of a colon cancer cell line, it generates plausible network hypotheses that explain distinct sensitivities toward various targeted inhibitors due to different PI3K mutants. Availability and implementation: A Python/Answer Set Programming implementation can be accessed at github.com/GrossTor/response-logic. Data and analysis scripts are available at github.com/GrossTor/response-logic-projects. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Oncogenic mutations in KRAS or BRAF are frequent in colorectal cancer and activate the ERK kinase. Here, we find graded ERK phosphorylation correlating with cell differentiation in patient-derived colorectal cancer organoids with and without KRAS mutations. Using reporters, single cell transcriptomics and mass cytometry, we observe cell type-specific phosphorylation of ERK in response to transgenic KRASG12V in mouse intestinal organoids, while transgenic BRAFV600E activates ERK in all cells. Quantitative network modelling from perturbation data reveals that activation of ERK is shaped by cell type-specific MEK to ERK feed forward and negative feedback signalling. We identify dual-specificity phosphatases as candidate modulators of ERK in the intestine. Furthermore, we find that oncogenic KRAS, together with β-Catenin, favours expansion of crypt cells with high ERK activity. Our experiments highlight key differences between oncogenic BRAF and KRAS in colorectal cancer and find unexpected heterogeneity in a signalling pathway with fundamental relevance for cancer therapy.
Article
Full-text available
The use of single-cell transcriptomics has become a major approach to delineate cell subpopulations and the transitions between them. While various computational tools using different mathematical methods have been developed to infer clusters, marker genes, and cell lineage, none yet integrate these within a mathematical framework to perform multiple tasks coherently. Such coherence is critical for the inference of cell-cell communication, a major remaining challenge. Here we present similarity matrix-based optimization for single-cell data analysis (SoptSC), in which unsupervised clustering, pseudotemporal ordering, lineage inference, and marker gene identification are inferred via a structured cell-to-cell similarity matrix. SoptSC then predicts cell-cell communication networks, enabling reconstruction of complex cell lineages that include feedback or feedforward interactions. Application of SoptSC to early embryonic development, epidermal regeneration, and hematopoiesis demonstrates robust identification of subpopulations, lineage relationships, and pseudotime, and prediction of pathway-specific cell communication patterns regulating processes of development and differentiation.
Article
Full-text available
The first decade of genome sequencing stimulated an explosion in the characterization of unknown proteins. More recently, the pace of functional discovery has slowed, leaving around 20% of the proteins even in well-studied model organisms without informative descriptions of their biological roles. Remarkably, many uncharacterized proteins are conserved from yeasts to human, suggesting that they contribute to fundamental biological processes (BP). To fully understand biological systems in health and disease, we need to account for every part of the system. Unstudied proteins thus represent a collective blind spot that limits the progress of both basic and applied biosciences. We use a simple yet powerful metric based on Gene Ontology BP terms to define characterized and uncharacterized proteins for human, budding yeast and fission yeast. We then identify a set of conserved but unstudied proteins in S. pombe, and classify them based on a combination of orthogonal attributes determined by large-scale experimental and comparative methods. Finally, we explore possible reasons why these proteins remain neglected, and propose courses of action to raise their profile and thereby reap the benefits of completing the catalogue of proteins’ biological roles.
Article
Full-text available
Protein signaling networks are static views of dynamic processes where proteins go through many biochemical modifications such as ubiquitination and phosphorylation to propagate signals that regulate cells and can act as feed-back systems. Understanding the precise mechanisms underlying protein interactions can elucidate how signaling and cell cycle progression occur within cells in different diseases such as cancer. Large-scale protein signaling networks contain an important number of experimentally verified protein relations but lack the capability to predict the outcomes of the system, and therefore to be trained with respect to experimental measurements. Boolean Networks (BNs) are a simple yet powerful framework to study and model the dynamics of the protein signaling networks. While many BN approaches exist to model biological systems, they focus mainly on system properties, and few exist to integrate experimental data in them. In this work, we show an application of a method conceived to integrate time series phosphoproteomic data into protein signaling networks. We use a large-scale real case study from the HPN-DREAM Breast Cancer challenge. Our efficient and parameter-free method combines logic programming and model-checking to infer a family of BNs from multiple perturbation time series data of four breast cancer cell lines given a prior protein signaling network. Because each predicted BN family is cell line specific, our method highlights commonalities and discrepancies between the four cell lines. Our models have a Root Mean Square Error (RMSE) of 0.31 with respect to the testing data, while the best performant method of this HPN-DREAM challenge had a RMSE of 0.47. To further validate our results, BNs are compared with the canonical mTOR pathway showing a comparable AUROC score (0.77) to the top performing HPN-DREAM teams. In addition, our approach can also be used as a complementary method to identify erroneous experiments. These results prove our methodology as an efficient dynamic model discovery method in multiple perturbation time course experimental data of large-scale signaling networks. The software and data are publicly available at https://github.com/misbahch6/caspo-ts.
Article
Full-text available
KEGG (Kyoto Encyclopedia of Genes and Genomes; https://www.kegg.jp/ or https://www.genome.jp/kegg/) is a reference knowledge base for biological interpretation of genome sequences and other high-throughput data. It is an integrated database consisting of three generic categories of systems information, genomic information and chemical information, and an additional human-specific category of health information. KEGG pathway maps, BRITE hierarchies and KEGG modules have been developed as generic molecular networks with KEGG Orthology nodes of functional orthologs so that KEGG pathway mapping and other procedures can be applied to any cellular organism. Unfortunately, however, this generic approach was inadequate for knowledge representation in the health information category, where variations of human genomes, especially disease-related variations, had to be considered. Thus, we have introduced a new approach where human gene variants are explicitly incorporated into what we call 'network variants' in the recently released KEGG NETWORK database. This allows accumulation of knowledge about disease-related perturbed molecular networks caused not only by gene variants, but also by viruses and other pathogens, environmental factors and drugs. We expect that KEGG NETWORK will become another reference knowledge base for the basic understanding of disease mechanisms and practical use in clinical sequencing and drug development.
Article
Full-text available
Motivation Signal-transduction networks are often aberrated in cancer cells, and new anti-cancer drugs that specifically target oncogenes involved in signaling show great clinical promise. However, the effectiveness of such targeted treatments is often hampered by innate or acquired resistance due to feedbacks, crosstalks or network adaptations in response to drug treatment. A quantitative understanding of these signaling networks and how they differ between cells with different oncogenic mutations or between sensitive and resistant cells can help in addressing this problem. Results Here, we present Comparative Network Reconstruction (CNR), a computational method to reconstruct signaling networks based on possibly incomplete perturbation data, and to identify which edges differ quantitatively between two or more signaling networks. Prior knowledge about network topology is not required but can straightforwardly be incorporated. We extensively tested our approach using simulated data and applied it to perturbation data from a BRAF mutant, PTPN11 KO cell line that developed resistance to BRAF inhibition. Comparing the reconstructed networks of sensitive and resistant cells suggests that the resistance mechanism involves re-establishing wild-type MAPK signaling, possibly through an alternative RAF-isoform. Availability and implementation CNR is available as a python module at https://github.com/NKI-CCB/cnr. Additionally, code to reproduce all figures is available at https://github.com/NKI-CCB/CNR-analyses. Supplementary information Supplementary data are available at Bioinformatics online.
Article
Full-text available
Cascades of phosphorylation between protein kinases comprise a core mechanism in the integration and propagation of intracellular signals. Although we have accumulated a wealth of knowledge around some such pathways, this is subject to study biases and much remains to be uncovered. Phosphoproteomics, the identification and quantification of phosphorylated proteins on a proteomic scale, provides a high-throughput means of interrogating the state of intracellular phosphorylation, both at the pathway level and at the whole-cell level. In this review, we discuss methods for using human quantitative phosphoproteomic data to reconstruct the underlying signalling networks that generated it. We address several challenges imposed by the data on such analyses and we consider promising advances towards reconstructing unbiased, kinome-scale signalling networks.
Article
Full-text available
The analysis of protein interaction networks is one of the key challenges in the study of biology. It connects genotypes to phenotypes, and disruption often leads to diseases. Hence, many technologies have been developed to study protein‐protein interactions (PPIs) in a cellular context. The expansion of the PPI technology toolbox however complicates the selection of optimal approaches for diverse biological questions. This review gives an overview of the binary and co‐complex technologies, with the former evaluating the interaction of two co‐expressed genetically tagged proteins, and the latter only needing the expression of a single tagged protein or no tagged proteins at all. Mass spectrometry is crucial for some binary and all co‐complex technologies. After the detailed description of the different technologies, the review compares their unique specifications, advantages, disadvantages, and applicability, while highlighting opportunities for further advancements.
Article
Full-text available
Motivation: Intracellular signalling is realised by complex signalling networks which are almost impossible to understand without network models, especially if feedbacks are involved. Modular Response Analysis (MRA) is a convenient modelling method to study signalling networks in various contexts. Results: We developed the software package STASNet that provides an augmented and extended version of MRA suited to model signalling networks from incomplete perturbation schemes and multi-perturbation data. Using data from the DREAM challenge, we show that predictions from STASNet models are among the top-performing methods. We applied the method to study the effect of SHP2, a protein that has been implicated in resistance to targeted therapy in colon cancer, using a novel data set from the colon cancer cell line Widr and a SHP2-depleted derivative. We find that SHP2 is required for MAPK signalling, whereas AKT signalling only partially depends on SHP2. Availability: An R-package is available at https://github.com/molsysbio/STASNet. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
A cornerstone of statistical inference, the maximum entropy framework is being increasingly applied to construct descriptive and predictive models of biological systems, especially complex biological networks, from large experimental data sets. Both its broad applicability and the success it obtained in different contexts hinge upon its conceptual simplicity and mathematical soundness. Here we try to concisely review the basic elements of the maximum entropy principle, starting from the notion of ‘entropy’, and describe its usefulness for the analysis of biological systems. As examples, we focus specifically on the problem of reconstructing gene interaction networks from expression data and on recent work attempting to expand our system-level understanding of bacterial metabolism. Finally, we highlight some extensions and potential limitations of the maximum entropy approach, and point to more recent developments that are likely to play a key role in the upcoming challenges of extracting structures and information from increasingly rich, high-throughput biological data.
Article
Full-text available
We conducted comprehensive integrative molecular analyses of the complete set of tumors in The Cancer Genome Atlas (TCGA), consisting of approximately 10,000 specimens and representing 33 types of cancer. We performed molecular clustering using data on chromosome-arm-level aneuploidy, DNA hypermethylation, mRNA, and miRNA expression levels and reverse-phase protein arrays, of which all, except for aneuploidy, revealed clustering primarily organized by histology, tissue type, or anatomic origin. The influence of cell type was evident in DNA-methylation-based clustering, even after excluding sites with known preexisting tissue-type-specific methylation. Integrative clustering further emphasized the dominant role of cell-of-origin patterns. Molecular similarities among histologically or anatomically related cancer types provide a basis for focused pan-cancer analyses, such as pan-gastrointestinal, pan-gynecological, pan-kidney, and pan-squamous cancers, and those related by stemness features, which in turn may inform strategies for future therapeutic development.
Article
Full-text available
The Cancer Genome Atlas (TCGA) has catalyzed systematic characterization of diverse genomic alterations underlying human cancers. At this historic junction marking the completion of genomic characterization of over 11,000 tumors from 33 cancer types, we present our current understanding of the molecular processes governing oncogenesis. We illustrate our insights into cancer through synthesis of the findings of the TCGA PanCancer Atlas project on three facets of oncogenesis: (1) somatic driver mutations, germline pathogenic variants, and their interactions in the tumor; (2) the influence of the tumor genome and epigenome on transcriptome and proteome; and (3) the relationship between tumor and the microenvironment, including implications for drugs targeting driver events and immunotherapies. These results will anchor future characterization of rare and common tumor types, primary and relapsed tumors, and cancers across ancestry groups and will guide the deployment of clinical genomic sequencing.
Article
Full-text available
New technologies to generate, store and retrieve medical and research data are inducing a rapid change in clinical and translational research and health care. Systems medicine is the interdisciplinary approach wherein physicians and clinical investigators team up with experts from biology, biostatistics, informatics, mathematics and computational modeling to develop methods to use new and stored data to the benefit of the patient. We here provide a critical assessment of the opportunities and challenges arising out of systems approaches in medicine and from this provide a definition of what systems medicine entails. Based on our analysis of current developments in medicine and healthcare and associated research needs, we emphasize the role of systems medicine as a multilevel and multidisciplinary methodological framework for informed data acquisition and interdisciplinary data analysis to extract previously inaccessible knowledge for the benefit of patients.
Article
Full-text available
The development of new methodologies has driven the expansion of systems biology over the past decades. Technological breakthroughs in sequencing, in quantitative proteomics, in single‐cell measurements, to name only a few, have each opened up whole new fields of research. To highlight the importance of new experimental and computational methodologies in enabling novel biological discoveries, we are pleased to announce the introduction of a new Methods section in Molecular Systems Biology (http://msb.embopress.org/authorguide#methodsguide).
Article
Full-text available
The recent advent of methods for high-throughput single-cell molecular profiling has catalyzed a growing sense in the scientific community that the time is ripe to complete the 150-year-old effort to identify all cell types in the human body. The Human Cell Atlas Project is an international collaborative effort that aims to define all human cell types in terms of distinctive molecular profiles (such as gene expression profiles) and to connect this information with classical cellular descriptions (such as location and morphology). An open comprehensive reference map of the molecular state of cells in healthy human tissues would propel the systematic study of physiological states, developmental trajectories, regulatory circuitry and interactions of cells, and also provide a framework for understanding cellular dysregulation in human disease. Here we describe the idea, its potential utility, early proofs-of-concept, and some design considerations for the Human Cell Atlas, including a commitment to open data, code, and community.
Article
Full-text available
Hundreds of different species colonize multicellular organisms making them “metaorganisms”. A growing body of data supports the role of microbiota in health and in disease. Grasping the principles of host-microbiota interactions (HMIs) at the molecular level is important since it may provide insights into the mechanisms of infections. The crosstalk between the host and the microbiota may help resolve puzzling questions such as how a microorganism can contribute to both health and disease. Integrated superorganism networks that consider host and microbiota as a whole–may uncover their code, clarifying perhaps the most fundamental question: how they modulate immune surveillance. Within this framework, structural HMI networks can uniquely identify potential microbial effectors that target distinct host nodes or interfere with endogenous host interactions, as well as how mutations on either host or microbial proteins affect the interaction. Furthermore, structural HMIs can help identify master host cell regulator nodes and modules whose tweaking by the microbes promote aberrant activity. Collectively, these data can delineate pathogenic mechanisms and thereby help maximize beneficial therapeutics. To date, challenges in experimental techniques limit large-scale characterization of HMIs. Here we highlight an area in its infancy which we believe will increasingly engage the computational community: predicting interactions across kingdoms, and mapping these on the host cellular networks to figure out how commensal and pathogenic microbiota modulate the host signaling and broadly cross-species consequences.
Article
Full-text available
Although single-cell RNA-seq is revolutionizing biology, data interpretation remains a challenge. We present SCENIC for the simultaneous reconstruction of gene regulatory networks and identification of cell states. We apply SCENIC to a compendium of single-cell data from tumors and brain, and demonstrate that the genomic regulatory code can be exploited to guide the identification of transcription factors and cell states. SCENIC provides critical biological insights into the mechanisms driving cellular heterogeneity.
Article
Full-text available
The increasing volume of ecologically and biologically relevant data has revealed a wide collection of emergent patterns in living systems. Analysing different data sets, ranging from metabolic gene-regulatory to species interaction networks, we find that these networks are sparse, i.e. the percentage of the active interactions scales inversely proportional to the system size. To explain the origin of this puzzling common characteristic, we introduce the new concept of explorability: a measure of the ability of an interacting system to adapt to newly intervening changes. We show that sparsity is an emergent property resulting from optimising both explorability and dynamical robustness, i.e. the capacity of the system to remain stable after perturbations of the underlying dynamics. Networks with higher connectivities lead to an incremental difficulty to find better values for both the explorability and dynamical robustness, associated with the fine-tuning of the newly added interactions. A relevant characteristic of our solution is its scale invariance, i.e., it remains optimal when several communities are assembled together. Connectivity is also a key ingredient in determining ecosystem stability and our proposed solution contributes to solving May's celebrated complexity-stability paradox.
Article
Full-text available
Author summary Microbiomes are important for better health, sustainable agriculture, and climate management. Since experimental studies of natural microbial communities are often challenging, microbiome wide association studies (MWAS) have been the primary method to reveal the connection between specific microbes and host phenotype. MWAS have established that many diseases are associated with a complex dysbiosis driven by a large number of microbes. We show that many of these associations may not reflect a mechanistic link with the disease, but arise instead due to interspecific interactions such as cross-feeding and competition for resources. We also propose a method grounded in the maximum entropy models of statistical physics to separate direct from indirect associations. Using both synthetic and real microbiome data, we show that this method detects only direct associations, identifies most predictive features of microbiomes, and avoids p-value inflation. We demonstrate the power of this method on one of the largest microbiome data sets on inflammatory bowel disease.
Article
Full-text available
Systems-biology approaches in immunology take various forms, but here we review strategies for measuring a broad swath of immunological functions as a means of discovering previously unknown relationships and phenomena and as a powerful way of understanding the immune system as a whole. This approach has rejuvenated the field of vaccine development and has fostered hope that new ways will be found to combat infectious diseases that have proven refractory to classical approaches. Systems immunology also presents an important new strategy for understanding human immunity directly, taking advantage of the many ways the immune system of humans can be manipulated.
Article
Full-text available
Systems biologists often distance themselves from reductionist approaches and formulate their aim as understanding living systems “as a whole.” Yet, it is often unclear what kind of reductionism they have in mind, and in what sense their methodologies would offer a superior approach. To address these questions, we distinguish between two types of reductionism which we call “modular reductionism” and “bottom-up reductionism.” Much knowledge in molecular biology has been gained by decomposing living systems into functional modules or through detailed studies of molecular processes. We ask whether systems biology provides novel ways to recompose these findings in the context of the system as a whole via computational simulations. As an example of computational integration of modules, we analyze the first whole-cell model of the bacterium M. genitalium. Secondly, we examine the attempt to recompose processes across different spatial scales via multi-scale cardiac models. Although these models rely on a number of idealizations and simplifying assumptions as well, we argue that they provide insight into the limitations of reductionist approaches. Whole-cell models can be used to discover properties arising at the interfaces of dynamically coupled processes within a biological system, thereby making more apparent what is lost through decomposition. Similarly, multi-scale modeling highlights the relevance of macroscale parameters and models and challenges the view that living systems can be understood “bottom-up.” Specifically, we point out that system-level properties constrain lower-scale processes. Thus, large-scale modeling reveals how living systems at the same time are more and less than the sum of the parts. Part of a special issue, Ontologies of Living Beings, guest-edited by A. M. Ferner and Thomas Pradeu
Article
Full-text available
High-throughput technologies have revolutionized medical research. The advent of genotyping arrays enabled large-scale genome-wide association studies and methods for examining global transcript levels, which gave rise to the field of “integrative genetics”. Other omics technologies, such as proteomics and metabolomics, are now often incorporated into the everyday methodology of biological researchers. In this review, we provide an overview of such omics technologies and focus on methods for their integration across multiple omics layers. As compared to studies of a single omics type, multi-omics offers the opportunity to understand the flow of information that underlies disease.
Article
Full-text available
Motivation: The analysis of RNA-Seq data from individual differentiating cells enables us to reconstruct the differentiation process and the degree of differentiation (in pseudo-time) of each cell. Such analyses can reveal detailed expression dynamics and functional relationships for differentiation. To further elucidate differentiation processes, more insight into gene regulatory networks is required. The pseudo-time can be regarded as time information and, therefore, single-cell RNA-Seq data are time-course data with high time resolution. Although time-course data are useful for inferring networks, conventional inference algorithms for such data suffer from high time complexity when the number of samples and genes is large. Therefore, a novel algorithm is necessary to infer networks from single-cell RNA-Seq during differentiation. Results: In this study, we developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations. We applied SCODE to three single-cell RNA-Seq datasets and confirmed that SCODE can reconstruct observed expression dynamics. We evaluated SCODE by comparing its inferred networks with use of a DNaseI-footprint based network. The performance of SCODE was best for two of the datasets and nearly best for the remaining dataset. We also compared the runtimes and showed that the runtimes for SCODE are significantly shorter than for alternatives. Thus, our algorithm provides a promising approach for further single-cell differentiation analyses. Availability: The R source code of SCODE is available at https://github.com/hmatsu1226/SCODE. Contact:hirotaka.matsumoto@riken.jp Supplementary information : Supplementary data are available at Bioinformatics online.
Book
This volume explores recent techniques for the computational inference of gene regulatory networks (GRNs). The chapters in this book cover topics such as methods to infer GRNs from time-varying data; the extraction of causal information from biological data; GRN inference from multiple heterogeneous data sets; non-parametric and hybrid statistical methods; the joint inference of differential networks; and mechanistic models of gene regulation dynamics. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, descriptions of recently developed methods for GRN inference, applications of these methods on real and/ or simulated biological data, and step-by-step tutorials on the usage of associated software tools. Cutting-edge and thorough, Gene Regulatory Networks: Methods and Protocols is an essential tool for evaluating the current research needed to further address the common challenges faced by specialists in this field.
Article
One of the most interesting, difficult, and potentially useful topics in computational biology is the inference of gene regulatory networks (GRNs) from expression data. Although researchers have been working on this topic for more than a decade and much progress has been made, it remains an unsolved problem and even the most sophisticated inference algorithms are far from perfect. In this paper, we review the latest developments in network inference, including state-of-the-art algorithms like PIDC, Phixer, and more. We also discuss unsolved computational challenges, including the optimal combination of algorithms, integration of multiple data sources, and pseudo-temporal ordering of static expression data. Lastly, we discuss some exciting applications of network inference in cancer research, and provide a list of useful software tools for researchers hoping to conduct their own network inference analyses.
Article
What does it mean to say that event X caused outcome Y in biology? Explaining the causal structure underlying the dynamic function of living systems is a central goal of biology. Transformative advances in regenerative medicine and synthetic bioengineering will require efficient strategies to cause desired system-level outcomes. We present a perspective on the need to move beyond the classical ‘necessary and sufficient’ approach to biological causality. Michael Levin and colleagues discuss the need to rethink how the causes of cell biological processes are being identified.
Article
The functional interpretation of single-cell RNA sequencing (scRNA-seq) data can be enhanced by integrating additional data types beyond RNA-based gene expression. In this Review, Stuart and Satija discuss diverse approaches for integrative single-cell analysis, including experimental methods for profiling multiple omics types from the same cells, analytical approaches for extracting additional layers of information directly from scRNA-seq data and computational integration of omics data collected across different cell samples.
Chapter
Transcriptional regulatory networks specify the regulatory proteins of target genes that control the context-specific expression levels of genes. With our ability to profile the different types of molecular components of cells under different conditions, we are now uniquely positioned to infer regulatory networks in diverse biological contexts such as different cell types, tissues, and time points. In this chapter, we cover two main classes of computational methods to integrate different types of information to infer genome-scale transcriptional regulatory networks. The first class of methods focuses on integrative methods for specifically inferring connections between transcription factors and target genes by combining gene expression data with regulatory edge-specific knowledge. The second class of methods integrates upstream signaling networks with transcriptional regulatory networks by combining gene expression data with protein–protein interaction networks and proteomic datasets. We conclude with a section on practical applications of a network inference algorithm to infer a genome-scale regulatory network.
Chapter
Recent technological breakthroughs in single-cell RNA sequencing are revolutionizing modern experimental design in biology. The increasing size of the single-cell expression data from which networks can be inferred allows identifying more complex, non-linear dependencies between genes. Moreover, the inter-cellular variability that is observed in single-cell expression data can be used to infer not only one global network representing all the cells, but also numerous regulatory networks that are more specific to certain conditions. By experimentally perturbing certain genes, the deconvolution of the true contribution of these genes can also be greatly facilitated. In this chapter, we will therefore tackle the advantages of single-cell transcriptomic data and show how new methods exploit this novel data type to enhance the inference of gene regulatory networks.
Chapter
This chapter addresses the problem of reconstructing regulatory networks in molecular biology by integrating multiple sources of data. We consider data sets measured from diverse technologies all related to the same set of variables and individuals. This situation is becoming more and more common in molecular biology, for instance, when both proteomic and transcriptomic data related to the same set of “genes” are available on a given cohort of patients. To infer a consensus network that integrates both proteomic and transcriptomic data, we introduce a multivariate extension of Gaussian graphical models (GGM), which we refer to as multiattribute GGM. Indeed, the GGM framework offers a good proxy for modeling direct links between biological entities. We perform the inference of our multivariate GGM with a neighborhood selection procedure that operates at a multiscale level. This procedure employs a group-Lasso penalty in order to select interactions which operate both at the proteomic and at the transcriptomic level between two genes. We end up with a consensus network embedding information shared at multiple scales of the cell. We illustrate this method on two breast cancer data sets. An R-package is publicly available on github at https://github.com/jchiquet/multivarNetwork to promote reproducibility. © 2019, Springer Science+Business Media, LLC, part of Springer Nature.
Chapter
Reconstructing a gene regulatory network from one or more sets of omics measurements has been a major task of computational biology in the last 20 years. Despite an overwhelming number of algorithms proposed to solve the network inference problem either in the general scenario or in an ad-hoc tailored situation, assessing the stability of reconstruction is still an uncharted territory and exploratory studies mainly tackled theoretical aspects. We introduce here empirical stability, which is induced by variability of reconstruction as a function of data subsampling. By evaluating differences between networks that are inferred using different subsets of the same data we obtain quantitative indicators of the robustness of the algorithm, of the noise level affecting the data, and, overall, of the reliability of the reconstructed graph. We show that empirical stability can be used whenever no ground truth is available to compute a direct measure of the similarity between the inferred structure and the true network. The main ingredient here is a suite of indicators, called NetSI, providing statistics of distances between graphs generated by a given algorithm fed with different data subsets, where the chosen metric is the Hamming–Ipsen–Mikhailov (HIM) distance evaluating dissimilarity of graph topologies with shared nodes. Operatively, the NetSI family is demonstrated here on synthetic and high-throughput datasets, inferring graphs at different resolution levels (topology, direction, weight), showing how the stability indicators can be effectively used for the quantitative comparison of the stability of different reconstruction algorithms.
Article
Gene regulatory networks control the cellular phenotype by changing the RNA and protein composition. Despite its importance, the gene regulatory network in higher organisms is only partly mapped out. Here, we investigate the potential of reverse engineering methods to unravel the structure of these networks. Particularly, we focus on modular response analysis (MRA), a method that can disentangle networks from perturbation data. We benchmark a version of MRA that was previously successfully applied to reconstruct a signalling-driven genetic network, termed MLMSMRA, to test cases mimicking various aspects of gene regulatory networks. We then investigate the performance in comparison with other MRA realisations and related methods. The benchmark shows that MRA has the potential to predict functional interactions, but also shows that successful application of MRA is restricted to small sparse networks and to data with a low signal-to-noise ratio.
Article
A large body of data have accumulated that characterize the gene regulatory network of stem cells. Yet, a comprehensive and integrative understanding of this complex network is lacking. Network reverse engineering methods that use transcriptome data to derive these networks may help to uncover the topology in an unbiased way. Many methods exist that use co-expression to reconstruct networks. However, it remains unclear how these methods perform in the context of stem cell differentiation, as most systematic assessments have been made for regulatory networks of unicellular organisms. Here, we report a systematic benchmark of different reverse engineering methods against functional data. We show that network pruning is critical for reconstruction performance. We also find that performance is similar for algorithms that use different co-expression measures, i.e. mutual information or correlation. In addition, different methods yield very different network topologies, highlighting the challenge of interpreting these resulting networks as a whole. This article is part of the theme issue ‘Designer human tissue: coming to a lab near you’.
Article
High throughput DNA sequencing methodology (next generation sequencing; NGS) has rapidly evolved over the past 15 years and new methods are continually being commercialized. As the technology develops, so do increases in the number of corresponding applications for basic and applied science. The purpose of this review is to provide a compendium of NGS methodologies and associated applications. Each brief discussion is followed by web links to the manufacturer and/or web‐based visualizations. Keyword searches, such as with Google, may also provide helpful internet links and information.
Article
Genetic alterations in signaling pathways that control cell-cycle progression, apoptosis, and cell growth are common hallmarks of cancer, but the extent, mechanisms, and co-occurrence of alterations in these pathways differ between individual tumors and tumor types. Using mutations, copy-number changes, mRNA expression, gene fusions and DNA methylation in 9,125 tumors profiled by The Cancer Genome Atlas (TCGA), we analyzed the mechanisms and patterns of somatic alterations in ten canonical pathways: cell cycle, Hippo, Myc, Notch, Nrf2, PI-3-Kinase/Akt, RTK-RAS, TGFβ signaling, p53 and β-catenin/Wnt. We charted the detailed landscape of pathway alterations in 33 cancer types, stratified into 64 subtypes, and identified patterns of co-occurrence and mutual exclusivity. Eighty-nine percent of tumors had at least one driver alteration in these pathways, and 57% percent of tumors had at least one alteration potentially targetable by currently available drugs. Thirty percent of tumors had multiple targetable alterations, indicating opportunities for combination therapy.
Article
In this review we discuss the origination and evolution of Modular Response Analysis (MRA), which is a physics-based method for reconstructing quantitative topological models of biochemical pathways. We first focus on the core theory of MRA, demonstrating how both the direction and the strength of local, causal connections between network modules can be precisely inferred from the global responses of the entire network to a sufficient number of perturbations, under certain conditions. Subsequently, we analyze statistical reformulations of MRA and show how MRA is used to build and calibrate mechanistic models of biological networks. We further discuss what sets MRA apart from other network reconstruction methods and outline future directions for MRA-based methods of network reconstruction.
Chapter
The word omics refers to a field of study in biological sciences that ends with -omics, such as genomics, transcriptomics, proteomics, or metabolomics. The ending -ome is used to address the objects of study of such fields, such as the genome, proteome, transcriptome, or metabolome, respectively. More specifically genomics is the science that studies the structure, function, evolution, and mapping of genomes and aims at characterization and quantification of genes, which direct the production of proteins with the assistance of enzymes and messenger molecules. Transcriptome is the set of all messenger RNA molecules in one cell, tissue, or organism. It includes the amount or concentration of each RNA molecule in addition to the molecular identities. The term proteome refers to the sum of all the proteins in a cell, tissue, or organism. Proteomics is the science that studies those proteins as related to their biochemical properties and functional roles, and how their quantities, modifications, and structures change during growth and in response to internal and external stimuli. The metabolome represents the collection of all metabolites in a biological cell, tissue, organ, or organism, which are the end products of cellular processes. Metabolomics is the science that studies all chemical processes involving metabolites. More specifically, metabolomics is the study of chemical fingerprints that specific cellular processes establish during their activity; it is the study of all small-molecule metabolite profiles. Overall, the objective of omics sciences is to identify, characterize, and quantify all biological molecules that are involved in the structure, function, and dynamics of a cell, tissue, or organism.
Book
The emergence of systems biology raises many fascinating questions: What does it mean to take a systems approach to problems in biology? To what extent is the use of mathematical and computational modelling changing the life sciences? How does the availability of big data influence research practices? What are the major challenges for biomedical research in the years to come? This book addresses such questions of relevance not only to philosophers and biologists but also to readers interested in the broader implications of systems biology for science and society. The book features reflections and original work by experts from across the disciplines including systems biologists, philosophers, and interdisciplinary scholars investigating the social and educational aspects of systems biology. In response to the same set of questions, the experts develop and defend their personal perspectives on the distinctive character of systems biology and the challenges that lie ahead. Readers are invited to engage with different views on the questions addressed, and may explore numerous themes relating to the philosophy of systems biology. This edited work will appeal to scholars and all levels, from undergraduates to researchers, and to those interested in a variety of scholarly approaches such as systems biology, mathematical and computational modelling, cell and molecular biology, genomics, systems theory, and of course, philosophy of biology.
Book
Stuart Kauffman here presents a brilliant new paradigm for evolutionary biology, one that extends the basic concepts of Darwinian evolution to accommodate recent findings and perspectives from the fields of biology, physics, chemistry and mathematics. The book drives to the heart of the exciting debate on the origins of life and maintenance of order in complex biological systems. It focuses on the concept of self-organization: the spontaneous emergence of order widely observed throughout nature. Kauffman here argues that self-organization plays an important role in the emergence of life itself and may play as fundamental a role in shaping life's subsequent evolution as does the Darwinian process of natural selection. Yet until now no systematic effort has been made to incorporate the concept of self-organization into evolutionary theory. The construction requirements which permit complex systems to adapt remain poorly understood, as is the extent to which selection itself can yield systems able to adapt more successfully. This book explores these themes. It shows how complex systems, contrary to expectations, can spontaneously exhibit stunning degrees of order, and how this order, in turn, is essential for understanding the emergence and development of life on Earth. Topics include the new biotechnology of applied molecular evolution, with its important implications for developing new drugs and vaccines; the balance between order and chaos observed in many naturally occurring systems; new insights concerning the predictive power of statistical mechanics in biology; and other major issues. Indeed, the approaches investigated here may prove to be the new center around which biological science itself will evolve. The work is written for all those interested in the cutting edge of research in the life sciences.
Article
The mammalian brain contains diverse neuronal types, yet we lack single-cell epigenomic assays that are able to identify and characterize them. DNA methylation is a stable epigenetic mark that distinguishes cell types and marks regulatory elements. We generated >6000 methylomes from single neuronal nuclei and used them to identify 16 mouse and 21 human neuronal subpopulations in the frontal cortex. CG and non-CG methylation exhibited cell type–specific distributions, and we identified regulatory elements with differential methylation across neuron types. Methylation signatures identified a layer 6 excitatory neuron subtype and a unique human parvalbumin-expressing inhibitory neuron subtype. We observed stronger cross-species conservation of regulatory elements in inhibitory neurons than in excitatory neurons. Single-nucleus methylomes expand the atlas of brain cell types and identify regulatory elements that drive conserved brain cell diversity. © 2017, American Association for the Advancement of Science. All rights reserved.