Andrei Zinovyev

Andrei Zinovyev
Institut Curie · Department of Bioinformatics and Systems Biology of Cancer

MS in Theor Physics, PhD in Machine Learning, HDR in Biology

About

346
Publications
56,246
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,858
Citations
Citations since 2017
166 Research Items
6829 Citations
201720182019202020212022202302004006008001,0001,2001,400
201720182019202020212022202302004006008001,0001,2001,400
201720182019202020212022202302004006008001,0001,2001,400
201720182019202020212022202302004006008001,0001,2001,400
Introduction
I am a permanent senior researcher working at Institut Curie since 2005. I co-lead and coordinate scientific projects of Computational Systems Biology of Cancer team inside Bioinformatics of Cancer department. From 2019, I hold an interdisciplinary chair at Prairie - PaRis Artificial Intelligence Research InstitutE. My research area include the following topics: Computational Systems Biology of Cancer, Methodology of machine learning and artifical intelligence, Dealing with complexity of biological systems through mathematical modeling. More information on my personal web-site: http://andreizinovyev.site/
Additional affiliations
October 2009 - present
Ecole Normale Supérieure de Paris
Position
  • Member of a pedagogical team for Systems Biology M2 course
January 2005 - present
Institut Curie
Position
  • Scientific coordinator of Computational Systems Biology of Cancer team
Description
  • http://sysbio.curie.fr
January 2005 - present
French Institute of Health and Medical Research
Position
  • Scientific coordinator

Publications

Publications (346)
Article
Full-text available
Important quantities of biological data can today be acquired to characterize cell types and states, from various sources and using a wide diversity of methods, providing scientists with more and more information to answer challenging biological questions. Unfortunately, working with this amount of data comes at the price of ever-increasing data co...
Article
Full-text available
Data integration of single-cell RNA-seq (scRNA-seq) data describes the task of embedding datasets gathered from different sources or experiments into a common representation so that cells with similar types or states are embedded close to one another independently from their dataset of origin. Data integration is a crucial step in most scRNA-seq da...
Article
Background: Molecular understanding of muscle-invasive (MIBC) and non-muscle-invasive (NMIBC) bladder cancer is currently based primarily on transcriptomic and genomic analyses. Objective: To conduct proteogenomic analyses to gain insights into bladder cancer (BC) heterogeneity and identify underlying processes specific to tumor subgroups and th...
Article
Full-text available
Motivation: Mathematical models of biological processes altered in cancer are built using the knowledge of complex networks of signaling pathways, detailing the molecular regulations inside different cell types, such as tumor cells, immune and other stromal cells. If these models mainly focus on intracellular information, they often omit a descrip...
Article
Full-text available
Background Exploring the function or the developmental history of cells in various organisms provides insights into a given cell type's core molecular characteristics and putative evolutionary mechanisms. Numerous computational methods now exist for analyzing single-cell data and identifying cell states. These methods mostly rely on the expression...
Article
Full-text available
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informati...
Preprint
Full-text available
Background: Exploring the function or the developmental history of cells in various organisms provides insights into a given cell type's core molecular characteristics and putative evolutionary mechanisms. Numerous computational methods now exist for analyzing single-cell data and identifying cell states. These methods mostly rely on the expression...
Chapter
In recent cancer genomics programs, large-scale profiling of microRNAs has been routinely used in order to better understand the role of microRNAs in gene regulation and disease. To support the analysis of such amount of data, scalability of bioinformatics pipelines is increasingly important to handle larger datasets.Here, we describe a scalable im...
Preprint
Full-text available
Data integration of single-cell data describes the task of embedding datasets obtained from different sources into a common space, so that cells with similar cell type or state end up close from one another in this representation independently from their dataset of origin. Data integration is a crucial early step in most data analysis pipelines inv...
Preprint
Full-text available
The efficiency of analyzing high-throughput data in systems biology has been demonstrated in numerous studies, where molecular data, such as transcriptomics and proteomics, offers great opportunities for understanding the complexity of biological processes. One important aspect of data analysis in systems biology is the shift from a reductionist ap...
Preprint
Full-text available
Mathematical models of biological processes implicated in cancer are built using the knowledge of complex networks of signaling pathways, describing the molecular regulations inside different cell types, such as tumor cells, immune and other stromal cells. If these models mainly focus on intracellular information, they often omit a description of t...
Preprint
Full-text available
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence between training or validation dataset possessing labels for learning and testing a classifier (source domain) and a potentially large unlabeled dataset where the model is exploited (target domain). The task is to find such a common r...
Preprint
The cell cycle is one of the most fundamental biological processes important for understanding normal physiology and various pathologies such as cancer. Single cell RNA sequencing technologies give an opportunity to analyse the cell cycle transcriptome dynamics in an unprecedented range of conditions (cell types and perturbations), with thousands o...
Preprint
Full-text available
The presence of serotonergic system during early pre-neural development is enigmatic and conserved amongst all studied invertebrate and vertebrate animals. We took advantage of zebrafish model system to address what is the role of early serotonin before first neurons form. Unexpectedly, we experimentally revealed the existence of delayed developmen...
Article
Full-text available
We developed BIODICA, an integrated computational environment for application of Independent Component Analysis (ICA) to bulk and single-cell molecular profiles, interpretation of the results in terms of biological functions and correlation with metadata. The computational core is the novel Python package stabilized-ica which provides interface to...
Preprint
Full-text available
Finding best architectures of learning machines, such as deep neural networks, is a well-known technical and theoretical challenge. Recent work by Mellor et al (2021) showed that there may exist correlations between the accuracies of trained networks and the values of some easily computable measures defined on randomly initialised networks which ma...
Article
Full-text available
Cell cycle is a biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single c...
Article
Motivation Single-cell RNA-seq (scRNAseq) datasets are characterized by large ambient dimensionality, and their analyses can be affected by various manifestations of the dimensionality curse. One of these manifestations is the hubness phenomenon, i.e. existence of data points with surprisingly large incoming connectivity degree in the datapoint nei...
Article
Full-text available
WebMaBoSS is an easy-to-use web interface for conversion, storage, simulation and analysis of Boolean models that allows to get insight from these models without any specific knowledge of modeling or coding. It relies on an existing software, MaBoSS, which simulates Boolean models using a stochastic approach: it applies continuous time Markov proce...
Article
Full-text available
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note intr...
Article
Full-text available
Independent Component Analysis is a matrix factorization method for data dimension reduction. ICA has been widely applied for the analysis of transcriptomic data for blind separation of biological, environmental, and technical factors affecting gene expression. The study aimed to analyze the publicly available esophageal cancer data using the ICA f...
Chapter
Full-text available
We introduce LNetReduce, a tool that simplifies linear dynamic networks. Dynamic networks are represented as digraphs labeled by integer timescale orders. Such models describe deterministic or stochastic monomolecular chemical reaction networks, but also random walks on weighted protein-protein interaction networks, spreading of infectious diseases...
Preprint
Full-text available
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note intr...
Conference Paper
Full-text available
Chronic obstructive pulmonary disease (COPD) presents significant clinical heterogeneity and non-trivial progression trajectories resulting in a wide range of patient outcomes. Clinical trajectory analysis (ClinTrajAn) is a powerful tool based on elastic principal graphs for the discovery and evaluation of trajectories in large cross-sectional clin...
Preprint
Full-text available
A bstract Cell cycle is the most fundamental biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has...
Article
Full-text available
Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS/DNB) onboard, global nighttime images have significantly improved; however, they remained panchro...
Preprint
Full-text available
A formulation of the dataset integration problem describes the task of aligning two or more empirical distributions sampled from sources of the same kind, so that records of similar object end up close to one another. We propose a variant of the optimal transport- and Gromov-Wasserstein-based dataset integration algorithm introduced in SCOT. We for...
Preprint
Full-text available
We introduce LNetReduce, a tool that simplifies linear dynamic networks. Dynamic networks are represented as digraphs labeled by integer timescale orders. Such models describe deterministic or stochastic monomolecular chemical reaction networks, but also random walks on weighted protein-protein interaction networks, spreading of infectious diseases...
Article
Full-text available
The rising interest for precise characterization of the tumour immune contexture has recently brought forward the high potential of RNA sequencing (RNA-seq) in identifying molecular mechanisms engaged in the response to immunotherapy. In this review, we provide an overview of the major principles of single-cell and conventional (bulk) RNA-seq appli...
Preprint
Full-text available
Background. Single-cell RNA-seq datasets are characterized by large ambient dimensionality, and their analyses, such as clustering or cell trajectory inference, can be affected by various manifestations of the dimensionality curse. One of these manifestations is the hubness phenomenon, i.e. existence of data points with surprisingly large incoming...
Article
Full-text available
Multilayer networks allow interpreting the molecular basis of diseases, which is particularly challenging in rare diseases where the number of cases is small in comparison with the size of the associated multi-omics datasets. In this work, we develop a dimensionality reduction methodology to identify the minimal set of genes that characterize disea...
Article
Ewing sarcoma (EwS) is a highly aggressive pediatric bone cancer that is defined by a somatic fusion between the EWSR1 gene and an ETS family member, most frequently the FLI1 gene, leading to expression of a chimeric transcription factor EWSR1−FLI1. Otherwise, EwS is one of the most genetically stable cancers. The situation when the major cancer dr...
Article
Full-text available
After the success of the new generation of immune therapies, immune checkpoint receptors have become one important center of attention of molecular oncologists. The initial success and hopes of anti-programmed cell death protein 1 (anti-PD1) and anti-cytotoxic T-lymphocyte-associated protein 4 (anti-CTLA4) therapies have shown some limitations sinc...
Article
Full-text available
Background Large observational clinical datasets are becoming increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete disease state develops through stereotypical routes, charac...
Article
Full-text available
Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods hav...
Chapter
The construction of models of biological networks from prior knowledge and experimental data often leads to a multitude of candidate models. Devising a single model from them can require arbitrary choices, which may lead to strong biases in subsequent predictions. We introduce here a methodology for a) synthesizing Boolean model ensembles satisfyin...
Preprint
Full-text available
Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on Earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day/Night Band (VIIRS/DNB) onboard in late 2011, global nighttime satellite images have considerably improved in terms...
Conference Paper
Full-text available
The construction of models of biological networks from prior knowledge and experimental data often leads to a multitude of candidate models. Devising a single model from them can require arbitrary choices, which may lead to strong biases in subsequent predictions. We introduce here a methodology for a) synthesizing Boolean model ensembles satisfyin...
Preprint
Full-text available
Large observational clinical datasets become increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete pathology develops through a number of stereotypical routes, characterized b...
Article
Full-text available
Background: Solutions to stochastic Boolean models are usually estimated by Monte Carlo simulations, but as the state space of these models can be enormous, there is an inherent uncertainty about the accuracy of Monte Carlo estimates and whether simulations have reached all attractors. Moreover, these models have timescale parameters (transition r...
Article
Full-text available
A subset of cancer-associated fibroblasts (FAP+/CAF-S1) mediates immunosuppression in breast cancers, but its heterogeneity and its impact on immunotherapy response remain unknown. Here, we identify 8 CAF-S1 clusters by analyzing more than 19,000 single CAF-S1 fibroblasts from breast cancer. We validate the five most abundant clusters by flow cytom...
Article
Full-text available
The processes leading to, or avoiding cell death are widely studied, because of their frequent perturbation in various diseases. Cell death occurs in three highly interconnected steps: Initiation, signaling and execution. We used a systems biology approach to gather information about all known modes of regulated cell death (RCD). Based on the exper...
Article
Full-text available
Multidimensional datapoint clouds representing large datasets are frequently characterized by non-trivial low-dimensional geometry and topology which can be recovered by unsupervised machine learning approaches, in particular, by principal graphs. Principal graphs approximate the multivariate data by a graph injected into the data space with some c...
Article
Full-text available
English Wikipedia, containing more than five millions articles, has approximately eleven thousands web pages devoted to proteins or genes most of which were generated by the Gene Wiki project. These pages contain information about interactions between proteins and their functional relationships. At the same time, they are interconnected with other...
Article
Full-text available
EWSR1-FLI1, the chimeric oncogene specific for Ewing sarcoma (EwS), induces a cascade of signaling events leading to cell transformation. However, it remains elusive how genetically homogeneous EwS cells can drive the heterogeneity of transcriptional programs. Here, we combine independent component analysis of single-cell RNA sequencing data from d...
Preprint
Full-text available
Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds. Knowing ID is crucial to choose the appropriate machine learning approach as well as to understand its behavior and validate it. ID can be computed globally for the whole data distribution, or estimated locally in a point. In this pa...
Article
Full-text available
Machine learning deals with datasets characterized by high dimensionality. However, in many cases, the intrinsic dimensionality of the datasets is surprisingly low. For example, the dimensionality of a robot's perception space can be large and multi-modal but its variables can have more or less complex non-linear interdependencies. Thus multidimens...
Article
Full-text available
Motivation: CellDesigner is a well-established biological map editor used in many large-scale scientific efforts (Funahashi et al., 2007). However, the interoperability between the Systems Biology Graphical Notation (SBGN) Markup Language (SBGN-ML) and the CellDesigner's proprietary Systems Biology Markup Language (SBML) extension formats remains...
Article
Full-text available
Cancer driver gene alterations influence cancer development, occurring in oncogenes, tumor suppressors, and dual role genes. Discovering dual role cancer genes is difficult because of their elusive context-dependent behavior. We define oncogenic mediators as genes controlling biological processes. With them, we classify cancer driver genes, unveili...
Chapter
ACSN (https://acsn.curie.fr) is a web-based resource of multi-scale biological maps depicting molecular processes in cancer cell and tumor microenvironment. The core of the Atlas is a set of interconnected cancer-related signaling and metabolic network maps. Molecular mechanisms are depicted on the maps at the level of biochemical interactions, for...
Article
Full-text available
Motivation Matrix factorization (MF) methods are widely used in order to reduce dimensionality of transcriptomic datasets to the action of few hidden factors (metagenes). MF algorithms have never been compared based on the between-datasets reproducibility of their outputs in similar independent datasets. Lack of this knowledge might have a crucial...
Article
Full-text available
The lack of integrated resources depicting the complexity of the innate immune response in cancer represents a bottleneck for high-throughput data interpretation. To address this challenge, we perform a systematic manual literature mining of molecular mechanisms governing the innate immune response in cancer and represent it as a signalling network...
Preprint
Full-text available
Motivation Solutions to stochastic Boolean models are usually estimated by Monte Carlo simulations, but as the state space of these models can be enormous, there is an inherent uncertainty about the accuracy of Monte Carlo estimates and whether simulations have reached all asymptotic solutions. Moreover, these models have timescale parameters (tran...
Article
Background Deep learning (DL) is one of the best approaches to predict nonlinear behaviors from high dimensional data. Nevertheless predicting the outcome of patients affected by cancers from transcriptomic data has shown limited performance, even with DL (C-index usually <0.65). Transfer learning is a DL two-step method where a model is pre-traine...
Article
Full-text available
The glucocorticoid receptor (GR) acts as a ubiquitous cortisol-dependent transcription factor (TF). To identify co-factors, we used protein-fragment complementation assays and found that GR recognizes FLI1 and additional ETS family proteins, TFs relaying proliferation and/or migration signals. Following steroid-dependent translocation of FLI1 and G...
Article
Full-text available
Background The amount of publicly available cancer-related “omics” data is constantly growing and can potentially be used to gain insights into the tumour biology of new cancer patients, their diagnosis and suitable treatment options. However, the integration of different datasets is not straightforward and requires specialized approaches to deal w...
Preprint
Full-text available
Boolean networks model finite discrete dynamical systems with complex behaviours. The state of each component is determined by a Boolean function of the state of (a subset of) the components of the network. This paper addresses the synthesis of these Boolean functions from constraints on their domain and emerging dynamical properties of the resulti...
Article
Full-text available
Independent component analysis (ICA) is a matrix factorization approach where the signals captured by each individual matrix factors are optimized to become as mutually independent as possible. Initially suggested for solving source blind separation problems in various fields, ICA was shown to be successful in analyzing functional magnetic resonanc...