
Andrei ZinovyevEvotec · In silico R&D
Andrei Zinovyev
MS in Theor Physics, PhD in Machine Learning, HDR in Biology
About
367
Publications
63,553
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,671
Citations
Introduction
I am Principal Scientist at in silico R&D department of Evotec where my mission is developing and applying cutting edge AI/ML methods in early drug discovery projects. I am interested in the methodology of computational disease modeling, systems biology, high-dimensional data analysis, causal AI. More information on my personal web-site: http://andreizinovyev.site/
Additional affiliations
January 2005 - present
October 2009 - present
November 2001 - December 2004
Publications
Publications (367)
Immunotherapy is improving the survival of patients with metastatic non-small cell lung cancer (NSCLC), yet reliable biomarkers are needed to identify responders prospectively and optimize patient care. In this study, we explore the benefits of multimodal approaches to predict immunotherapy outcome using multiple machine learning algorithms and int...
Cell Painting images offer valuable insights into the state of a cell and enable many biological applications, but publicly available arrayed datasets only include hundreds of genes perturbed. The JUMP (Joint Undertaking in Morphological Profiling) Cell Painting Consortium perturbed roughly 75% of the protein-coding genome in human U-2 OS cells, ge...
Digital twins represent a key technology for precision health. Medical digital twins consist of computational models that represent the health state of individual patients over time, enabling optimal therapeutics and forecasting patient prognosis. Many health conditions involve the immune system, so it is crucial to include its key features when de...
Living cells presumably employ optimized information transfer methods, enabling efficient communication even in noisy environments. As expected, the efficiency of chemical communications between cells depends on the properties of the molecular messenger. Evidence suggests that proteins from narrow ranges of molecular masses have been naturally sele...
Digital twins represent a key technology for precision health. Medical digital twins consist of computational models that represent the health state of individual patients over time, enabling optimal therapeutics and forecasting patient prognosis. Many health conditions involve the immune system, so it is crucial to include its key features when de...
Boolean networks provide robust explainable and predictive models of cellular dynamics, especially for cellular differentiation and fate decision processes. Yet, the construction of such models is extremely challenging, as it requires integrating prior knowledge with experimental observation of transcriptome, potentially relating thousands of genes...
This review examines current and potential applications of DTs in healthcare, focusing on the integration of multi-omics data using multilayer network approaches in cancer research. We discuss methodologies, tools and platforms commonly used for this integration, while highlighting case studies, challenges, and research gaps. Finally, we advocate t...
Background:
Chronic obstructive pulmonary disease (COPD) exhibits considerable progression heterogeneity. We hypothesized that elastic principal graph analysis (EPGA) would identify distinct clinical phenotypes and their longitudinal relationships.
Methods:
Cross-sectional data from 8,972 tobacco-exposed COPDGene participants, with and without C...
Background: In neuroblastoma (NB), intratumor heterogeneity (ITH) is frequently observed, but the role of cell-to-cell allele-specific copy number alterations in phenotypic variation, clonal evolution and treatment response remains to be determined. Here we investigate ITH, timing of specific genomic aberrations, single-cell replication timing and...
Motivation
Deciphering molecular signals from omics data helps understanding cellular processes and disease progression. Effective algorithms for extracting these signals are essential, with a strong emphasis on robustness and reproducibility.
Results
R/Bioconductor package consICA implements consensus independent component analysis (ICA)—a data-d...
Boolean networks are largely employed to model the qualitative dynamics of cell fate processes by describing the change of binary activation states of genes and transcription factors with time. Being able to bridge such qualitative states with quantitative measurements of gene expression in cells, as scRNA-seq, is a cornerstone for data-driven mode...
Esophageal squamous cell carcinoma (ESCC) is the predominant subtype of esophageal cancer in Central Asia, often diagnosed at advanced stages. Understanding population-specific patterns of ESCC is crucial for tailored treatments. This study aimed to unravel ESCC’s genetic basis in Kazakhstani patients and identify potential biomarkers for early dia...
In the context of natural disasters, human responses inevitably intertwine with natural factors. The COVID-19 pandemic, as a significant stress factor, has brought to light profound variations among different countries in terms of their adaptive dynamics in addressing the spread of infection outbreaks across different regions. This emphasizes the c...
The efficiency of analyzing high-throughput data in systems biology has been demonstrated in numerous studies, where molecular data, such as transcriptomics and proteomics, offers great opportunities for understanding the complexity of biological processes. One important aspect of data analysis in systems biology is the shift from a reductionist ap...
Boolean networks are largely employed to model the qualitative dynamics of cell fate processes by describing the change of binary activation states of genes and transcription factors with time. Being able to bridge such qualitative states with quantitative measurements of gene expressions in cells, as scRNA-Seq, is a cornerstone for data-driven mod...
Important quantities of biological data can today be acquired to characterize cell types and states, from various sources and using a wide diversity of methods, providing scientists with more and more information to answer challenging biological questions. Unfortunately, working with this amount of data comes at the price of ever-increasing data co...
Data integration of single-cell RNA-seq (scRNA-seq) data describes the task of embedding datasets gathered from different sources or experiments into a common representation so that cells with similar types or states are embedded close to one another independently from their dataset of origin. Data integration is a crucial step in most scRNA-seq da...
Background:
Molecular understanding of muscle-invasive (MIBC) and non-muscle-invasive (NMIBC) bladder cancer is currently based primarily on transcriptomic and genomic analyses.
Objective:
To conduct proteogenomic analyses to gain insights into bladder cancer (BC) heterogeneity and identify underlying processes specific to tumor subgroups and th...
Motivation:
Mathematical models of biological processes altered in cancer are built using the knowledge of complex networks of signaling pathways, detailing the molecular regulations inside different cell types, such as tumor cells, immune and other stromal cells. If these models mainly focus on intracellular information, they often omit a descrip...
Background
Exploring the function or the developmental history of cells in various organisms provides insights into a given cell type's core molecular characteristics and putative evolutionary mechanisms. Numerous computational methods now exist for analyzing single-cell data and identifying cell states. These methods mostly rely on the expression...
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informati...
Background: Exploring the function or the developmental history of cells in various organisms provides insights into a given cell type's core molecular characteristics and putative evolutionary mechanisms. Numerous computational methods now exist for analyzing single-cell data and identifying cell states. These methods mostly rely on the expression...
In recent cancer genomics programs, large-scale profiling of microRNAs has been routinely used in order to better understand the role of microRNAs in gene regulation and disease. To support the analysis of such amount of data, scalability of bioinformatics pipelines is increasingly important to handle larger datasets.Here, we describe a scalable im...
Data integration of single-cell data describes the task of embedding datasets obtained from different sources into a common space, so that cells with similar cell type or state end up close from one another in this representation independently from their dataset of origin. Data integration is a crucial early step in most data analysis pipelines inv...
The efficiency of analyzing high-throughput data in systems biology has been demonstrated in numerous studies, where molecular data, such as transcriptomics and proteomics, offers great opportunities for understanding the complexity of biological processes.
One important aspect of data analysis in systems biology is the shift from a reductionist ap...
Mathematical models of biological processes implicated in cancer are built using the knowledge of complex networks of signaling pathways, describing the molecular regulations inside different cell types, such as tumor cells, immune and other stromal cells. If these models mainly focus on intracellular information, they often omit a description of t...
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence between training or validation dataset possessing labels for learning and testing a classifier (source domain) and a potentially large unlabeled dataset where the model is exploited (target domain). The task is to find such a common r...
The cell cycle is one of the most fundamental biological processes important for understanding normal physiology and various pathologies such as cancer. Single cell RNA sequencing technologies give an opportunity to analyse the cell cycle transcriptome dynamics in an unprecedented range of conditions (cell types and perturbations), with thousands o...
The presence of serotonergic system during early pre-neural development is enigmatic and conserved amongst all studied invertebrate and vertebrate animals. We took advantage of zebrafish model system to address what is the role of early serotonin before first neurons form. Unexpectedly, we experimentally revealed the existence of delayed developmen...
We developed BIODICA, an integrated computational environment for application of Independent Component Analysis (ICA) to bulk and single-cell molecular profiles, interpretation of the results in terms of biological functions and correlation with metadata. The computational core is the novel Python package stabilized-ica which provides interface to...
Finding best architectures of learning machines, such as deep neural networks, is a well-known technical and theoretical challenge. Recent work by Mellor et al (2021) showed that there may exist correlations between the accuracies of trained networks and the values of some easily computable measures defined on randomly initialised networks which ma...
Cell cycle is a biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single c...
Motivation
Single-cell RNA-seq (scRNAseq) datasets are characterized by large ambient dimensionality, and their analyses can be affected by various manifestations of the dimensionality curse. One of these manifestations is the hubness phenomenon, i.e. existence of data points with surprisingly large incoming connectivity degree in the datapoint nei...
WebMaBoSS is an easy-to-use web interface for conversion, storage, simulation and analysis of Boolean models that allows to get insight from these models without any specific knowledge of modeling or coding. It relies on an existing software, MaBoSS, which simulates Boolean models using a stochastic approach: it applies continuous time Markov proce...
Independent Component Analysis is a matrix factorization method for data dimension reduction. ICA has been widely applied for the analysis of transcriptomic data for blind separation of biological, environmental, and technical factors affecting gene expression. The study aimed to analyze the publicly available esophageal cancer data using the ICA f...
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note intr...
We introduce LNetReduce, a tool that simplifies linear dynamic networks. Dynamic networks are represented as digraphs labeled by integer timescale orders. Such models describe deterministic or stochastic monomolecular chemical reaction networks, but also random walks on weighted protein-protein interaction networks, spreading of infectious diseases...
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note intr...
Chronic obstructive pulmonary disease (COPD) presents significant clinical heterogeneity and non-trivial progression trajectories resulting in a wide range of patient outcomes. Clinical trajectory analysis (ClinTrajAn) is a powerful tool based on elastic principal graphs for the discovery and evaluation of trajectories in large cross-sectional clin...
A bstract
Cell cycle is the most fundamental biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has...
Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS/DNB) onboard, global nighttime images have significantly improved; however, they remained panchro...
A formulation of the dataset integration problem describes the task of aligning two or more empirical distributions sampled from sources of the same kind, so that records of similar object end up close to one another. We propose a variant of the optimal transport- and Gromov-Wasserstein-based dataset integration algorithm introduced in SCOT. We for...
We introduce LNetReduce, a tool that simplifies linear dynamic networks. Dynamic networks are represented as digraphs labeled by integer timescale orders. Such models describe deterministic or stochastic monomolecular chemical reaction networks, but also random walks on weighted protein-protein interaction networks, spreading of infectious diseases...
The rising interest for precise characterization of the tumour immune contexture has recently brought forward the high potential of RNA sequencing (RNA-seq) in identifying molecular mechanisms engaged in the response to immunotherapy. In this review, we provide an overview of the major principles of single-cell and conventional (bulk) RNA-seq appli...
Background. Single-cell RNA-seq datasets are characterized by large ambient dimensionality, and their analyses, such as clustering or cell trajectory inference, can be affected by various manifestations of the dimensionality curse. One of these manifestations is the hubness phenomenon, i.e. existence of data points with surprisingly large incoming...
Multilayer networks allow interpreting the molecular basis of diseases, which is particularly challenging in rare diseases where the number of cases is small in comparison with the size of the associated multi-omics datasets. In this work, we develop a dimensionality reduction methodology to identify the minimal set of genes that characterize disea...
Ewing sarcoma (EwS) is a highly aggressive pediatric bone cancer that is defined by a somatic fusion between the EWSR1 gene and an ETS family member, most frequently the FLI1 gene, leading to expression of a chimeric transcription factor EWSR1−FLI1. Otherwise, EwS is one of the most genetically stable cancers. The situation when the major cancer dr...
After the success of the new generation of immune therapies, immune checkpoint receptors have become one important center of attention of molecular oncologists. The initial success and hopes of anti-programmed cell death protein 1 (anti-PD1) and anti-cytotoxic T-lymphocyte-associated protein 4 (anti-CTLA4) therapies have shown some limitations sinc...
Background
Large observational clinical datasets are becoming increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete disease state develops through stereotypical routes, charac...
Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods hav...
The construction of models of biological networks from prior knowledge and experimental data often leads to a multitude of candidate models. Devising a single model from them can require arbitrary choices, which may lead to strong biases in subsequent predictions. We introduce here a methodology for a) synthesizing Boolean model ensembles satisfyin...
Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on Earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day/Night Band (VIIRS/DNB) onboard in late 2011, global nighttime satellite images have considerably improved in terms...
The construction of models of biological networks from prior knowledge and experimental data often leads to a multitude of candidate models. Devising a single model from them can require arbitrary choices, which may lead to strong biases in subsequent predictions. We introduce here a methodology for a) synthesizing Boolean model ensembles satisfyin...
Large observational clinical datasets become increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete pathology develops through a number of stereotypical routes, characterized b...
Background:
Solutions to stochastic Boolean models are usually estimated by Monte Carlo simulations, but as the state space of these models can be enormous, there is an inherent uncertainty about the accuracy of Monte Carlo estimates and whether simulations have reached all attractors. Moreover, these models have timescale parameters (transition r...
A subset of cancer-associated fibroblasts (FAP+/CAF-S1) mediates immunosuppression in breast cancers, but its heterogeneity and its impact on immunotherapy response remain unknown. Here, we identify 8 CAF-S1 clusters by analyzing more than 19,000 single CAF-S1 fibroblasts from breast cancer. We validate the five most abundant clusters by flow cytom...
The processes leading to, or avoiding cell death are widely studied, because of their frequent perturbation in various diseases. Cell death occurs in three highly interconnected steps: Initiation, signaling and execution. We used a systems biology approach to gather information about all known modes of regulated cell death (RCD). Based on the exper...
Multidimensional datapoint clouds representing large datasets are frequently characterized by non-trivial low-dimensional geometry and topology which can be recovered by unsupervised machine learning approaches, in particular, by principal graphs. Principal graphs approximate the multivariate data by a graph injected into the data space with some c...
English Wikipedia, containing more than five millions articles, has approximately eleven thousands web pages devoted to proteins or genes most of which were generated by the Gene Wiki project. These pages contain information about interactions between proteins and their functional relationships. At the same time, they are interconnected with other...
EWSR1-FLI1, the chimeric oncogene specific for Ewing sarcoma (EwS), induces a cascade of signaling events leading to cell transformation. However, it remains elusive how genetically homogeneous EwS cells can drive the heterogeneity of transcriptional programs. Here, we combine independent component analysis of single-cell RNA sequencing data from d...
Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds. Knowing ID is crucial to choose the appropriate machine learning approach as well as to understand its behavior and validate it. ID can be computed globally for the whole data distribution, or estimated locally in a point. In this pa...
Machine learning deals with datasets characterized by high dimensionality. However, in many cases, the intrinsic dimensionality of the datasets is surprisingly low. For example, the dimensionality of a robot's perception space can be large and multi-modal but its variables can have more or less complex non-linear interdependencies. Thus multidimens...