Maria Poptsova’s research while affiliated with National Research University Higher School of Economics and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (43)


Figure 3. Examples of the GQ forming sequences associated with RNA editing. (A) ALU-associated editing of the RHOA transcript. The red boxes highlight the GQ. (B) The long dsRNA editing substrate formed by the fold back of the two ALU inverted repeats displayed in (C). The red box shows the position of the RHOA exon relative to the ALU elements. (D) The well-characterized K242R edit of NEIL1 (endonuclease VIII-like DNA glycosylase) is associated with a GQ. (E) The NEIL1 editing substrate with the edits indicated by red arrows. (F) The edits and various RNA isoforms of TNFRSF14 pre-mRNA are associated with a GQ motif. (G) Location of the edits in a dsRNA substrate formed by the TNFRSF14 transcript. (H) The FLG (filaggrin) RNA is associated with both a Figure 3. Examples of the GQ forming sequences associated with RNA editing. (A) ALU-associated editing of the RHOA transcript. The red boxes highlight the GQ. (B) The long dsRNA editing substrate formed by the fold back of the two ALU inverted repeats displayed in (C). The red box shows the position of the RHOA exon relative to the ALU elements. (D) The well-characterized K242R edit of NEIL1 (endonuclease VIII-like DNA glycosylase) is associated with a GQ. (E) The NEIL1 editing substrate with the edits indicated by red arrows. (F) The edits and various RNA isoforms of TNFRSF14 pre-mRNA are associated with a GQ motif. (G) Location of the edits in a dsRNA substrate formed by the TNFRSF14 transcript. (H) The FLG (filaggrin) RNA is associated with both a GQ and the antisense CCDST (cervical cancer associated DHX9 suppressive transcript) that could form a dsRNA substrate. (I) The FLG pre-mRNA by itself also folds into a dsRNA substrate.
Zα and Zβ Localize ADAR1 to Flipons That Modulate Innate Immunity, Alternative Splicing, and Nonsynonymous RNA Editing
  • Article
  • Full-text available

March 2025

·

22 Reads

·

Oleksandr Cherednichenko

·

Terry P. Lybrand

·

[...]

·

Maria Poptsova

The double-stranded RNA editing enzyme ADAR1 connects two forms of genetic programming, one based on codons and the other on flipons. ADAR1 recodes codons in pre-mRNA by deaminating adenosine to form inosine, which is translated as guanosine. ADAR1 also plays essential roles in the immune defense against viruses and cancers by recognizing left-handed Z-DNA and Z-RNA (collectively called ZNA). Here, we review various aspects of ADAR1 biology, starting with codons and progressing to flipons. ADAR1 has two major isoforms, with the p110 protein lacking the p150 Zα domain that binds ZNAs with high affinity. The p150 isoform is induced by interferon and targets ALU inverted repeats, a class of endogenous retroelement that promotes their transcription and retrotransposition by incorporating Z-flipons that encode ZNAs and G-flipons that form G-quadruplexes (GQ). Both p150 and p110 include the Zβ domain that is related to Zα but does not bind ZNAs. Here we report strong evidence that Zβ binds the GQ that are formed co-transcriptionally by ALU repeats and within R-loops. By binding GQ, ADAR1 suppresses ALU-mediated alternative splicing, generates most of the reported nonsynonymous edits and promotes R-loop resolution. The recognition of the various alternative nucleic acid conformations by ADAR1 connects genetic programming by flipons with the encoding of information by codons. The findings suggest that incorporating G-flipons into editmers might improve the therapeutic editing efficacy of ADAR1.

Download


Graphs of AUC-ROC values measured for different machine learning methods. Each value at the X axis corresponds to a phenotype with a certain contribution of epistasis. For each AUC-ROC value, the boundaries of the 95% confidence interval are indicated. Each graph corresponds to a different dataset composition and feature-to-instance ratio. In all cases, 3-loci epistasis model with heritability of 0.25 was used.
Distributions of metrics measured for different machine learning models. Three theoretical forms of epistasis and their corresponding datasets were generated using GAMETES (LR, Lasso regression; GB, Gradient Boosting).
Deep learning captures the effect of epistasis in multifactorial diseases

January 2025

·

12 Reads

Background Polygenic risk score (PRS) prediction is widely used to assess the risk of diagnosis and progression of many diseases. Routinely, the weights of individual SNPs are estimated by the linear regression model that assumes independent and linear contribution of each SNP to the phenotype. However, for complex multifactorial diseases such as Alzheimer’s disease, diabetes, cardiovascular disease, cancer, and others, association between individual SNPs and disease could be non-linear due to epistatic interactions. The aim of the presented study is to explore the power of non-linear machine learning algorithms and deep learning models to predict the risk of multifactorial diseases with epistasis. Methods Simulated data with 2- and 3-loci interactions and tested three different models of epistasis: additive, multiplicative and threshold, were generated using the GAMETES. Penetrance tables were generated using PyTOXO package. For machine learning methods we used multilayer perceptron (MLP), convolutional neural network (CNN) and recurrent neural network (RNN), Lasso regression, random forest and gradient boosting models. Performance of machine learning models were assessed using accuracy, AUC-ROC, AUC-PR, recall, precision, and F1 score. Results First, we tested ensemble tree methods and deep learning neural networks against LASSO linear regression model on simulated data with different types and strength of epistasis. The results showed that with the increase of strength of epistasis effect, non-linear models significantly outperform linear. Then the higher performance of non-linear models over linear was confirmed on real genetic data for multifactorial phenotypes such as obesity, type 1 diabetes, and psoriasis. From non-linear models, gradient boosting appeared to be the best model in obesity and psoriasis while deep learning methods significantly outperform linear approaches in type 1 diabetes. Conclusion Overall, our study underscores the efficacy of non-linear models and deep learning approaches in more accurately accounting for the effects of epistasis in simulations with specific configurations and in the context of certain diseases.


Kolmogorov-Arnold Networks for Genomic Tasks

December 2024

·

83 Reads

Kolmogorov-Arnold Networks (KANs) emerged as a promising alternative for multilayer perceptrons in dense fully connected networks. Multiple attempts have been made to integrate KANs into various deep learning architectures in the domains of computer vision and natural language processing. Integrating KANs into deep learning models for genomic tasks has not been explored. Here, we tested linear KANs (LKANs) and convolutional KANs (CKANs) as replacement for MLP in baseline deep learning architectures for classification and generation of genomic sequences. We used three genomic benchmark datasets: Genomic Benchmarks, Genome Understanding Evaluation, and Flipon Benchmark. We demonstrated that LKANs outperformed both baseline and CKANs on almost all datasets. CKANs can achieve comparable results but struggle with scaling over large number of parameters. Ablation analysis demonstrated that the number of KAN layers correlates with the model performance. Overall, linear KANs show promising results in improving the performance of deep learning models with relatively small number of parameters. Unleashing KAN potential in different SOTA deep learning architectures currently used in genomics requires further research.



ML model performance comparison based on ROC AUC metrics for 1,000 bootstrapped test sets. (A) ML models trained on 39 features. (B) ML models trained on various number of features.
Sequential forward feature selection for the model with the best performance on all features (CatBoost).
SHAP feature importance plot for Catboost model built on 9 SFS-selected features.
ML model risk assessment for a high-risk patient.
ML model risk assessment for a low-risk patient.
Machine learning models for predicting risks of MACEs for myocardial infarction patients with different VEGFR2 genotypes

September 2024

·

17 Reads

·

1 Citation

Background The development of prognostic models for the identification of high-risk myocardial infarction (MI) patients is a crucial step toward personalized medicine. Genetic factors are known to be associated with an increased risk of cardiovascular diseases; however, little is known about whether they can be used to predict major adverse cardiac events (MACEs) for MI patients. This study aimed to build a machine learning (ML) model to predict MACEs in MI patients based on clinical, imaging, laboratory, and genetic features and to assess the influence of genetics on the prognostic power of the model. Methods We analyzed the data from 218 MI patients admitted to the emergency department at the Surgut District Center for Diagnostics and Cardiovascular Surgery, Russia. Upon admission, standard clinical measurements and imaging data were collected for each patient. Additionally, patients were genotyped for VEGFR-2 variation rs2305948 (C/C, C/T, T/T genotypes with T being the minor risk allele). The study included a 9-year follow-up period during which major ischemic events were recorded. We trained and evaluated various ML models, including Gradient Boosting, Random Forest, Logistic Regression, and AutoML. For feature importance analysis, we applied the sequential feature selection (SFS) and Shapley’s scheme of additive explanation (SHAP) methods. Results The CatBoost algorithm, with features selected using the SFS method, showed the best performance on the test cohort, achieving a ROC AUC of 0.813. Feature importance analysis identified the dose of statins as the most important factor, with the VEGFR-2 genotype among the top 5. The other important features are coronary artery lesions (coronary artery stenoses ≥70%), left ventricular (LV) parameters such as lateral LV wall and LV mass, diabetes, type of revascularization (CABG or PCI), and age. We also showed that contributions are additive and that high risk can be determined by cumulative negative effects from different prognostic factors. Conclusion Our ML-based approach demonstrated that the VEGFR-2 genotype is associated with an increased risk of MACEs in MI patients. However, the risk can be significantly reduced by high-dose statins and positive factors such as the absence of coronary artery lesions, absence of diabetes, and younger age.


Z-DNA formation in promoters conserved between human and mouse are associated with increased transcription reinitiation rates

August 2024

·

20 Reads

·

5 Citations

A long-standing question concerns the role of Z-DNA in transcription. Here we use a deep learning approach DeepZ that predicts Z-flipons based on DNA sequence, structural properties of nucleotides and omics data. We examined Z-flipons that are conserved between human and mouse genomes after generating whole-genome Z-flipon maps and then validated them by orthogonal approaches based on high resolution chemical mapping of Z-DNA and the transformer algorithm Z-DNABERT. For human and mouse, we revealed similar pattern of transcription factors, chromatin remodelers, and histone marks associated with conserved Z-flipons. We found significant enrichment of Z-flipons in alternative and bidirectional promoters associated with neurogenesis genes. We show that conserved Z-flipons are associated with increased experimentally determined transcription reinitiation rates compared to promoters without Z-flipons, but without affecting elongation or pausing. Our findings support a model where Z-flipons engage Transcription Factor E and impact phenotype by enabling the reset of preinitiation complexes when active, and the suppression of gene expression when engaged by repressive chromatin complexes.


Analysis of live cell data with G-DNABERT supports a role for G-quadruplexes in chromatin looping

June 2024

·

30 Reads

Alternative DNA conformation formed by sequences called flipons potentially alter the readout of genetic information by directing the shape-specific assembly of complexes on DNA The biological roles of G-quadruplexes formed by motifs rich in guanosine repeats have been investigated experimentally using many different methodologies including G4-seq, G4 ChIP-seq, permanganate nuclease footprinting (KEx), KAS-seq, CUT&Tag with varying degrees of overlap between the results. Here we trained large language model DNABERT on existing data generated by KEx, a rapid chemical footprinting technique performed on live, intact cells using potassium permanganate. The snapshot of flipon state when combined with results from other in vitro methods that are performed on permeabilized cells, allows a high confidence mapping of G-flipons to proximal enhancer and promoter sequences. Using G4-DNABERT predictions,with data from ENdb, Zoonomia cCREs and single сell G4 CUT&Tag experiments, we found support for a model where G4-quadruplexes regulate gene expression through chromatin loop formation.


Fig. 1. Data augmentation pipeline. Synthetic data are created with three types of generative models: diffusion model, WGAN, and VQ-VAE. Concatenated real and artificial data are fed to the detector for testing the power to identify non-B DNA structures.
Repeats with Z-DNABERT highest attention scores in generated and real data
Generative Models for Prediction of Non-B DNA Structures

March 2024

·

52 Reads

·

1 Citation

Motivation: Deep learning methods have been successfully applied to the tasks of predicting non-B DNA structures, however model performance depends on the availability of experimental data for training. Experimental technologies for non-B DNA structure detection are limited to the subsets that are active at the time of an experiment and cannot detect entire functional set of elements. Recently deep generative models demonstrated promising results in data augmentation approach improving classifier performance trained on augmented real and generated data. Here we aimed at testing performance of diffusion models in comparison to other generative models and explore the data augmentation approach for the task of non-B DNA structure prediction. Results: We tested denoising diffusion probabilistic and implicit models (DDPM and DDIM), Wasserstein generative adversarial network (WGAN) and vector quantised variational autoencoder (VQ-VAE) for the task of improving detection of Z-DNA, G-quadruplexes and H-DNA. We showed that data augmentation increased the quality of classifiers with diffusion models being the best for Z-DNA and H-DNA while WGAN worked better for G4s. Diffusion models are the best in diversity for all types of non-B DNA structures, WGAN produced the best novelty for G-quadruplexes and H- DNA. Since diffusion models require substantial resources, we showed that distillation technique can significantly enhance sampling in training diffusion models. When considering three criteria - quality of generated samples, sampling speed, and diversity, we conclude that trade-off is possible between generative diffusion model and other architectures such as WGAN and VQ-VAE. Availability: The code with conducted experiments is freely available at https://github.com/powidla/nonB-DNA- structures-generation.


Figure 2
Deep Learning captures the effect of epistasis in multifactorial diseases

March 2024

·

105 Reads

Background Polygenic risk score (PRS) prediction is widely used to assess the risk of diagnosis and progression of many diseases. Routinely, the weights of individual SNPs are estimated by the linear regression model that assumes independent and linear contribution of each SNP to the phenotype. However, for complex multifactorial diseases such as Alzheimer's disease, diabetes, cardiovascular disease, cancer, and others, association between individual SNPs and disease could be non-linear due to epistatic interactions. The aim of the presented study is to explore the power of non-linear machine learning algorithms and deep learning models to predict the risk of multifactorial diseases with epistasis. Results First, we tested ensemble tree methods and deep learning neural networks against LASSO linear regression model on simulated data with different types and strength of epistasis. The results showed that with the increase of strength of epistasis effect, non-linear models significantly outperform linear. Then the higher performance of non-linear models over linear was confirmed on real genetic data for multifactorial phenotypes such as obesity, type 1 diabetes, and psoriasis. From non-linear models, gradient boosting appeared to be the best model in obesity and psoriasis while deep learning methods significantly outperform linear approaches in type 1 diabetes. Conclusions Overall, our study underscores the efficacy of non-linear models and deep learning approaches in more accurately accounting for the effects of epistasis in simulations with specific configurations and in the context of certain diseases.


Citations (20)


... /2024 We compare the performance of KAN in generative design by evaluating Kullback-Leibler divergence and Wasserstein distance. Since diffusion model require substantial resources for training and inference (see Supplementary Materials for details), we conduct these experiments only for flipon's datasets (Cherednichenko & Poptsova, 2025). We compare the diversity calculated on edit distance within a sample of synthetic sequences for various models (Table 6). ...

Reference:

Kolmogorov-Arnold Networks for Genomic Tasks
Data augmentation with generative models improves detection of Non-B DNA structures
  • Citing Article
  • November 2024

Computers in Biology and Medicine

... Mohamed et al. performed a systematic mutation-analysis based on DNA-sequencing of all coding exons and adjacent splice consensus sequences of NOTCH1 gene, they demonstrated that Notch1 rs3124591 play a role in bicuspid aortic valve 25 . Several studies explored the potential relationships between polymorphisms in VEGFR2 and coronary heart disease or major adverse cardiac events, consistent results were found for rs2305948 polymorphism 26,27 . However, investigations on whether these genetic polymorphisms affect the association between lead exposure and hypertension remains unexplored. ...

Machine learning models for predicting risks of MACEs for myocardial infarction patients with different VEGFR2 genotypes

... Enzymes like topoisomerases can relax the tension and restore the righthanded DNA helical conformation, while helicases can unwind dsRNA with free ends to produce single-stranded RNA (ssRNA) [79,80]. The flipon state can also be biased by editing the epigenetic markings on DNA, RNA, or chromatin, or by proteins and RNAs that dock to one conformation or another via sequence-or structure-specific complexes [81][82][83][84][85][86][87]. Usually, structure-specific proteins can recognize folds formed by RNA, DNA, or DRHs (DNA:RNA hybrids), although the affinity for each of these ligands may vary. ...

Z-DNA formation in promoters conserved between human and mouse are associated with increased transcription reinitiation rates

... Diffusion Models also require significantly more computational resources to train, and this made WGANs a more suitable choice for us given the tradeoff between resources and quality. Moreover, in several scenarios WGANs may outperform Diffusion Models on smaller datasets, making it the more practical choice for real world applications [70,71]. ...

Generative Models for Prediction of Non-B DNA Structures

... The KRAB genes encode different combinations of the zinc-finger sequences that they have captured and embellished. The clusters potentially produce even more permutations by trans-splicing RNAs from different genes (Umerenkov et al., 2023). The many KRAB variants generated counter any attempted escape by ERE based on the recombination of existing sequences. ...

Z-flipon variants reveal the many roles of Z-DNA and Z-RNA in health and disease

Life Science Alliance

... In Table 3, transfer learning is shown to accelerate the generalization of QS systems, allowing synthetic biologists to efficiently design adaptive systems that work across multiple microbial environments. [309][310][311][312][313] 12 Evolutionary Algorithms: Evolutionary algorithms are inspired by the principles of natural selection, where the most efficient solutions evolve over time through iterative optimization. In synthetic biology, these algorithms are used to optimize synthetic gene circuits, allowing QS systems to evolve towards greater efficiency and adaptability. ...

Unsupervised domain adaptation methods for cross-species transfer of regulatory code signals

Frontiers in Big Data

... These methods are low-throughput, which restrict their applications on a genome-wide scale. In stark contrast, only a few high-throughput methodologies have been applied for the global profiling of Z-DNA, including the first global view of Z-DNA using Zaa-based ChIP-seq (Shin et al., 2016) and computer based prediction such as DeepZ in the human genome (Beknazarov et al., 2020;Beknazarov and Poptsova, 2023) and Z-Hunt in the genomes of humans, Arabidopsis and rice (Ho et al., 1986;Schroth et al., 1992;Zhou et al., 2009). Therefore, increased investment is necessary to develop more robust high-throughput methodologies for the global characterization of Z-DNA. ...

DeepZ: A Deep Learning Approach for Z-DNA Prediction
  • Citing Article
  • March 2023

Methods in molecular biology (Clifton, N.J.)

... For example, while d(TG) n favors Z-DNA, r(UG) n crystalizes as a lefthanded GQ [67], while triplet repeats can also form Z-DNA [68]. There is also substantial overlap between ZNA-and GQ-forming sequences in promoter regions [69]. The high frequency of flipons in the genome renders them uninformative as B-DNA. ...

Citation: Conserved microRNAs and Flipons Shape Gene Expression during Development by Altering Promoter Conformations

... Specifically, they employed these concepts, which had previously proved effective in the training of deep Convolutional Neural Networks (CNNs), including the use of residual connections, dense connections, and dilated convolutions. Voytetskiy et al. [46] carried out an implementation of a GCN approach for PPI. They experimented with three distinct GNN architectures: GCN, Graph Attention Network (GAT), and an inductive representation learning network. ...

Graph Neural Networks for Z-DNA prediction in Genomes
  • Citing Conference Paper
  • December 2022

... Since the incidence of the coronavirus disease 2019 (COVID-19) pandemic, there is a significant interest in studying the role of ZBP1 in SARS-CoV-2-induced pathology. The SARS-CoV-2 genome encompasses several potential Z-RNA-forming regions, and infected cells generate Z-RNAs [65][66][67]. In SARS-CoV-2infected primary human airway epithelia (HAE), Z-RNA sensing activates ZBP1 to initiate MLKL phosphorylation, causing necroptosis. ...

Z-RNA and the Flipside of the SARS Nsp13 Helicase: Is There a Role for Flipons in Coronavirus-Induced Pathology?