Kuan Pang’s research while affiliated with University of Toronto and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (5)


Schematic illustration of 3UTRBERT framework. 3UTRBERT is pre‐trained on the 3'UTR sequences of human genes using self‐supervised learning. Unlabeled mRNA sequences are segmented by sequential 3‐mer tokens and subsequently fed into a Transformer architecture to learn general‐purpose representations using a masked language model, i.e., reconstructing the masked tokens from other tokens thus learning the syntax and grammar hidden in the 3'UTRs. The pre‐trained model is then fine‐tuned on several downstream prediction tasks, including predicting RBP–RNA interactions, m6A modification sites and mRNA subcellular localization.
a) Performance comparisons between 3UTRBERT and other methods on predicting RBP–RNA interactions (22 RBPs under eCLIP protocol); prediction results are color coded and * represent a value greater than 0.75. b) The boxplot evaluated the performance of different methods in RBP‐generic strategy, which trained a model applicable to any RBPs. c) Violin plot of the overall AUROC and AUPRC scores of 3UTRBERT versus other language models on CDs and 5'UTRs. d) Landscape of attention scores surrounding RBP–RNA binding sites and the negative control sites; RBM15 in K562 cell line and SND1 in HepG2 cell line are shown as examples. e) Radial tree plot showing the congruent clustering of RBP‐binding motifs and RNA‐binding domains.
a) Performance comparison between 3UTRBERT semantic embedding and PWM‐based representation on different CLIP datasets, with AUROCs in green and AUPRCs in red (the top half), and the corresponding AUROC and AUPRC scores are shown in the barplot below. b) Distribution of prediction scores for positive and negative samples on PUM2_K562, LIN28B_K562, and YBX3_K562 (left half); the right half shows the density of false positive hits in negative samples. c) The top five 7‐mer counts in the positive and negative samples, along with the significant binding motif extracted by 3UTRBERT. d) 3UTRBERT can assess the impact of functional variants with mutation map. The attentional contribution at the RBP targets decreases with the occurrence of mRNA sequence mutation. e) Predicted complex structure between SND1 and a target RNA sequence (UUCCAGG). Multiple hydrogen bonds are observed: Glu744‐A58, Asp765‐G60, and Tyr766‐G59.
a) Bubble plots comparing prediction performance between 3UTRBERT and other methods. Each panel corresponds to a statistical metric: Accuracy, AUROC, AUPRC, F1 measure, and Matthew's correlation coefficients. b) Two examples of predicted m6A sites with high attentions scores and consistent with well‐defined RRACH m6A motif. c) A bird's‐eye view of global attention scores throughout entire architecture, showing 3UTRBERT correctly self‐focused on two converged regions (red boxed) corresponding to known m6A modification sites. d) Performance comparison between 3UTRBERT and other baselines using ninefold cross‐validation under model‐generic strategy. e) 3UTRBERT has enough power to perform dynamic modification prediction under cross‐cell line conditions, where the A549, CD8T and HCT116 samples have significant differences in motif signals. f) Impact of gene‐specific regulatory relationships on generalization performance (measured by AUROCs). g) (top): distribution of attention scores of experimentally determined m6A sites (positive) and negative control sites, at single nucleotide level and within 40 nucleotides window. (bottom): Distribution plots showing the predicted m6A sites are enriched near stop codons (see text).
a) Comparison of prediction performance between 3UTRBERT and other methods on mRNA subcellular localizations. Localization data from DM3Loc was used. b) Prediction performance on a newer independent set of localization data from RNALocate 2.0. c,d) Bar diagrams separately plot the AUROC and AUPRC performance on identical model architecture of different feature schemes for multi‐label mRNA subcellular localization prediction. e) Visualization of the top three localization‐specific sequence motifs. The assembled sequence motifs by 3UTRBERT are then compared against known targeting signals from CISBP‐RNA database. f) 3UTRBERT is naturally suitable for locating functional regulatory regions on the zipcode sequence, which controls the mRNA transport by interacting with ZBP1 and HuD protein. g,h) The Gene Ontology Network to capture the relationships between multiple enriched terms from Endoplasmic Reticulum (left) and a set consisting of nucleus, exosome, cytosol, and membrane (right), respectively.

+2

Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning
  • Article
  • Full-text available

August 2024

·

107 Reads

·

14 Citations

·

Gen Li

·

Kuan Pang

·

[...]

·

The 3' untranslated regions (3'UTRs) of messenger RNAs contain many important cis‐regulatory elements that are under functional and evolutionary constraints. It is hypothesized that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language techniques such as Transformers, which has been very effective in modeling complex protein sequence and structures. Here 3UTRBERT is described, which implements an attention‐based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT is pre‐trained on aggregated 3'UTR sequences of human mRNAs in a task‐agnostic manner; the pre‐trained model is then fine‐tuned for specific downstream tasks such as identifying RBP binding sites, m6A RNA modification sites, and predicting RNA sub‐cellular localizations. Benchmark results show that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. More importantly, the self‐attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements and effectively identifies regions with important regulatory potential. It is expected that 3UTRBERT model can serve as the foundational tool to analyze various sequence labeling tasks within the 3'UTR fields, thus enhancing the decipherability of post‐transcriptional regulatory mechanisms.

Download

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

February 2024

·

589 Reads

·

313 Citations

Nature Methods

Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference.


Fig. 2: (a) Performance comparisons between 3UTRBERT and other methods on predicting RBP-RNA interactions (22 RBPs); prediction accuracy is color coded. (b) Performance comparisons between 3UTRBERT and other methods on 19 additional RBP CLIP datasets; prediction accuracy is coded by size and color of the bubbles. (c) Landscape of attention scores surrounding RBP-RNA binding sites and the negative control sites; RBM15 in K562 cell line and SND1 in HepG2 cell line are shown as examples. (d) Radial tree plot showing the congruent clustering of RBP-binding motifs and RNA-binding domains.
Fig. 4: (a) Bubble plots comparing prediction performance between 3UTRBERT and other methods. Each panel corresponds to a statistical metric: AUROC, F1 measure, Accuracy, and Matthew's correlation coefficients. (b) Two examples of predicted m6A sites with high attentions scores and consistent with well-defined RRACH m6A motif. (c) A bird's-eye view of global attention scores throughout entire architecture (top), showing 3UTRBERT correctly self-focused on two converged regions (red boxed) corresponding to known m6A modification sites. The details of contextual relationship within k-mer level are shown in Attention-head plot (bottom). (d) (top: distribution of attention scores of experimentally determined m6A sites (positive) and negative control sites, at single nucleotide level and within 40 nucleotides window. (bottom) Distribution plots showing the predicted m6A sites are enriched near stop codons (see text).
Fig. 5: (a) Comparison of prediction performance between 3UTRBERT and other methods on mRNA subcellular localizations. Localization data from DM3Loc was used. (b) Prediction performance on a newer independent set of localization data from RNALocate 2.0.
Fig. 6: (a) Scatter plots of AUROC scores after pre-training (X-axis) and after fine-tuning (Y-axis) for 22 RBP eCLIP datasets. Different k-mer sizes are shown separately. Fine-tuning has generally improved the prediction performance. (b) UMAP projection of the embedding vectors generated by 3UTRBERT for RBP binding sites (positive) and negative controls; different k-mer sizes are shown. (c) Clustering of positive RNA sequences bound by RBM15 as encoded by four different coding schemes: 3UTRBERT, one-hot scheme, secondary structure, and a graph-based embedding. (d) Attention maps illustrating that 3UTRBERT can extract both local and global contextual semantics. (e) Illustration of how information is extracted at scratch, pre-training and fine-tuning stage.
Fig. 7: (a) Attention Head Overview provides the contribution score in each attention head for the input sequences. The dynamic representation-based 3UTRBERT shares different contextual information for homologous sequences. (b)Textplot visualizes the importance of individual tokens in 3UTRBERT. A token with a strong positive SHAP value is highlighted in red color, indicating a positive prediction from the language model, while a negative SHAP value is marked with a blue color, indicating a decrease in the model's prediction.
Deciphering 3'UTR mediated gene regulation using interpretable deep representation learning

September 2023

·

242 Reads

·

1 Citation

The 3'prime untranslated regions (3'UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. We hypothesize that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language models such as Transformers, which has been very effective in modeling protein sequence and structures. Here we describe 3UTRBERT, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT was pre-trained on aggregated 3'UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model was then fine-tuned for specific downstream tasks such as predicting RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results showed that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. We also showed that the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements.


Figure 2: Predicted expression profiles compare with reference expression profiles normalized by gene count means (Upper) or gene count variance (Lower).
Figure 5: Leiden clusterings for original and predicted expressions overlaying the H&E image. Image-only expression predictions are invariant to low quality regions (red) during actual experiment.
Predicted gene expressions with top 5 correlations with original profile for each method from one representative replicate.
Spatially Resolved Gene Expression Prediction from H&E Histology Images via Bi-modal Contrastive Learning

June 2023

·

168 Reads

·

2 Citations

Histology imaging is an important tool in medical diagnosis and research, enabling the examination of tissue structure and composition at the microscopic level. Understanding the underlying molecular mechanisms of tissue architecture is critical in uncovering disease mechanisms and developing effective treatments. Gene expression profiling provides insight into the molecular processes underlying tissue architecture, but the process can be time-consuming and expensive. In this study, we present BLEEP (Bi-modaL Embedding for Expression Prediction), a bi-modal embedding framework capable of generating spatially resolved gene expression profiles of whole-slide Hematoxylin and eosin (H&E) stained histology images. BLEEP uses a contrastive learning framework to construct a low-dimensional joint embedding space from a reference dataset using paired image and expression profiles at micrometer resolution. With this framework, the gene expression of any query image patch can be imputed using the expression profiles from the reference dataset. We demonstrate BLEEP's effectiveness in gene expression prediction by benchmarking its performance on a human liver tissue dataset captured via the 10x Visium platform, where it achieves significant improvements over existing methods. Our results demonstrate the potential of BLEEP to provide insights into the molecular mechanisms underlying tissue architecture, with important implications in diagnosis and research of various diseases. The proposed framework can significantly reduce the time and cost associated with gene expression profiling, opening up new avenues for high-throughput analysis of histology images for both research and clinical applications.


Citations (5)


... Recently, there have been a plethora of RNA Foundation Models pre-trained on vast RNA sequence data to learn evolutionary information for comprehensive downstream tasks [10][11][12][13][14][15][16] . Most of them take encoder-only transformer frameworks and leverage the Masked Language Modeling (MLM) as the pertaining strategy. ...

Reference:

RNAGenesis: Foundation Model for Enhanced RNA Sequence Generation and Structural Insights
Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning

... Eqs. (6)(7) demonstrates that our definition of samplelevel representation results in a hierarchical two-step aggregation mechanism, first over the cell representations to obtain the cell type representations, and then over the cell type representations to obtain the samplelevel representations. Note that our hierarchical aggregation results in normalized attention weights since ...

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Nature Methods

... The evolution of deep learning, especially through Convolutional Neural Networks (CNNs), has enabled more nuanced analyses of RNA sequences and structures [59,6,35]. More recently, pre-trained language models (LM) have revolutionized RNA research, facilitating more accurate predictions of RNA function and interactions [8,12,76]. These advancements significantly deepen our understanding of RNA's regulatory roles in cellular processes. ...

Deciphering 3'UTR mediated gene regulation using interpretable deep representation learning

... Another fruitful research avenue is the use of rich latent spaces as basis for segmentation. Rather than segmenting pixels directly, other approaches, such as MAESTER [12] or DINOv2 [13,14], train a large network on a different task (e.g. reconstructing masked areas of the image) in order to produce rich feature embedding of the image. ...

MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition
  • Citing Conference Paper
  • June 2023

... Leng et al. [ 27 ] proposed a label-efficient approach that leverages curriculum learning and confidence learning to detect noise for the analysis of ST data. To decipher intercellular communication within spatial transcriptomics graphs, BLEEP [ 28 ] constructs paired images and expression profiles simultaneously using contrast learning at micrometer resolution, thus mapping the original dataset to a low-dimensional joint embedding space. TCGN [ 29 ] takes advantage of convolutional neural networks (CNNs), the transformer encoder, and graph neural networks (GNNs) as input for histopathological image analysis to process the pathology images in ST data. ...

Spatially Resolved Gene Expression Prediction from H&E Histology Images via Bi-modal Contrastive Learning