BMC Bioinformatics

Published by Springer Nature
Online ISSN: 1471-2105
Learn more about this page
Recent publications
Article
Background Estimating relatedness is an important step for many genetic study designs. A variety of methods for estimating coefficients of pairwise relatedness from genotype data have been proposed. Both the kinship coefficient $$\varphi$$ φ and the fraternity coefficient $$\psi$$ ψ for all pairs of individuals are of interest. However, when dealing with low-depth sequencing or imputation data, individual level genotypes cannot be confidently called. To ignore such uncertainty is known to result in biased estimates. Accordingly, methods have recently been developed to estimate kinship from uncertain genotypes. Results We present new method-of-moment estimators of both the coefficients $$\varphi$$ φ and $$\psi$$ ψ calculated directly from genotype likelihoods. We have simulated low-depth genetic data for a sample of individuals with extensive relatedness by using the complex pedigree of the known genetic isolates of Cilento in South Italy. Through this simulation, we explore the behaviour of our estimators, demonstrate their properties, and show advantages over alternative methods. A demonstration of our method is given for a sample of 150 French individuals with down-sampled sequencing data. Conclusions We find that our method can provide accurate relatedness estimates whilst holding advantages over existing methods in terms of robustness, independence from external software, and required computation time. The method presented in this paper is referred to as LowKi ( Low -depth Ki nship) and has been made available in an R package ( https://github.com/genostats/LowKi ).
 
Enrichment of top candidate genes in the SFARI T1 gene set
Enrichment of ASD heritability for top candidate genes
Tissue expression specificity of top candidate ASD genes. The results are reported in bulleye plots with the size of the bullseye scaled to the number of enriched genes and color coded by Fisher's exact test p-values. Brain tissue is the only category of tissues significantly enriched for the top candidate genes (BH corrected p value = 1.42 × 10 −10 )
Article
Background Autism spectrum disorder (ASD) is a group of complex neurodevelopment disorders with a strong genetic basis. Large scale sequencing studies have identified over one hundred ASD risk genes. Nevertheless, the vast majority of ASD risk genes remain to be discovered, as it is estimated that more than 1000 genes are likely to be involved in ASD risk. Prioritization of risk genes is an effective strategy to increase the power of identifying novel risk genes in genetics studies of ASD. As ASD risk genes are likely to exhibit distinct properties from multiple angles, we reason that integrating multiple levels of genomic data is a powerful approach to pinpoint genuine ASD risk genes. Results We present BNScore, a Bayesian model selection framework to probabilistically prioritize ASD risk genes through explicitly integrating evidence from sequencing-identified ASD genes, biological annotations, and gene functional network. We demonstrate the validity of our approach and its improved performance over existing methods by examining the resulting top candidate ASD risk genes against sets of high-confidence benchmark genes and large-scale ASD genome-wide association studies. We assess the tissue-, cell type- and development stage-specific expression properties of top prioritized genes, and find strong expression specificity in brain tissues, striatal medium spiny neurons, and fetal developmental stages. Conclusions In summary, we show that by integrating sequencing findings, functional annotation profiles, and gene-gene functional network, our proposed BNScore provides competitive performance compared to current state-of-the-art methods in prioritizing ASD genes. Our method offers a general and flexible strategy to risk gene prioritization that can potentially be applied to other complex traits as well.
 
RE models performance on TBGA dataset
Article
Background Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size—preventing models from scaling effectively to large amounts of data. Results To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the GDA was extracted, the corresponding GDA, and the information about the gene-disease pair. Conclusions TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.
 
Article
Background Extraction of drug drug interactions from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around relation extraction using bidirectional long short-term memory networks (BiLSTM) and BERT model which does not attain the best feature representations. Results Our proposed DDI (drug drug interaction) prediction model provides multiple advantages: (1) The newly proposed attention vector is added to better deal with the problem of overlapping relations, (2) The molecular structure information of drugs is integrated into the model to better express the functional group structure of drugs, (3) We also added text features that combined the T-distribution and chi-square distribution to make the model more focused on drug entities and (4) it achieves similar or better prediction performance (F-scores up to 85.16%) compared to state-of-the-art DDI models when tested on benchmark datasets. Conclusions Our model that leverages state of the art transformer architecture in conjunction with multiple features can bolster the performances of drug drug interation tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions.
 
Article
Background We study in this work the inverse folding problem for RNA, which is the discovery of sequences that fold into given target secondary structures. Results We implement a Lévy mutation scheme in an updated version of an evolutionary inverse folding algorithm and apply it to the design of RNAs with and without pseudoknots. We find that the Lévy mutation scheme increases the diversity of designed RNA sequences and reduces the average number of evaluations of the evolutionary algorithm. Compared to , CPU time is higher but more successful in finding designed sequences that fold correctly into the target structures. Conclusion We propose that a Lévy flight offers a better standard mutation scheme for optimizing RNA design. Our new version of is available on GitHub as a python script and the benchmark results show improved performance on both and the datasets, compared to existing inverse folding tools.
 
Simplifying single-cell RNA-seq data with SuperCell. a Overview of the SuperCell coarse-graining pipeline, including the following steps. (1) A single-cell network is constructed from the single-cell gene expression matrix using k-nearest neighbors (kNN) algorithm. (2) Densely connected cells are merged into metacells at a user-defined graining level (γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma$$\end{document}). (3) A gene expression matrix of metacells is computed by averaging gene expression within each metacell. (4) The metacell gene expression matrix can be used for visualization and downstream analyses such as clustering, differential expression, cell type annotation, gene correlation, imputation, RNA velocity and data integration. b–e Examples of metacell networks at several graining levels. For comparison, the network of clusters is shown on the right. b Five cancer cell lines (cell_lines, N=3918\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=3918$$\end{document}) shown with different colors. c Tumor-infiltrating immune cells (TIICs, N=15,939\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=\mathrm{15,939}$$\end{document}). d T cells sorted from PBMC (Tcells, N=40,560\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=\mathrm{40,560}$$\end{document}). e Tumor-infiltrating CD8 T lymphocytes (Cd8_TILs, N=3574)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=3574)$$\end{document}
Metacells preserve clustering and differential expression results, and reveal genes specifically expressed in dendritic cell subtypes. a Median purity of metacells computed with SuperCell, MetaCell_def and MetaCell_SC at different graining levels for the four datasets shown in Fig. 1b–e (cell_lines, TIICs, Tcells, Cd8_TILs). As a lower bound, the purity after random grouping of cells is shown in gray. b Consistency between the hierarchical clustering of metacells or after subsampling and the one of single cells (see Additional file 1: Fig. S4a for results with other clustering algorithms). The blue line shows the range of ARI values when other clustering algorithms are applied to the single-cell data (median shown with “X”). c Proportion of the cluster-specific DE genes (based on weighted t-test) found at the single-cell level and recovered at the metacell level or after subsampling. d Proportion of the condition-specific DE genes found in bulk RNA-seq and recovered at the metacell level or after subsampling in the Mouse_DE dataset. e Expression of genes coding for trans-membrane proteins in single cells (top) and metacells (bottom) that are more differentially expressed (i.e., better ranking) between cDCs and pDCs at the metacell level. The number following the ‘#’ sign indicates the ranking of each gene among the top differentially expressed ones. f Flow cytometry analysis of DCs from murine KP1.9 lung adenocarcinoma (N=7\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=7$$\end{document}). g Median fluorescence intensity of proteins coded by the genes from (e). All comparison shown in e and g pass statistical significance based on two-tailed unpaired Student’s t-test (p values < 0.05)
Metacells improve cell type annotation, gene correlation, imputation and RNA velocity. a–b AUC of recovery of CD4 (top) and CD8 (bottom) T cells from the Tcells dataset using single markers (a) or signatures defined form bulk (b) consisting of the top 5 or top 50 genes for metacells computed with SuperCell, MetaCell_def, MetaCell_SC and random grouping, or after subsampling. c Expression of CD4 (top) and CD8A (bottom) in T cells from the Tcells dataset at the single-cell and the metacell (γ=100\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =100$$\end{document}) levels. d Gene correlation at the single-cell (top) and the metacell (γ=50\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =50$$\end{document}) (bottom) levels for selected gene pairs in the Cd8_TILs dataset, with the corresponding sample-weighted Pearson correlation (ρ). e Comparison of the GO similarity of metacell and single-cell top correlated genes identified for individual cell lines from the cell_lines dataset. The y-axis shows the ratio between mean GO match scores of the top correlated genes at the metacell and the single-cell levels. f Mean Spearman correlation between bulk and MAGIC-imputed data in each cell line of the cell_lines dataset. The dashed lines show the correlation between the pseudo-bulk (i.e., averaged gene expression within a cell line) and bulk gene expression. g Joint tSNE visualization of RNA velocity for the brain_cells dataset (N=3396\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 3396$$\end{document}) for single cells (left) and metacells (γ=10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =10$$\end{document}) (right) colored by cell type annotation. h, Velocity purity in metacells (defined as the cosine similarity of single-cell velocities within each metacell). i Number of genes with valid estimated equilibrium slope values. j Pearson correlation of gene equilibrium slope values obtained in single-cell and metacell RNA velocities. k Cosine similarity between 2D single-cell and metacell RNA velocities shown in (g). For the subsampling and random grouping, the center of the error bars denotes the median, and the extrema denotes the 1st and 3rd quartiles (obtained with different random seeds)
Metacells facilitate data integration and accelerate downstream analyses. a–b UMAP visualization of the non-integrated (a) and Harmony-integrated (b) COVID-19_atlas dataset (N=1,462,702\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=\mathrm{1,462,702}$$\end{document}) at the metacell γ=10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\gamma =10\right)$$\end{document} level. Metacells are colored according to the cell type annotation, protocol or sample. c, Batch effect level in terms of protocol (top) and sample (bottom) in the non-integrated and Harmony-integrated COVID-19_atlas dataset, computed as the kBET acceptance rate for the four most frequent cell types. d Computational time (top) and memory allocation (bottom) for the visualization (UMAP), clustering (Seurat), DE analysis (t-test, each cell type versus the rest), data integration (Harmony) and all steps together (‘Combined analysis’) for the metacells (dashed lines) and single cells (solid line). Red dots show the limits reached on standard desktops (16G of RAM). Black dots correspond to the limits reached on a machine with 512G RAM (linear extrapolations shown in gray). e UMAP visualization of the TIM_atlas dataset (N=108,566\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=\mathrm{108,566}$$\end{document}) at the single-cell (left) and the metacell γ=50\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\gamma =50\right)$$\end{document} (right) levels computed with the approximate coarse-graining. Cells are colored according to the cell type annotation. f Relative (z-score) expression of genes experimentally tested in Fig. 2g at the single-cell (top) and metacell (bottom) levels. The number following the ‘#’ sign indicates the ranking of each gene among the top differentially expressed ones. All comparisons pass statistical significance based on two-tailed unpaired Student’s t-test (p values < 0.05) except for CD74 at the single-cell level (p value = 1). Ranks for genes showing a different behavior both at single-cell and metacell levels between mouse and human are shown in red. g, Computational time (top) and memory allocation (bottom) for the building of metacells followed by downstream analyses including dimensionality reduction, clustering and DE analysis for metacells computed with SuperCell (red dashed line) or MetaCell (green dashed line), and for the single cells (solid black line). Red dots show the limit reached on standard desktops (16G of RAM)
Article
Background Single-cell RNA sequencing (scRNA-seq) technologies offer unique opportunities for exploring heterogeneous cell populations. However, in-depth single-cell transcriptomic characterization of complex tissues often requires profiling tens to hundreds of thousands of cells. Such large numbers of cells represent an important hurdle for downstream analyses, interpretation and visualization. Results We develop a framework called SuperCell to merge highly similar cells into metacells and perform standard scRNA-seq data analyses at the metacell level. Our systematic benchmarking demonstrates that metacells not only preserve but often improve the results of downstream analyses including visualization, clustering, differential expression, cell type annotation, gene correlation, imputation, RNA velocity and data integration. By capitalizing on the redundancy inherent to scRNA-seq data, metacells significantly facilitate and accelerate the construction and interpretation of single-cell atlases, as demonstrated by the integration of 1.46 million cells from COVID-19 patients in less than two hours on a standard desktop. Conclusions SuperCell is a framework to build and analyze metacells in a way that efficiently preserves the results of scRNA-seq data analyses while significantly accelerating and facilitating them.
 
The t-SNE maps of siRNAs of training and test data, where potent and ineffective siRNAs were coloured according to their memberships. The X-axis represented the first projections (1P) of t-SNE. The Y-axis represented the second projections (2P) of t-SNE. a The t-SNE maps of C15\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{15}$$\end{document}-features of training data. b The t-SNE maps of C31\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{31}$$\end{document}-features of training data. c The t-SNE maps of C15\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{15}$$\end{document}-features of test data. d The t-SNE maps of C31\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{31}$$\end{document}-features of test data
The Cαs,t\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^{\alpha _{s},t}$$\end{document}-feature maps of training and test data, where potent and ineffective siRNAs were coloured according to their memberships. The X-axis represented c20,t(1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{20,t}(1)$$\end{document}-features of siRNAs. The Y-axis represented c20,t(2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{20,t}(2)$$\end{document}-features of siRNAs. a The map of C20,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^{20,1}$$\end{document}-features of siRNAs of training data. b The map of C20,3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^{20,3}$$\end{document}-features siRNAs of training data. c The map of C20,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^{20,1}$$\end{document}-features of siRNAs of test data. d The map of C20,3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^{20,3}$$\end{document}-features siRNAs of test data
a The cumulative number of the removed ineffective siRNAs, where the X-axis represented αs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{s}$$\end{document}, the Y-axis represented the number of the removed siRNAs. b The cumulative number of the removed potent siRNAs, where the X-axis represented αs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{s}$$\end{document}, the Y-axis represented the number of the removed siRNAs. c The map of β1s,3(1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta ^{s,3}_{1}(1)$$\end{document}-parameters and β1s,4(1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta ^{s,4}_{1}(1)$$\end{document}-parameters. d The map of β2s,t(1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta ^{s,t}_{2}(1)$$\end{document}-parameters
a The t-SNE map of Cα10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^{{\alpha _{10}}}$$\end{document}-features of siRNAs in test data, where the X-axis represented the first projections (1P) of t-SNE, the Y-axis represented the second projections (2P) of t-SNE, TN, FN, TP and FP of siRNAs were coloured according to their memberships. b The predicted efficacy and observed inhibition of siRNAs, where TN, FN, TP and FP of siRNAs were coloured according to their memberships, the X-axis represented the observed inhibition of siRNAs, and the Y-axis represented the predicted efficacy of siRNAs
Article
Background In siRNA based antiviral therapeutics, selection of potent siRNAs is an indispensable step, but these commonly used features are unable to construct the boundary between potent and ineffective siRNAs. Results Here, we select potent siRNAs by removing ineffective ones, where these conditions for removals are constructed by C -features of siRNAs, C -features are generated by MG -algorithm, Icc -cluster and the different combinations of some commonly used features, MG -algorithm and Icc -cluster are two different algorithms to search the nearest siRNA neighbors. For the ineffective siRNAs in test data, they are removed from test data by I -iteration, where I -iteration continually updates training data by adding these successively removed siRNAs. Furthermore, the efficacy of siRNAs of test data is predicted by their nearest neighbors of training data. Conclusions By siRNAs of Hencken dataset, results show that our algorithm removes almost ineffective siRNAs from test data, gives the clear boundary between potent and ineffective siRNAs, and accurately predicts the efficacy of siRNAs also. We suggest that our algorithm can provide new insights for selecting the potent siRNAs.
 
Article
Background Image segmentation in fluorescence microscopy is often based on spectral separation of fluorescent probes (color-based segmentation) or on significant intensity differences in individual image regions (intensity-based segmentation). These approaches fail, if dye fluorescence shows large spectral overlap with other employed probes or with strong cellular autofluorescence. Results Here, a novel model-free approach is presented which determines bleaching characteristics based on dynamic mode decomposition (DMD) and uses the inferred photobleaching kinetics to distinguish different probes or dye molecules from autofluorescence. DMD is a data-driven computational method for detecting and quantifying dynamic events in complex spatiotemporal data. Here, DMD is first used on synthetic image data and thereafter used to determine photobleaching characteristics of a fluorescent sterol probe, dehydroergosterol (DHE), compared to that of cellular autofluorescence in the nematode Caenorhabditis elegans. It is shown that decomposition of those dynamic modes allows for separating probe from autofluorescence without invoking a particular model for the bleaching process. In a second application, DMD of dye-specific photobleaching is used to separate two green-fluorescent dyes, an NBD-tagged sphingolipid and Alexa488-transferrin, thereby assigning them to different cellular compartments. Conclusions Data-based decomposition of dynamic modes can be employed to analyze spatially varying photobleaching of fluorescent probes in cells and tissues for spatial and temporal image segmentation, discrimination of probe from autofluorescence and image denoising. The new method should find wide application in analysis of dynamic fluorescence imaging data.
 
Host-specific signature identification method based on both adjusted and unadjusted (Shannon) entropy measurement
The BLOSUM62 scoring matrix for amino acid substitution. A table value for a particular pair of amino acids is the log odds defined as 2log2(P(O)/P(E)) where P(O) is the observed probability of occurrence of the pair and P(E) is the expected probability of occurrence of the pair assuming independence [18]. Similarities between amino acid pairs are based on log odds as described in the text
Article
Background Influenza A viruses (IAV) exhibit vast genetic mutability and have great zoonotic potential to infect avian and mammalian hosts and are known to be responsible for a number of pandemics. A key computational issue in influenza prevention and control is the identification of molecular signatures with cross-species transmission potential. We propose an adjusted entropy-based host-specific signature identification method that uses a similarity coefficient to incorporate the amino acid substitution information and improve the identification performance. Mutations in the polymerase genes (e.g., PB2) are known to play a major role in avian influenza virus adaptation to mammalian hosts. We thus focus on the analysis of PB2 protein sequences and identify host specific PB2 amino acid signatures. Results Validation with a set of H5N1 PB2 sequences from 1996 to 2006 results in adjusted entropy having a 40% false negative discovery rate compared to a 60% false negative rate using unadjusted entropy. Simulations across different levels of sequence divergence show a false negative rate of no higher than 10% while unadjusted entropy ranged from 9 to 100%. In addition, under all levels of divergence adjusted entropy never had a false positive rate higher than 9%. Adjusted entropy also identifies important mutations in H1N1pdm PB2 previously identified in the literature that explain changes in divergence between 2008 and 2009 which unadjusted entropy could not identify. Conclusions Based on these results, adjusted entropy provides a reliable and widely applicable host signature identification approach useful for IAV monitoring and vaccine development.
 
The overall workflow of the proposed algorithm
The average AUC of each classifier is based on the size of feature vectors extracted by DeepWalk
ROC curve and AUC based on the average values of 5 folds (the size of extracted feature vector with Deepwalk set to the optimum value for each classifier)
ROC curve and AUC based on the average values of 5 folds for different algorithms compared with our method
Article
Background Several types of RNA in the cell are usually involved in biological processes with multiple functions. Coding RNAs code for proteins while non-coding RNAs regulate gene expression. Some single-strand RNAs can create a circular shape via the back splicing process and convert into a new type called circular RNA (circRNA). circRNAs are among the essential non-coding RNAs in the cell that involve multiple disorders. One of the critical functions of circRNAs is to regulate the expression of other genes through sponging micro RNAs (miRNAs) in diseases. This mechanism, known as the competing endogenous RNA (ceRNA) hypothesis, and additional information obtained from biological datasets can be used by computational approaches to predict novel associations between disease and circRNAs. Results We applied multiple classifiers to validate the extracted features from the heterogeneous network and selected the most appropriate one based on some evaluation criteria. Then, the XGBoost is utilized in our pipeline to generate a novel approach, called CircWalk, to predict CircRNA-Disease associations. Our results demonstrate that CircWalk has reasonable accuracy and AUC compared with other state-of-the-art algorithms. We also use CircWalk to predict novel circRNAs associated with lung, gastric, and colorectal cancers as a case study. The results show that our approach can accurately detect novel circRNAs related to these diseases. Conclusions Considering the ceRNA hypothesis, we integrate multiple resources to construct a heterogeneous network from circRNAs, mRNAs, miRNAs, and diseases. Next, the DeepWalk algorithm is applied to the network to extract feature vectors for circRNAs and diseases. The extracted features are used to learn a classifier and generate a model to predict novel CircRNA-Disease associations. Our approach uses the concept of the ceRNA hypothesis and the miRNA sponge effect of circRNAs to predict their associations with diseases. Our results show that this outlook could help identify CircRNA-Disease associations more accurately.
 
Examples show the ambiguous boundaries between WM and GM
An overview of the proposed model
The architecture of our model’s transformer
An example of the MICCAIiSEG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$MICCAI\ iSEG$$\end{document} dataset (T1, T2, manual reference contour)
Visualization results on MRBrainS dataset
Article
MRI brain images are always of low contrast, which makes it difficult to identify to which area the information at the boundary of brain images belongs. This can make the extraction of features at the boundary more challenging, since those features can be misleading as they might mix properties of different brain regions. Hence, to alleviate such a problem, image boundary detection plays a vital role in medical image segmentation, and brain segmentation in particular, as unclear boundaries can worsen brain segmentation results. Yet, given the low quality of brain images, boundary detection in the context of brain image segmentation remains challenging. Despite the research invested to improve boundary detection and brain segmentation, these two problems were addressed independently, i.e., little attention was paid to applying boundary detection to brain segmentation tasks. Therefore, in this paper, we propose a boundary detection-based model for brain image segmentation. To this end, we first design a boundary segmentation network for detecting and segmenting images brain tissues. Then, we design a boundary information module (BIM) to distinguish boundaries from the three different brain tissues. After that, we add a boundary attention gate (BAG) to the encoder output layers of our transformer to capture more informative local details. We evaluate our proposed model on two datasets of brain tissue images, including infant and adult brains. The extensive evaluation experiments of our model show better performance (a Dice Coefficient (DC) accuracy of up to 5.3% compared to the state-of-the-art models) in detecting and segmenting brain tissue images.
 
Article
Background Biological data suffers from noise that is inherent in the measurements. This is particularly true for time-series gene expression measurements. Nevertheless, in order to to explore cellular dynamics, scientists employ such noisy measurements in predictive and clustering tools. However, noisy data can not only obscure the genes temporal patterns, but applying predictive and clustering tools on noisy data may yield inconsistent, and potentially incorrect, results. Results To reduce the noise of short-term (< 48 h) time-series expression data, we relied on the three basic temporal patterns of gene expression: waves, impulses and sustained responses. We constrained the estimation of the true signals to these patterns by estimating the parameters of first and second-order Fourier functions and using the nonlinear least-squares trust-region optimization technique. Our approach lowered the noise in at least 85% of synthetic time-series expression data, significantly more than the spline method ( $$p<10^{-6}$$ p < 10 - 6 ). When the data contained a higher signal-to-noise ratio, our method allowed downstream network component analyses to calculate consistent and accurate predictions, particularly when the noise variance was high. Conversely, these tools led to erroneous results from untreated noisy data. Our results suggest that at least 5–7 time points are required to efficiently de-noise logarithmic scaled time-series expression data. Investing in sampling additional time points provides little benefit to clustering and prediction accuracy. Conclusions Our constrained Fourier de-noising method helps to cluster noisy gene expression and interpret dynamic gene networks more accurately. The benefit of noise reduction is large and can constitute the difference between a successful application and a failing one.
 
The SMaSH framework. ASMaSH works directly from the counts matrix, producing a dictionary relating the user-defined classes of interest (e.g. cell type annotations) to top marker genes for each class (default top 5). BSMaSH filters and ranks genes according to an ensemble learning model or a deep neural network
Classifying broad cell types based on SMaSH-specific marker genes. A Confusion matrices for the top 30 marker genes per cell type in the lung broad cell classification data-set for scGeneFit, RankCorr, SMaSH using the network mode, and SMaSH using the ensemble mode (using XGBoost). B Cell misclassification and F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{1}$$\end{document} scores for the two SMaSH modes against scGeneFit and RankCorr. C Benchmarking different SMaSH ensemble learning models across biological scRNA-seq data and related modalities
Marker genes for the broad mouse brain cell types. A The mean |Shapley value| for the top 30 ranked marker genes across all broad cell types of the mouse brain, before additional filtering and sorting, using SMaSH’s network mode. Different colours indicate the different class contributions which that particular gene explains. B the final three markers for each class/broad cell type are shown, with the colour profile corresponding to the mean logarithm of the gene expression and a pattern uniquely matching specific markers to specific cell types
Marker gene misclassification rates and F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{1}$$\end{document} scores in cell types in the lung and mouse brain. Performance for each human lung cancer cell sub-type and framework, including the two modes in SMaSH. HLC: Human lung cancer; MB: Mouse brain
Marker genes for the mouse brain cell sub-types from the Inhibitory neuron broad types, and human foetal organ of origin classification. The mean logarithm of gene expression for mouse brain cell Inhibitory neuron cell sub-type markers. A the markers for scGeneFit; B the markers for RankCorr; C markers from SMaSH’s network mode. Particularly in the case of SMaSH unique patterns can still be identified in this highly granular cell-type identification problem, whereas approaches such as scGeneFit are not able to identify many markers which uniquely resolve the sub-types present. DSMaSH is able to select statistically significant markers for a highly imbalanced problem of distinguishing organs of origin in foetal scRNA-seq
Article
Background Single-cell RNA-sequencing is revolutionising the study of cellular and tissue-wide heterogeneity in a large number of biological scenarios, from highly tissue-specific studies of disease to human-wide cell atlases. A central task in single-cell RNA-sequencing analysis design is the calculation of cell type-specific genes in order to study the differential impact of different replicates (e.g. tumour vs. non-tumour environment) on the regulation of those genes and their associated networks. The crucial task is the efficient and reliable calculation of such cell type-specific ‘marker’ genes. These optimise the ability of the experiment to isolate highly-specific cell phenotypes of interest to the analyser. However, while methods exist that can calculate marker genes from single-cell RNA-sequencing, no such method places emphasise on specific cell phenotypes for downstream study in e.g. differential gene expression or other experimental protocols (spatial transcriptomics protocols for example). Here we present , a general computational framework for extracting key marker genes from single-cell RNA-sequencing data which reliably characterise highly-specific and niche populations of cells in numerous different biological data-sets. Results extracts robust and biologically well-motivated marker genes, which characterise a given single-cell RNA-sequencing data-set better than existing computational approaches for general marker gene calculation. We demonstrate the utility of through its substantial performance improvement over several existing methods in the field. Furthermore, we evaluate the markers on spatial transcriptomics data, demonstrating they identify highly localised compartments of the mouse cortex. Conclusion is a new methodology for calculating robust markers genes from large single-cell RNA-sequencing data-sets, and has implications for e.g. effective gene identification for probe design in downstream analyses spatial transcriptomics experiments. has been fully-integrated with the framework and provides a valuable bioinformatics tool for cell type characterisation and validation in every-growing data-sets spanning over 50 different cell types across hundreds of thousands of cells.
 
Article
Abstract Background The treatment and prognosis of lung adenocarcinoma (LUAD) remains a challenge. The study aimed to conduct a systematic analysis of the predictive capacity of N6-methyladenosine (m6A)-related long non-coding RNAs (lncRNAs) in the prognosis of LUAD. Methods 594 samples were totally selected from a dataset from The Cancer Genome Atlas. The identification of prognostic m6A-related lncRNAs were performed by Pearson correlation analysis and Cox regression analysis. Systematic analyses, including cluster analysis, survival analysis, and immuno-correlated analysis, were conducted. A prognosis model was built from the optimized subset of m6A-related lncRNAs. The assessment of model was performed by survival analysis, and receiver operating characteristic (ROC) curve. Finally, the risk score of patients with LUAD calculated by the prognosis model was implemented by the analysis of Cox regression. Differential analysis was for further evaluation of the cuproptosis-related genes in two risk sets. Results These patients were grouped into two clusters according to the expression levels of 22 prognostic m6A-related lncRNAs. The patients with LUAD in cluster 2 was significantly worse in the overall survival (OS) (P = 0.006). Three scores calculated by the ESTIMATE methods in cluster 2 were significantly lower. After the least absolute shrinkage and selection operator algorithm, 10 prognostic m6A-related lncRNAs were totally selected to construct the final model to obtain the risk score. Then the area under the ROC curve of the prognosis model for 1, 3, and 5-year OS was 0.767, 0.709, and 0.736 in the training set, and 0.707, 0.691, and 0.675 in the test set. The OS of the low-risk cohort was significantly higher than that of the high-risk cohort in both the training set (P
 
Potential transmembrane proteins in the globular data set. AlphaFold2 [11, 68] structure of extracellular serine protease (P09489) and Lipase 1 (P40601). Transmembrane segments (dark purple) predicted by TMbed correlate well with membrane boundaries (dotted lines: red = outside, blue = inside) predicted by the PPM [45] web server. Images created using Mol* Viewer [71]. Though our data set lists them as globular proteins, the predicted structures indicate transmembrane domains, which align with segments predicted by our method. The predicted domains overlap with autotransporter domains detected by the UniProtKB [46] automatic annotation system. Transmembrane segment predictions were made with the final TMbed ensemble model
New membrane proteins. PDB structures for probable flagellin 1 (Q9YAN8; 7TXI [73]), protein-serine O-palmitoleoyltransferase porcupine (Q9H237; 7URD [74]), choline transporter-like protein 1 (Q8WWI5; 7WWB [75]), S-layer protein SlpA (Q9RRB6; 7ZGY [76]), and membrane protein (P0DTC5; 8CTK [77]). Transmembrane segments (dark purple) predicted by TMbed; membrane boundaries (dotted lines: red = outside, blue = inside) predicted by the PPM [45] web server. Images created using Mol* Viewer [71]
Article
Background Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4–5 times underrepresented compared to non-TMPs. Today’s top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions. Results Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060). Conclusions Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2.
 
Article
Background The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS). Results We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction. Conclusions Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score).
 
Conversion of UCSC Pathway Tab Format to valid pathways for the mathematical model. These pathways model the relationship between the i-th child and its progenitors using various types of interactions including component>, member>, and activations & inhibitions
Flow diagram of the methodology. Starting from a specific experimental picture (discrete gene expression), we calculate the minimum number of lowly expressed genes required to be active for the cell to sustain cellular life (Sawild\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_{a}^{wild}$$\end{document}). Then, we systematically knock-out one by one all the expressed genes g present in the pathway P (Eg = 0) and recalculate the minimum number of lowly expressed genes required to be active for the cell to sustain cellular life (Sag\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_{a}^{g}$$\end{document}). We define a gene as essential for a given active if Sag>Sawild\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_{a}^{g} > S_{a}^{wild}$$\end{document}. We repeat this process for all the genes, actives, and pathways included in the database. The essentiality of a gene g is finally defined as the maximum of all its essentiality predictions across all actives A and pathways where the gene appears Pg
Toy example. Graphical representation of the pathway activating the Wnt receptor signaling pathway, planar cell polarity pathway. Component-type interactions are represented with solid arrows whilst activation-type interactions are illustrated with dashed lines
Toy example solution. Possible scenarios when FZD7 and FZD1 are expressed. W5F7C represents the WNT5A/FZD7 complex, W3F1C represents the WNT3A/FZD1 complex, and ABSTR represents the Wnt receptor signaling pathway, planar cell polarity pathway. Dark and light nodes represent inactive and active nodes in the final solution respectively, namely Ei = 0 and Ei = 1. The dashed edge in a gene g represents highly expressed genes (g ∈ G) whereas continuous edges represent lowly expressed genes (g ∈ L). A WNT5A and WNT3A are lowly expressed. For the abstract to be active we need to activate one lowly expressed gene (WNT3A in the example). A knockout of FZD1 requires the activation of one lowly expressed gene (WNT5A in the example) thus providing an equivalent solution (Swild = SFZD1, FZD1 is not essential). B WNT5A is lowly expressed and WNT3A is highly expressed. For the abstract to be active we do not need to activate any lowly expressed gene. A knock-out of FZD7 does not require the activation of any lowly expressed gene for the abstract to be active (Swild = SFZD7, FZD7 is not essential). C WNT5A is highly expressed and WNT3A is lowly expressed. For the abstract to be active we do not need to activate any lowly expressed gene. A knock-out of FZD7 requires the activation of one lowly expressed gene (WNT3A) for the abstract to be active (Swild < SFZD7, FZD7 is essential)
Method validation. A Histogram showing the results from the validation of the method. The dark distribution shows the Achilles scores of those pair gene & cell-lines predicted as essential; the light distribution shows the Achilles scores of those predicted as not essential. Genes predicted as essential have significantly lower Achilles score than genes predicted as not essential (p-value = 6.4032 × 10–246). The average difference between both distributions is defined by the parameter delta.score = − 0.1463. B Impact of ECS in the performance of the method. Evolution of the results when different thresholds of ECS are used to define a gene as essential. delta.score: average difference in Achilles score between the genes predicted as essential and the genes predicted as not essential; MCC: Matthew’s Correlation Coefficient; N: number of genes included in the comparison; Precision: obtained precision assuming as real essential genes those with an Achilles score < − 0.5. C Histogram when MCC finds its maximum (ECS = 0.6667). The average difference in Achilles Score between genes predicted as essential and genes predicted as not essential becomes bigger (delta.score = − 0.5954) and so does their significance (p-value = 0)
Article
A gene is considered as essential when it is indispensable for cells to grow and replicate in a certain environment. However, gene essentiality is not a structural property but rather a contextual one, which depends on the specific biological conditions affecting the cell. This circumstantial essentiality of genes is what brings the attention of scientist since we can identify genes essential for cancer cells but not essential for healthy cells. This same contextuality makes their identification extremely challenging. Huge experimental efforts such as Project Achilles where the essentiality of thousands of genes is measured together with a plethora of molecular data (transcriptomics, copy number, mutations, etc.) in over one thousand cell lines can shed light on the causality behind the essentiality of a gene in a given environment. Here, we present an in-silico method for the identification of patient-specific essential genes using constraint-based modelling (CBM). Our method expands the ideas behind traditional CBM to accommodate multisystem networks. In essence, it first calculates the minimum number of lowly expressed genes required to be activated by the cell to sustain life as defined by a set of requirements; and second, it performs an exhaustive in-silico gene knockout to find those that lead to the need of activating additional lowly expressed genes. We validated the proposed methodology using a set of 452 cancer cell lines derived from the Cancer Cell Line Encyclopedia where an exhaustive experimental large-scale gene knockout study using CRISPR (Achilles Project) evaluates the impact of each removal. We also show that the integration of different essentiality predictions per gene, what we called Essentiality Congruity Score, reduces the number of false positives. Finally, we explored our method in a breast cancer patient dataset, and our results showed high concordance with previous publications. These findings suggest that identifying genes whose activity is fundamental to sustain cellular life in a patient-specific manner is feasible using in-silico methods. The patient-level gene essentiality predictions can pave the way for precision medicine by identifying potential drug targets whose deletion can induce death in tumour cells.
 
An overview of the general approach. We take advantage of graph analysis algorithms to convert knowledge from the literature to features that enhance the performance of multi-target machine learning models
The results of models built based in different methods, b: baseline, nn: common neighbour, ib: local path index, ent: entropy-based method , ll: locally linear embedding, le: Laplacian eigenmaps, n2v: Nod2vec, dw: Deepwalk , st: stacking model. The figure shows the number of times a model outperformed other models for each gene, Stacking model surpassed other models by a wide margin
A pairwise comparison between baseline and models built based on graph methods, b: baseline, nn: common neighbour, ib: local path index, ent: entropy-based method , ll: locally linear embedding, le: Laplacian eigenmaps, n2v: Nod2vec, dw: Deepwalk , st: stacking model. The y axis shows the number of times a model outperformed the other. When compared to the base model, all graph-based models recorded higher scores
The general workflow describing the integration of signalling pathways models into the machine learning model
Article
Background A key problem in bioinformatics is that of predicting gene expression levels. There are two broad approaches: use of mechanistic models that aim to directly simulate the underlying biology, and use of machine learning (ML) to empirically predict expression levels from descriptors of the experiments. There are advantages and disadvantages to both approaches: mechanistic models more directly reflect the underlying biological causation, but do not directly utilize the available empirical data; while ML methods do not fully utilize existing biological knowledge. Results Here, we investigate overcoming these disadvantages by integrating mechanistic cell signalling models with ML. Our approach to integration is to augment ML with similarity features (attributes) computed from cell signalling models. Seven sets of different similarity feature were generated using graph theory. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model - without the similarity features. Finally, the seven multi-target regression models were stacked together to form an overall prediction model that was significantly better than the baseline on 95% of genes on an independent test set. The similarity features enable this stacking model to provide interpretable knowledge about cancer, e.g. the role of ERBB3 in the MCF7 breast cancer cell line. Conclusion Integrating mechanistic models as graphs helps to both improve the predictive results of machine learning models, and to provide biological knowledge about genes that can help in building state-of-the-art mechanistic models.
 
Article
Rupture of intracranial aneurysm is the first cause of subarachnoid hemorrhage, second only to cerebral thrombosis and hypertensive cerebral hemorrhage, and the mortality rate is very high. MRI technology plays an irreplaceable role in the early detection and diagnosis of intracranial aneurysms and supports evaluating the size and structure of aneurysms. The increase in many aneurysm images, may be a massive workload for the doctors, which is likely to produce a wrong diagnosis. Therefore, we proposed a simple and effective comprehensive residual attention network (CRANet) to improve the accuracy of aneurysm detection, using a residual network to extract the features of an aneurysm. Many experiments have shown that the proposed CRANet model could detect aneurysms effectively. In addition, on the test set, the accuracy and recall rates reached 97.81% and 94%, which significantly improved the detection rate of aneurysms.
 
Article
Background Visceral Leishmaniasis (VL) is a fatal vector-borne parasitic disorder occurring mainly in tropical and subtropical regions. VL falls under the category of neglected tropical diseases with growing drug resistance and lacking a licensed vaccine. Conventional vaccine synthesis techniques are often very laborious and challenging. With the advancement of bioinformatics and its application in immunology, it is now more convenient to design multi-epitope vaccines comprising predicted immuno-dominant epitopes of multiple antigenic proteins. We have chosen four antigenic proteins of Leishmania donovani and identified their T-cell and B-cell epitopes, utilizing those for in-silico chimeric vaccine designing. The various physicochemical characteristics of the vaccine have been explored and the tertiary structure of the chimeric construct is predicted to perform docking studies and molecular dynamics simulations. Results The vaccine construct is generated by joining the epitopes with specific linkers. The predicted tertiary structure of the vaccine has been found to be valid and docking studies reveal the construct shows a high affinity towards the TLR-4 receptor. Population coverage analysis shows the vaccine can be effective on the majority of the world population. In-silico immune simulation studies confirms the vaccine to raise a pro-inflammatory response with the proliferation of activated T and B cells. In-silico codon optimization and cloning of the vaccine nucleic acid sequence have also been achieved in the pET28a vector. Conclusion The above bioinformatics data support that the construct may act as a potential vaccine. Further wet lab synthesis of the vaccine and in vivo works has to be undertaken in animal model to confirm vaccine potency.
 
Article
Background Applying directed acyclic graph (DAG) models to proteogenomic data has been shown effective for detecting causal biomarkers of complex diseases. However, there remain unsolved challenges in DAG learning to jointly model binary clinical outcome variables and continuous biomarker measurements. Results In this paper, we propose a new tool, DAGBagM, to learn DAGs with both continuous and binary nodes. By using appropriate models, DAGBagM allows for either continuous or binary nodes to be parent or child nodes. It employs a bootstrap aggregating strategy to reduce false positives in edge inference. At the same time, the aggregation procedure provides a flexible framework to robustly incorporate prior information on edges. Conclusions Through extensive simulation experiments, we demonstrate that DAGBagM has superior performance compared to alternative strategies for modeling mixed types of nodes. In addition, DAGBagM is computationally more efficient than two competing methods. When applying DAGBagM to proteogenomic datasets from ovarian cancer studies, we identify potential protein biomarkers for platinum refractory/resistant response in ovarian cancer. DAGBagM is made available as a github repository at https://github.com/jie108/dagbagM .
 
Performance comparison of modified inverse-normal, inverse-normal and fused inverse-normal methods. Plots of receiver operating characteristics (ROC) curves averaged over 100 trials for each simulation setting for all three methods. Simulation settings are represented by rows (from top to bottom): corresponding to low (σ = 0.15) and high (σ = 0.5) inter-study variability and columns (from left to right): corresponding to 3 (S = 3) and 5 studies (S = 5) combined. The black, blue, and red ROC curves represent the modified inverse-normal (MIN), inverse-normal (IN) and fused inverse-normal (FIN) methods respectively
Characteristics of modified inverse-normal method. a False discovery rates (FDR) for modified inverse-normal (MIN) method for all simulation settings. b Proportion of true positives (TPs) among unique differentially expressed genes (DEGs) identified by MIN method as compared to inverse-normal (IN) method. c Proportion of truly unique DEGs (MIN) with the observed effective direction of expression as the true direction of expression
Characteristics of fused inverse-normal method. a False discovery rates (FDR) for fused inverse-normal (FIN) method for all simulation settings. b Proportion of true-positives (TPs) among unique differentially expressed genes (DEGs) identified by FIN method as compared to inverse-normal method. c Proportion of truly unique DEGs (FIN) with the observed effective direction of expression as the true direction of expression
Comparison of results from meta-analysis methods. a Histograms of raw p-values obtained from per-study differential analysis of GSE123892 and GSE151352 and TCGA-GBM datasets used in real data application. b Venn diagram of the differentially expressed genes (DEGs) identified using inverse-normal (IN), modified inverse-normal (MIN) and fused inverse-normal (FIN) methods
Significant pathways identified by IPA. The top ten significant pathways based on Benjamini Hochberg (BH) p-value among the canonical pathways identified by Ingenuity Pathway Analysis (IPA) for the up-regulated differentially expressed genes (DEGs) (orange bar) and down-regulated DEGs (green bar). The numbers on the bar plot show the ratio between the numbers of DEGs enriched and total number of genes in each of these pathways
Article
Background Use of next-generation sequencing technologies to transcriptomics (RNA-seq) for gene expression profiling has found widespread application in studying different biological conditions including cancers. However, RNA-seq experiments are still small sample size experiments due to the cost. Recently, an increased focus has been on meta-analysis methods for integrated differential expression analysis for exploration of potential biomarkers. In this study, we propose a p -value combination method for meta-analysis of multiple independent but related RNA-seq studies that accounts for sample size of a study and direction of expression of genes in individual studies. Results The proposed method generalizes the inverse-normal method without an increase in statistical or computational complexity and does not pre- or post-hoc filter genes that have conflicting direction of expression in different studies. Thus, the proposed method, as compared to the inverse-normal, has better potential for the discovery of differentially expressed genes (DEGs) with potentially conflicting differential signals from multiple studies related to disease. We demonstrated the use of the proposed method in detection of biologically relevant DEGs in glioblastoma (GBM), the most aggressive brain cancer. Our approach notably enabled the identification of over-expressed tumour suppressor gene RAD51 in GBM compared to healthy controls, which has recently been shown to be a target for inhibition to enhance radiosensitivity of GBM cells during treatment. Pathway analysis identified multiple aberrant GBM related pathways as well as novel regulators such as TCF7L2 and MAPT as important upstream regulators in GBM. Conclusions The proposed meta-analysis method generalizes the existing inverse-normal method by providing a way to establish differential expression status for genes with conflicting direction of expression in individual RNA-seq studies. Hence, leading to further exploration of them as potential biomarkers for the disease.
 
Article
Motivation Aberrant DNA methylation in transcription factor binding sites has been shown to lead to anomalous gene regulation that is strongly associated with human disease. However, the majority of methylation-sensitive positions within transcription factor binding sites remain unknown. Here we introduce SEMplMe, a computational tool to generate predictions of the effect of methylation on transcription factor binding strength in every position within a transcription factor’s motif. Results SEMplMe uses ChIP-seq and whole genome bisulfite sequencing to predict effects of methylation within binding sites. SEMplMe validates known methylation sensitive and insensitive positions within a binding motif, identifies cell type specific transcription factor binding driven by methylation, and outperforms SELEX-based predictions for CTCF. These predictions can be used to identify aberrant sites of DNA methylation contributing to human disease. Availability and Implementation SEMplMe is available from https://github.com/Boyle-Lab/SEMplMe.
 
Workflow of Cogito. After preparation and aggregation of the input data (tracks) on gene level, Cogito summarizes and compares all provided data columns for single tracks and groups of tracks and creates a comprehensive output report
Cogito base output for King et al.’s murine dataset tracks and subgroups of tracks. a ChIP-seq peak score visualization of a single track (interval attribute). b Methylation status plot for a single track (ordinal attribute). c ChIP-seq score overview for replicate wildtype samples (condition J1). d Barplot depiction of RRBS replicates for condition J1. e Methylation status plot per track, grouped by condition. f ChIP-seq scores per track, color-coded by condition
Advanced Cogito output graphics for pairwise comparisons in King et al.’s dataset. a Comparison plot for the gene expression of two tracks. b Correspondence visualization of the methylation status of one track and the gene expression of another track. c Correlation heatmap of the methylation status of two tracks: the lighter the color is, the higher is the quantity of genes which have the corresponding methylation status
Overview correlation heatmap for the full murine sample set of King et al. A high-level visualization of pairwise comparisons of all samples contained in the murine example dataset presents rich information density in one heatmap, and emphasizes possible connections
Article
Background Genetic and epigenetic biological studies often combine different types of experiments and multiple conditions. While the corresponding raw and processed data are made available through specialized public databases, the processed files are usually limited to a specific research question. Hence, they are unsuitable for an unbiased, systematic overview of a complex dataset. However, possible combinations of different sample types and conditions grow exponentially with the amount of sample types and conditions. Therefore the risk to miss a correlation or to overrate an identified correlation should be mitigated in a complex dataset. Since reanalysis of a full study is rarely a viable option, new methods are needed to address these issues systematically, reliably, reproducibly and efficiently. Results Cogito “COmpare annotated Genomic Intervals TOol” provides a workflow for an unbiased, structured overview and systematic analysis of complex genomic datasets consisting of different data types (e.g. RNA-seq, ChIP-seq) and conditions. Cogito is able to visualize valuable key information of genomic or epigenomic interval-based data, thereby providing a straightforward analysis approach for comparing different conditions. It supports getting an unbiased impression of a dataset and developing an appropriate analysis strategy for it. In addition to a text-based report, Cogito offers a fully customizable report as a starting point for further in-depth investigation. Conclusions Cogito implements a novel approach to facilitate high-level overview analyses of complex datasets, and offers additional insights into the data without the need for a full, time-consuming reanalysis. The R/Bioconductor package is freely available at https://bioconductor.org/packages/release/bioc/html/Cogito.html, a comprehensive documentation with detailed descriptions and reproducible examples is included.
 
Example of real-data genotype-imputation accuracy-measures for the telomere region on chromosome 9. Top left: hiQ, top right: info, centre left: Iamchance, centre right: IamHWE, bottom left: rBeagle2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{r}}_{{{\text{Beagle}}}}^{2}$$\end{document}, bottom right: rMACH2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{r}}_{{{\text{MACH}}}}^{2}$$\end{document}; each dot represents one imputed marker; the marker size is according to minor allele frequency (MAF); threshold values are freely selectable; vertical lines: centre of region classified as: “cold”, “tepid”, “hot”, or “very hot” (the definition is given in the Additional file 1).
Article
Background ImputAccur is a software tool to measure genotype-imputation accuracy. Imputation of untyped markers is a standard approach in genome-wide association studies to close the gap between directly genotyped and other known DNA variants. However, high accuracy for imputed genotypes is fundamental. Several accuracy measures have been proposed, but unfortunately, they are implemented on different platforms, which is impractical. Results With ImputAccur, the accuracy measures info, Iam-hiQ and r²-based indices can be derived from standard output files of imputation software. Sample/probe and marker filtering is possible. This allows e.g. accurate marker filtering ahead of data analysis. Conclusions The source code (Python version 3.9.4), a standalone executive file, and example data for ImputAccur are freely available at https://gitlab.gwdg.de/kolja.thormann1/imputationquality.git.
 
Article
Background Essential Proteins are demonstrated to exert vital functions on cellular processes and are indispensable for the survival and reproduction of the organism. Traditional centrality methods perform poorly on complex protein–protein interaction (PPI) networks. Machine learning approaches based on high-throughput data lack the exploitation of the temporal and spatial dimensions of biological information. Results We put forward a deep learning framework to predict essential proteins by integrating features obtained from the PPI network, subcellular localization, and gene expression profiles. In our model, the node2vec method is applied to learn continuous feature representations for proteins in the PPI network, which capture the diversity of connectivity patterns in the network. The concept of depthwise separable convolution is employed on gene expression profiles to extract properties and observe the trends of gene expression over time under different experimental conditions. Subcellular localization information is mapped into a long one-dimensional vector to capture its characteristics. Additionally, we use a sampling method to mitigate the impact of imbalanced learning when training the model. With experiments carried out on the data of Saccharomyces cerevisiae, results show that our model outperforms traditional centrality methods and machine learning methods. Likewise, the comparative experiments have manifested that our process of various biological information is preferable. Conclusions Our proposed deep learning framework effectively identifies essential proteins by integrating multiple biological data, proving a broader selection of subcellular localization information significantly improves the results of prediction and depthwise separable convolution implemented on gene expression profiles enhances the performance.
 
Our proposed multi-granularity multi-scaled SAN model for DTI prediction
Our proposed multi-scaled SAN block
Results of DeepDTA [6] model on the KIBA dataset with different multi-granularity representations as inputs. These multi-granularity representations are encoded by BPE algorithm with different threshold T. Here, Td\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_d$$\end{document} is the threshold T for drug segmentation and Tp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_p$$\end{document} is the threshold T for protein segmentation
Results of DeepDTA [6] model on the Davis dataset with different multi-granularity representations as inputs. These multi-granularity representations are encoded by BPE algorithm with different threshold T. Here, Td\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_d$$\end{document} is the threshold T for drug segmentation and Tp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_p$$\end{document} is the threshold T for protein segmentation
Article
Background Drug–target interaction (DTI) prediction plays a crucial role in drug discovery. Although the advanced deep learning has shown promising results in predicting DTIs, it still needs improvements in two aspects: (1) encoding method, in which the existing encoding method, character encoding, overlooks chemical textual information of atoms with multiple characters and chemical functional groups; as well as (2) the architecture of deep model, which should focus on multiple chemical patterns in drug and target representations. Results In this paper, we propose a multi-granularity multi-scaled self-attention (SAN) model by alleviating the above problems. Specifically, in process of encoding, we investigate a segmentation method for drug and protein sequences and then label the segmented groups as the multi-granularity representations. Moreover, in order to enhance the various local patterns in these multi-granularity representations, a multi-scaled SAN is built and exploited to generate deep representations of drugs and targets. Finally, our proposed model predicts DTIs based on the fusion of these deep representations. Our proposed model is evaluated on two benchmark datasets, KIBA and Davis. The experimental results reveal that our proposed model yields better prediction accuracy than strong baseline models. Conclusion Our proposed multi-granularity encoding method and multi-scaled SAN model improve DTI prediction by encoding the chemical textual information of drugs and targets and extracting their various local patterns, respectively.
 
Mechanism of functioning of DIRs interacting with mono- or bis-quinone methides intermediates. A In lignans forming-DIR, Example of FiDIR1 for (1)-pinoresinol forming DP and atDIR6 for (+d)-pinoresinol forming DP. B In terpenoids forming-DIR GhDIR4 and C in Pterocarpan forming-DIR as GePTS1 (adapted from [10]). FiDIR1, Forsythia intermedia (−)-pinoresinol–forming DIR, AtDIR6, A. thaliana (+)-pinoresinol–forming DIR, GhDIR4, Gossipium.hirsutum-gossypol–forming DIR, GePTS1, Glycyrrhiza. echinata pterocarpan synthase 1
HMM Dirigent domain profile of Pfam PF03018 DIR family from available sequenced genomes
Phylogenetic tree constructed by Seaview server according to the Neighbor-Joining method, corrected by ML method. In green, branches linked to the 8 best characterized plant DIRs (PTS1 and DRR206 from Pisum sativum and FiDIR1 from fosythia intermedia are from family DIR-a1. AtDIR5 and atDIR6 from Arabidopsis thaliana are from family DIR-a2. GmDRR1 from Glycine max, GePTS1 from Glycyrrhiza echinanta and Gossipium Hirsitum GhDIR4 are from family DIR-b/d. AtDIR10 from A. thaliana is from family DIR-e); in black, branches connecting the 49 bacterial DIRLs; groups of similar DIRLs are squared and numbered in 5 groups DIRL I to IV. Branch Distance scale is indicated. Bacteria which do not belong to family I to V and not clustered are not squared
3D model of AtDIR6 (5LAL) and 3D model predictions of DPLs representative of family I to V. A superposition of the 6 models is presented far right
Schematic representation of a part the genome of Streptomyces formicae, annotated manually in order to obtain genomic information around the genes encoding a potential DIRL
Article
Background DIRs are mysterious protein that have the ability to scavenge free radicals, which, are highly reactive with molecules in their vicinity. What is even more fascinating is that they carry out from these highly unstable species, a selective reaction (i.e., stereoenantioselective) from a well-defined substrate to give a very precise product. Unfortunately, to date, only three products have been demonstrated following studies on DIRs from the plant world, which until now was the kingdom where these proteins had been demonstrated. Within this kingdom, each DIR protein has its own type of substrate. The products identified to date, have on the other hand, a strong economic impact: in agriculture for example, the biosynthesis of (+)-gossypol could be highlighted (a repellent antifood produced by the cotton plant) by the DIRs of cotton. In forsythia plant species, it is the biosynthesis of (−)-pinoresinol, an intermediate leading to the synthesis of podophyllotoxine (a powerful anicancerous agent) which has been revealed. Recently, a clear path of study, potentially with strong impact, appeared by the hypothesis of the potential existence of protein DIR within the genomes of prokaryotes. The possibility of working with this type of organism is an undeniable advantage: since many sequenced genomes are available and the molecular tools are already developed. Even easier to implement and working on microbes, of less complex composition, offers many opportunities for laboratory studies. On the other hand, the diversity of their environment (e.g., soil, aquatic environments, extreme environmental conditions (pH, temperature, pressure) make them very diverse and varied subjects of study. Identifying new DIR proteins from bacteria means identifying new substrate or product molecules from these organisms. It is the promise of going further in understanding the mechanism of action of these proteins and this will most likely have a strong impact in the fields of agricultural, pharmaceutical and/or food chemistry. Results Our goal is to obtain as much information as possible about these proteins to unlock the secrets of their exceptional functioning. Analyzes of structural and functional genomic data led to the identification of the Pfam PF03018 domain as characteristic of DIR proteins. This domain has been further identified in the sequence of bacterial proteins therefore named as DIR-like (DIRL). We have chosen a multidisciplinary bioinformatic approach centered on bacterial genome identification, gene expression and regulation signals, protein structures, and their molecular information content. The objective of this study was to perform a thorough bioinformatic analysis on these DIRLs to highlight any information leading to the selection of candidate bacteria for further cloning, purification, and characterization of bacterial DIRs. Conclusions From studies of DIRL genes identification, primary structures, predictions of their secondary and tertiary structures, prediction of DIRL signals sequences, analysis of their gene organization and potential regulation, a list of primary bacterial candidates is proposed.
 
Article
Background Cervical cancer is the fourth most common cancer affecting women and is caused by human Papillomavirus (HPV) infections that are sexually transmitted. There are currently commercially available prophylactic vaccines that have been shown to protect vaccinated individuals against HPV infections, however, these vaccines have no therapeutic effects for those who are previously infected with the virus. The current study’s aim was to use immunoinformatics to develop a multi-epitope vaccine with therapeutic potential against cervical cancer. Results In this study, T-cell epitopes from E5 and E7 proteins of HPV16/18 were predicted. These epitopes were evaluated and chosen based on their antigenicity, allergenicity, toxicity, and induction of IFN-γ production (only in helper T lymphocytes). Then, the selected epitopes were sequentially linked by appropriate linkers. In addition, a C-terminal fragment of Mycobacterium tuberculosis heat shock protein 70 (HSP70) was used as an adjuvant for the vaccine construct. The physicochemical parameters of the vaccine construct were acceptable. Furthermore, the vaccine was soluble, highly antigenic, and non-allergenic. The vaccine’s 3D model was predicted, and the structural improvement after refinement was confirmed using the Ramachandran plot and ProSA-web. The vaccine’s B-cell epitopes were predicted. Molecular docking analysis showed that the vaccine's refined 3D model had a strong interaction with the Toll-like receptor 4. The structural stability of the vaccine construct was confirmed by molecular dynamics simulation. Codon adaptation was performed in order to achieve efficient vaccine expression in Escherichia coli strain K12 (E. coli). Subsequently, in silico cloning of the multi-epitope vaccine was conducted into pET-28a ( +) expression vector. Conclusions According to the results of bioinformatics analyses, the multi-epitope vaccine is structurally stable, as well as a non-allergic and non-toxic antigen. However, in vitro and in vivo studies are needed to validate the vaccine’s efficacy and safety. If satisfactory results are obtained from in vitro and in vivo studies, the vaccine designed in this study may be effective as a therapeutic vaccine against cervical cancer.
 
Expression level of the APOBEC3B gene in different tumors and pathological stages. a The mRNA expression of the APOBEC3B in different tumor subtypes and adjacent normal tissues. **P < 0.01; ***P < 0.001. b The mRNA expression of the APOBEC3B in ACC, LAML, LGG, OV, SKCM, TGCT, THYM, and UCS tumor types in the TCGA dataset, corresponding normal tissues of the GTEx database were included as controls. ***P < 0.001. c The protein levels of APOBEC3B between normal tissues and primary tissues of breast cancers, clear cell RCC, hepatocellular carcinomas, PAAD, UCEC, HNSC, and LUAD, based on the CPTAC dataset. **P < 0.01. d The relationship between APOBEC3B expression and clinical stages in ACC, BLCA, CHOL, KIRC, KIRP, LIHC, OV, and THCA tumor types
Correlation between APOBEC3B gene expression and survival prognosis for patients with cancers in TCGA. We performed overall survival (OS) (a) and disease-free survival (DFS) (b) analyses of different tumors in TCGA according to APOBEC3B gene expression. The survival maps and Kaplan–Meier curves with positive results were provided
Mutation feature of APOBEC3B in different tumors listed in TCGA. a, b The mutation features of APOBEC3B for the TCGA-listed tumors using the cBioPortal tool. The alteration frequencies with mutation type (a) and mutation site (b) were displayed. c The highest alteration frequency of mutation sites (R114, R306, R355) of APOBEC3B were shown in the 3D structure. d The potential correlation between mutation status and disease-specific, overall and progression-free survival in patients with SKCM
Correlation analysis between APOBEC3B expression and immune infiltration of cancer-associated fibroblasts and CD8⁺ T-cells. a Heatmap showing the correlation between the expression levels of the APOBEC3B and the infiltration level of cancer-associated fibroblasts/CD8⁺ T-cells obtained by different algorithms. b The correlation between APOBEC3B expression and immune infiltration cancer-associated fibroblasts in some TCGA-listed tumors
APOBEC3B-related gene enrichment analysis. a The APOBEC3B-binding proteins identified using the STRING tool. b The correlation between the expression of APOBEC3B and top 5 genes co-expression with APOBEC3B (TK1, MELK, CEP55, MCM2, and NCAPH). c Heatmap showing the correlation between the expression of APOBEC3B and top 5 genes co-expression with APOBEC3B (TK1, MELK, CEP55, MCM2, and NCAPH) in the detailed cancer types. d Venn diagram showing the intersection analysis of the APOBEC3B-binding and correlated genes. e Bubble chart of KEGG pathway analysis and GO enrichment analysis based on the APOBEC3B-binding and interacting genes
Article
Although there have been some recent cell and animal experiments indicating that expression of the gene encoding apolipoprotein B mRNA editing enzyme catalytic subunit 3B (APOBEC3B) is closely related to cancer, it still lacks pan-cancer analysis. Here we analyzed the potential carcinogenic role of APOBEC3B in 33 tumors based on The Cancer Genome Atlas (TCGA). APOBEC3B was highly expressed in most tumors and weakly expressed in a few. Differences in expression level were significantly correlated with the pathological tumor stage and prognosis of affected patients. The high-frequency APOBEC3B changes were principally mutations and amplifications in some tumors, such as uterine corpus endometrial carcinomas or cutaneous melanomas. In testicular germ cell tumors and invasive breast carcinomas, APOBEC3B expression and CD8⁺ T lymphocyte counts were correlated. In other cancers, such as human papilloma virus (HPV)-related head and neck squamous cell carcinomas or esophageal adenocarcinomas, there was also cancer-associated fibroblast infiltration. The APOBEC3B enzyme acts in the mitochondrial respiratory electron transport chain and in oxidative phosphorylation. This first pan-cancer study provides a comprehensive understanding of the multiple roles of APOBEC3B in different tumor types.
 
Simulation results for Binary cFDR and BL. Mean +/− standard error for the sensitivity, specificity and FDR of FDR values (derived from the Benjamini–Hochberg procedure) from Binary cFDR when iterating over independent (A; “simulation A”) and dependent (B; “simulation B” and C; “simulation C”) binary auxiliary data. BL refers to results when using Boca and Leek’s FDR regression to leverage the 5-dimensional covariate data. Iteration 0 corresponds to the original FDR values. Results were averaged across 100 simulations
of cFDR results from type 1 diabetes application. A FDR values (derived from the Benjamini–Hochberg procedure) before and after each iteration of cFDR, coloured by the auxiliary data values. B Manhattan plot of (-log10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-log_{10}$$\end{document}) FDR values (y-axis truncated to aid visualisation). Green points indicate the four lead variants that were newly FDR significant after cFDR. Black dashed line at FDR significance threshold (FDR=3.3×10-6\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$FDR=3.3\times 10^{-6}$$\end{document})
Article
Background Genome-wide association studies (GWAS) are limited in power to detect associations that exceed the stringent genome-wide significance threshold. This limitation can be alleviated by leveraging relevant auxiliary data, such as functional genomic data. Frameworks utilising the conditional false discovery rate have been developed for this purpose, and have been shown to increase power for GWAS discovery whilst controlling the false discovery rate. However, the methods are currently only applicable for continuous auxiliary data and cannot be used to leverage auxiliary data with a binary representation, such as whether SNPs are synonymous or non-synonymous, or whether they reside in regions of the genome with specific activity states. Results We describe an extension to the cFDR framework for binary auxiliary data, called “Binary cFDR”. We demonstrate FDR control of our method using detailed simulations, and show that Binary cFDR performs better than a comparator method in terms of sensitivity and FDR control. We introduce an all-encompassing user-oriented CRAN R package ( https://annahutch.github.io/fcfdr/ ; https://cran.r-project.org/web/packages/fcfdr/index.html ) and demonstrate its utility in an application to type 1 diabetes, where we identify additional genetic associations. Conclusions Our all-encompassing R package, , serves as a comprehensive toolkit to unite GWAS and functional genomic data in order to increase statistical power to detect genetic associations.
 
Article
Background Reference intervals represent the expected range of physiological test results in a healthy population and are essential to support medical decision making. Particularly in the context of pediatric reference intervals, where recruitment regulations make prospective studies challenging to conduct, indirect estimation strategies are becoming increasingly important. Established indirect methods enable robust identification of the distribution of “healthy” samples from laboratory databases, which include unlabeled pathologic cases, but are currently severely limited when adjusting for essential patient characteristics such as age. Here, we propose the use of mixture density networks (MDN) to overcome this problem and model all parameters of the mixture distribution in a single step. Results Estimated reference intervals from varying settings with simulated data demonstrate the ability to accurately estimate latent distributions from unlabeled data using different implementations of MDNs. Comparing the performance with alternative estimation approaches further highlights the importance of modeling the mixture component weights as a function of the input in order to avoid biased estimates for all other parameters and the resulting reference intervals. We also provide a strategy to generate partially customized starting weights to improve proper identification of the latent components. Finally, the application on real-world hemoglobin samples provides results in line with current gold standard approaches, but also suggests further investigations with respect to adequate regularization strategies in order to prevent overfitting the data. Conclusions Mixture density networks provide a promising approach capable of extracting the distribution of healthy samples from unlabeled laboratory databases while simultaneously and explicitly estimating all parameters and component weights as non-linear functions of the covariate(s), thereby allowing the estimation of age-dependent reference intervals in a single step. Further studies on model regularization and asymmetric component distributions are warranted to consolidate our findings and expand the scope of applications.
 
Article
Background Technical improvement in ATAC-seq makes it possible for high throughput profiling the chromatin states of single cells. However, data from multiple sources frequently show strong technical variations, which is referred to as batch effects. In order to perform joint analysis across multiple datasets, specialized method is required to remove technical variations between datasets while keep biological information. Results Here we present an algorithm named epiConv to perform joint analyses on scATAC-seq datasets. We first show that epiConv better corrects batch effects and is less prone to over-fitting problem than existing methods on a collection of PBMC datasets. In a collection of mouse brain data, we show that epiConv is capable of aligning low-depth scATAC-Seq from co-assay data (simultaneous profiling of transcriptome and chromatin) onto high-quality ATAC-seq reference and increasing the resolution of chromatin profiles of co-assay data. Finally, we show that epiConv can be used to integrate cells from different biological conditions (T cells in normal vs. germ-free mouse; normal vs. malignant hematopoiesis), which reveals hidden cell populations that would otherwise be undetectable. Conclusions In this study, we introduce epiConv to integrate multiple scATAC-seq datasets and perform joint analysis on them. Through several case studies, we show that epiConv removes the batch effects and retains the biological signal. Moreover, joint analysis across multiple datasets improves the performance of clustering and differentially accessible peak calling, especially when the biological signal is weak in single dataset.
 
A sentence with visualized events provided by BionNLP-ST2013
The distribution of the event types on the MLEE corpus
The overall architecture of biomedical event extraction
An example of “Binding” type biomedical event
Article
Background Biomedical event extraction is a fundamental task in biomedical text mining, which provides inspiration for medicine research and disease prevention. Biomedical events include simple events and complex events. Existing biomedical event extraction methods usually deal with simple events and complex events uniformly, and the performance of complex event extraction is relatively low. Results In this paper, we propose a fine-grained Bidirectional Long Short Term Memory method for biomedical event extraction, which designs different argument detection models for simple and complex events respectively. In addition, multi-level attention is designed to improve the performance of complex event extraction, and sentence embeddings are integrated to obtain sentence level information which can resolve the ambiguities for some types of events. Our method achieves state-of-the-art performance on the commonly used dataset Multi-Level Event Extraction. Conclusions The sentence embeddings enrich the global sentence-level information. The fine-grained argument detection model improves the performance of complex biomedical event extraction. Furthermore, the multi-level attention mechanism enhances the interactions among relevant arguments. The experimental results demonstrate the effectiveness of the proposed method for biomedical event extraction.
 
Article
Background Ewing sarcoma (ES) is the second most common primary malignant bone tumor mainly occurring in children, adolescents and young adults with high metastasis and mortality. Autophagy has been reported to be involved in the survival of ES, but the role remains unclear. Therefore, it’s necessary to investigate the prognostic value of autophagy related genes using bioinformatics methods. Results ATG2B , ATG10 and DAPK1 were final screened genes for a prognostic model. KM and risk score plots showed patients in high score group had better prognoses both in training and validation sets. C-indexes of the model for training and validation sets were 0.68 and 0.71, respectively. Calibration analyses indicated the model had high prediction accuracy in training and validation sets. The AUC values of ROC for 1-, 3-, 5-year prediction were 0.65, 0.73 and 0.84 in training set, 0.88, 0.73 and 0.79 in validation set, which suggested high prediction accuracy of the model. Decision curve analyses showed that patients could benefit much from the model. Differential and functional analyses suggested that autophagy and apoptosis were upregulated in high risk score group. Conclusions ATG2B , ATG10 and DAPK1 were autophagy related genes with potential protective function in ES. The prognostic model established by them exhibited excellent prediction accuracy and discriminatory capacities. They might be used as potential prognostic biomarkers and therapeutic targets in ES.
 
Article
Background Previous studies have demonstrated the value of re-analysing publicly available genetics data with recent analytical approaches. Publicly available datasets, such as the Women’s Health Initiative (WHI) offered by the database of genotypes and phenotypes (dbGaP), provide a wealthy resource for researchers to perform multiple analyses, including Genome-Wide Association Studies. Often, the genetic information of individuals in these datasets are stored in imputed dosage files output by MaCH; mldose and mlinfo files. In order for researchers to perform GWAS studies with this data, they must first be converted to a file format compatible with their tool of choice e.g., PLINK. Currently, there is no published tool which easily converts the datasets provided in MACH dosage files into PLINK-ready files. Results Herein, we present Canary a singularity-based tool which converts MaCH dosage files into PLINK-compatible files with a single line of user input at the command line. Further, we provide a detailed tutorial on preparation of phenotype files. Moreover, Canary comes with preinstalled software often used during GWAS studies, to further increase the ease-of-use of HPC systems for researchers. Conclusions Until now, conversion of imputed data in the form of MaCH mldose and mlinfo files needed to be completed manually. Canary uses singularity container technology to allow users to automatically convert these MaCH files into PLINK compatible files. Additionally, Canary provides researchers with a platform to conduct GWAS analysis more easily as it contains essential software needed for conducting GWAS studies, such as PLINK and Bioconductor. We hope that this tool will greatly increase the ease at which researchers can perform GWAS with imputed data, particularly on HPC environments.
 
A saturated genetic and environmental factor model for three traits
A genetic factor model for height and body mass index (BMI) observed at five different points in time (denoted by subscripts indicating waves 7, 8, ..., 11)
Typical MGREML estimate of a genetic correlation (ρG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _G$$\end{document}) matrix in Simulation 2. True genetic correlations (ρG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _G$$\end{document}’s) are shown above the diagonal. Estimated ρG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _G$$\end{document}’s (standard error between parentheses) are shown below the diagonal
Article
Background Heritability and genetic correlation can be estimated from genome-wide single-nucleotide polymorphism (SNP) data using various methods. We recently developed multivariate genomic-relatedness-based restricted maximum likelihood (MGREML) for statistically and computationally efficient estimation of SNP-based heritability ( $$h^2_{\text{SNP}}$$ h SNP 2 ) and genetic correlation ( $$\rho _G$$ ρ G ) across many traits in large datasets. Here, we extend MGREML by allowing it to fit and perform tests on user-specified factor models, while preserving the low computational complexity. Results Using simulations, we show that MGREML yields consistent estimates and valid inferences for such factor models at low computational cost (e.g., for data on 50 traits and 20,000 individuals, a saturated model involving 50 $$h^2_{\text{SNP}}$$ h SNP 2 ’s, 1225 $$\rho _G$$ ρ G ’s, and 50 fixed effects is estimated and compared to a restricted model in less than one hour on a single notebook with two 2.7 GHz cores and 16 GB of RAM). Using repeated measures of height and body mass index from the US Health and Retirement Study, we illustrate the ability of MGREML to estimate a factor model and test whether it fits the data better than a nested model. The MGREML tool, the simulation code, and an extensive tutorial are freely available at https://github.com/devlaming/mgreml/ . Conclusion MGREML can now be used to estimate multivariate factor structures and perform inferences on such factor models at low computational cost. This new feature enables simple structural equation modeling using MGREML, allowing researchers to specify, estimate, and compare genetic factor models of their choosing using SNP data.
 
The relationship between the number of features (number of clusters) with Acc and Auc. a represents DLBCL data set, b represents leukemia data set, c represents prostate data set, and d represents ALL_4 data set
The distribution of biomarkers selected by the proposed method in positive and negative samples. a represents DLBCL data set, b represents leukemia data set, c represents prostate data set, and d represents ALL_4 data set
Heat map analysis of the features selected by the proposed method. a represents DLBCL data set, b represents leukemia data set, c represents prostate data set, and d represents ALL_4 data set
Partial dependency graph of the features selected by the proposed method on the DLBCL dataset
The overall framework of the proposed approach: The gene relationship data is obtained from GeneMANIA, the expression of each gene in positive and negative samples is embedded as node information, and the gene relationship data and Pearson correlation coefficient are embedded as edges after passing through a layer of softmax function. The graph neural networks’ information dissemination and aggregation process is carried out. The dependency relationship is predicted by the link prediction method, and spectral clustering is carried out to delete redundant features. The feature of each subgraph is evaluated, eight kinds of evaluators are used, the ranking information is aggregated by the robust ranking method, and the feature subset is finally output
Article
Background The discovery of critical biomarkers is significant for clinical diagnosis, drug research and development. Researchers usually obtain biomarkers from microarray data, which comes from the dimensional curse. Feature selection in machine learning is usually used to solve this problem. However, most methods do not fully consider feature dependence, especially the real pathway relationship of genes. Results Experimental results show that the proposed method is superior to classical algorithms and advanced methods in feature number and accuracy, and the selected features have more significance. Method This paper proposes a feature selection method based on a graph neural network. The proposed method uses the actual dependencies between features and the Pearson correlation coefficient to construct graph-structured data. The information dissemination and aggregation operations based on graph neural network are applied to fuse node information on graph structured data. The redundant features are clustered by the spectral clustering method. Then, the feature ranking aggregation model using eight feature evaluation methods acts on each clustering sub-cluster for different feature selection. Conclusion The proposed method can effectively remove redundant features. The algorithm’s output has high stability and classification accuracy, which can potentially select potential biomarkers.
 
Article
Background Protein complexes are essential for biologists to understand cell organization and function effectively. In recent years, predicting complexes from protein–protein interaction (PPI) networks through computational methods is one of the current research hotspots. Many methods for protein complex prediction have been proposed. However, how to use the information of known protein complexes is still a fundamental problem that needs to be solved urgently in predicting protein complexes. Results To solve these problems, we propose a supervised learning method based on network representation learning and gene ontology knowledge, which can fully use the information of known protein complexes to predict new protein complexes. This method first constructs a weighted PPI network based on gene ontology knowledge and topology information, reducing the network's noise problem. On this basis, the topological information of known protein complexes is extracted as features, and the supervised learning model SVCC is obtained according to the feature training. At the same time, the SVCC model is used to predict candidate protein complexes from the protein interaction network. Then, we use the network representation learning method to obtain the vector representation of the protein complex and train the random forest model. Finally, we use the random forest model to classify the candidate protein complexes to obtain the final predicted protein complexes. We evaluate the performance of the proposed method on two publicly PPI data sets. Conclusions Experimental results show that our method can effectively improve the performance of protein complex recognition compared with existing methods. In addition, we also analyze the biological significance of protein complexes predicted by our method and other methods. The results show that the protein complexes predicted by our method have high biological significance.
 
Flowchart of ISPIP Methodology: ISPIP’s classification models are generated through training on the interface likelihoods of the three input predictors. (Created with BioRender.com)
Enhanced prediction as ISPIP model evolves: (A) The PR curves of the 3 input methods indicate that PredUs 2.0 and ISPRED4 perform slightly better than DockPred. (B) All the ISPIP models significantly outperform the input predictors, and PR-AUC is boosted as the model evolves from simple linear regression to more complex ensemble decision tree algorithms
ISPIP consensus prediction of interface residues: On the left, the structure (1YPI.A) is shown. In the middle, the interface prediction of the 3 input classifiers is displayed. On the right, the ISPIP consensus prediction includes overlapping and unique TP residues of the input classifiers to yield an improved interface prediction of 19 TP out of the 23 annotated residues
ISPIP outperforms other structure-based classifiers and meta-predictors: The PR curves highlight ISPIP’s improved performance of a complex structure-based classifier (VORFFIP) and previous meta-predictor (meta-PPISP)
ISPIP is robust to poor performance of input classifier: On the left, the structure of 1CP2.A) is shown. In the middle, the interface prediction of the 3 input classifiers is displayed. PredUs 2.0 has an especially poor prediction relative to the other 2 input classifiers. On the right, the ISPIP has a robust consensus prediction with 10 TP out of the 13 annotated residues, despite the poor performance of the PredUs 2.0 input classifier
Article
Background Identifying protein interfaces can inform how proteins interact with their binding partners, uncover the regulatory mechanisms that control biological functions and guide the development of novel therapeutic agents. A variety of computational approaches have been developed for predicting a protein’s interfacial residues from its known sequence and structure. Methods using the known three-dimensional structures of proteins can be template-based or template-free. Template-based methods have limited success in predicting interfaces when homologues with known complex structures are not available to use as templates. The prediction performance of template-free methods that only rely only upon proteins’ intrinsic properties is limited by the amount of biologically relevant features that can be included in an interface prediction model. Results We describe the development of an integrated method for protein interface prediction (ISPIP) to explore the hypothesis that the efficacy of a computational prediction method of protein binding sites can be enhanced by using a combination of methods that rely on orthogonal structure-based properties of a query protein, combining and balancing both template-free and template-based features. ISPIP is a method that integrates these approaches through simple linear or logistic regression models and more complex decision tree models. On a diverse test set of 156 query proteins, ISPIP outperforms each of its individual classifiers in identifying protein binding interfaces. Conclusions The integrated method captures the best performance of individual classifiers and delivers an improved interface prediction. The method is robust and performs well even when one of the individual classifiers performs poorly on a particular query protein. This work demonstrates that integrating orthogonal methods that depend on different structural properties of proteins performs better at interface prediction than any individual classifier alone.
 
Article
Background A large number of evidences from biological experiments have confirmed that miRNAs play an important role in the progression and development of various human complex diseases. However, the traditional experiment methods are expensive and time-consuming. Therefore, it is a challenging task that how to develop more accurate and efficient methods for predicting potential associations between miRNA and disease. Results In the study, we developed a computational model that combined heterogeneous graph convolutional network with enhanced layer for miRNA–disease association prediction (HGCNELMDA). The major improvement of our method lies in through restarting the random walk optimized the original features of nodes and adding a reinforcement layer to the hidden layer of graph convolutional network retained similar information between nodes in the feature space. In addition, the proposed approach recalculated the influence of neighborhood nodes on target nodes by introducing the attention mechanism. The reliable performance of the HGCNELMDA was certified by the AUC of 93.47% in global leave-one-out cross-validation (LOOCV), and the average AUCs of 93.01% in fivefold cross-validation. Meanwhile, we compared the HGCNELMDA with the state‑of‑the‑art methods. Comparative results indicated that o the HGCNELMDA is very promising and may provide a cost‑effective alternative for miRNA–disease association prediction. Moreover, we applied HGCNELMDA to 3 different case studies to predict potential miRNAs related to lung cancer, prostate cancer, and pancreatic cancer. Results showed that 48, 50, and 50 of the top 50 predicted miRNAs were supported by experimental association evidence. Therefore, the HGCNELMDA is a reliable method for predicting disease-related miRNAs. Conclusions The results of the HGCNELMDA method in the LOOCV (leave-one-out cross validation, LOOCV) and 5-cross validations were 93.47% and 93.01%, respectively. Compared with other typical methods, the performance of HGCNELMDA is higher. Three cases of lung cancer, prostate cancer, and pancreatic cancer were studied. Among the predicted top 50 candidate miRNAs, 48, 50, and 50 were verified in the biological database HDMMV2.0. Therefore; this further confirms the feasibility and effectiveness of our method. Therefore, this further confirms the feasibility and effectiveness of our method. To facilitate extensive studies for future disease-related miRNAs research, we developed a freely available web server called HGCNELMDA is available at http://124.221.62.44:8080/HGCNELMDA.jsp.
 
Article
Background The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, identifying many cancer-associated genes. Although cancers in the same tissue are widely categorized in the same group, they demonstrate many differences concerning their mutational profiles. Hence, there is no definitive treatment for most cancer types. This reveals the importance of developing new pipelines to identify cancer-associated genes accurately and re-classify patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types. Results In this study, we propose a new machine learning pipeline to identify protein-coding genes mutated in many samples to identify cancer subtypes. We apply our pipeline to 12,270 samples collected from the international cancer genome consortium, covering 19 cancer types. As a result, we identify 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways. Conclusions This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. Additionally, we analyze the mutational signatures for samples in each subtype, which provides important insight into their active molecular mechanisms. Some of the pathways we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rates of mutation in multiple cancer subtypes. In addition, our study on “gene-motif” suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes. The source codes for our proposed clustering pipeline and analysis are publicly available at: https://github.com/bcb-sut/Pan-Cancer.
 
Article
Background Probabilistic functional integrated networks (PFINs) are designed to aid our understanding of cellular biology and can be used to generate testable hypotheses about protein function. PFINs are generally created by scoring the quality of interaction datasets against a Gold Standard dataset, usually chosen from a separate high-quality data source, prior to their integration. Use of an external Gold Standard has several drawbacks, including data redundancy, data loss and the need for identifier mapping, which can complicate the network build and impact on PFIN performance. Additionally, there typically are no Gold Standard data for non-model organisms. Results We describe the development of an integration technique, ssNet, that scores and integrates both high-throughput and low-throughout data from a single source database in a consistent manner without the need for an external Gold Standard dataset. Using data from Saccharomyces cerevisiae we show that ssNet is easier and faster, overcoming the challenges of data redundancy, Gold Standard bias and ID mapping. In addition ssNet results in less loss of data and produces a more complete network. Conclusions The ssNet method allows PFINs to be built successfully from a single database, while producing comparable network performance to networks scored using an external Gold Standard source and with reduced data loss.
 
Article
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .
 
Mediation analysis of A a single mediator; B high dimensional mediators, plotted similarly to [3]. An arrow from X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X$$\end{document} to U\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U$$\end{document} is possible though omitted to avoid the complexity in interpreting α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document} as the total effect
Article
Mediation analysis plays a major role in identifying significant mediators in the pathway between environmental exposures and health outcomes. With advanced data collection technology for large-scale studies, there has been growing research interest in developing methodology for high-dimensional mediation analysis. In this paper we present HIMA2, an extension of the HIMA method (Zhang in Bioinformatics 32:3150–3154, 2016). First, the proposed HIMA2 reduces the dimension of mediators to a manageable level based on the sure independence screening (SIS) method (Fan in J R Stat Soc Ser B 70:849–911, 2008). Second, a de-biased Lasso procedure is implemented for estimating regression parameters. Third, we use a multiple-testing procedure to accurately control the false discovery rate (FDR) when testing high-dimensional mediation hypotheses. We demonstrate its practical performance using Monte Carlo simulation studies and apply our method to identify DNA methylation markers which mediate the pathway from smoking to reduced lung function in the Coronary Artery Risk Development in Young Adults (CARDIA) Study.
 
The first two rows of the figure show illustrative examples of RxRx19a [25] (a) and RxRx1 [26] datasets (b). The third row (c) presents representative examples of Style-GAN generated images for the RxRx19a [25] dataset
Overview of GAN-DL self-supervised representation learning framework, whose pretext task consists in the adversarial game between the generator and the discriminator of the backbone StyleGAN2 (a). The discriminator’s features are exploited to several downstream tasks (b): (1) Controls classification - classification of active and inactive compounds against SARS-CoV2 in two different cell models; (2) Dose-response modelling—disease-associated profiling from raw microscopy images; (3) Cell models classification—zero-shot representation learning classification task consisting in categorizing four different cell types
The left column of the figure shows the scatter plots of GAN-DL’s embedding of the RxRx19a [25] dataset projected onto the E2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E^2$$\end{document} (a) and C2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^2$$\end{document} (c) axes. The right column shows the baseline embeddings of the RxRx19a [25] dataset projected onto the E2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E^2$$\end{document} (b) and C2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C^2$$\end{document} (d) axes
Drug effectiveness as a function of concentration, obtained using our GAN-DL (a) and the baseline embedding (b)
Confusion matrix of the cell classification task on the RxRx1 [26] dataset
Article
Motivation Computer-aided analysis of biological images typically requires extensive training on large-scale annotated datasets, which is not viable in many situations. In this paper, we present Generative Adversarial Network Discriminator Learner (GAN-DL), a novel self-supervised learning paradigm based on the StyleGAN2 architecture, which we employ for self-supervised image representation learning in the case of fluorescent biological images. Results We show that Wasserstein Generative Adversarial Networks enable high-throughput compound screening based on raw images. We demonstrate this by classifying active and inactive compounds tested for the inhibition of SARS-CoV-2 infection in two different cell models: the primary human renal cortical epithelial cells (HRCE) and the African green monkey kidney epithelial cells (VERO). In contrast to previous methods, our deep learning-based approach does not require any annotation, and can also be used to solve subtle tasks it was not specifically trained on, in a self-supervised manner. For example, it can effectively derive a dose-response curve for the tested treatments. Availability and implementation Our code and embeddings are available at https://gitlab.com/AlesioRFM/gan-dl StyleGAN2 is available at https://github.com/NVlabs/stylegan2 .
 
The example of FASTQ format
The Structure of the random access and index
The averaged compression rates of five compressors over 11 datasets
Article
Background Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access. Results In this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores. Conclusions CMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way: https://github.com/Humonex/Cmic .
 
Venn diagrams presenting a the overlap of genes associated with each disease, b the overlap of GO Biological Process terms enriched in genes associated with each disease, c the overlap of GO Biological Process terms enriched in genes associated with each disease after collective filtering by orsum
Top 20 representative terms and the quartiles their ranks belong to according to each input enrichment result
Top 20 representative terms and the number of terms they represent
Numbers of representative terms resulting from orsum and REVIGO applied to the enrichment analyses of artificially generated gene lists. Each point corresponds to an enrichment analysis result obtained for one of the 100 artificially generated gene lists. The size and color of a point indicates the number of terms in the original enrichment analysis result. The red line shows the coordinates where the representative term numbers in orsum and REVIGO are equal
Article
Background Enrichment analyses are widely applied to investigate lists of genes of interest. However, such analyses often result in long lists of annotation terms with high redundancy, making the interpretation and reporting difficult. Long annotation lists and redundancy also complicate the comparison of results obtained from different enrichment analyses. An approach to overcome these issues is using down-sized annotation collections composed of non-redundant terms. However, down-sized collections are generic and the level of detail may not fit the user’s study. Other available approaches include clustering and filtering tools, which are based on similarity measures and thresholds that can be complicated to comprehend and set. Result We propose orsum, a Python package to filter enrichment results. orsum can filter multiple enrichment results collectively and highlight common and specific annotation terms. Filtering in orsum is based on a simple principle: a term is discarded if there is a more significant term that annotates at least the same genes; the remaining more significant term becomes the representative term for the discarded term. This principle ensures that the main biological information is preserved in the filtered results while reducing redundancy. In addition, as the representative terms are selected from the original enrichment results, orsum outputs filtered terms tailored to the study. As a use case, we applied orsum to the enrichment analyses of four lists of genes, each associated with a neurodegenerative disease. Conclusion orsum provides a comprehensible and effective way of filtering and comparing enrichment results. It is available at https://anaconda.org/bioconda/orsum .
 
Article
Background With the widespread availability of microarray technology for epigenetic research, methods for calling differentially methylated probes or differentially methylated regions have become effective tools to analyze this type of data. Furthermore, visualization is usually employed for quality check of results and for further insights. Expert knowledge is required to leverage capabilities of these methods. To overcome this limitation and make visualization in epigenetic research available to the public, we designed EpiVisR. Results The EpiVisR tool allows to select and visualize combinations of traits (i.e., concentrations of chemical compounds) and differentially methylated probes/regions. It supports various modes of enriched presentation to get the most knowledge out of existing data: (1) enriched Manhattan plot and enriched volcano plot for selection of probes, (2) trait-methylation plot for visualization of selected trait values against methylation values, (3) methylation profile plot for visualization of a selected range of probes against selected trait values as well as, (4) correlation profile plot for selection and visualization of further probes that are correlated to the selected probe. EpiVisR additionally allows exporting selected data to external tools for tasks such as network analysis. Conclusion The key advantage of EpiVisR is the annotation of data in the enriched plots (and tied tables) as well as linking to external data sources for further integrated data analysis. Using the EpiVisR approach will allow users to integrate data from traits with epigenetic analyses that are connected by belonging to the same individuals. Merging data from various data sources among the same cohort and visualizing them will enable users to gain more insights from existing data.
 
Article
In recent years, the introduction of single-cell RNA sequencing (scRNAseq) has enabled the analysis of a cell’s transcriptome at an unprecedented granularity and processing speed. The experimental outcome of applying this technology is a M × N matrix containing aggregated mRNA expression counts of M genes and N cell samples. From this matrix, scientists can study how cell protein synthesis changes in response to various factors, for example, disease versus non-disease states in response to a treatment protocol. This technology’s critical challenge is detecting and accurately recording lowly expressed genes. As a result, low expression levels tend to be missed and recorded as zero - an event known as dropout. This makes the lowly expressed genes indistinguishable from true zero expression and different than the low expression present in cells of the same type. This issue makes any subsequent downstream analysis difficult. To address this problem, we propose an approach to measure cell similarity using consensus clustering and demonstrate an effective and efficient algorithm that takes advantage of this new similarity measure to impute the most probable dropout events in the scRNA-seq datasets. We demonstrate that our approach exceeds the performance of existing imputation approaches while introducing the least amount of new noise as measured by clustering performance characteristics on datasets with known cell identities.
 
Top-cited authors
Steve Horvath
  • University of California, Los Angeles
Colin N Dewey
  • University of Wisconsin–Madison
Ning Ma
  • National Institutes of Health
Bo Li
  • University of California, Berkeley
Frederique Lisacek
  • Swiss Institute of Bioinformatics