# BMC Bioinformatics

Online ISSN: 1471-2105
Recent publications
Article
Background Estimating relatedness is an important step for many genetic study designs. A variety of methods for estimating coefficients of pairwise relatedness from genotype data have been proposed. Both the kinship coefficient $$\varphi$$ φ and the fraternity coefficient $$\psi$$ ψ for all pairs of individuals are of interest. However, when dealing with low-depth sequencing or imputation data, individual level genotypes cannot be confidently called. To ignore such uncertainty is known to result in biased estimates. Accordingly, methods have recently been developed to estimate kinship from uncertain genotypes. Results We present new method-of-moment estimators of both the coefficients $$\varphi$$ φ and $$\psi$$ ψ calculated directly from genotype likelihoods. We have simulated low-depth genetic data for a sample of individuals with extensive relatedness by using the complex pedigree of the known genetic isolates of Cilento in South Italy. Through this simulation, we explore the behaviour of our estimators, demonstrate their properties, and show advantages over alternative methods. A demonstration of our method is given for a sample of 150 French individuals with down-sampled sequencing data. Conclusions We find that our method can provide accurate relatedness estimates whilst holding advantages over existing methods in terms of robustness, independence from external software, and required computation time. The method presented in this paper is referred to as LowKi ( Low -depth Ki nship) and has been made available in an R package ( https://github.com/genostats/LowKi ).

Article
Background Autism spectrum disorder (ASD) is a group of complex neurodevelopment disorders with a strong genetic basis. Large scale sequencing studies have identified over one hundred ASD risk genes. Nevertheless, the vast majority of ASD risk genes remain to be discovered, as it is estimated that more than 1000 genes are likely to be involved in ASD risk. Prioritization of risk genes is an effective strategy to increase the power of identifying novel risk genes in genetics studies of ASD. As ASD risk genes are likely to exhibit distinct properties from multiple angles, we reason that integrating multiple levels of genomic data is a powerful approach to pinpoint genuine ASD risk genes. Results We present BNScore, a Bayesian model selection framework to probabilistically prioritize ASD risk genes through explicitly integrating evidence from sequencing-identified ASD genes, biological annotations, and gene functional network. We demonstrate the validity of our approach and its improved performance over existing methods by examining the resulting top candidate ASD risk genes against sets of high-confidence benchmark genes and large-scale ASD genome-wide association studies. We assess the tissue-, cell type- and development stage-specific expression properties of top prioritized genes, and find strong expression specificity in brain tissues, striatal medium spiny neurons, and fetal developmental stages. Conclusions In summary, we show that by integrating sequencing findings, functional annotation profiles, and gene-gene functional network, our proposed BNScore provides competitive performance compared to current state-of-the-art methods in prioritizing ASD genes. Our method offers a general and flexible strategy to risk gene prioritization that can potentially be applied to other complex traits as well.

Article
Background Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size—preventing models from scaling effectively to large amounts of data. Results To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the GDA was extracted, the corresponding GDA, and the information about the gene-disease pair. Conclusions TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

Article
Background Extraction of drug drug interactions from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around relation extraction using bidirectional long short-term memory networks (BiLSTM) and BERT model which does not attain the best feature representations. Results Our proposed DDI (drug drug interaction) prediction model provides multiple advantages: (1) The newly proposed attention vector is added to better deal with the problem of overlapping relations, (2) The molecular structure information of drugs is integrated into the model to better express the functional group structure of drugs, (3) We also added text features that combined the T-distribution and chi-square distribution to make the model more focused on drug entities and (4) it achieves similar or better prediction performance (F-scores up to 85.16%) compared to state-of-the-art DDI models when tested on benchmark datasets. Conclusions Our model that leverages state of the art transformer architecture in conjunction with multiple features can bolster the performances of drug drug interation tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions.

Article
Background We study in this work the inverse folding problem for RNA, which is the discovery of sequences that fold into given target secondary structures. Results We implement a Lévy mutation scheme in an updated version of an evolutionary inverse folding algorithm and apply it to the design of RNAs with and without pseudoknots. We find that the Lévy mutation scheme increases the diversity of designed RNA sequences and reduces the average number of evaluations of the evolutionary algorithm. Compared to , CPU time is higher but more successful in finding designed sequences that fold correctly into the target structures. Conclusion We propose that a Lévy flight offers a better standard mutation scheme for optimizing RNA design. Our new version of is available on GitHub as a python script and the benchmark results show improved performance on both and the datasets, compared to existing inverse folding tools.

Article
Background Single-cell RNA sequencing (scRNA-seq) technologies offer unique opportunities for exploring heterogeneous cell populations. However, in-depth single-cell transcriptomic characterization of complex tissues often requires profiling tens to hundreds of thousands of cells. Such large numbers of cells represent an important hurdle for downstream analyses, interpretation and visualization. Results We develop a framework called SuperCell to merge highly similar cells into metacells and perform standard scRNA-seq data analyses at the metacell level. Our systematic benchmarking demonstrates that metacells not only preserve but often improve the results of downstream analyses including visualization, clustering, differential expression, cell type annotation, gene correlation, imputation, RNA velocity and data integration. By capitalizing on the redundancy inherent to scRNA-seq data, metacells significantly facilitate and accelerate the construction and interpretation of single-cell atlases, as demonstrated by the integration of 1.46 million cells from COVID-19 patients in less than two hours on a standard desktop. Conclusions SuperCell is a framework to build and analyze metacells in a way that efficiently preserves the results of scRNA-seq data analyses while significantly accelerating and facilitating them.

Article
Background In siRNA based antiviral therapeutics, selection of potent siRNAs is an indispensable step, but these commonly used features are unable to construct the boundary between potent and ineffective siRNAs. Results Here, we select potent siRNAs by removing ineffective ones, where these conditions for removals are constructed by C -features of siRNAs, C -features are generated by MG -algorithm, Icc -cluster and the different combinations of some commonly used features, MG -algorithm and Icc -cluster are two different algorithms to search the nearest siRNA neighbors. For the ineffective siRNAs in test data, they are removed from test data by I -iteration, where I -iteration continually updates training data by adding these successively removed siRNAs. Furthermore, the efficacy of siRNAs of test data is predicted by their nearest neighbors of training data. Conclusions By siRNAs of Hencken dataset, results show that our algorithm removes almost ineffective siRNAs from test data, gives the clear boundary between potent and ineffective siRNAs, and accurately predicts the efficacy of siRNAs also. We suggest that our algorithm can provide new insights for selecting the potent siRNAs.

Article
Background Image segmentation in fluorescence microscopy is often based on spectral separation of fluorescent probes (color-based segmentation) or on significant intensity differences in individual image regions (intensity-based segmentation). These approaches fail, if dye fluorescence shows large spectral overlap with other employed probes or with strong cellular autofluorescence. Results Here, a novel model-free approach is presented which determines bleaching characteristics based on dynamic mode decomposition (DMD) and uses the inferred photobleaching kinetics to distinguish different probes or dye molecules from autofluorescence. DMD is a data-driven computational method for detecting and quantifying dynamic events in complex spatiotemporal data. Here, DMD is first used on synthetic image data and thereafter used to determine photobleaching characteristics of a fluorescent sterol probe, dehydroergosterol (DHE), compared to that of cellular autofluorescence in the nematode Caenorhabditis elegans. It is shown that decomposition of those dynamic modes allows for separating probe from autofluorescence without invoking a particular model for the bleaching process. In a second application, DMD of dye-specific photobleaching is used to separate two green-fluorescent dyes, an NBD-tagged sphingolipid and Alexa488-transferrin, thereby assigning them to different cellular compartments. Conclusions Data-based decomposition of dynamic modes can be employed to analyze spatially varying photobleaching of fluorescent probes in cells and tissues for spatial and temporal image segmentation, discrimination of probe from autofluorescence and image denoising. The new method should find wide application in analysis of dynamic fluorescence imaging data.

Article

Article
Background Several types of RNA in the cell are usually involved in biological processes with multiple functions. Coding RNAs code for proteins while non-coding RNAs regulate gene expression. Some single-strand RNAs can create a circular shape via the back splicing process and convert into a new type called circular RNA (circRNA). circRNAs are among the essential non-coding RNAs in the cell that involve multiple disorders. One of the critical functions of circRNAs is to regulate the expression of other genes through sponging micro RNAs (miRNAs) in diseases. This mechanism, known as the competing endogenous RNA (ceRNA) hypothesis, and additional information obtained from biological datasets can be used by computational approaches to predict novel associations between disease and circRNAs. Results We applied multiple classifiers to validate the extracted features from the heterogeneous network and selected the most appropriate one based on some evaluation criteria. Then, the XGBoost is utilized in our pipeline to generate a novel approach, called CircWalk, to predict CircRNA-Disease associations. Our results demonstrate that CircWalk has reasonable accuracy and AUC compared with other state-of-the-art algorithms. We also use CircWalk to predict novel circRNAs associated with lung, gastric, and colorectal cancers as a case study. The results show that our approach can accurately detect novel circRNAs related to these diseases. Conclusions Considering the ceRNA hypothesis, we integrate multiple resources to construct a heterogeneous network from circRNAs, mRNAs, miRNAs, and diseases. Next, the DeepWalk algorithm is applied to the network to extract feature vectors for circRNAs and diseases. The extracted features are used to learn a classifier and generate a model to predict novel CircRNA-Disease associations. Our approach uses the concept of the ceRNA hypothesis and the miRNA sponge effect of circRNAs to predict their associations with diseases. Our results show that this outlook could help identify CircRNA-Disease associations more accurately.

Article
MRI brain images are always of low contrast, which makes it difficult to identify to which area the information at the boundary of brain images belongs. This can make the extraction of features at the boundary more challenging, since those features can be misleading as they might mix properties of different brain regions. Hence, to alleviate such a problem, image boundary detection plays a vital role in medical image segmentation, and brain segmentation in particular, as unclear boundaries can worsen brain segmentation results. Yet, given the low quality of brain images, boundary detection in the context of brain image segmentation remains challenging. Despite the research invested to improve boundary detection and brain segmentation, these two problems were addressed independently, i.e., little attention was paid to applying boundary detection to brain segmentation tasks. Therefore, in this paper, we propose a boundary detection-based model for brain image segmentation. To this end, we first design a boundary segmentation network for detecting and segmenting images brain tissues. Then, we design a boundary information module (BIM) to distinguish boundaries from the three different brain tissues. After that, we add a boundary attention gate (BAG) to the encoder output layers of our transformer to capture more informative local details. We evaluate our proposed model on two datasets of brain tissue images, including infant and adult brains. The extensive evaluation experiments of our model show better performance (a Dice Coefficient (DC) accuracy of up to 5.3% compared to the state-of-the-art models) in detecting and segmenting brain tissue images.

Article
Background Biological data suffers from noise that is inherent in the measurements. This is particularly true for time-series gene expression measurements. Nevertheless, in order to to explore cellular dynamics, scientists employ such noisy measurements in predictive and clustering tools. However, noisy data can not only obscure the genes temporal patterns, but applying predictive and clustering tools on noisy data may yield inconsistent, and potentially incorrect, results. Results To reduce the noise of short-term (< 48 h) time-series expression data, we relied on the three basic temporal patterns of gene expression: waves, impulses and sustained responses. We constrained the estimation of the true signals to these patterns by estimating the parameters of first and second-order Fourier functions and using the nonlinear least-squares trust-region optimization technique. Our approach lowered the noise in at least 85% of synthetic time-series expression data, significantly more than the spline method ( $$p<10^{-6}$$ p < 10 - 6 ). When the data contained a higher signal-to-noise ratio, our method allowed downstream network component analyses to calculate consistent and accurate predictions, particularly when the noise variance was high. Conversely, these tools led to erroneous results from untreated noisy data. Our results suggest that at least 5–7 time points are required to efficiently de-noise logarithmic scaled time-series expression data. Investing in sampling additional time points provides little benefit to clustering and prediction accuracy. Conclusions Our constrained Fourier de-noising method helps to cluster noisy gene expression and interpret dynamic gene networks more accurately. The benefit of noise reduction is large and can constitute the difference between a successful application and a failing one.

Article
Background Single-cell RNA-sequencing is revolutionising the study of cellular and tissue-wide heterogeneity in a large number of biological scenarios, from highly tissue-specific studies of disease to human-wide cell atlases. A central task in single-cell RNA-sequencing analysis design is the calculation of cell type-specific genes in order to study the differential impact of different replicates (e.g. tumour vs. non-tumour environment) on the regulation of those genes and their associated networks. The crucial task is the efficient and reliable calculation of such cell type-specific ‘marker’ genes. These optimise the ability of the experiment to isolate highly-specific cell phenotypes of interest to the analyser. However, while methods exist that can calculate marker genes from single-cell RNA-sequencing, no such method places emphasise on specific cell phenotypes for downstream study in e.g. differential gene expression or other experimental protocols (spatial transcriptomics protocols for example). Here we present , a general computational framework for extracting key marker genes from single-cell RNA-sequencing data which reliably characterise highly-specific and niche populations of cells in numerous different biological data-sets. Results extracts robust and biologically well-motivated marker genes, which characterise a given single-cell RNA-sequencing data-set better than existing computational approaches for general marker gene calculation. We demonstrate the utility of through its substantial performance improvement over several existing methods in the field. Furthermore, we evaluate the markers on spatial transcriptomics data, demonstrating they identify highly localised compartments of the mouse cortex. Conclusion is a new methodology for calculating robust markers genes from large single-cell RNA-sequencing data-sets, and has implications for e.g. effective gene identification for probe design in downstream analyses spatial transcriptomics experiments. has been fully-integrated with the framework and provides a valuable bioinformatics tool for cell type characterisation and validation in every-growing data-sets spanning over 50 different cell types across hundreds of thousands of cells.

Article
Abstract Background The treatment and prognosis of lung adenocarcinoma (LUAD) remains a challenge. The study aimed to conduct a systematic analysis of the predictive capacity of N6-methyladenosine (m6A)-related long non-coding RNAs (lncRNAs) in the prognosis of LUAD. Methods 594 samples were totally selected from a dataset from The Cancer Genome Atlas. The identification of prognostic m6A-related lncRNAs were performed by Pearson correlation analysis and Cox regression analysis. Systematic analyses, including cluster analysis, survival analysis, and immuno-correlated analysis, were conducted. A prognosis model was built from the optimized subset of m6A-related lncRNAs. The assessment of model was performed by survival analysis, and receiver operating characteristic (ROC) curve. Finally, the risk score of patients with LUAD calculated by the prognosis model was implemented by the analysis of Cox regression. Differential analysis was for further evaluation of the cuproptosis-related genes in two risk sets. Results These patients were grouped into two clusters according to the expression levels of 22 prognostic m6A-related lncRNAs. The patients with LUAD in cluster 2 was significantly worse in the overall survival (OS) (P = 0.006). Three scores calculated by the ESTIMATE methods in cluster 2 were significantly lower. After the least absolute shrinkage and selection operator algorithm, 10 prognostic m6A-related lncRNAs were totally selected to construct the final model to obtain the risk score. Then the area under the ROC curve of the prognosis model for 1, 3, and 5-year OS was 0.767, 0.709, and 0.736 in the training set, and 0.707, 0.691, and 0.675 in the test set. The OS of the low-risk cohort was significantly higher than that of the high-risk cohort in both the training set (P

Article
Background Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4–5 times underrepresented compared to non-TMPs. Today’s top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions. Results Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060). Conclusions Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2.

Article
Background The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS). Results We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction. Conclusions Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score).

Article
A gene is considered as essential when it is indispensable for cells to grow and replicate in a certain environment. However, gene essentiality is not a structural property but rather a contextual one, which depends on the specific biological conditions affecting the cell. This circumstantial essentiality of genes is what brings the attention of scientist since we can identify genes essential for cancer cells but not essential for healthy cells. This same contextuality makes their identification extremely challenging. Huge experimental efforts such as Project Achilles where the essentiality of thousands of genes is measured together with a plethora of molecular data (transcriptomics, copy number, mutations, etc.) in over one thousand cell lines can shed light on the causality behind the essentiality of a gene in a given environment. Here, we present an in-silico method for the identification of patient-specific essential genes using constraint-based modelling (CBM). Our method expands the ideas behind traditional CBM to accommodate multisystem networks. In essence, it first calculates the minimum number of lowly expressed genes required to be activated by the cell to sustain life as defined by a set of requirements; and second, it performs an exhaustive in-silico gene knockout to find those that lead to the need of activating additional lowly expressed genes. We validated the proposed methodology using a set of 452 cancer cell lines derived from the Cancer Cell Line Encyclopedia where an exhaustive experimental large-scale gene knockout study using CRISPR (Achilles Project) evaluates the impact of each removal. We also show that the integration of different essentiality predictions per gene, what we called Essentiality Congruity Score, reduces the number of false positives. Finally, we explored our method in a breast cancer patient dataset, and our results showed high concordance with previous publications. These findings suggest that identifying genes whose activity is fundamental to sustain cellular life in a patient-specific manner is feasible using in-silico methods. The patient-level gene essentiality predictions can pave the way for precision medicine by identifying potential drug targets whose deletion can induce death in tumour cells.

Article
Background A key problem in bioinformatics is that of predicting gene expression levels. There are two broad approaches: use of mechanistic models that aim to directly simulate the underlying biology, and use of machine learning (ML) to empirically predict expression levels from descriptors of the experiments. There are advantages and disadvantages to both approaches: mechanistic models more directly reflect the underlying biological causation, but do not directly utilize the available empirical data; while ML methods do not fully utilize existing biological knowledge. Results Here, we investigate overcoming these disadvantages by integrating mechanistic cell signalling models with ML. Our approach to integration is to augment ML with similarity features (attributes) computed from cell signalling models. Seven sets of different similarity feature were generated using graph theory. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model - without the similarity features. Finally, the seven multi-target regression models were stacked together to form an overall prediction model that was significantly better than the baseline on 95% of genes on an independent test set. The similarity features enable this stacking model to provide interpretable knowledge about cancer, e.g. the role of ERBB3 in the MCF7 breast cancer cell line. Conclusion Integrating mechanistic models as graphs helps to both improve the predictive results of machine learning models, and to provide biological knowledge about genes that can help in building state-of-the-art mechanistic models.

Article
Rupture of intracranial aneurysm is the first cause of subarachnoid hemorrhage, second only to cerebral thrombosis and hypertensive cerebral hemorrhage, and the mortality rate is very high. MRI technology plays an irreplaceable role in the early detection and diagnosis of intracranial aneurysms and supports evaluating the size and structure of aneurysms. The increase in many aneurysm images, may be a massive workload for the doctors, which is likely to produce a wrong diagnosis. Therefore, we proposed a simple and effective comprehensive residual attention network (CRANet) to improve the accuracy of aneurysm detection, using a residual network to extract the features of an aneurysm. Many experiments have shown that the proposed CRANet model could detect aneurysms effectively. In addition, on the test set, the accuracy and recall rates reached 97.81% and 94%, which significantly improved the detection rate of aneurysms.

Article
Background Visceral Leishmaniasis (VL) is a fatal vector-borne parasitic disorder occurring mainly in tropical and subtropical regions. VL falls under the category of neglected tropical diseases with growing drug resistance and lacking a licensed vaccine. Conventional vaccine synthesis techniques are often very laborious and challenging. With the advancement of bioinformatics and its application in immunology, it is now more convenient to design multi-epitope vaccines comprising predicted immuno-dominant epitopes of multiple antigenic proteins. We have chosen four antigenic proteins of Leishmania donovani and identified their T-cell and B-cell epitopes, utilizing those for in-silico chimeric vaccine designing. The various physicochemical characteristics of the vaccine have been explored and the tertiary structure of the chimeric construct is predicted to perform docking studies and molecular dynamics simulations. Results The vaccine construct is generated by joining the epitopes with specific linkers. The predicted tertiary structure of the vaccine has been found to be valid and docking studies reveal the construct shows a high affinity towards the TLR-4 receptor. Population coverage analysis shows the vaccine can be effective on the majority of the world population. In-silico immune simulation studies confirms the vaccine to raise a pro-inflammatory response with the proliferation of activated T and B cells. In-silico codon optimization and cloning of the vaccine nucleic acid sequence have also been achieved in the pET28a vector. Conclusion The above bioinformatics data support that the construct may act as a potential vaccine. Further wet lab synthesis of the vaccine and in vivo works has to be undertaken in animal model to confirm vaccine potency.

Article
Background Applying directed acyclic graph (DAG) models to proteogenomic data has been shown effective for detecting causal biomarkers of complex diseases. However, there remain unsolved challenges in DAG learning to jointly model binary clinical outcome variables and continuous biomarker measurements. Results In this paper, we propose a new tool, DAGBagM, to learn DAGs with both continuous and binary nodes. By using appropriate models, DAGBagM allows for either continuous or binary nodes to be parent or child nodes. It employs a bootstrap aggregating strategy to reduce false positives in edge inference. At the same time, the aggregation procedure provides a flexible framework to robustly incorporate prior information on edges. Conclusions Through extensive simulation experiments, we demonstrate that DAGBagM has superior performance compared to alternative strategies for modeling mixed types of nodes. In addition, DAGBagM is computationally more efficient than two competing methods. When applying DAGBagM to proteogenomic datasets from ovarian cancer studies, we identify potential protein biomarkers for platinum refractory/resistant response in ovarian cancer. DAGBagM is made available as a github repository at https://github.com/jie108/dagbagM .

Article
Background Use of next-generation sequencing technologies to transcriptomics (RNA-seq) for gene expression profiling has found widespread application in studying different biological conditions including cancers. However, RNA-seq experiments are still small sample size experiments due to the cost. Recently, an increased focus has been on meta-analysis methods for integrated differential expression analysis for exploration of potential biomarkers. In this study, we propose a p -value combination method for meta-analysis of multiple independent but related RNA-seq studies that accounts for sample size of a study and direction of expression of genes in individual studies. Results The proposed method generalizes the inverse-normal method without an increase in statistical or computational complexity and does not pre- or post-hoc filter genes that have conflicting direction of expression in different studies. Thus, the proposed method, as compared to the inverse-normal, has better potential for the discovery of differentially expressed genes (DEGs) with potentially conflicting differential signals from multiple studies related to disease. We demonstrated the use of the proposed method in detection of biologically relevant DEGs in glioblastoma (GBM), the most aggressive brain cancer. Our approach notably enabled the identification of over-expressed tumour suppressor gene RAD51 in GBM compared to healthy controls, which has recently been shown to be a target for inhibition to enhance radiosensitivity of GBM cells during treatment. Pathway analysis identified multiple aberrant GBM related pathways as well as novel regulators such as TCF7L2 and MAPT as important upstream regulators in GBM. Conclusions The proposed meta-analysis method generalizes the existing inverse-normal method by providing a way to establish differential expression status for genes with conflicting direction of expression in individual RNA-seq studies. Hence, leading to further exploration of them as potential biomarkers for the disease.

Article
Motivation Aberrant DNA methylation in transcription factor binding sites has been shown to lead to anomalous gene regulation that is strongly associated with human disease. However, the majority of methylation-sensitive positions within transcription factor binding sites remain unknown. Here we introduce SEMplMe, a computational tool to generate predictions of the effect of methylation on transcription factor binding strength in every position within a transcription factor’s motif. Results SEMplMe uses ChIP-seq and whole genome bisulfite sequencing to predict effects of methylation within binding sites. SEMplMe validates known methylation sensitive and insensitive positions within a binding motif, identifies cell type specific transcription factor binding driven by methylation, and outperforms SELEX-based predictions for CTCF. These predictions can be used to identify aberrant sites of DNA methylation contributing to human disease. Availability and Implementation SEMplMe is available from https://github.com/Boyle-Lab/SEMplMe.

Article
Background Genetic and epigenetic biological studies often combine different types of experiments and multiple conditions. While the corresponding raw and processed data are made available through specialized public databases, the processed files are usually limited to a specific research question. Hence, they are unsuitable for an unbiased, systematic overview of a complex dataset. However, possible combinations of different sample types and conditions grow exponentially with the amount of sample types and conditions. Therefore the risk to miss a correlation or to overrate an identified correlation should be mitigated in a complex dataset. Since reanalysis of a full study is rarely a viable option, new methods are needed to address these issues systematically, reliably, reproducibly and efficiently. Results Cogito “COmpare annotated Genomic Intervals TOol” provides a workflow for an unbiased, structured overview and systematic analysis of complex genomic datasets consisting of different data types (e.g. RNA-seq, ChIP-seq) and conditions. Cogito is able to visualize valuable key information of genomic or epigenomic interval-based data, thereby providing a straightforward analysis approach for comparing different conditions. It supports getting an unbiased impression of a dataset and developing an appropriate analysis strategy for it. In addition to a text-based report, Cogito offers a fully customizable report as a starting point for further in-depth investigation. Conclusions Cogito implements a novel approach to facilitate high-level overview analyses of complex datasets, and offers additional insights into the data without the need for a full, time-consuming reanalysis. The R/Bioconductor package is freely available at https://bioconductor.org/packages/release/bioc/html/Cogito.html, a comprehensive documentation with detailed descriptions and reproducible examples is included.

Article
Background ImputAccur is a software tool to measure genotype-imputation accuracy. Imputation of untyped markers is a standard approach in genome-wide association studies to close the gap between directly genotyped and other known DNA variants. However, high accuracy for imputed genotypes is fundamental. Several accuracy measures have been proposed, but unfortunately, they are implemented on different platforms, which is impractical. Results With ImputAccur, the accuracy measures info, Iam-hiQ and r²-based indices can be derived from standard output files of imputation software. Sample/probe and marker filtering is possible. This allows e.g. accurate marker filtering ahead of data analysis. Conclusions The source code (Python version 3.9.4), a standalone executive file, and example data for ImputAccur are freely available at https://gitlab.gwdg.de/kolja.thormann1/imputationquality.git.

Article
Background Essential Proteins are demonstrated to exert vital functions on cellular processes and are indispensable for the survival and reproduction of the organism. Traditional centrality methods perform poorly on complex protein–protein interaction (PPI) networks. Machine learning approaches based on high-throughput data lack the exploitation of the temporal and spatial dimensions of biological information. Results We put forward a deep learning framework to predict essential proteins by integrating features obtained from the PPI network, subcellular localization, and gene expression profiles. In our model, the node2vec method is applied to learn continuous feature representations for proteins in the PPI network, which capture the diversity of connectivity patterns in the network. The concept of depthwise separable convolution is employed on gene expression profiles to extract properties and observe the trends of gene expression over time under different experimental conditions. Subcellular localization information is mapped into a long one-dimensional vector to capture its characteristics. Additionally, we use a sampling method to mitigate the impact of imbalanced learning when training the model. With experiments carried out on the data of Saccharomyces cerevisiae, results show that our model outperforms traditional centrality methods and machine learning methods. Likewise, the comparative experiments have manifested that our process of various biological information is preferable. Conclusions Our proposed deep learning framework effectively identifies essential proteins by integrating multiple biological data, proving a broader selection of subcellular localization information significantly improves the results of prediction and depthwise separable convolution implemented on gene expression profiles enhances the performance.

Article
Background Drug–target interaction (DTI) prediction plays a crucial role in drug discovery. Although the advanced deep learning has shown promising results in predicting DTIs, it still needs improvements in two aspects: (1) encoding method, in which the existing encoding method, character encoding, overlooks chemical textual information of atoms with multiple characters and chemical functional groups; as well as (2) the architecture of deep model, which should focus on multiple chemical patterns in drug and target representations. Results In this paper, we propose a multi-granularity multi-scaled self-attention (SAN) model by alleviating the above problems. Specifically, in process of encoding, we investigate a segmentation method for drug and protein sequences and then label the segmented groups as the multi-granularity representations. Moreover, in order to enhance the various local patterns in these multi-granularity representations, a multi-scaled SAN is built and exploited to generate deep representations of drugs and targets. Finally, our proposed model predicts DTIs based on the fusion of these deep representations. Our proposed model is evaluated on two benchmark datasets, KIBA and Davis. The experimental results reveal that our proposed model yields better prediction accuracy than strong baseline models. Conclusion Our proposed multi-granularity encoding method and multi-scaled SAN model improve DTI prediction by encoding the chemical textual information of drugs and targets and extracting their various local patterns, respectively.

Article
Background DIRs are mysterious protein that have the ability to scavenge free radicals, which, are highly reactive with molecules in their vicinity. What is even more fascinating is that they carry out from these highly unstable species, a selective reaction (i.e., stereoenantioselective) from a well-defined substrate to give a very precise product. Unfortunately, to date, only three products have been demonstrated following studies on DIRs from the plant world, which until now was the kingdom where these proteins had been demonstrated. Within this kingdom, each DIR protein has its own type of substrate. The products identified to date, have on the other hand, a strong economic impact: in agriculture for example, the biosynthesis of (+)-gossypol could be highlighted (a repellent antifood produced by the cotton plant) by the DIRs of cotton. In forsythia plant species, it is the biosynthesis of (−)-pinoresinol, an intermediate leading to the synthesis of podophyllotoxine (a powerful anicancerous agent) which has been revealed. Recently, a clear path of study, potentially with strong impact, appeared by the hypothesis of the potential existence of protein DIR within the genomes of prokaryotes. The possibility of working with this type of organism is an undeniable advantage: since many sequenced genomes are available and the molecular tools are already developed. Even easier to implement and working on microbes, of less complex composition, offers many opportunities for laboratory studies. On the other hand, the diversity of their environment (e.g., soil, aquatic environments, extreme environmental conditions (pH, temperature, pressure) make them very diverse and varied subjects of study. Identifying new DIR proteins from bacteria means identifying new substrate or product molecules from these organisms. It is the promise of going further in understanding the mechanism of action of these proteins and this will most likely have a strong impact in the fields of agricultural, pharmaceutical and/or food chemistry. Results Our goal is to obtain as much information as possible about these proteins to unlock the secrets of their exceptional functioning. Analyzes of structural and functional genomic data led to the identification of the Pfam PF03018 domain as characteristic of DIR proteins. This domain has been further identified in the sequence of bacterial proteins therefore named as DIR-like (DIRL). We have chosen a multidisciplinary bioinformatic approach centered on bacterial genome identification, gene expression and regulation signals, protein structures, and their molecular information content. The objective of this study was to perform a thorough bioinformatic analysis on these DIRLs to highlight any information leading to the selection of candidate bacteria for further cloning, purification, and characterization of bacterial DIRs. Conclusions From studies of DIRL genes identification, primary structures, predictions of their secondary and tertiary structures, prediction of DIRL signals sequences, analysis of their gene organization and potential regulation, a list of primary bacterial candidates is proposed.

Article
Background Cervical cancer is the fourth most common cancer affecting women and is caused by human Papillomavirus (HPV) infections that are sexually transmitted. There are currently commercially available prophylactic vaccines that have been shown to protect vaccinated individuals against HPV infections, however, these vaccines have no therapeutic effects for those who are previously infected with the virus. The current study’s aim was to use immunoinformatics to develop a multi-epitope vaccine with therapeutic potential against cervical cancer. Results In this study, T-cell epitopes from E5 and E7 proteins of HPV16/18 were predicted. These epitopes were evaluated and chosen based on their antigenicity, allergenicity, toxicity, and induction of IFN-γ production (only in helper T lymphocytes). Then, the selected epitopes were sequentially linked by appropriate linkers. In addition, a C-terminal fragment of Mycobacterium tuberculosis heat shock protein 70 (HSP70) was used as an adjuvant for the vaccine construct. The physicochemical parameters of the vaccine construct were acceptable. Furthermore, the vaccine was soluble, highly antigenic, and non-allergenic. The vaccine’s 3D model was predicted, and the structural improvement after refinement was confirmed using the Ramachandran plot and ProSA-web. The vaccine’s B-cell epitopes were predicted. Molecular docking analysis showed that the vaccine's refined 3D model had a strong interaction with the Toll-like receptor 4. The structural stability of the vaccine construct was confirmed by molecular dynamics simulation. Codon adaptation was performed in order to achieve efficient vaccine expression in Escherichia coli strain K12 (E. coli). Subsequently, in silico cloning of the multi-epitope vaccine was conducted into pET-28a ( +) expression vector. Conclusions According to the results of bioinformatics analyses, the multi-epitope vaccine is structurally stable, as well as a non-allergic and non-toxic antigen. However, in vitro and in vivo studies are needed to validate the vaccine’s efficacy and safety. If satisfactory results are obtained from in vitro and in vivo studies, the vaccine designed in this study may be effective as a therapeutic vaccine against cervical cancer.

Article
Although there have been some recent cell and animal experiments indicating that expression of the gene encoding apolipoprotein B mRNA editing enzyme catalytic subunit 3B (APOBEC3B) is closely related to cancer, it still lacks pan-cancer analysis. Here we analyzed the potential carcinogenic role of APOBEC3B in 33 tumors based on The Cancer Genome Atlas (TCGA). APOBEC3B was highly expressed in most tumors and weakly expressed in a few. Differences in expression level were significantly correlated with the pathological tumor stage and prognosis of affected patients. The high-frequency APOBEC3B changes were principally mutations and amplifications in some tumors, such as uterine corpus endometrial carcinomas or cutaneous melanomas. In testicular germ cell tumors and invasive breast carcinomas, APOBEC3B expression and CD8⁺ T lymphocyte counts were correlated. In other cancers, such as human papilloma virus (HPV)-related head and neck squamous cell carcinomas or esophageal adenocarcinomas, there was also cancer-associated fibroblast infiltration. The APOBEC3B enzyme acts in the mitochondrial respiratory electron transport chain and in oxidative phosphorylation. This first pan-cancer study provides a comprehensive understanding of the multiple roles of APOBEC3B in different tumor types.

Article
Background Genome-wide association studies (GWAS) are limited in power to detect associations that exceed the stringent genome-wide significance threshold. This limitation can be alleviated by leveraging relevant auxiliary data, such as functional genomic data. Frameworks utilising the conditional false discovery rate have been developed for this purpose, and have been shown to increase power for GWAS discovery whilst controlling the false discovery rate. However, the methods are currently only applicable for continuous auxiliary data and cannot be used to leverage auxiliary data with a binary representation, such as whether SNPs are synonymous or non-synonymous, or whether they reside in regions of the genome with specific activity states. Results We describe an extension to the cFDR framework for binary auxiliary data, called “Binary cFDR”. We demonstrate FDR control of our method using detailed simulations, and show that Binary cFDR performs better than a comparator method in terms of sensitivity and FDR control. We introduce an all-encompassing user-oriented CRAN R package ( https://annahutch.github.io/fcfdr/ ; https://cran.r-project.org/web/packages/fcfdr/index.html ) and demonstrate its utility in an application to type 1 diabetes, where we identify additional genetic associations. Conclusions Our all-encompassing R package, , serves as a comprehensive toolkit to unite GWAS and functional genomic data in order to increase statistical power to detect genetic associations.

Article
Background Reference intervals represent the expected range of physiological test results in a healthy population and are essential to support medical decision making. Particularly in the context of pediatric reference intervals, where recruitment regulations make prospective studies challenging to conduct, indirect estimation strategies are becoming increasingly important. Established indirect methods enable robust identification of the distribution of “healthy” samples from laboratory databases, which include unlabeled pathologic cases, but are currently severely limited when adjusting for essential patient characteristics such as age. Here, we propose the use of mixture density networks (MDN) to overcome this problem and model all parameters of the mixture distribution in a single step. Results Estimated reference intervals from varying settings with simulated data demonstrate the ability to accurately estimate latent distributions from unlabeled data using different implementations of MDNs. Comparing the performance with alternative estimation approaches further highlights the importance of modeling the mixture component weights as a function of the input in order to avoid biased estimates for all other parameters and the resulting reference intervals. We also provide a strategy to generate partially customized starting weights to improve proper identification of the latent components. Finally, the application on real-world hemoglobin samples provides results in line with current gold standard approaches, but also suggests further investigations with respect to adequate regularization strategies in order to prevent overfitting the data. Conclusions Mixture density networks provide a promising approach capable of extracting the distribution of healthy samples from unlabeled laboratory databases while simultaneously and explicitly estimating all parameters and component weights as non-linear functions of the covariate(s), thereby allowing the estimation of age-dependent reference intervals in a single step. Further studies on model regularization and asymmetric component distributions are warranted to consolidate our findings and expand the scope of applications.

Article
Background Technical improvement in ATAC-seq makes it possible for high throughput profiling the chromatin states of single cells. However, data from multiple sources frequently show strong technical variations, which is referred to as batch effects. In order to perform joint analysis across multiple datasets, specialized method is required to remove technical variations between datasets while keep biological information. Results Here we present an algorithm named epiConv to perform joint analyses on scATAC-seq datasets. We first show that epiConv better corrects batch effects and is less prone to over-fitting problem than existing methods on a collection of PBMC datasets. In a collection of mouse brain data, we show that epiConv is capable of aligning low-depth scATAC-Seq from co-assay data (simultaneous profiling of transcriptome and chromatin) onto high-quality ATAC-seq reference and increasing the resolution of chromatin profiles of co-assay data. Finally, we show that epiConv can be used to integrate cells from different biological conditions (T cells in normal vs. germ-free mouse; normal vs. malignant hematopoiesis), which reveals hidden cell populations that would otherwise be undetectable. Conclusions In this study, we introduce epiConv to integrate multiple scATAC-seq datasets and perform joint analysis on them. Through several case studies, we show that epiConv removes the batch effects and retains the biological signal. Moreover, joint analysis across multiple datasets improves the performance of clustering and differentially accessible peak calling, especially when the biological signal is weak in single dataset.

Article
Background Biomedical event extraction is a fundamental task in biomedical text mining, which provides inspiration for medicine research and disease prevention. Biomedical events include simple events and complex events. Existing biomedical event extraction methods usually deal with simple events and complex events uniformly, and the performance of complex event extraction is relatively low. Results In this paper, we propose a fine-grained Bidirectional Long Short Term Memory method for biomedical event extraction, which designs different argument detection models for simple and complex events respectively. In addition, multi-level attention is designed to improve the performance of complex event extraction, and sentence embeddings are integrated to obtain sentence level information which can resolve the ambiguities for some types of events. Our method achieves state-of-the-art performance on the commonly used dataset Multi-Level Event Extraction. Conclusions The sentence embeddings enrich the global sentence-level information. The fine-grained argument detection model improves the performance of complex biomedical event extraction. Furthermore, the multi-level attention mechanism enhances the interactions among relevant arguments. The experimental results demonstrate the effectiveness of the proposed method for biomedical event extraction.

Article
Background Ewing sarcoma (ES) is the second most common primary malignant bone tumor mainly occurring in children, adolescents and young adults with high metastasis and mortality. Autophagy has been reported to be involved in the survival of ES, but the role remains unclear. Therefore, it’s necessary to investigate the prognostic value of autophagy related genes using bioinformatics methods. Results ATG2B , ATG10 and DAPK1 were final screened genes for a prognostic model. KM and risk score plots showed patients in high score group had better prognoses both in training and validation sets. C-indexes of the model for training and validation sets were 0.68 and 0.71, respectively. Calibration analyses indicated the model had high prediction accuracy in training and validation sets. The AUC values of ROC for 1-, 3-, 5-year prediction were 0.65, 0.73 and 0.84 in training set, 0.88, 0.73 and 0.79 in validation set, which suggested high prediction accuracy of the model. Decision curve analyses showed that patients could benefit much from the model. Differential and functional analyses suggested that autophagy and apoptosis were upregulated in high risk score group. Conclusions ATG2B , ATG10 and DAPK1 were autophagy related genes with potential protective function in ES. The prognostic model established by them exhibited excellent prediction accuracy and discriminatory capacities. They might be used as potential prognostic biomarkers and therapeutic targets in ES.

Article
Background Previous studies have demonstrated the value of re-analysing publicly available genetics data with recent analytical approaches. Publicly available datasets, such as the Women’s Health Initiative (WHI) offered by the database of genotypes and phenotypes (dbGaP), provide a wealthy resource for researchers to perform multiple analyses, including Genome-Wide Association Studies. Often, the genetic information of individuals in these datasets are stored in imputed dosage files output by MaCH; mldose and mlinfo files. In order for researchers to perform GWAS studies with this data, they must first be converted to a file format compatible with their tool of choice e.g., PLINK. Currently, there is no published tool which easily converts the datasets provided in MACH dosage files into PLINK-ready files. Results Herein, we present Canary a singularity-based tool which converts MaCH dosage files into PLINK-compatible files with a single line of user input at the command line. Further, we provide a detailed tutorial on preparation of phenotype files. Moreover, Canary comes with preinstalled software often used during GWAS studies, to further increase the ease-of-use of HPC systems for researchers. Conclusions Until now, conversion of imputed data in the form of MaCH mldose and mlinfo files needed to be completed manually. Canary uses singularity container technology to allow users to automatically convert these MaCH files into PLINK compatible files. Additionally, Canary provides researchers with a platform to conduct GWAS analysis more easily as it contains essential software needed for conducting GWAS studies, such as PLINK and Bioconductor. We hope that this tool will greatly increase the ease at which researchers can perform GWAS with imputed data, particularly on HPC environments.

Article
Background Heritability and genetic correlation can be estimated from genome-wide single-nucleotide polymorphism (SNP) data using various methods. We recently developed multivariate genomic-relatedness-based restricted maximum likelihood (MGREML) for statistically and computationally efficient estimation of SNP-based heritability ( $$h^2_{\text{SNP}}$$ h SNP 2 ) and genetic correlation ( $$\rho _G$$ ρ G ) across many traits in large datasets. Here, we extend MGREML by allowing it to fit and perform tests on user-specified factor models, while preserving the low computational complexity. Results Using simulations, we show that MGREML yields consistent estimates and valid inferences for such factor models at low computational cost (e.g., for data on 50 traits and 20,000 individuals, a saturated model involving 50 $$h^2_{\text{SNP}}$$ h SNP 2 ’s, 1225 $$\rho _G$$ ρ G ’s, and 50 fixed effects is estimated and compared to a restricted model in less than one hour on a single notebook with two 2.7 GHz cores and 16 GB of RAM). Using repeated measures of height and body mass index from the US Health and Retirement Study, we illustrate the ability of MGREML to estimate a factor model and test whether it fits the data better than a nested model. The MGREML tool, the simulation code, and an extensive tutorial are freely available at https://github.com/devlaming/mgreml/ . Conclusion MGREML can now be used to estimate multivariate factor structures and perform inferences on such factor models at low computational cost. This new feature enables simple structural equation modeling using MGREML, allowing researchers to specify, estimate, and compare genetic factor models of their choosing using SNP data.

Article
Background The discovery of critical biomarkers is significant for clinical diagnosis, drug research and development. Researchers usually obtain biomarkers from microarray data, which comes from the dimensional curse. Feature selection in machine learning is usually used to solve this problem. However, most methods do not fully consider feature dependence, especially the real pathway relationship of genes. Results Experimental results show that the proposed method is superior to classical algorithms and advanced methods in feature number and accuracy, and the selected features have more significance. Method This paper proposes a feature selection method based on a graph neural network. The proposed method uses the actual dependencies between features and the Pearson correlation coefficient to construct graph-structured data. The information dissemination and aggregation operations based on graph neural network are applied to fuse node information on graph structured data. The redundant features are clustered by the spectral clustering method. Then, the feature ranking aggregation model using eight feature evaluation methods acts on each clustering sub-cluster for different feature selection. Conclusion The proposed method can effectively remove redundant features. The algorithm’s output has high stability and classification accuracy, which can potentially select potential biomarkers.

Article
Background Protein complexes are essential for biologists to understand cell organization and function effectively. In recent years, predicting complexes from protein–protein interaction (PPI) networks through computational methods is one of the current research hotspots. Many methods for protein complex prediction have been proposed. However, how to use the information of known protein complexes is still a fundamental problem that needs to be solved urgently in predicting protein complexes. Results To solve these problems, we propose a supervised learning method based on network representation learning and gene ontology knowledge, which can fully use the information of known protein complexes to predict new protein complexes. This method first constructs a weighted PPI network based on gene ontology knowledge and topology information, reducing the network's noise problem. On this basis, the topological information of known protein complexes is extracted as features, and the supervised learning model SVCC is obtained according to the feature training. At the same time, the SVCC model is used to predict candidate protein complexes from the protein interaction network. Then, we use the network representation learning method to obtain the vector representation of the protein complex and train the random forest model. Finally, we use the random forest model to classify the candidate protein complexes to obtain the final predicted protein complexes. We evaluate the performance of the proposed method on two publicly PPI data sets. Conclusions Experimental results show that our method can effectively improve the performance of protein complex recognition compared with existing methods. In addition, we also analyze the biological significance of protein complexes predicted by our method and other methods. The results show that the protein complexes predicted by our method have high biological significance.

Article
Background Identifying protein interfaces can inform how proteins interact with their binding partners, uncover the regulatory mechanisms that control biological functions and guide the development of novel therapeutic agents. A variety of computational approaches have been developed for predicting a protein’s interfacial residues from its known sequence and structure. Methods using the known three-dimensional structures of proteins can be template-based or template-free. Template-based methods have limited success in predicting interfaces when homologues with known complex structures are not available to use as templates. The prediction performance of template-free methods that only rely only upon proteins’ intrinsic properties is limited by the amount of biologically relevant features that can be included in an interface prediction model. Results We describe the development of an integrated method for protein interface prediction (ISPIP) to explore the hypothesis that the efficacy of a computational prediction method of protein binding sites can be enhanced by using a combination of methods that rely on orthogonal structure-based properties of a query protein, combining and balancing both template-free and template-based features. ISPIP is a method that integrates these approaches through simple linear or logistic regression models and more complex decision tree models. On a diverse test set of 156 query proteins, ISPIP outperforms each of its individual classifiers in identifying protein binding interfaces. Conclusions The integrated method captures the best performance of individual classifiers and delivers an improved interface prediction. The method is robust and performs well even when one of the individual classifiers performs poorly on a particular query protein. This work demonstrates that integrating orthogonal methods that depend on different structural properties of proteins performs better at interface prediction than any individual classifier alone.

Article
Background A large number of evidences from biological experiments have confirmed that miRNAs play an important role in the progression and development of various human complex diseases. However, the traditional experiment methods are expensive and time-consuming. Therefore, it is a challenging task that how to develop more accurate and efficient methods for predicting potential associations between miRNA and disease. Results In the study, we developed a computational model that combined heterogeneous graph convolutional network with enhanced layer for miRNA–disease association prediction (HGCNELMDA). The major improvement of our method lies in through restarting the random walk optimized the original features of nodes and adding a reinforcement layer to the hidden layer of graph convolutional network retained similar information between nodes in the feature space. In addition, the proposed approach recalculated the influence of neighborhood nodes on target nodes by introducing the attention mechanism. The reliable performance of the HGCNELMDA was certified by the AUC of 93.47% in global leave-one-out cross-validation (LOOCV), and the average AUCs of 93.01% in fivefold cross-validation. Meanwhile, we compared the HGCNELMDA with the state‑of‑the‑art methods. Comparative results indicated that o the HGCNELMDA is very promising and may provide a cost‑effective alternative for miRNA–disease association prediction. Moreover, we applied HGCNELMDA to 3 different case studies to predict potential miRNAs related to lung cancer, prostate cancer, and pancreatic cancer. Results showed that 48, 50, and 50 of the top 50 predicted miRNAs were supported by experimental association evidence. Therefore, the HGCNELMDA is a reliable method for predicting disease-related miRNAs. Conclusions The results of the HGCNELMDA method in the LOOCV (leave-one-out cross validation, LOOCV) and 5-cross validations were 93.47% and 93.01%, respectively. Compared with other typical methods, the performance of HGCNELMDA is higher. Three cases of lung cancer, prostate cancer, and pancreatic cancer were studied. Among the predicted top 50 candidate miRNAs, 48, 50, and 50 were verified in the biological database HDMMV2.0. Therefore; this further confirms the feasibility and effectiveness of our method. Therefore, this further confirms the feasibility and effectiveness of our method. To facilitate extensive studies for future disease-related miRNAs research, we developed a freely available web server called HGCNELMDA is available at http://124.221.62.44:8080/HGCNELMDA.jsp.

Article
Background The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, identifying many cancer-associated genes. Although cancers in the same tissue are widely categorized in the same group, they demonstrate many differences concerning their mutational profiles. Hence, there is no definitive treatment for most cancer types. This reveals the importance of developing new pipelines to identify cancer-associated genes accurately and re-classify patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types. Results In this study, we propose a new machine learning pipeline to identify protein-coding genes mutated in many samples to identify cancer subtypes. We apply our pipeline to 12,270 samples collected from the international cancer genome consortium, covering 19 cancer types. As a result, we identify 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways. Conclusions This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. Additionally, we analyze the mutational signatures for samples in each subtype, which provides important insight into their active molecular mechanisms. Some of the pathways we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rates of mutation in multiple cancer subtypes. In addition, our study on “gene-motif” suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes. The source codes for our proposed clustering pipeline and analysis are publicly available at: https://github.com/bcb-sut/Pan-Cancer.

Article
Background Probabilistic functional integrated networks (PFINs) are designed to aid our understanding of cellular biology and can be used to generate testable hypotheses about protein function. PFINs are generally created by scoring the quality of interaction datasets against a Gold Standard dataset, usually chosen from a separate high-quality data source, prior to their integration. Use of an external Gold Standard has several drawbacks, including data redundancy, data loss and the need for identifier mapping, which can complicate the network build and impact on PFIN performance. Additionally, there typically are no Gold Standard data for non-model organisms. Results We describe the development of an integration technique, ssNet, that scores and integrates both high-throughput and low-throughout data from a single source database in a consistent manner without the need for an external Gold Standard dataset. Using data from Saccharomyces cerevisiae we show that ssNet is easier and faster, overcoming the challenges of data redundancy, Gold Standard bias and ID mapping. In addition ssNet results in less loss of data and produces a more complete network. Conclusions The ssNet method allows PFINs to be built successfully from a single database, while producing comparable network performance to networks scored using an external Gold Standard source and with reduced data loss.

Article
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .

Article
Mediation analysis plays a major role in identifying significant mediators in the pathway between environmental exposures and health outcomes. With advanced data collection technology for large-scale studies, there has been growing research interest in developing methodology for high-dimensional mediation analysis. In this paper we present HIMA2, an extension of the HIMA method (Zhang in Bioinformatics 32:3150–3154, 2016). First, the proposed HIMA2 reduces the dimension of mediators to a manageable level based on the sure independence screening (SIS) method (Fan in J R Stat Soc Ser B 70:849–911, 2008). Second, a de-biased Lasso procedure is implemented for estimating regression parameters. Third, we use a multiple-testing procedure to accurately control the false discovery rate (FDR) when testing high-dimensional mediation hypotheses. We demonstrate its practical performance using Monte Carlo simulation studies and apply our method to identify DNA methylation markers which mediate the pathway from smoking to reduced lung function in the Coronary Artery Risk Development in Young Adults (CARDIA) Study.

Article
Motivation Computer-aided analysis of biological images typically requires extensive training on large-scale annotated datasets, which is not viable in many situations. In this paper, we present Generative Adversarial Network Discriminator Learner (GAN-DL), a novel self-supervised learning paradigm based on the StyleGAN2 architecture, which we employ for self-supervised image representation learning in the case of fluorescent biological images. Results We show that Wasserstein Generative Adversarial Networks enable high-throughput compound screening based on raw images. We demonstrate this by classifying active and inactive compounds tested for the inhibition of SARS-CoV-2 infection in two different cell models: the primary human renal cortical epithelial cells (HRCE) and the African green monkey kidney epithelial cells (VERO). In contrast to previous methods, our deep learning-based approach does not require any annotation, and can also be used to solve subtle tasks it was not specifically trained on, in a self-supervised manner. For example, it can effectively derive a dose-response curve for the tested treatments. Availability and implementation Our code and embeddings are available at https://gitlab.com/AlesioRFM/gan-dl StyleGAN2 is available at https://github.com/NVlabs/stylegan2 .

Article
Background Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access. Results In this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores. Conclusions CMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way: https://github.com/Humonex/Cmic .

Article
Background Enrichment analyses are widely applied to investigate lists of genes of interest. However, such analyses often result in long lists of annotation terms with high redundancy, making the interpretation and reporting difficult. Long annotation lists and redundancy also complicate the comparison of results obtained from different enrichment analyses. An approach to overcome these issues is using down-sized annotation collections composed of non-redundant terms. However, down-sized collections are generic and the level of detail may not fit the user’s study. Other available approaches include clustering and filtering tools, which are based on similarity measures and thresholds that can be complicated to comprehend and set. Result We propose orsum, a Python package to filter enrichment results. orsum can filter multiple enrichment results collectively and highlight common and specific annotation terms. Filtering in orsum is based on a simple principle: a term is discarded if there is a more significant term that annotates at least the same genes; the remaining more significant term becomes the representative term for the discarded term. This principle ensures that the main biological information is preserved in the filtered results while reducing redundancy. In addition, as the representative terms are selected from the original enrichment results, orsum outputs filtered terms tailored to the study. As a use case, we applied orsum to the enrichment analyses of four lists of genes, each associated with a neurodegenerative disease. Conclusion orsum provides a comprehensible and effective way of filtering and comparing enrichment results. It is available at https://anaconda.org/bioconda/orsum .

Article
Background With the widespread availability of microarray technology for epigenetic research, methods for calling differentially methylated probes or differentially methylated regions have become effective tools to analyze this type of data. Furthermore, visualization is usually employed for quality check of results and for further insights. Expert knowledge is required to leverage capabilities of these methods. To overcome this limitation and make visualization in epigenetic research available to the public, we designed EpiVisR. Results The EpiVisR tool allows to select and visualize combinations of traits (i.e., concentrations of chemical compounds) and differentially methylated probes/regions. It supports various modes of enriched presentation to get the most knowledge out of existing data: (1) enriched Manhattan plot and enriched volcano plot for selection of probes, (2) trait-methylation plot for visualization of selected trait values against methylation values, (3) methylation profile plot for visualization of a selected range of probes against selected trait values as well as, (4) correlation profile plot for selection and visualization of further probes that are correlated to the selected probe. EpiVisR additionally allows exporting selected data to external tools for tasks such as network analysis. Conclusion The key advantage of EpiVisR is the annotation of data in the enriched plots (and tied tables) as well as linking to external data sources for further integrated data analysis. Using the EpiVisR approach will allow users to integrate data from traits with epigenetic analyses that are connected by belonging to the same individuals. Merging data from various data sources among the same cohort and visualizing them will enable users to gain more insights from existing data.

Article
In recent years, the introduction of single-cell RNA sequencing (scRNAseq) has enabled the analysis of a cell’s transcriptome at an unprecedented granularity and processing speed. The experimental outcome of applying this technology is a M × N matrix containing aggregated mRNA expression counts of M genes and N cell samples. From this matrix, scientists can study how cell protein synthesis changes in response to various factors, for example, disease versus non-disease states in response to a treatment protocol. This technology’s critical challenge is detecting and accurately recording lowly expressed genes. As a result, low expression levels tend to be missed and recorded as zero - an event known as dropout. This makes the lowly expressed genes indistinguishable from true zero expression and different than the low expression present in cells of the same type. This issue makes any subsequent downstream analysis difficult. To address this problem, we propose an approach to measure cell similarity using consensus clustering and demonstrate an effective and efficient algorithm that takes advantage of this new similarity measure to impute the most probable dropout events in the scRNA-seq datasets. We demonstrate that our approach exceeds the performance of existing imputation approaches while introducing the least amount of new noise as measured by clustering performance characteristics on datasets with known cell identities.

Top-cited authors
• University of California, Los Angeles