About
56
Publications
28,867
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
46,175
Citations
Citations since 2017
Introduction
Skills and Expertise
Additional affiliations
January 1999 - December 2007
January 1997 - December 2011
Publications
Publications (56)
The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with string...
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomi...
Motivation:
Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.
Results:
In this study, we focus on predicting the solubility and usability...
The native subcellular location (also referred to as localization or cellular compartment) of a protein is the one in which it acts most frequently; it is one aspect of protein function. Do ten eukaryotic model organisms differ in their location spectrum, i.e., the fraction of its proteome in each of seven major cellular compartments? As experiment...
A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimiz...
Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.
In this study, we focus on predicting the solubility and usability for purification of pro...
Matrix targeting sequences (MTSs) direct proteins from the cytosol into mitochondria. Efficient targeting often relies on internal matrix targeting-like sequences (iMTS-Ls) which share structural features with MTSs. Predicting iMTS-Ls was tedious and required multiple tools and webservices. We present iMLP, a deep learning approach for the predicti...
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. As experimental characterization of SPs is costly, prediction algorithms are applied to predict them from sequence data. However, existing methods are unable to detect all known types of SPs. We introduce SignalP 6.0, the f...
A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimiz...
GPI-anchors constitute a very important post-translational modification, linking many proteins to the outer face of the plasma membrane in eukaryotic cells. Since experimental validation of GPI-anchoring signals is slow and costly, computational approaches for predicting them from amino acid sequences are needed. However, the most recent GPI predic...
Motivation: Language modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed toward...
Background:
In the last decade, increasing evidence has shown that changes in human gut microbiota are associated with diseases, such as obesity. The excreted/secreted proteins (secretome) of the gut microbiota affect the microbial composition, altering its colonization and persistence. Furthermore, it influences microbiota-host interactions by tr...
Even though in the last few years several families of eukaryotic β-barrel outer membrane proteins (OMPs) have been discovered, their computational characterization and their annotation in public databases is far from complete. The PFAM database includes only very few characteristic profiles for these families and, in most cases, the profile Hidden...
GPI-anchors constitute a very important post-translational modification, linking many proteins to the outer face of the plasma membrane in eukaryotic cells. Since experimental validation of GPI-anchoring signals is slow and costly, computational approaches for predicting them from amino acid sequences are needed. However, the most recent GPI predic...
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting...
Ever since the signal hypothesis was proposed in 1971, the exact nature of signal peptides has been a focus point of research. The prediction of signal peptides and protein subcellular location from amino acid sequences has been an important problem in bioinformatics since the dawn of this research field, involving many statistical and machine lear...
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state of art method to identify N-terminal sorting sig...
Signal peptides (SPs) are short amino acid sequences in the amino terminus of many newly synthesized proteins that target proteins into, or across, membranes. Bioinformatic tools can predict SPs from amino acid sequences, but most cannot distinguish between various types of signal peptides. We present a deep neural network-based approach that impro...
The ability to predict local structural features of a protein from the primary sequence is of paramount importance for unravelling its function in absence of experimental structural information. Two main factors affect the utility of potential prediction tools: their accuracy must enable extraction of reliable structural information on the proteins...
Predicting unconventional protein secretion is a much harder problem than predicting signal peptide-based protein secretion, both due to the small number of examples and due to the heterogeneity and the limited knowledge of the pathways involved, especially in eukaryotes. However, the idea that secreted proteins share certain properties regardless...
The ability to predict local structural features of a protein from the primary sequence is of paramount importance for unravelling its function in absence of experimental structural information. Two main factors affect the utility of potential prediction tools: their accuracy must enable extraction of reliable structural information on the proteins...
Motivation:
Deep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training...
Motivation:
The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins wh...
Many computational methods are available for predicting protein sorting in bacteria. When comparing them, it is important to know that they can be grouped into three fundamentally different approaches: signal-based, global-property-based and homology-based prediction. In this chapter, the strengths and drawbacks of each of these approaches is descr...
SignalP is the currently most widely used program for prediction of signal peptides from amino acid sequences. Proteins with signal peptides are targeted to the secretory pathway, but are not necessarily secreted. After a brief introduction to the biology of signal peptides and the history of signal peptide prediction, this chapter will describe al...
When predicting the subcellular localization of proteins from their amino acid sequences, there are basically three approaches: signal-based, global property-based, and homology-based. Each of these has its advantages and drawbacks, and it is important when comparing methods to know which approach was used. Various statistical and machine learning...
Machine learning is widely used to analyze biological sequence data. Non-sequential models such as SVMs or feed-forward neural networks are often used although they have no natural way of handling sequences of varying length. Recurrent neural networks such as the long short term memory (LSTM) model on the other hand are designed to handle sequences...
The prediction of protein sub-cellular localization is an important step toward elucidating protein function. For each query
protein sequence, LocTree2 applies machine learning (profile kernel SVM) to predict the native sub-cellular localization in
18 classes for eukaryotes, in six for bacteria and in three for archaea. The method outputs a score t...
Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization base...
Additional file 1 which has been uploaded with this manuscript is a PDF document of 3.3 MB (17 pages). It contains Supplementary Figures S1–S4 and Supplementary Tables S1 and S2 which have been referred to in the text. The data sets used in this study are deposited at our website [35], where you can also find a web page version of the supplementary...
A knowledge of the positions of introns in eukaryotic genes is important for understanding the evolution of introns. Despite this, there has been relatively little focus on the distribution of intron positions in genes.
In proteins with signal peptides, there is an overabundance of phase 1 introns around the region of the signal peptide cleavage si...
Proteins have signals that direct their localization in the cell. The best known of these signals is the secretory signal peptide. Prediction of protein localization from the amino acid sequence is done by two approaches: either by recognizing the signals themselves or by using global properties, such as amino acid composition, of the protein. A nu...
Proteins carrying twin-arginine (Tat) signal peptides are exported into the periplasmic compartment or extracellular environment independently of the classical Sec-dependent translocation pathway. To complement other methods for classical signal peptide prediction we here present a publicly available method, TatP, for prediction of bacterial Tat si...
We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition...
A method to predict lipoprotein signal peptides in Gram-negative Eubacteria, LipoP, has been developed. The hidden Markov model (HMM) was able to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins. This predictor was able to predict 96.8% of the lipoproteins correct...
We have developed an entirely sequence-based method that identifies and integrates relevant features that can be used to assign proteins of unknown function to functional classes, and enzyme categories for enzymes. We show that strategies for the elucidation of protein function may benefit from a number of functional attributes that are more direct...
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 9...
We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information. We briefly discuss the advantages and disadv...
4
Also at the Department of Biological Sciences, University of California, Irvine, USA, to whom all correspondence should be addressed.
We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficie...
Recently, a new protein translocation pathway, the twin-arginine translocation (TAT) pathway, has been identified in both bacteria and chloroplasts. To study the possible competition between the TAT- and the well-characterized Sec translocon-dependent pathways in Escherichia coli, we have fused the TorA TAT-targeting signal peptide to the Sec-depen...
Hidden Markov models were introduced in the beginning of
the 1970's as a tool in speech recognition. During the last decade
they have been found useful in addressing problems in computational
biology such as characterising sequence families, gene finding,
structure prediction and phylogenetic analysis. In this paper
we propose several measures betw...
We present a neural network based method (ChloroP) for identifying chloroplast transit peptides and their cleavage sites. Using cross-validation, 88% of the sequences in our homology reduced training set were correctly classified as transit peptides or nontransit peptides. This performance level is well above that of the publicly available chloropl...
Hidden Markov models were introduced in the beginning of the 1970's as a tool in speech recognition. During the last decade they have been found useful in addressing problems in computational biology such as characterising sequence families, gene finding, structure prediction and phylogenetic analysis. In this paper we propose several measures betw...
Prediction of protein sorting signals from the sequence of amino acids has great importance in the field of proteomics today. Recently, the growth of protein databases, combined with machine learning approaches, such as neural networks and hidden Markov models, have made it possible to achieve a level of reliability where practical use in, for exam...
A hidden Markov model of signal peptides has been developed. It contains submodels for the N-terminal part, the hydrophobic region, and the region around the cleavage site. For known signal peptides, the model can be used to assign objective boundaries between these three regions. Applied to our data, the length distributions for the three regions...
We have developed a new method for the identication of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs signicantly better than previous prediction schemes, and can easily be applied to genome-wide data sets. Discrimination between cleaved signal...
Translation in eukaryotes does not always start at the first AUG in an mRNA, implying that context information also plays a role. This makes prediction of translation initiation sites a non-trivial task, especially when analysing EST and genome data where the entire mature mRNA sequence is not known. In this paper, we employ artificial neural netwo...
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks
trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction
schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signa...
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs significantly better than previous prediction schemes, and can easily be applied to genome-wide data sets. Discrimination between cleaved sig...
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on...
The gntP gene, located between the fim and uxu loci in Escherichia coli K-12, has been cloned and characterized. Nucleotide sequencing of a region encompassing the gntP gene revealed an open reading frame of 447 codons with significant homology to the Bacillus subtilis gluconate permease. Northern (RNA) blotting indicated that the gntP gene was mon...
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on...
Projects
Project (1)