Article

Hidden Markov models in biological sequence analysis

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The vast increase of data in biology has meant that many aspects of computational science have been drawn into the field. Two areas of crucial importance are large-scale data management and machine learning. The field between computational science and biology is varyingly described as “computational biology” or “bioinformatics.” This paper reviews machine learning techniques based on the use of hidden Markov models (HMMs) for investigating biomolecular sequences. The approach is illustrated with brief descriptions of gene-prediction HMMs and protein family HMMs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Model (HMM) [74] is a state modeling system that has proven exceptionally useful in describing complex probabilistic systems. While originally applied to speech recognition tasks, this modeling format has quickly gained increasing popularity in areas ranging from data mining [58,87], to computer vision, to gene / protein sequence analysis [90,19,38,51,52,13] and fingerprint identification [86]. Much of this popularity comes from the HMM's ability to characterize these complex systems in a mathematically tractable way. ...
... The energy benefit of this configuration over the base idle waiting configuration is shown in Figure 8.12 and the operating frequency distribution for 4 and 8 processors (4 contexts each) is depicted in Figure 8. 13. We see that, while a reasonable energy savings is produced, it is only slightly more than half the energy benefit seen in the ideal case. ...
... 13: Operating Frequency Distribution of Base Scaling System -This figure depicts the distribution cycles spent at a given operating frequency for 4 and 8 processors (4 contexts each) using per iteration frequency scaling at 25MHz increments. ...
... Hidden Markov model focuses on prior probabilistic approach where hidden states being represented over observed sequence values. Birney et al. [6] utilized Markov chain process for capturing the activity of students while interacted with mathematical expert and tutoring system and record their learning path sequences. Jiyong et al. observed behavior of students in teacher centric learning environment where different hidden variables have been considered for activity recognition related to state sequences [28]. ...
... Similarly, the second row shows the sum of B in the increasing order of places in Table 3. From each of the column we select the highest value which is 0.0001461 in the first place in first column corresponding to observation (2) i.e. pleasure out of given observations. Similarly, for second observation (6) i.e. Isolation the highest value is 0.00013105 as shown in second column. The detailed description of Table IV is given below. ...
Article
Full-text available
In the proposed work, Hidden Markov Model (HMM) has been deployed to improve the learner’s performance or grades on the basis of their Psychological and Environmental factors like: Connect/Gather Isolation, Pleasure/Comfort, Depression, Trust, Anxiety, Proper Guidance, Improper Guidance, Entertainment and Stress. The categorization of Psychological and Environmental factors has been done on the basis of two factors as positive and negative. The responsibility of positive factor is to boost up learner’s performance or grades whereas negative factors reduces learning performance respectively. Finally, this study addresses application of HMM to determine the optimal sequence of states for different states as grades A, B and C for the different emission observations. The states identification leads to train the HMM model where optimal value of individual states computed using different observation sequences which determines the probability of state sequences. The probability of achieved optimal states is shown in different logical combinations where best state is searched among available different states using different search techniques. The computational results obtained after training are encouraging and useful.
... A set of stochastic processes that produces the sequence of observed symbols is used to infer an underlying stochastic process that is not observable (hidden states). HMMs have been widely utilized in many application areas including speech recognition [1], bioinformatics [2], finance [3], computer vision [4], and driver behavior modeling [5,6]. A comprehensive survey on the applications of HMMs is presented in [7]. ...
... The prediction probability p(x t v , y 1:T e ) for t > T can be handled by the the Forward-Backward algorithm by computing F(v, t) with an empty set of evidence for Y T+1:t . It can be implemented by replacing the term o v,e t by 1 for t > T in Equation (2). As a result, for T = 0, the prediction probability p(x t v , y 1:T e ) becomes the prior probability p(x t v ). ...
Article
Full-text available
In this paper, a new algorithm for sensitivity analysis of discrete hidden Markov models (HMMs) is proposed. Sensitivity analysis is a general technique for investigating the robustness of the output of a system model. Sensitivity analysis of probabilistic networks has recently been studied extensively. This has resulted in the development of mathematical relations between a parameter and an output probability of interest and also methods for establishing the effects of parameter variations on decisions. Sensitivity analysis in HMMs has usually been performed by taking small perturbations in parameter values and re-computing the output probability of interest. As recent studies show, the sensitivity analysis of an HMM can be performed using a functional relationship that describes how an output probability varies as the network’s parameters of interest change. To derive this sensitivity function, existing Bayesian network algorithms have been employed for HMMs. These algorithms are computationally inefficient as the length of the observation sequence and the number of parameters increases. In this study, a simplified efficient matrix-based algorithm for computing the coefficients of the sensitivity function for all hidden states and all time steps is proposed and an example is presented.
... Básicamente se tiene la misma secuencia de nucleótidos cambiando el nucleótido T por U. Las proteínas son secuencias de aminoácidos (combinaciones de los veinte posibles aminoácidos). El paso del ARN a proteína (traducción) se realiza con un mapeo de información de tres letras de ARN a una letra de proteína (Birney, 2001) Dicho mapeo se da a través del llamado código genético (figura 2). La secuencia lineal de aminoácidos de la proteína (estructura primaria) determina la estructura proteica y la estructura proteica determina la función (Bergeron, 2002). ...
... La utilización de modelos escondidos de Markov en los estudios bioinformáticos es ubicua sin embargo, su uso clásicamente se da en la solución de los siguientes problemas (Birney, 2001): INGENIERÍA • Predicción de genes (discriminación de intrones y exones). • Doblamiento de proteínas (predicción de la estructura secundaria a partir de la primaria). ...
Article
Este documento pretende hacer una revision del uso de los modelos escondidos de Markov en el analisis de secuencias biologicas y especificamente de como estos son empleados en la biologia molecular computacional.
... Alternatively, model-based approaches have also been adopted for time series classification, e.g. Hidden Markov Model (HMM)-based approaches for biological sequence classification (Birney, 2001). ...
Preprint
We present a general framework for classifying partially observed dynamical systems based on the idea of learning in the model space. In contrast to the existing approaches using model point estimates to represent individual data items, we employ posterior distributions over models, thus taking into account in a principled manner the uncertainty due to both the generative (observational and/or dynamic noise) and observation (sampling in time) processes. We evaluate the framework on two testbeds - a biological pathway model and a stochastic double-well system. Crucially, we show that the classifier performance is not impaired when the model class used for inferring posterior distributions is much more simple than the observation-generating model class, provided the reduced complexity inferential model class captures the essential characteristics needed for the given classification task.
... We could also want to determine in which protein family, this novel protein sequence belongs to. HMMs have been used to build a variety of sophisticated sequence analysis techniques that effectively describe biological sequences (Birney, 2001;Durbin et al., 1998;Yoon, 2009). ...
Article
Full-text available
The major objective of the paper is to review the theory for an hidden Markov model, a very general type of probabilistic model for sequences of symbols. In order for the hidden Markov model to be applicable to real-world applications, three key problems about the model must be addressed, and to do this, first we go over how to choose the best state sequence to explain an observation sequence, then we go over how to calculate the probability of an observation sequence, and finally we go over how to compute the maximization of the probability of the observation sequence. From these three angles, we review the mathematical concept behind the identification of CpG islands. The entire process and study of the outcomes have been tackled by examining both hypothetical and real DNA sequences side by side. We use well-known biological sequence analysis servers to carry out the experiment. Analytical and algorithmic approaches are compared while taking the hypothetical DNA sequence example into consideration.
... In bioinformatics, HMM are used to model biological sequences (Franzese & Iuliano, 2019). Profile HMMs are Markov models applied to protein families (Birney, 2001). They are able to capture the evolutionary changes that have occurred in a set of related sequences by using position-specific information about how conserved each amino acid based on a multiple alignment (EMBL-EBI, n.d.). ...
Thesis
Full-text available
In the Hudson Lab, which is focused on discovering new antibiotics, bacteria samples are taken from the environment and cultured in large quantities. Then they are tested for antibiotic resistance before they are sequenced and their secondary metabolite compounds are extracted. This is both a lengthy and expensive process that becomes more and more difficult as the number of samples one is working increases. This project assessed a different approach to rejuvenate antibiotic development with antiSMASH. antiSMASH is an online tool created by collaborators from many different institutions that uses profile Hidden Markov Models (pHMMs) to detect gene clusters which produce secondary metabolites in bacteria. The antiSMASH tool has its own repository of these “profiles” which are position specific information about an amino acid from a protein encoding gene derived from multiple sequence alignments. Once a genome is entered into antiSMASH, if these profile modules are detected and they are outputted to the user if a certain metabolite/cluster is present. Many gene clusters are known to produce metabolites with antimicrobial properties which the antiSMASH tool could potentially detect. Using this tool, the goal was to identify a potential pipeline of antibiotic discovery that would be a great improvement in time and reduce costs by using the tool as a screen of a possible viable candidate for antibiotics. In this project 30 genomes were used and fed into antiSMASH. They were broken down into positive and negative controls, known producers and unknown producers. We then looked at the tools ability to screen for antibiotics in each of those data types
... This study is modelled significantly on heterogeneous groups. Birney (2001) investigated the biomolecular sequence using HMM which deals with geneprediction. This paper briefly described about the techniques used in the sequence analysis. ...
Article
Diabetes is one of the chronic diseases which occur when the pancreas is not able to secret insulin. Insulin is an important factor that transforms glucose in to energy. Analysing the multiple DNA sequence of diabetes is helpful in deriving more information about the disease. Profile Hidden Markov Model has a wide application in molecular biology. Thus we emphasized the use of PHMM for this Multiple Sequence Alignment (MSA). The main objective of this paper is to find the sequence pattern which the disease follows, estimating the parameters using Baum-Welch algorithm and finding the best optimal path using Viterbi algorithm. All valuable information from the sequences is obtained using PHMM.
... HMMs are recurring themes in computational biology since often biological sequence analysis is just a matter of putting the right label on each residue. For example, in genomics, it is possible to use HMMs to label nucleotides as exons, introns, or intergenic sequences, as reported in [89][90][91] , while, in proteomics, HMMs applications are available, for example, in protein modeling [92] . ...
Article
Full-text available
Background and objective: Mechanistic-based Model simulations (MM) are an effective approach commonly employed, for research and learning purposes, to better investigate and understand the inherent behavior of biological systems. Recent advancements in modern technologies and the large availability of omics data allowed the application of Machine Learning (ML) techniques to different research fields, including systems biology. However, the availability of information regarding the analyzed biological context, sufficient experimental data, as well as the degree of computational complexity, represent some of the issues that both MMs and ML techniques could present individually. For this reason, recently, several studies suggest overcoming or significantly reducing these drawbacks by combining the above-mentioned two methods. In the wake of the growing interest in this hybrid analysis approach, with the present review, we want to systematically investigate the studies available in the scientific literature in which both MMs and ML have been combined to explain biological processes at genomics, proteomics, and metabolomics levels, or the behavior of entire cellular populations. Methods: Elsevier Scopus®, Clarivate Web of Science™ and National Library of Medicine PubMed® databases were enquired using the queries reported in Table 1, resulting in 350 scientific articles. Results: Only 14 of the 350 documents returned by the comprehensive search conducted on the three major online databases met our search criteria, i.e. present a hybrid approach consisting of the synergistic combination of MMs and ML to treat a particular aspect of systems biology. Conclusions: Despite the recent interest in this methodology, from a careful analysis of the selected papers, it emerged how examples of integration between MMs and ML are already present in systems biology, highlighting the great potential of this hybrid approach to both at micro and macro biological scales.
... The hidden Markov models (HMMs) are the prospective ones [1][2][3][4]. First, they form the mathematical basis for the description of the processes in telecommunications [5,6], finance [7][8][9], biology and medicine [10][11][12][13], tracking and navigation [14,15] and speech and image recognition [16][17][18][19]. Second, the HMM advanced theoretical framework provides a solution to a series of optimal estimation and control problems [20]. ...
Article
Full-text available
The paper aims to identify hidden Markov model parameters. The unobservable state represents a finite-state Markov jump process. The observations contain Wiener noise with state-dependent intensity. The identified parameters include the transition intensity matrix of the system state, conditional drift and diffusion coefficients in the observations. We propose an iterative identification algorithm based on the fixed-interval smoothing of the Markov state. Using the calculated state estimates, we restore all required system parameters. The paper contains a detailed description of the numerical schemes of state estimation and parameter identification. The comprehensive numerical study confirms the high precision of the proposed identification estimates.
... Probing their structure is even more difficult. Nevertheless, the broad popularity and application of HMCs-not only in the study of complex systems, 10 but also in coding theory, 24 stochastic processes, 25 stochastic thermodynamics, 26 speech recognition, 27 computational biology, 28,29 epidemiology, 30 and finance 31 -give testimony to the ubiquity of truly complex systems, in both theory and nature. ...
Article
Even simply defined, finite-state generators produce stochastic processes that require tracking an uncountable infinity of probabilistic features for optimal prediction. For processes generated by hidden Markov chains, the consequences are dramatic. Their predictive models are generically infinite state. Until recently, one could determine neither their intrinsic randomness nor structural complexity. The prequel to this work introduced methods to accurately calculate the Shannon entropy rate (randomness) and to constructively determine their minimal (though, infinite) set of predictive features. Leveraging this, we address the complementary challenge of determining how structured hidden Markov processes are by calculating their statistical complexity dimension—the information dimension of the minimal set of predictive features. This tracks the divergence rate of the minimal memory resources required to optimally predict a broad class of truly complex processes.
... In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone. [5]Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes. These areas are called CpG islands (p denotes "pair"). ...
Article
Full-text available
CpG islands (CGIs) play a vital role in genome analysis as genomic markers. Identification of the CpG pair has contributed not only to the prediction of promoters but also to the understanding of the epigenetic causes of cancer. In the human genome [1] wherever the dinucleotides CG occurs the C nucleotide (cytosine) undergoes chemical modifications. There is a relatively high probability of this modification that mutates C into a T. For biologically important reasons the mutation modification process is suppressed in short stretches of the genome, such as 'start' regions. In these regions *2+ predominant CpG dinucleotides are found than elsewhere. Such regions are called CpG islands. DNA methylation is an effective means by which gene expression is silenced. In normal cells, DNA methylation functions to prevent the expression of imprinted and inactive X chromosome genes. In cancerous cells, DNA methylation inactivates tumor-suppressor genes, as well as DNA repair genes, can disrupt cell-cycle regulation. The most current methods for identifying CGIs suffered from various limitations and involved a lot of human interventions. This paper gives an easy searching technique with data mining of Markov Chain in genes. Markov chain model has been applied to study the probability of occurrence of C-G pair in the given gene sequence. Maximum Likelihood estimators for the transition probabilities for each model and analgously for the model has been developed and log odds ratio that is calculated estimates the presence or absence of CpG is lands in the given gene which brings in many facts for the cancer detection in human genome.
... The recent work introduced a suite of tools to capture this state of affairs for a broad class of stochastic processes-those used not only in the study of complex systems [10], but also in coding theory [13], stochastic processes [14], stochastic thermodynamics [15], speech recognition [16], computational biology [17,18], epidemiology [19], and finance [20]. ...
Preprint
The ϵ\epsilon-machine is a stochastic process' optimal model -- maximally predictive and minimal in size. It often happens that to optimally predict even simply-defined processes, probabilistic models -- including the ϵ\epsilon-machine -- must employ an uncountably-infinite set of features. To constructively work with these infinite sets we map the ϵ\epsilon-machine to a place-dependent iterated function system (IFS) -- a stochastic dynamical system. We then introduce the ambiguity rate that, in conjunction with a process' Shannon entropy rate, determines the rate at which this set of predictive features must grow to maintain maximal predictive power. We demonstrate, as an ancillary technical result which stands on its own, that the ambiguity rate is the (until now missing) correction to the Lyapunov dimension of an IFS's attractor. For a broad class of complex processes and for the first time, this then allows calculating their statistical complexity dimension -- the information dimension of the minimal set of predictive features.
... This established the vital role that information plays in physical theories of complex systems. In particular, the application of hidden Markov chains to model and analyze the randomness and structure of physical systems has seen considerable success, not only in complex systems [14], but also in coding theory [15], stochastic processes [16], stochastic thermodynamics [17], speech recognition [18], computational biology [19,20], epidemiology [21], and finance [22], to offer a nonexhaustive list of examples. ...
Article
Full-text available
Hidden Markov chains are widely applied statistical models of stochastic processes, from fundamental physics and chemistry to finance, health, and artificial intelligence. The hidden Markov processes they generate are notoriously complicated, however, even if the chain is finite state: no finite expression for their Shannon entropy rate exists, as the set of their predictive features is generically infinite. As such, to date one cannot make general statements about how random they are nor how structured. Here, we address the first part of this challenge by showing how to efficiently and accurately calculate their entropy rates. We also show how this method gives the minimal set of infinite predictive features. A sequel addresses the challenge’s second part on structure.
... Probing their structure is even more difficult. Nevertheless, the broad popularity and application of HMCsnot only in the study of complex systems [10], but also in coding theory [24], stochastic processes [25], stochastic thermodynamics [26], speech recognition [27], computational biology [28,29], epidemiology [30], and finance [31]-gives testimony to the ubiquity of truly complex systems, in both theory and nature. ...
Preprint
Inferring models from samples of stochastic processes is challenging, even in the most basic setting in which processes are stationary and ergodic. A principle reason for this, discovered by Blackwell in 1957, is that finite-state generators produce realizations with arbitrarily-long dependencies that require an uncountable infinity of probabilistic features be tracked for optimal prediction. This, in turn, means predictive models are generically infinite-state. Specifically, hidden Markov chains, even if finite, generate stochastic processes that are irreducibly complicated. The consequences are dramatic. For one, no finite expression for their Shannon entropy rate exists. Said simply, one cannot make general statements about how random they are and finite models incur an irreducible excess degree of unpredictability. This was the state of affairs until a recently-introduced method showed how to accurately calculate their entropy rate and, constructively, to determine the minimal set of infinite predictive features. Leveraging this, here we address the complementary challenge of determining how structured hidden Markov processes are by calculating the rate of statistical complexity divergence -- the information dimension of the minimal set of predictive features.
... This established the vital role that information plays in physical theories of complex systems. In particular, the application of hidden Markov chains to model and analyze the randomness and structure of physical systems has seen considerable success, not only in complex systems [14], but also in coding theory [15], stochastic processes [16], stochastic thermodynamics [17], speech recognition [18], computational biology [19,20], epidemiology [21], and finance [22], to offer a nonexhaustive list of examples. ...
Preprint
Hidden Markov chains are widely applied statistical models of stochastic processes, from fundamental physics and chemistry to finance, health, and artificial intelligence. The hidden Markov processes they generate are notoriously complicated, however, even if the chain is finite state: no finite expression for their Shannon entropy rate exists, as the set of their predictive features is generically infinite. As such, to date one cannot make general statements about how random they are nor how structured. Here, we address the first part of this challenge by showing how to efficiently and accurately calculate their entropy rates. We also show how this method gives the minimal set of infinite predictive features. A sequel addresses the challenge's second part on structure.
... This study is modelled significantly on heterogeneous groups. Birney (2001) investigated the biomolecular sequence using HMM which deals with geneprediction. This paper briefly described about the techniques used in the sequence analysis. ...
Article
Full-text available
Diabetes is one of the chronic diseases which occur when the pancreas is not able to secret insulin. Insulin is an important factor that transforms glucose in to energy. Analysing the multiple DNA sequence of diabetes is helpful in deriving more information about the disease. Profile Hidden Markov Model has a wide application in molecular biology. Thus we emphasized the use of PHMM for this Multiple Sequence Alignment (MSA). The main objective of this paper is to find the sequence pattern which the disease follows, estimating the parameters using Baum-Welch algorithm and finding the best optimal path using Viterbi algorithm. All valuable information from the sequences is obtained using PHMM.
... For this segmentation, we employ Hidden Markov Models (HMM) [26,27] (Fig. 3). For the "observations" in the HMM models, we required a parameter that could be used as a surrogate for the percentage of randomness within a region. ...
Article
Nucleosomal profiling is an effective method to determine the positioning and occupancy of nucleosomes, which is essential to understand their roles in genomic processes. However, the positional randomness across the genome and its relationship with nucleosome occupancy remains poorly understood. Here we present a computational method that segments the profile into nucleosomal domains and quantifies their randomness and relative occupancy level. Applying this method to published data, we find on average ~ 3-fold differences in the degree of positional randomness between regions typically considered "well-ordered", as well as an unexpected predominance of only two types of domains of positional randomness in yeast cells. Further, we find that occupancy levels between domains actually differ maximally by ~ 2-3-fold in both cells, which has not been described before. We also developed a procedure by which one can estimate the sequencing depth that is required to identify nucleosomal positions even when regional positional randomness is high. Overall, we have developed a pipeline to quantitatively characterize domain-level features of nucleosome randomness and occupancy genome-wide, enabling the identification of otherwise unknown features in nucleosomal organization.
... Gene finding refers to the process of identification of protein coding segments of the given DNA. Gene searching algorithms are used to find proteins 8,10,11,13 . This is not an easy task as the structure of the gene has varying degrees ofcomplexity. ...
... Merging biological knowledge with computational techniques, machine learning aims to build a predictive model by learning the difference between coded and non-coded regions and use the learned model to predict the coding regions in the DNA sequence. The Hidden Markov Model (HMM) is the foundation of many current gene recognition algorithms [10][11][12][13][14]. The Hidden Markov Model considers the DNA sequence as a random process and automatically finds its internal hidden rules based on the difference in the frequency of nucleotide selection between the encoded and non-encoded DNA sequences. ...
Article
Full-text available
Identifying protein coding regions in DNA sequences by computational methods is an active research topic. Welan gum produced by Sphingomonas sp. WG has great application potential in oil recovery and concrete construction industry. Predicting the coding regions in the Sphingomonas sp. WG genome and addressing the mechanism underlying the explanation for the synthesis of Welan gum metabolism is an important issue at present. In this study, we apply a self adaptive spectral rotation (SASR, for short) method, which is based on the investigation of the Triplet Periodicity property, to predict the coding regions of the whole-genome data of Sphingomonas sp. WG without any previous training process, and 1115 suspected gene fragments are obtained. Suspected gene fragments are subjected to a similarity search against the non-redundant protein sequences (nr) database of NCBI with blastx, and 762 suspected gene fragments have been labeled as genes in the nr database.
... Multimedia driven networked system had been used to capture the future behaviour of specific users and deliver lecture according to personalized behavior from dedicated courseware [2]. HMM able to detect gene encoding in different strands of DNA through sequencing of eukaryotes [3]. Hasan and Nath presented HMM approach for stock market forecasting [4]. ...
Chapter
Full-text available
The usage of web enabled e-learning systems has been increased for education in recent years. In present study, a Hidden Markov Model (HMM) driven approach is used to predict the future lecture topics or paths of C programming those has been accessed by students in an adaptive web enabled educational system. Data has been preprocessed and collected from e-learning system then HMM parameters were adjusted and used modified algorithm to train the data. This system help faculty to identified the student’s problems and provide assistance to them as per their need. The experiment result shows the accuracy of prediction in proposed system is 80.23% which is better than neural network multilayer perceptron model whose accuracy rate is 78.15%.
... Bioinformatics analysis. Initial screening for potential toxins in the genome of V. parahaemolyticus RIMD2210633 was conducted by using the HMMER program 63 with previously-curated toxin domain profile database 35,36 . To identify homologous RHS-containing toxins, we queried the non-redundant database at NCBI with the Rhs fragment of RhsP using the PSI-BLAST program with profileinclusion threshold of expect E-value at 0.005 64 . ...
Article
Full-text available
Type VI secretion systems (T6SSs) translocate effector proteins, such as Rhs toxins, to eukaryotic cells or prokaryotic competitors. All T6SS Rhs-type effectors characterized thus far contain a PAAR motif or a similar structure. Here, we describe a T6SS-dependent delivery mechanism for a subset of Rhs proteins that lack a PAAR motif. We show that the N-terminal Rhs domain of protein RhsP (or VP1517) from Vibrio parahaemolyticus inhibits the activity of the C-terminal DNase domain. Upon auto-proteolysis, the Rhs fragment remains inside the cells, and the C-terminal region interacts with PAAR2 and is secreted by T6SS2; therefore, RhsP acts as a pro-effector. Furthermore, we show that RhsP contributes to the control of certain "social cheaters" (opaR mutants). Genes encoding proteins with similar Rhs and PAAR-interacting domains, but diverse C-terminal regions, are widely distributed among Vibrio species.
... i.e., the probability of the partial observation sequence, x 1 ….x i , from the start state until the state k at time i. This variable can also be calculated efficiently by using the recurrence (1)(2)(3)(4)(5)(6)(7)(8). ...
Article
Full-text available
The prediction of the secondary structure of proteins is one of the most studied problems in computational biology. However, the accuracy of the predicted secondary structure is insufficient for practical utility. In this paper, we propose an algorithmic approach based on Hidden Markov Models (HMM) to model the problem of prediction. Therefore, HMM are often used for data mining in bioinformatics. In this research, we have built a HMM that models the prediction problem of protein secondary structure. Moreover, two procedures for estimating the probability parameters were performed by the Maximum Likelihood Estimation (MLE) of protein sequences from a public database (Brookhaven PDB). Finally, a new prediction approach based on a posteriori probability of hidden regimes has been implemented. Our model appears to be very efficient on single sequences, with a score of 66.6% by comparing the first results obtained with the real secondary sequence and encouraging for an improvement of the system.
... [145] They have been vigorously applied in the analysis of biological sequences. [153][154][155] HMM can have either supervised or unsupervised learning depending on the training data provided. [116,156] HMM comprises two stochastic processes: an invisible process of hidden states and a visible process of observable symbols. ...
Article
T-cell epitopes are specific peptide sequences derived from foreign or own proteins that can initiate an immune response and which are recognized by specific T-cells when displayed on the surface of other cells. The prediction of T-cell epitopes is of particular interest in vaccine design, disease prevention and the development of immunotherapeutics. There are two principal categories of predictive methods: peptide-sequence based and peptide-structure-based. Sequence-based methods make use of various approaches to identify likely immunogenic amino acid sequences, such as sequence motifs, decision trees, partial least squares (PLS), quantitative matrices (QM), artificial neural networks (ANN), hidden Markov models (HMM), and support vector machines (SVM). Structure-based methods are more diverse in nature and involve approaches such as quantitative structure-activity relationships (QSAR), molecular modelling, molecular docking and molecular dynamics simulations (MD). This review highlights the key features of all of these approaches, provides some key examples of their application, and compares and contrasts the most important methods currently in use.
... In this study, estimation problem of a driver's intention near a road intersection is studied using discrete HMM and the HSS framework. HMM has been applied to many applications including speech recognition, bioinformatics, finance, computer vision, etc [7], [8], [9], [10]. Assuming, the driver's decisions that affect the vehicle trajectory are governed by the Markov process, HMM is used to represent the stochastic process that results the continuous vehicle observations using the Markov chain intuition. ...
... Alternatively, model-based approaches have also been adopted for time series classification, e.g. Hidden Markov Model (HMM)-based approaches for biological sequence classification (Birney, 2001). ...
Article
We present a general framework for classifying partially observed dynamical systems based on the idea of learning in the model space. In contrast to the existing approaches using model point estimates to represent individual data items, we employ posterior distributions over models, thus taking into account in a principled manner the uncertainty due to both the generative (observational and/or dynamic noise) and observation (sampling in time) processes. We evaluate the framework on two testbeds - a biological pathway model and a stochastic double-well system. Crucially, we show that the classifier performance is not impaired when the model class used for inferring posterior distributions is much more simple than the observation-generating model class, provided the reduced complexity inferential model class captures the essential characteristics needed for the given classification task.
... In an HMM one must select as emissions the monomers of the sequence, because they are the only known data, and as internal states the features to be estimated. An excellent case is provided by the polypeptides, for which it is just the amino acid sequence that causes the secondary structures, while in an HMM the amino acids are expected as emissions and the secondary structures are supposed as internal states [31][32][33] . Another use of HMMs can also be considered as special instances of Machine Learning Techniques that are often alternatively used for similar applications. ...
... In 2002, building on their long history of collaboration in bioinformatics [5][6][7][8][9], the authors received NSF funding to support a dissemination grant for curricular materials in bioinformatics education. The resulting BEDROCK Project (Bioinformatics Education Dissemination: Reaching Out, Connecting and Knitting together) was named in tribute to Birney's assertion that 'arguments of homology are the bedrock of bioinformatics' [10]. From its inception, BEDROCK has emphasized that nucleotide sequences, gene-expression levels and other bioinformatics data result from evolutionary processes (e.g. ...
... By using a Markov model, one can then simply compute the probability of the sequence generated according to this model [61]. HMM are more successful because they can naturally accommodate uneven length models of sequence regions because maximum biological data has variable length properties [62,63]. They are used for motif finding [64], multiple sequence alignment [65] and identification of protein structure [66]. ...
Article
Bioinformatics is a promising and innovative research field in 21st century. Automatic gene prediction has been an actively researched field of bioinformatics. Despite a high number of techniques specifically dedicated to bioinformatics problems as well as many successful applications, we are in the beginning of a process to massively integrate the aspects and experiences in the different core subjects such as biology, medicine, computer science, engineering, chemistry, physics, and mathematics. Presently, a large number of gene identification tools are based on computational intelligence approaches. Here, we have discussed the existing conventional as well as computational methods to identify gene(s) and various gene predictors are compared. The paper includes some drawbacks of the presently available methods and also, the probable guidelines for future directions are discussed.
... The drawback of PSI-BLAST is that it can converge before finding all the true positive hits. On the other hand, probabilistic model based profile HMMs (Eddy, 1998(Eddy, , 2011 are more sensitive and efficient in remote homolog detection (Birney, 2001;Karplus et al., 1998). HMMER utility provides one such package named Jackhmmer, which iterates profile-HMMs till convergence, to detect distant relationships of proteins (Eddy, 2011;Finn et al., 2011). ...
Article
Full-text available
Motivation: In the post-genomic era, automatic annotation of protein sequences using computational homology-based methods is highly desirable. However, often protein sequences diverge to an extent where detection of homology and automatic annotation transfer is not straightforward. Sophisticated approaches to detect such distant relationships are needed. We propose a new approach to identify deep evolutionary relationships of proteins to overcome shortcomings of the available methods. Results: We have developed a method to identify remote homologues more effectively from any protein sequence database by using several cascading events with Hidden Markov Models (C-HMM). We have implemented clustering of hits and profile generation of hit clusters to effectively reduce the computational timings of the cascaded sequence searches. Our C-HMM approach could cover 94%, 83% and 40% coverage at family, superfamily and fold levels respectively, when applied on diverse protein folds. We have compared C-HMM with various remote homology detection methods and discuss the trade-offs between coverage and false positives. Availability and implementation: A standalone package implemented in Java along with a detailed documentation can be downloaded from https://github.com/RSLabNCBS/C-HMM CONTACT: mini@ncbs.res.in.
... Starting in the last decade, the power and flexibility of HMM-based statistical machine learning techniques has been recognized by the bioinformatics community who adapted HMM techniques to various computational biology applications such as gene prediction, protein fold recognition and sequence alignment(These and other uses of HMMs in bioinformatics have been outlined in [20,31]). The HMMER workload that we used in our study is one of the most common HMM-based applications in bioinformatics, and it uses "profile HMMs" for analyzing protein families and classifying proteins by comparing them to statistical models of protein families. ...
... The erroneous input sources due to the technical limitation have created the undesirable result mentioned previously, thus there is a need to implement an automated scheme that can be used to accurately identify and cluster transcript sequences. This study proposed to implement an automated and high-throughput scheme to filter out and mark out the erroneous regions of each EST sequence using Hidden Markov Models [15][16][17][18][19], so that EST sequences maybe accurately cluster together and give users more reliable information when locating SNPs [20,21]. The purpose of this paper is not to replace the existing base-calling programs such as Phred or Phrap [22] but to facilitate them, and in the meantime to point out the hidden errors embedded in the EST sequences data during the process. ...
Article
In the era of post-Human Genome Project, researches have shifted the emphasis from the mapping of human genomic to the discovery of correlation between genetic markers and clinical phenotypes, where finding effective treatment against disease are becoming crucial and applicable goals. The Expressed Sequence Tags (ESTs) data plays an important role in the completion of the Human Genome Sequencing and is widely used for gene discovery, polymorphism analysis, expression studies, and gene prediction. However, due to the chemical properties and manufacturing processes, ESTs data might contain errors, which might mislead Bioinformatics researchers that attempt to use EST-libraries to identify Single Nucleotide Polymorphisms (SNPs). Therefore this study proposes a paradigm for EST data, where users might better address this issue and use them to correctly identify SNPs.
... The majority of molecular data used in computational biology consists in sequences of nucleotides corresponding to the primary structure of DNA and RNA, or sequences of amino acids corresponding to the primary structure of proteins. Birney in his work [1], reviews gene-prediction HMMs and protein family HMMs. The role of gene-prediction in DNA is to discover the location of genes on the genome. ...
Article
Full-text available
Ubiquitous systems use context information to adapt appliance behavior to human needs. Even more convenience is reached if the appliance foresees the user's desires and acts proactively. This paper introduces Hidden Markov Models, in order to anticipate the next movement of some persons. The optimal configuration of the model is determined by evaluating some movement sequences of real persons within an office building. The simulation results show accuracy in next location prediction reaching up to 92%.
Article
The ε-machine is a stochastic process's optimal model—maximally predictive and minimal in size. It often happens that to optimally predict even simply defined processes, probabilistic models—including the ε-machine—must employ an uncountably infinite set of features. To constructively work with these infinite sets we map the ε-machine to a place-dependent iterated function system (IFS)—a stochastic dynamical system. We then introduce the ambiguity rate that, in conjunction with a process's Shannon entropy rate, determines the rate at which this set of predictive features must grow to maintain maximal predictive power over increasing horizons. We demonstrate, as an ancillary technical result that stands on its own, that the ambiguity rate is the (until now missing) correction to the Lyapunov dimension of an IFS's attracting invariant set. For a broad class of complex processes, this then allows calculating their statistical complexity dimension—the information dimension of the minimal set of predictive features.
Chapter
In recent years, biological research has been witness of a sea change mainly spearheaded by the advent of novel high throughput technologies that can provide unprecedented amounts of valuable data. This has given rise to novel field sharing the popular suffix ‘omics’. Genomics/transcriptomics, proteomics, metabolomics, interactomics/regulomics and numerous other terms have been coined to categorize this ever increasing number of new fields. Biomarkers comprise the most critical tools for the early detection, diagnosis, prognosis and prediction of diseases providing key clues for drug development processes. A significant challenge is to define appropriate levels of specificity and sensitivity of new biomarkers in detecting complex diseases. The establishment of new biomarkers is not only an issue of optimizing wet lab experiments but also of designing appropriate and robust data analysis methods.Various approaches, like multivariate analysis methods as well as standard statistical tests have been applied to search for the important features in ‘omics’ data. Likewise, several methods, e.g. FDA, SVM, CART, nonparametric kernels, kNN, boosted decision stump and genetic algorithms, have been reported. However, it still remains an unsolved challenge to analyze and interpret the enormous volumes of ‘omics’ data
Chapter
Current advances in technology allow for the efficient capturing and storage of high-resolution and high-frequency person movement data.
Chapter
The pattern matching problem is to find all occurrences of a given pattern in an input text. In particular, we consider the case when the pattern is a stochastic regular language where each pattern string has its own probability. Our problem is to find all matching patterns—(start, end) indices in the text—whose probability is larger than a given threshold probability. A pattern matching procedure is frequently used on streaming data in several applications, and often it is very challenging to find the start index of a matching in streaming data. We design an efficient algorithm for the stochastic pattern matching problem over streaming data based on the transformation of the pattern PFA into a weighted automaton and a constant bound on the number of backtracks required to find a start index while reading the streaming input. We also employ heuristics that enable us to reduce the number of backtracks, which improves the practical runtime of our algorithm. We establish the tight theoretical runtime of the proposed algorithm and experimentally demonstrate its practical performance. Finally, we show a possible application of our algorithm to another stochastic pattern matching problem where we search for the maximum probability substring of a text that is a superstring of a specified string.
Article
Recently more and more universities have been incorporating HPC (High Performance Computing) in their computing curriculum. The Bio-Grid REU (Research Experience for Undergraduates) Site offers undergraduate students interested or experienced in HPC a summer research opportunity to participate in projects that apply HPC in various life-science disciplines. The projects are associated with the Bio-Grid Initiatives conducted at the University of Connecticut. Training seminars are designed to equip students with background knowledge such as basic parallel programming, large-scale data analytics, and middleware support, etc., as well as some ongoing projects using these computing methods. Students participate in several collaborative projects supported by a campus-wide computational and data grid. The REU project introduces such interdisciplinary research work to students in the early stage of their academic career to spark their interest. The project aims at preparing future software engineers to formalize and solve emerging life-science problems, as well as life-science researchers with a strong background in high-performance computing. The Bio-Grid REU Site was supported by the National Science Foundation from 08-10 and 12-14, with a website located at http://biogrid.engr.uconn.edu/REU.
Conference Paper
Full-text available
One of the tasks in analysis of biomolecule data is clustering of protein sequences. The cluster of protein sequences can lead to discover subfamily of protein. One of the methods developed in bioinformatics is hidden markov model (HMM). Globin protein is one of the proteins in the blood. Clustering results depend on HMM architecture which is gained from training process. The results of the training process may be end in local minima or overfit. Therefore, it is needed some alternatives to obtain more accurate clustering with HMM. This research implements HMM method for clustering of globin protein sequences. The globin protein sequences are taken from UNIPROT database. A prototype is developed with java programming language and an accuracy that is measured by the number of miss cluster member. The less number of miss cluster member the more accurate clustering achieved. To improve HMM in clustering of sequence protein, prior knowledge is used as the data training. The prior knowledge is gained from multiple alignment of the protein sequence as the training data. The HMM will present a good cluster if partly multiple alignment of the protein sequence is used as the training data.
Article
CpG islands (CGIs) play a vital role in genome analysis as genomic markers. Identification of the CpG pair has contributed not only to the prediction of promoters but also to the understanding of the epigenetic causes of cancer. In the human genome [1] wherever the dinucleotides CG occurs the C nucleotide (cytosine) undergoes chemical modifications. There is a relatively high probability of this modification that mutates C into a T. For biologically important reasons the mutation modification process is suppressed in short stretches of the genome, such as ‘start’ regions. In these regions [2] predominant CpG dinucleotides are found than elsewhere. Such regions are called CpG islands. DNA methylation is an effective means by which gene expression is silenced. In normal cells, DNA methylation functions to prevent the expression of imprinted and inactive X chromosome genes. In cancerous cells, DNA methylation inactivates tumor-suppressor genes, as well as DNA repair genes, can disrupt cell-cycle regulation. The most current methods for identifying CGIs suffered from various limitations and involved a lot of human interventions. This paper gives an easy searching technique with data mining of Markov Chain in genes. Markov chain model has been applied to study the probability of occurrence of C-G pair in the given gene sequence. Maximum Likelihood estimators for the transition probabilities for each model and analgously for the model has been developed and log odds ratio that is calculated estimates the presence or absence of CpG is lands in the given gene which brings in many facts for the cancer detection in human genome.
Conference Paper
The Bio-Grid REU (Research Experience for Undergraduates) Site offers undergraduate students to participate in the research activities associated with the Bio-Grid Initiatives conducted at UConn. The initiatives aim at advancing the application of modern computing infrastructures and information technology to research and practice in various life-science disciplines. Training seminars are designed to equip students with preliminary background knowledge such as basic parallel programming skills, large-scale data analytics, and middleware support, etc., as well as some ongoing life-science research projects using these computing methods. Students participate in research activities associated with several collaborative projects supported by a campus-wide computational and data grid. The Site was supported by the national Science Foundation from 08-10 and 12-14. The REU project introduces such interdisciplinary research work to students in the early stage of their academic career to spark their interest. The project aims at preparing future software engineers to formalize and solve emerging life-science problems, as well as life-science researchers with a strong background in high-performance computing.
Article
Full-text available
Recently bio-informatics has become a central focus area in research in the life sciences due to the accumulation of biological data as a result of deferent organism projects such as the Human-Genome project and other studies [1][2]. The interaction between molecular biology and informatics is rejected in the word bio-informatics [3]. Bio-informatics is a new, developing field, and researchers, especially from computer science, have a good opportunity to contribute by engineering solutions to the open problems existing in this field.
Article
Full-text available
Classifiers based on discriminant model achieved the highest accuracy compared to other protein classification methods in remote homology detection, but all of the classifiers were troubled by imbalance training in modeling. This paper presented a protein classification based on optimization of discriminant model to further improve the classifier performance by setting different penalty coefficients for the positive and negative samples to balance the training set weights. Comparative experiments show that the method based on optimized discriminant model obtained higher accuracy, and the method can improve the performance of all classifiers based on discriminant model by optimization of the parameters.
Article
Full-text available
The selection of a scoring matrix and gap penalty parameters continues to be an important problem in sequence alignment. We describe here an algorithm, the 'Bayes block aligner, which bypasses this requirement. Instead of requiring a fixed set of parameter settings, this algorithm returns the Bayesian posterior probability for the number of gaps and for the scoring matrices in any series of interest. Furthermore, instead of returning the single best alignment for the chosen parameter settings, this algorithm returns the posterior distribution of all alignments considering the full range of gapping and scoring matrices selected, weighing each in proportion to its probability based on the data. We compared the Bayes aligner with the popular Smith-Waterman algorithm with parameter settings from the literature which had been optimized for the identification of structural neighbors, and found that the Bayes aligner correctly identified more structural neighbors. In a detailed examination of the alignment of a pair of kinase and a pair of GTPase sequences, we illustrate the algorithm's potential to identify subsequences that are conserved to different degrees. In addition, this example shows that the Bayes aligner returns an alignment-free assessment of the distance between a pair of sequences.
Article
Full-text available
We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching. The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon frequency model conditioned on windowed nucleotide frequency and the preceding codon. Two neural networks are used, as in (Brunak, Engelbrecht, & Knudsen 1991), for splice site prediction. We show that this simple model performs quite well. For a cross-validated standard test set of 304 genes [ftp:@www-hgc.lbl.gov/pub/genesets] in human DNA, our gene-finding system identified up to 85% of protein-coding bases correctly with a specificity of 80%. 58% of exons were exactly identified with a specificity of 51%. Genie is shown to perform favorably compared with several other gene-finding systems.
Article
Full-text available
We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichiet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can be more reliably recognized by the model. This paper corrects the previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichiet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
Article
Full-text available
We introduce a general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived to account for the many substantial differences in gene density and structure observed in distinct C + G compositional regions of the human genome. In addition, new models of the donor and acceptor splice signals are described which capture potentially important dependencies between signal positions. The model is applied to the problem of gene identification in a computer program, GENSCAN, which identifies complete exon/intron structures of genes in genomic DNA. Novel features of the program include the capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. GENSCAN is shown to have substantially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes, with 75 to 80% of exons identified exactly. The program is also capable of indicating fairly accurately the reliability of each predicted exon. Consistently high levels of accuracy are observed for sequences of differing C + G content and for distinct groups of vertebrates.
Article
Full-text available
A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a protein family, in this case a hidden Markov model. This general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Article
Full-text available
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.
Article
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the WWW in the UK at http://www.sanger.ac.uk/Software/Pfam/ , in Sweden at http://www.cgr.ki.se/Pfam/ and in the US at http://pfam.wustl.edu/ . The latest version (4.3) of Pfam contains 1815 families. These Pfam families match 63% of proteins in SWISS-PROT 37 and TrEMBL 9. For complete genomes Pfam currently matches up to half of the proteins. Genomic DNA can be directly searched against the Pfam library using the Wise2 package.
Article
We introduce a maximum discrimination method for building hidden Markov models (HMMs) of protein or nucleic acid primary sequence consensus. The method compensates for biased representation in sequence data sets, superseding the need for sequence weighting methods. Maximum discrimination HMMs are more sensitive for detecting distant sequence homologs than various other HMM methods or BLAST when tested on globin and protein kinase catalytic domain sequences.
Article
Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the SWISS-PROT 22 database for other sequences that are members of the given protein family, or contain the given domain. The HMM produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate three-dimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appears to have a slight advantage over PROFILESEARCH in terms of lower rates of false negatives and false positives, even though the HMM is trained using only unaligned sequences, whereas PROFILESEARCH requires aligned training sequences. Our results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling. This region has been suggested to contain the functional domains that are typical or essential for all L-type calcium channels regardless of whether they couple to ryanodine receptors, conduct ions or both.
Article
A hidden Markov model for gene finding consists of submodels for coding regions, splice sites, introns, intergenic regions and possibly more. It is described how to estimate the model as a whole from labeled sequences instead of estimating the individual parts independently from subsequences. It is argued that the standard maximum likelihood estimation criterion is not optimal for training such a model. Instead of maximizing the probability of the DNA sequence, one should maximize the probability of the correct prediction. Such a criterion, called conditional maximum likelihood, is used for the gene finder 'HMM-gene'. A new (approximative) algorithm is described, which finds the most probable prediction summed over all paths yielding the same prediction. We show that these methods contribute significantly to the high performance of HMMgene.
Article
We have developed a code generating language, called Dynamite, specialised for the production and subsequent manipulation of complex dynamic programming methods for biological sequence comparison. From a relatively simple text definition file Dynamite will produce a variety of implementations of a dynamic programming method, including database searches and linear space alignments. The speed of the generated code is comparable to hand written code, and the additional flexibility has proved invaluable in designing and testing new algorithms. An innovation is a flexible labelling system, which can be used to annotate the original sequences with biological information. We illustrate the Dynamite syntax and flexibility by showing definitions for dynamic programming routines (i) to align two protein sequences under the assumption that they are both poly-topic transmembrane proteins, with the simultaneous assignment of transmembrane helices and (ii) to align protein information to genomic DNA, allowing for introns and sequencing error.
Article
Algorithms for generating alignments of biological sequences have inherent statistical limitations when it comes to the accuracy of the alignments they produce. Using simulations, we measure the accuracy of the standard global dynamic programming method and show that it can be reasonably well modelled by an "edge wander" approximation to the distribution of the optimal scoring path around the correct path in the vicinity of a gap. We also give a table from which accuracy values can be predicted for commonly used scoring schemes and sequence divergences (the PAM and BLOSUM series). Finally we describe how to calculate the expected accuracy of a given alignment, and show how this can be used to construct an optimal accuracy alignment algorithm which generates significantly more accurate alignments than standard dynamic programming methods in simulated experiments.
Article
Motivation: Evolutionary models of amino acid sequences can be adapted to incorporate structure information; protein structure biologists can use phylogenetic relationships among species to improve prediction accuracy. Results : A computer program called PASSML ('Phylogeny and Secondary Structure using Maximum Likelihood') has been developed to implement an evolutionary model that combines protein secondary structure and amino acid replacement. The model is related to that of Dayhoff and co-workers, but we distinguish eight categories of structural environment: alpha helix, beta sheet, turn and coil, each further classified according to solvent accessibility, i.e. buried or exposed. The model of sequence evolution for each of the eight categories is a Markov process with discrete states in continuous time, and the organization of structure along protein sequences is described by a hidden Markov model. This paper describes the PASSML software and illustrates how it allows both the reconstruction of phylogenies and prediction of secondary structure from aligned amino acid sequences. Availability: PASSML 'ANSI C' source code and the example data sets described here are available at http://ng-dec1.gen.cam.ac.uk/hmm/Passml.html and 'downstream' Web pages. Contact: P.Lio@gen.cam.ac.uk
Article
The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.
Article
We describe a dynamic programming algorithm for predicting optimal RNA secondary structure, including pseudoknots. The algorithm has a worst case complexity of O(N6) in time and O(N4) in storage. The description of the algorithm is complex, which led us to adopt a useful graphical representation (Feynman diagrams) borrowed from quantum field theory. We present an implementation of the algorithm that generates the optimal minimum energy structure for a single RNA sequence, using standard RNA folding thermodynamic parameters augmented by a few parameters describing the thermodynamic stability of pseudoknots. We demonstrate the properties of the algorithm by using it to predict structures for several small pseudoknotted and non-pseudoknotted RNAs. Although the time and memory demands of the algorithm are steep, we believe this is the first algorithm to be able to fold optimal (minimum energy) pseudoknotted RNAs with the accepted RNA thermodynamic model.
Article
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures (http://SMART.embl-heidelberg.de ). More than 400 domain families found in signalling, extra­cellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.
Article
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the WWW in the UK at http://www.sanger.ac.uk/Software/Pfam/ , in Sweden at http://www.cgr.ki.se/Pfam/ and in the US at http:// pfam.wustl.edu/ . The latest version (4.3) of Pfam contains 1815 families. These Pfam families match 63% of proteins in SWISS-PROT 37 and TrEMBL 9. For complete genomes Pfam currently matches up to half of the proteins. Genomic DNA can be directly searched against the Pfam library using the Wise2 package.
  • Ibm J Res Dev
IBM J. RES. & DEV. VOL. 45 NO. 3/4 MAY/JULY 2001 E. BIRNEY Used in Sequence Comparison, " Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 1997, pp. 56 – 64.