[Show abstract][Hide abstract] ABSTRACT: In this thesis we present two methods applied to the annotation and evolutionary analysis of overlapping coding sequence (CDS) in single stranded RNA (ssRNA) viral genomes. The overlapping coding mode is utilized by RNA viruses to increase the coding potential within their short genomes. We discuss the complications this poses for gene finders and suggest methodologies which can improve annotations. We introduce the problems we are addressing and relevant previous work. We present two methods described in detail in the two methodological chapters. The application of these methods to ssRNA viral datasets form the bulk of two results chapters. In these chapters we also illustrate how our methods perform on simulated datasets (including goodness of fit tests). Our methods use Hidden Markov Models (HMMs) to infer the underlying genomic arrangements as well as the evolutionary pressures operating along the genome. The first method presents our novel HMM topology which allows for any nucleotide to code in up to three genes. This method uses only a single sequence and solves jointly for the best annotation/parameter set within the model definition (using an Expectation Maximisation algorithm). The second method is an extension using continuous time Markov processes to model the differences in evolutionary selection pressures along the genome. We use additional comparative sequence information to improve the annotation. We illustrate that it is possible to jointly solve for the protein coding annotation and the selection pressures operating along the HIV2 genome within our extended Phylogenetic HMM method. We conclude with suggestions for areas of further work including incorporating RNA secondary structure conservation and reducing the methodological reliance on the input topologies and alignment.
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: Viral genomes tend to code in overlapping reading frames to maximize informational content. This may result in atypical codon bias and particular evolutionary constraints. Due to the fast mutation rate of viruses, there is additional strong evidence for varying selection between intra- and intergenomic regions. The presence of multiple coding regions complicates the concept of K(a)/K(s) ratio, and thus begs for an alternative approach when investigating selection strengths. Building on the paper by McCauley and Hein, we develop a method for annotating a viral genome coding in overlapping reading frames. We introduce an evolutionary model capable of accounting for varying levels of selection along the genome, and incorporate it into our prior single sequence HMM methodology, extending it now to a phylogenetic HMM. Given an alignment of several homologous viruses to a reference sequence, we may thus achieve an annotation both of coding regions as well as selection strengths, allowing us to investigate different selection patterns and hypotheses. RESULTS: We illustrate our method by applying it to a multiple alignment of four HIV2 sequences, as well as of three Hepatitis B sequences. We obtain an annotation of the coding regions, as well as a posterior probability for each site of the strength of selection acting on it. From this we may deduce the average posterior selection acting on the different genes. Whilst we are encouraged to see in HIV2, that the known to be conserved genes gag and pol are indeed annotated as such, we also discover several sites of less stringent negative selection within the env gene. To the best of our knowledge, we are the first to subsequently provide a full selection annotation of the Hepatitis B genome by explicitly modelling the evolution within overlapping reading frames, and not relying on simple K(a)/K(s) ratios.
[Show abstract][Hide abstract] ABSTRACT: Motivation: ssRNA (single stranded) viral genomes are generally constrained in length and utilise overlapping reading frames to maximally exploit the coding potential within the genome length restrictions. This overlapping coding phenomenon leads to complex evolutionary constraints operating on the genome. In regions which code for more than one protein, silent mutations in one reading frame generally have a protein coding effect in another. To maximise coding flexibility in all reading frames, overlapping regions are often compositionally biased towards amino acids which are sixfold degenerate with respect to the 64 codon alphabet. Previous methodologies have used this fact in an ad-hoc manner to look for overlapping genes by motif matching. In this paper differentiated nucleotide compositional patterns in overlapping regions are incorporated into a probabilistic Hidden Markov Model (HMM) framework which is used to annotate