Thomas M Keane

Wellcome Trust Sanger Institute, Cambridge, ENG, United Kingdom

Are you Thomas M Keane?

Claim your profile

Publications (30)195.41 Total impact

  • Article: Go retro and get a GRIP.
    Kim Wong, David J Adams, Thomas M Keane
    [show abstract] [hide abstract]
    ABSTRACT: Gene retrocopy insertions are a source of new genes and new gene functions, and can now be identified using paired-end whole genome sequencing data.
    Genome biology 03/2013; 14(3):108. · 6.63 Impact Factor
  • Article: RetroSeq: Transposable element discovery from Illumina paired-end sequencing data.
    Thomas M Keane, Kim Wong, David J Adams
    [show abstract] [hide abstract]
    ABSTRACT: A significant proportion of eukaryote genomes consist of transposable element (TE) derived sequence. These elements are known to have the capacity to modulate gene function and genome evolution. We have developed RetroSeq for detecting non-reference TE insertions from Illumina paired-end whole-genome sequencing data. We evaluate RetroSeq on a human trio from the 1000 Genomes Project showing that it produces highly accurate TE calls. AVAILABILITY: RetroSeq is open-source and available from https://github.com/tk2/RetroSeq CONTACT: tk2@sanger.ac.uk.
    Bioinformatics 12/2012; · 5.47 Impact Factor
  • Source
    Article: Sequencing and characterization of the FVB/NJ mouse genome.
    [show abstract] [hide abstract]
    ABSTRACT: BACKGROUND: The FVB/NJ mouse strain has its origins in a colony of outbred Swiss mice established in 1935 at the National Institutes of Health. Mice derived from this source were selectively bred for sensitivity to histamine diphosphate and the B strain of Friend leukemia virus. This led to the establishment of the FVB/N inbred strain, which was subsequently imported to the Jackson Laboratory and designated FVB/NJ. The FVB/NJ mouse has several distinct characteristics, such as large pronuclear morphology, vigorous reproductive performance, and consistently large litters that make it highly desirable for transgenic strain production and general purpose use. RESULTS: Using next-generation sequencing technology, we have sequenced the genome of FVB/NJ to approximately 50-fold coverage, and have generated a comprehensive catalog of single nucleotide polymorphisms, small insertion/deletion polymorphisms, and structural variants, relative to the reference C57BL/6J genome. We have examined a previously identified quantitative trait locus for atherosclerosis susceptibility on chromosome 10 and identify several previously unknown candidate causal variants. CONCLUSION: The sequencing of the FVB/NJ genome and generation of this catalog has increased the number of known variant sites in FVB/NJ by a factor of four, and will help accelerate the identification of the precise molecular variants that are responsible for phenotypes observed in this widely used strain.
    Genome biology 08/2012; 13(8):R72. · 6.63 Impact Factor
  • Article: Next-generation sequencing of experimental mouse strains.
    [show abstract] [hide abstract]
    ABSTRACT: Since the turn of the century the complete genome sequence of just one mouse strain, C57BL/6J, has been available. Knowing the sequence of this strain has enabled large-scale forward genetic screens to be performed, the creation of an almost complete set of embryonic stem (ES) cell lines with targeted alleles for protein-coding genes, and the generation of a rich catalog of mouse genomic variation. However, many experiments that use other common laboratory mouse strains have been hindered by a lack of whole-genome sequence data for these strains. The last 5 years has witnessed a revolution in DNA sequencing technologies. Recently, these technologies have been used to expand the repertoire of fully sequenced mouse genomes. In this article we review the main findings of these studies and discuss how the sequence of mouse genomes is helping pave the way from sequence to phenotype. Finally, we discuss the prospects for using de novo assembly techniques to obtain high-quality assembled genome sequences of these laboratory mouse strains, and what advances in sequencing technologies may be required to achieve this goal.
    Mammalian Genome 07/2012; 23(9-10):490-8. · 2.89 Impact Factor
  • Article: The genomic landscape shaped by selection on transposable elements across 18 mouse strains.
    [show abstract] [hide abstract]
    ABSTRACT: Transposable element (TE)-derived sequence dominates the landscape of mammalian genomes and can modulate gene function by dysregulating transcription and translation. Our current knowledge of TEs in laboratory mouse strains is limited primarily to those present in the C57BL/6J reference genome, with most mouse TEs being drawn from three distinct classes, namely short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs) and the endogenous retrovirus (ERV) superfamily. Despite their high prevalence, the different genomic and gene properties controlling whether TEs are preferentially purged from, or are retained by, genetic drift or positive selection in mammalian genomes remain poorly defined. Using whole genome sequencing data from 13 classical laboratory and 4 wild-derived mouse inbred strains, we developed a comprehensive catalogue of 103,798 polymorphic TE variants. We employ this extensive data set to characterize TE variants across the Mus lineage, and to infer neutral and selective processes that have acted over 2 million years. Our results indicate that the majority of TE variants are introduced though the male germline and that only a minority of TE variants exert detectable changes in gene expression. However, among genes with differential expression across the strains there are twice as many TE variants identified as being putative causal variants as expected. Most TE variants that cause gene expression changes appear to be purged rapidly by purifying selection. Our findings demonstrate that past TE insertions have often been highly deleterious, and help to prioritize TE variants according to their likely contribution to gene expression or phenotype variation.
    Genome biology 06/2012; 13(6):R45. · 6.63 Impact Factor
  • Article: High levels of RNA-editing site conservation amongst 15 laboratory mouse strains.
    [show abstract] [hide abstract]
    ABSTRACT: Adenosine-to-inosine (A-to-I) editing is a site-selective post-transcriptional alteration of double-stranded RNA by ADAR deaminases that is crucial for homeostasis and development. Recently the Mouse Genomes Project generated genome sequences for 17 laboratory mouse strains and rich catalogues of variants. We also generated RNA-seq data from whole brain RNA from 15 of the sequenced strains. Here we present a computational approach that takes an initial set of transcriptome/genome mismatch sites and filters these calls taking into account systematic biases in alignment, single nucleotide variant calling, and sequencing depth to identify RNA editing sites with high accuracy. We applied this approach to our panel of mouse strain transcriptomes identifying 7,389 editing sites with an estimated false-discovery rate of between 2.9 and 10.5%. The overwhelming majority of these edits were of the A-to-I type, with less than 2.4% not of this class, and only three of these edits could not be explained as alignment artifacts. We validated 24 novel RNA editing sites in coding sequence, including two non-synonymous edits in the Cacna1d gene that fell into the IQ domain portion of the Cav1.2 voltage-gated calcium channel, indicating a potential role for editing in the generation of transcript diversity. We show that despite over two million years of evolutionary divergence, the sites edited and the level of editing at each site is remarkably consistent across the 15 strains. In the Cds2 gene we find evidence for RNA editing acting to preserve the ancestral transcript sequence despite genomic sequence divergence.
    Genome biology 04/2012; 13(4):r26. · 6.63 Impact Factor
  • Article: The fine-scale architecture of structural variants in 17 mouse genomes.
    [show abstract] [hide abstract]
    ABSTRACT: Accurate catalogs of structural variants (SVs) in mammalian genomes are necessary to elucidate the potential mechanisms that drive SV formation and to assess their functional impact. Next generation sequencing methods for SV detection are an advance on array-based methods, but are almost exclusively limited to four basic types: deletions, insertions, inversions and copy number gains. By visual inspection of 100 Mbp of genome to which next generation sequence data from 17 inbred mouse strains had been aligned, we identify and interpret 21 paired-end mapping patterns, which we validate by PCR. These paired-end mapping patterns reveal a greater diversity and complexity in SVs than previously recognized. In addition, Sanger-based sequence analysis of 4,176 breakpoints at 261 SV sites reveal additional complexity at approximately a quarter of structural variants analyzed. We find micro-deletions and micro-insertions at SV breakpoints, ranging from 1 to 107 bp, and SNPs that extend breakpoint micro-homology and may catalyze SV formation. An integrative approach using experimental analyses to train computational SV calling is essential for the accurate resolution of the architecture of SVs. We find considerable complexity in SV formation; about a quarter of SVs in the mouse are composed of a complex mixture of deletion, insertion, inversion and copy number gain. Computational methods can be adapted to identify most paired-end mapping patterns.
    Genome biology 03/2012; 13(3):R18. · 6.63 Impact Factor
  • Source
    Article: Mouse genomic variation and its effect on phenotypes and gene regulation.
    [show abstract] [hide abstract]
    ABSTRACT: We report genome sequences of 17 inbred strains of laboratory mice and identify almost ten times more variants than previously known. We use these genomes to explore the phylogenetic history of the laboratory mouse and to examine the functional consequences of allele-specific variation on transcript abundance, revealing that at least 12% of transcripts show a significant tissue-specific expression bias. By identifying candidate functional variants at 718 quantitative trait loci we show that the molecular nature of functional variants and their position relative to genes vary according to the effect size of the locus. These sequences provide a starting point for a new era in the functional analysis of a key model organism.
    Nature 09/2011; 477(7364):289-94. · 36.28 Impact Factor
  • Source
    Article: Sequence-based characterization of structural variation in the mouse genome.
    [show abstract] [hide abstract]
    ABSTRACT: Structural variation is widespread in mammalian genomes and is an important cause of disease, but just how abundant and important structural variants (SVs) are in shaping phenotypic variation remains unclear. Without knowing how many SVs there are, and how they arise, it is difficult to discover what they do. Combining experimental with automated analyses, we identified 711,920 SVs at 281,243 sites in the genomes of thirteen classical and four wild-derived inbred mouse strains. The majority of SVs are less than 1 kilobase in size and 98% are deletions or insertions. The breakpoints of 160,000 SVs were mapped to base pair resolution, allowing us to infer that insertion of retrotransposons causes more than half of SVs. Yet, despite their prevalence, SVs are less likely than other sequence variants to cause gene expression or quantitative phenotypic variation. We identified 24 SVs that disrupt coding exons, acting as rare variants of large effect on gene function. One-third of the genes so affected have immunological functions.
    Nature 09/2011; 477(7364):326-9. · 36.28 Impact Factor
  • Article: Mouse genomic variation and its effect on phenotypes and gene regulation
    Nature 09/2011; 477(7364):289-294. · 36.28 Impact Factor
  • Source
    Article: Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly.
    [show abstract] [hide abstract]
    ABSTRACT: We present a pipeline, SVMerge, to detect structural variants by integrating calls from several existing structural variant callers, which are then validated and the breakpoints refined using local de novo assembly. SVMerge is modular and extensible, allowing new callers to be incorporated as they become available. We applied SVMerge to the analysis of a HapMap trio, demonstrating enhanced structural variant detection, breakpoint refinement, and a lower false discovery rate. SVMerge can be downloaded from http://svmerge.sourceforge.net.
    Genome biology 12/2010; 11(12):R128. · 6.63 Impact Factor
  • Source
    Article: Multi-heuristic dynamic task allocation using genetic algorithms in a heterogeneous distributed system.
    Andrew J Page, Thomas M Keane, Thomas J Naughton
    [show abstract] [hide abstract]
    ABSTRACT: We present a multi-heuristic evolutionary task allocation algorithm to dynamically map tasks to processors in a heterogeneous distributed system. It utilizes a genetic algorithm, combined with eight common heuristics, in an effort to minimize the total execution time. It operates on batches of unmapped tasks and can preemptively remap tasks to processors. The algorithm has been implemented on a Java distributed system and evaluated with a set of six problems from the areas of bioinformatics, biomedical engineering, computer science and cryptography. Experiments using up to 150 heterogeneous processors show that the algorithm achieves better efficiency than other state-of-the-art heuristic algorithms.
    Journal of Parallel and Distributed Computing 07/2010; 70(7):758-766. · 0.86 Impact Factor
  • Source
    Article: New insights into the blood-stage transcriptome of Plasmodium falciparum using RNA-Seq.
    [show abstract] [hide abstract]
    ABSTRACT: Recent advances in high-throughput sequencing present a new opportunity to deeply probe an organism's transcriptome. In this study, we used Illumina-based massively parallel sequencing to gain new insight into the transcriptome (RNA-Seq) of the human malaria parasite, Plasmodium falciparum. Using data collected at seven time points during the intraerythrocytic developmental cycle, we (i) detect novel gene transcripts; (ii) correct hundreds of gene models; (iii) propose alternative splicing events; and (iv) predict 5' and 3' untranslated regions. Approximately 70% of the unique sequencing reads map to previously annotated protein-coding genes. The RNA-Seq results greatly improve existing annotation of the P. falciparum genome with over 10% of gene models modified. Our data confirm 75% of predicted splice sites and identify 202 new splice sites, including 84 previously uncharacterized alternative splicing events. We also discovered 107 novel transcripts and expression of 38 pseudogenes, with many demonstrating differential expression across the developmental time series. Our RNA-Seq results correlate well with DNA microarray analysis performed in parallel on the same samples, and provide improved resolution over the microarray-based method. These data reveal new features of the P. falciparum transcriptional landscape and significantly advance our understanding of the parasite's red blood cell-stage transcriptome.
    Molecular Microbiology 02/2010; 76(1):12-24. · 5.01 Impact Factor
  • Source
    Article: Plasmodium falciparum var gene expression is modified by host immunity.
    [show abstract] [hide abstract]
    ABSTRACT: Plasmodium falciparum erythrocyte membrane protein 1 (PfEMP1) is a potentially important family of immune targets, which play a central role in the host-parasite interaction by binding to various host molecules. They are encoded by a diverse family of genes called var, of which there are approximately 60 copies in each parasite genome. In sub-Saharan Africa, although P. falciparum infection occurs throughout life, severe malarial disease tends to occur only in childhood. This could potentially be explained if (i) PfEMP1 variants differ in their capacity to support pathogenesis of severe malaria and (ii) this capacity is linked to the likelihood of each molecule being recognized and cleared by naturally acquired antibodies. Here, in a study of 217 Kenyan children with malaria, we show that expression of a group of var genes "cys2," containing a distinct pattern of cysteine residues, is associated with low host immunity. Expression of cys2 genes was associated with parasites from young children, those with severe malaria, and those with a poorly developed antibody response to parasite-infected erythrocyte surface antigens. Cys-2 var genes form a minor component of all genomic var repertoires analyzed to date. Therefore, the results are compatible with the hypothesis that the genomic var gene repertoire is organized such that PfEMP1 molecules that confer the most virulence to the parasite tend also to be those that are most susceptible to the development of host immunity. This may help the parasite to adapt effectively to the development of host antibodies through modification of the host-parasite relationship.
    Proceedings of the National Academy of Sciences 12/2009; 106(51):21801-6. · 9.68 Impact Factor
  • Source
    Article: Genome-wide end-sequenced BAC resources for the NOD/MrkTac() and NOD/ShiLtJ() mouse genomes.
    [show abstract] [hide abstract]
    ABSTRACT: Non-obese diabetic (NOD) mice spontaneously develop type 1 diabetes (T1D) due to the progressive loss of insulin-secreting beta-cells by an autoimmune driven process. NOD mice represent a valuable tool for studying the genetics of T1D and for evaluating therapeutic interventions. Here we describe the development and characterization by end-sequencing of bacterial artificial chromosome (BAC) libraries derived from NOD/MrkTac (DIL NOD) and NOD/ShiLtJ (CHORI-29), two commonly used NOD substrains. The DIL NOD library is composed of 196,032 BACs and the CHORI-29 library is composed of 110,976 BACs. The average depth of genome coverage of the DIL NOD library, estimated from mapping the BAC end-sequences to the reference mouse genome sequence, was 7.1-fold across the autosomes and 6.6-fold across the X chromosome. Clones from this library have an average insert size of 150 kb and map to over 95.6% of the reference mouse genome assembly (NCBIm37), covering 98.8% of Ensembl mouse genes. By the same metric, the CHORI-29 library has an average depth over the autosomes of 5.0-fold and 2.8-fold coverage of the X chromosome, the reduced X chromosome coverage being due to the use of a male donor for this library. Clones from this library have an average insert size of 205 kb and map to 93.9% of the reference mouse genome assembly, covering 95.7% of Ensembl genes. We have identified and validated 191,841 single nucleotide polymorphisms (SNPs) for DIL NOD and 114,380 SNPs for CHORI-29. In total we generated 229,736,133 bp of sequence for the DIL NOD and 121,963,211 bp for the CHORI-29. These BAC libraries represent a powerful resource for functional studies, such as gene targeting in NOD embryonic stem (ES) cell lines, and for sequencing and mapping experiments.
    Genomics 11/2009; 95(2):105-10. · 3.02 Impact Factor
  • Source
    Article: ABACAS: algorithm-based automatic contiguation of assembled sequences.
    [show abstract] [hide abstract]
    ABSTRACT: Due to the availability of new sequencing technologies, we are now increasingly interested in sequencing closely related strains of existing finished genomes. Recently a number of de novo and mapping-based assemblers have been developed to produce high quality draft genomes from new sequencing technology reads. New tools are necessary to take contigs from a draft assembly through to a fully contiguated genome sequence. ABACAS is intended as a tool to rapidly contiguate (align, order, orientate), visualize and design primers to close gaps on shotgun assembled contigs based on a reference sequence. The input to ABACAS is a set of contigs which will be aligned to the reference genome, ordered and orientated, visualized in the ACT comparative browser, and optimal primer sequences are automatically generated. Availability and Implementation: ABACAS is implemented in Perl and is freely available for download from http://abacas.sourceforge.net.
    Bioinformatics 07/2009; 25(15):1968-9. · 5.47 Impact Factor
  • Source
    Article: MultiPhyl: a high-throughput phylogenomics webserver using distributed computing.
    [show abstract] [hide abstract]
    ABSTRACT: With the number of fully sequenced genomes increasing steadily, there is greater interest in performing large-scale phylogenomic analyses from large numbers of individual gene families. Maximum likelihood (ML) has been shown repeatedly to be one of the most accurate methods for phylogenetic construction. Recently, there have been a number of algorithmic improvements in maximum-likelihood-based tree search methods. However, it can still take a long time to analyse the evolutionary history of many gene families using a single computer. Distributed computing refers to a method of combining the computing power of multiple computers in order to perform some larger overall calculation. In this article, we present the first high-throughput implementation of a distributed phylogenetics platform, MultiPhyl, capable of using the idle computational resources of many heterogeneous non-dedicated machines to form a phylogenetics supercomputer. MultiPhyl allows a user to upload hundreds or thousands of amino acid or nucleotide alignments simultaneously and perform computationally intensive tasks such as model selection, tree searching and bootstrapping of each of the alignments using many desktop machines. The program implements a set of 88 amino acid models and 56 nucleotide maximum likelihood models and a variety of statistical methods for choosing between alternative models. A MultiPhyl webserver is available for public use at: http://www.cs.nuim.ie/distributed/multiphyl.php.
    Nucleic Acids Research 08/2007; 35(Web Server issue):W33-7. · 8.03 Impact Factor
  • Source
    Article: Building Large Phylogenetic Trees on Coarse-Grained Parallel Machines
    [show abstract] [hide abstract]
    ABSTRACT: Phylogenetic analysis is an area of computational biology concerned with the reconstruction of evolutionary relationships between organisms, genes, and gene families. Maximum likelihood evaluation has proven to be one of the most reliable methods for constructing phylogenetic trees. The huge computational requirements associated with maximum likelihood analysis means that it is not feasible to produce large phylogenetic trees using a single processor. We have completed a fully cross platform coarse-grained distributed application, DPRml, which overcomes many of the limitations imposed by the current set of parallel phylogenetic programs. We have completed a set of efficiency tests that show how to maximise efficiency while using the program to build large phylogenetic trees. The software is publicly available under the terms of the GNU general public licence from the system webpage at http://www.cs.nuim.ie/distributed.
    Algorithmica 06/2006; 45(3):285-300. · 0.60 Impact Factor
  • Article: Emergence of a three codon deletion in gag p17 in HIV type 1 subtype C long-term survivors, and general population spread.
    [show abstract] [hide abstract]
    ABSTRACT: In a population-based study in northern Malawi we investigated HIV-1 subtype C gag and env gene sequences associated with long-term survival. DNA samples were available from 31 individuals surviving between population surveys carried out in the 1980s and 1990s. Most survivors with paired sequences dating from the 1980s and the 1990s had a three codon deletion in the gag p17 region of the sequence retrieved from the sample collected in the 1990s that was not present in the sequence from the same individual dating from the 1980s. This deletion was also not present in any other 1980s sequences from Malawi, but was common in samples collected in Malawi in the 1990s. The deletion is equivalent to the loss of three amino acids in the D helix region of the gag protein, and may be associated with longer survival and onward transmission.
    AIDS Research and Human Retroviruses 03/2006; 22(2):195-201. · 2.25 Impact Factor
  • Source
    Article: Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified.
    [show abstract] [hide abstract]
    ABSTRACT: In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner. We start by highlighting the potential dangers of arbitrarily choosing protein models by demonstrating an empirical example where a single alignment can produce two topologically different and strongly supported phylogenies using two different arbitrarily-chosen amino acid substitution models. We demonstrate that in simple simulations, statistical methods of model selection are indeed robust and likely to be useful for protein model selection. We have investigated patterns of amino acid substitution among homologous sequences from the three Domains of life and our results show that no single amino acid matrix is optimal for any of the datasets. Perhaps most interestingly, we demonstrate that for two large datasets derived from the proteobacteria and archaea, one of the most favored models in both datasets is a model that was originally derived from retroviral Pol proteins. This demonstrates that choosing protein models based on their source or method of construction may not be appropriate.
    BMC Evolutionary Biology 02/2006; 6:29. · 3.52 Impact Factor