Topics (1)

Publications (26) View all

  • Source
    Article: Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer.
    Pierre Peterlongo, Rayan Chikhi
    [show abstract] [hide abstract]
    ABSTRACT: ABSTRACT: BACKGROUND: The analysis of next-generation sequencing data from large genomes is a timely research topic. Sequencers are producing billions of short sequence fragments from newly sequenced organisms. Computational methods for reconstructing whole genomes/transcriptomes (de novo assemblers) are typically employed to process such data. However, these methods require large memory resources and computation time. Many basic biological questions could be answered targeting specific information in the reads, thus avoiding complete assembly. RESULTS: We present Mapsembler, an iterative micro and targeted assembler which processes large datasets of reads on commodity hardware. Mapsembler checks for the presence of given regions of interest that can be constructed from reads and builds a short assembly around it, either as a plain sequence or as a graph, showing contextual structure. We introduce new algorithms to retrieve approximate occurrences of a sequence from reads and construct an extension graph. Among other results presented in this paper, Mapsembler enabled to retrieve previously described human breast cancer candidate fusion genes, and to detect new ones not previously known. CONCLUSIONS: Mapsembler is the first software that enables de novo discovery around a region of interest of repeats, SNPs, exon skipping, gene fusion, as well as other structural events, directly from raw sequencing reads. As indexing is localized, the memory footprint of Mapsembler is negligible. Mapsembler is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/mapsembler/.
    BMC Bioinformatics 03/2012; 13(1):48. · 2.75 Impact Factor
  • Source
    Article: KISSPLICE: de-novo calling alternative splicing events from RNA-seq data.
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, we address the problem of identifying and quantifying polymorphisms in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each polymorphism corresponds to a recognisable pattern in a De Bruijn graph constructed from the RNA-seq reads, we propose a general model for all polymorphisms in such graphs. We then introduce an exact algorithm, called KISSPLICE, to extract alternative splicing events. We show that KISSPLICE enables to identify more correct events than general purpose transcriptome assemblers. Additionally, on a 71 M reads dataset from human brain and liver tissues, KISSPLICE identified 3497 alternative splicing events, out of which 56% are not present in the annotations, which confirms recent estimates showing that the complexity of alternative splicing has been largely underestimated so far. We propose new models and algorithms for the detection of polymorphism in RNA-seq data. This opens the way to a new kind of studies on large HTS RNA-seq datasets, where the focus is not the global reconstruction of full-length transcripts, but local assembly of polymorphic regions. KISSPLICE is available for download at http://alcovna.genouest.org/kissplice/.
    BMC Bioinformatics 01/2012; 13 Suppl 6:S5. · 2.75 Impact Factor
  • Chapter: Filters and Seeds Approaches for Fast Homology Searches in Large Datasets
    Nadia Pisanti, Mathieu Giraud, Pierre Peterlongo
    12/2010: pages 299 - 319; , ISBN: 9780470892107
  • Conference Proceeding: An optimized filter for finding multiple repeats in DNA sequences.
    Maria Federico, Pierre Peterlongo, Nadia Pisanti
    The 8th ACS/IEEE International Conference on Computer Systems and Applications, AICCSA 2010, Hammamet, Tunisia, May 16-19, 2010; 01/2010
  • Chapter: c-GAMMA:Comparative Genome Analysis of Molecular Markers
    [show abstract] [hide abstract]
    ABSTRACT: Discovery of molecular markers for efficient identification of living organisms remains a challenge of high interest. The diversity of species can now be observed in details with low cost genomic sequences produced by new generation of sequencers. A method, called c-GAMMA, is proposed. It formalizes the design of new markers for such data. It is based on a series of filters on forbidden pairs of words, followed by an optimization step on the discriminative power of candidate markers. First results are presented on a set of microbial genomes. The importance of further developments are stressed to face the huge amounts of data that will soon become available in all kingdoms of life.
    08/2009: pages 255-269;

Following (8) See all

Followers (14) See all