SURVEY AND SUMMARY: Prospects and limitations of full-text index structures in genome analysis

Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan, Belgium.
Nucleic Acids Research (Impact Factor: 9.11). 05/2012; 40(15):6993-7015. DOI: 10.1093/nar/gks408
Source: PubMed

ABSTRACT The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared.

Download full-text


Available from: Veerle Fack, Sep 27, 2015
20 Reads
  • Source
    • "This has so far not been possible with competing approaches, as surveyed by Vyverman et al. [32]. When we are not given an upper bound on the pattern length, we can use one of the competing indexes that does not require such a bound or we can scan, with an online pattern-matching algorithm, the reference genome and the parts of the other genomes near phrase boundaries. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The rapid advance of DNA sequencing technologies has yielded databases of thousands of genomes. To search and index these databases effectively, it is important that we take advantage of the similarity between those genomes. Several authors have recently suggested searching or indexing only one reference genome and the parts of the other genomes where they differ. In this paper we survey the twenty-year history of this idea and discuss its relation to kernelization in parameterized complexity.
    Frontiers in Bioengineering and Biotechnology 12/2014; 3. DOI:10.3389/fbioe.2015.00012
  • Source
    • "The first algorithms from 2009 were soon followed by more mature proposals, which will be presented below, focusing on their indexing capabilities. More information on genome data compressors and indexes can be found in the recent surveys (Vyverman et al., 2012; Deorowicz and Grabowski, 2013; Giancarlo et al., 2013). Mäkinen et al. (2010) added index functionalities to compressed DNA sequences: display (which can also be called the random access functionality) returning the substring specified by its start and end position, count telling the number of times the given pattern occurs in the text, and locate listing the positions of the pattern in the text. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16-32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at under a free license. Data S1 is available at PLOS One online.
    PLoS ONE 03/2014; 9(10):e109384. DOI:10.1371/journal.pone.0109384 · 3.23 Impact Factor
  • Source
    • "Moreover, in bioinformatics, applications finding regularities, like transcription factor– binding sites, and inexact searches are of great interest, for these reasons indexing genomic data on suffix trees is still fundamental. For a comprehensive review on prospects and limitations of full text indexes in genome analysis we refer the reader to Vyverman et al. (2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.
    Journal of computational biology: a journal of computational molecular cell biology 03/2014; 21(4). DOI:10.1089/cmb.2012.0256 · 1.74 Impact Factor
Show more