Gary Benson

Aalto University, Helsinki, Province of Southern Finland, Finland

Are you Gary Benson?

Claim your profile

Publications (46)107.45 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The use of sequencing technologies to investigate the microbiome of a sample can positively impact patient healthcare by providing therapeutic targets for personalized disease treatment. However, these samples contain genomic sequences from various sources that complicate the identification of pathogens.
    BMC Bioinformatics 08/2014; 15(1):262. · 3.02 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations comprised of AND, OR, XOR, complement, shift, and addition. Bit-parallelism has been successfully applied to the Longest Common Subsequence (LCS) and edit-distance problems, producing very fast algorithms in practice.
    Bioinformatics (Oxford, England). 07/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: DNA tandem repeats (TRs) are ubiquitous genomic features which consist of two or more adjacent copies of an underlying pattern sequence. The copies may be identical or approximate. Variable number of tandem repeats or VNTRs are polymorphic TR loci in which the number of pattern copies is variable. In this paper we describe VNTRseek, our software for discovery of minisatellite VNTRs (pattern size ≥ 7 nucleotides) using whole genome sequencing data. VNTRseek maps sequencing reads to a set of reference TRs and then identifies putative VNTRs based on a discrepancy between the copy number of a reference and its mapped reads. VNTRseek was used to analyze the Watson and Khoisan genomes (454 technology) and two 1000 Genomes family trios (Illumina). In the Watson genome, we identified 752 VNTRs with pattern sizes ranging from 7 to 84 nt. In the Khoisan genome, we identified 2572 VNTRs with pattern sizes ranging from 7 to 105 nt. In the trios, we identified between 2660 and 3822 VNTRs per individual and found nearly 100% consistency with Mendelian inheritance. VNTRseek is, to the best of our knowledge, the first software for genome-wide detection of minisatellite VNTRs. It is available at http://orca.bu.edu/vntrseek/.
    Nucleic Acids Research 07/2014; · 8.81 Impact Factor
  • Source
    Gary Benson, Avivit Levy, Riva Shalom
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we define a new problem, motivated by computational biology, $LCSk$ aiming at finding the maximal number of $k$ length $substrings$, matching in both input strings while preserving their order of appearance. The traditional LCS definition is a special case of our problem, where $k = 1$. We provide an algorithm, solving the general case in $O(n^2)$ time, where $n$ is the length of the input strings, equaling the time required for the special case of $k=1$. The space requirement of the algorithm is $O(kn)$. %, however, in order to enable %backtracking of the solution, $O(n^2)$ space is needed. We also define a complementary $EDk$ distance measure and show that $EDk(A,B)$ can be computed in $O(nm)$ time and $O(km)$ space, where $m$, $n$ are the lengths of the input sequences $A$ and $B$ respectively.
    02/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the problem of mining poly-regions in DNA sequences and propose three methods to solve it. A poly-region is defined as a bursty DNA area, i.e., area of high occurrence of a DNA pattern. In this paper, we introduce a general formulation that covers all possibly meaningful types of poly-regions in DNA and develop three efficient methods to detect them. The first one is entropy-based and applies a recursive segmentation technique that produces a set of candidate segments which may potentially lead to a poly-region. The key idea behind the second approach is to use a set of sliding windows over the sequence. Each sliding window covers a sequence segment and keeps a summary that mainly includes the number of occurrences of each item or pattern in that segment. Combining these summaries yields the complete set of poly-regions in the given sequence. The third approach applies a technique based on the majority vote, achieving linear running time with a minimal number of false negatives. In addition, we use apply an existing method to discover frequently occurring arrangements of those poly-regions in several types of DNA regions, such as introns, exons, and nucleosomes. The proposed algorithms are tested on DNA sequences of four different organisms in terms of recall and runtime.
    Int. J. Data Mining and Bioinformatics. 09/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We study the problem of mining poly-regions in DNA. A poly-region is defined as a bursty DNA area, i.e., area of elevated frequency of a DNA pattern. We introduce a general formulation that covers a range of meaningful types of poly-regions and develop three efficient detection methods. The first applies recursive segmentation and is entropy-based. The second uses a set of sliding windows that summarize each sequence segment using several statistics. Finally, the third employs a technique based on majority vote. The proposed algorithms are tested on DNA sequences of four different organisms in terms of recall and runtime.
    International Journal of Data Mining and Bioinformatics 01/2012; 6(4):406-28. · 0.39 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The population structure of the species Legionella pneumophila was investigated by multilocus variable number of tandem repeats (VNTR) analysis (MLVA) and sequencing of three VNTRs (Lpms01, Lpms04 and Lpms13) in selected strains. Of 150 isolates of diverse origins, 136 (86 %) were distributed into eight large MLVA clonal complexes (VACCs) and the rest were either unique or formed small clusters of up to two MLVA genotypes. In spite of the lower degree of genome-wide linkage disequilibrium of the MLVA loci compared with sequence-based typing, the clustering achieved by the two methods was highly congruent. The detailed analysis of VNTR Lpms04 alleles showed a very complex organization, with five different repeat unit lengths and a high level of internal variation. Within each MLVA-defined VACC, Lpms04 was endowed with a common recognizable pattern with some interesting exceptions. Evidence of recombination events was suggested by analysis of internal repeat variations at the two additional VNTR loci, Lpms01 and Lpms13. Sequence analysis of L. pneumophila VNTR locus Lpms04 alone provides a first-line assay for allocation of a new isolate within the L. pneumophila population structure and for epidemiological studies.
    Microbiology 05/2011; 157(Pt 9):2582-94. · 3.06 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Although a variety of possible functions have been proposed for inverted repeat sequences (IRs), it is not known which of them might occur in vivo. We investigate this question by assessing the distributions and properties of IRs in the Saccharomyces cerevisiae (SC) genome. Using the IRFinder algorithm we detect 100,514 IRs having copy length greater than 6 bp and spacer length less than 77 bp. To assess statistical significance we also determine the IR distributions in two types of randomization of the S. cerevisiae genome. We find that the S. cerevisiae genome is significantly enriched in IRs relative to random. The S. cerevisiae IRs are significantly longer and contain fewer imperfections than those from the randomized genomes, suggesting that processes to lengthen and/or correct errors in IRs may be operative in vivo. The S. cerevisiae IRs are highly clustered in intergenic regions, while their occurrence in coding sequences is consistent with random. Clustering is stronger in the 3' flanks of genes than in their 5' flanks. However, the S. cerevisiae genome is not enriched in those IRs that would extrude cruciforms, suggesting that this is not a common event. Various explanations for these results are considered.
    Current Genetics 05/2010; 56(4):321-40. · 2.41 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Hepatitis C virus (HCV) infection can promote the development of hepatocellular carcinoma (HCC). Published data implicate the HCV core gene in oncogenesis. We tested the hypothesis that core gene sequences from HCC patients differ from those of patients without cirrhosis/HCC. Full-length HCV sequences from HCC patients and controls were obtained from the investigators and GenBank and compared with each other. A logistic regression model was developed to predict the HCC risk of individual point mutations and other sequence features. Mutations in partial sequences (bases 36-288) from HCC patients and controls were also analyzed. The first base of the AUG start codon was designated position 1. A logistic regression model developed through analysis of full-length core gene sequences identified seven polymorphisms significantly associated with increased HCC risk (36G/C, 209A, 271U/C, 309A/C, 435A/C, 481A, and 546A/C) and an interaction term (for 209A-271U/C) that had an odds ratio <1.0. Three of these polymorphisms could be analyzed in the partial sequences. Two of them, 36G/C and 209A, were again associated with increased HCC risk, but 271U/C was not. The odds ratio of 209A-271U/C was not significant. HCV core genes from patients with and without HCC differ at several positions. Of interest, 209A has been associated with IFN resistance and HCC in previous studies. Our findings suggest that HCV core gene sequence data might provide useful information about HCC risk. Prospective investigation is needed to establish the temporal relationship between appearance of the viral mutations and development of HCC.
    Clinical Cancer Research 05/2009; 15(9):3205-13. · 7.84 Impact Factor
  • Source
    Denise Y F Mak, Gary Benson
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Standard search techniques for DNA repeats start by identifying small matching words, or seeds, that may inhabit larger repeats. Recent innovations in seed structure include spaced seeds and indel seeds which are more sensitive than contiguous seeds. Evaluating seed sensitivity requires (i) specifying a homology model for alignments and (ii) assigning probabilities to those alignments. Optimal seed selection is resource intensive because all alternative seeds must be tested. Current methods require that the model and its probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. RESULTS: We show how to eliminate the need for prior parameter specification by exploiting a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe an efficient preprocessing step, which is computed once per seed. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to identify optimal seeds and the boundaries within probability space where they are optimal.
    Bioinformatics 01/2009; 25(3):302-8. · 5.47 Impact Factor
  • Source
    Denise Y. F. Mak, Gary Benson
    [Show abstract] [Hide abstract]
    ABSTRACT: The accurate computational prediction of RNA secondary structures is a difficult task, but an important one, since RNA structure is usually more evolutionarily conserved than primary sequence. We describe a dynamic programming algorithm called FoldRRS (Folding of RNA by Ranking of Stems) that predicts a consensus secondary structure from a multiple sequence alignment. Our algorithm exploits the use of k-length stems (k = 2) to acquire base pairing probability and covariation information from individual sequences. We test sequences from the BRAliBase I data set (1) and the Rfam database (2). Our results were compared against three algorithms, RNAalifold, Pfold, and KNetFold, that are similar in nature. FoldRRS exhibits an increase in accuracy over the other programs in data sets which contain longer and/or more numerous sequences.
    International Conference on Bioinformatics & Computational Biology, BIOCOMP 2009, July 13-16, 2009, Las Vegas Nevada, USA, 2 Volumes; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations may have shifted/changed. We formally define a broad class of problems of this type, capturing situations in which the pattern is obtained from the text by a sequence of rearrangements. We consider several natural rearrangement schemes, including the analogues of the ℓ1 and ℓ2 distances, as well as two distances based on interchanges. For these, we present efficient algorithms to solve the resulting string matching problems.
    Journal of Computer and System Sciences 01/2009; · 1.00 Impact Factor
  • Source
    J. Comput. Syst. Sci. 01/2009; 75:359-370.
  • Source
    Gary Benson, Denise Y. F. Mak
    [Show abstract] [Hide abstract]
    ABSTRACT: Let a seed, S, be a string from the alphabet {1,} , of arbi- trary length k, which starts and ends with a 1. For example, S = 11 1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with prob- ability of 1 equal to p. We refer to this new probability distribution as CnSp, for covered, with S being the seed. We present an ecient method to calculate this distribution exactly. Covered 1s represent matching po- sitions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.
    String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008, Melbourne, Australia, November 10-12, 2008. Proceedings; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The constant bombardment of mammalian genomes by transposable elements (TEs) has resulted in TEs comprising at least 45% of the human genome. Because of their great age and abundance, TEs are important in comparative phylogenomics. However, estimates of TE age were previously based on divergence from derived consensus sequences or phylogenetic analysis, which can be unreliable, especially for older more diverged elements. Therefore, a novel genome-wide analysis of TE organization and fragmentation was performed to estimate TE age independently of sequence composition and divergence or the assumption of a constant molecular clock. Analysis of TEs in the human genome revealed approximately 600,000 examples where TEs have transposed into and fragmented other TEs, covering >40% of all TEs or approximately 542 Mbp of genomic sequence. The relative age of these TEs over evolutionary time is implicit in their organization, because newer TEs have necessarily transposed into older TEs that were already present. A matrix of the number of times that each TE has transposed into every other TE was constructed, and a novel objective function was developed that derived the chronological order and relative ages of human TEs spanning >100 million years. This method has been used to infer the relative ages across all four major TE classes, including the oldest, most diverged elements. Analysis of DNA transposons over the history of the human genome has revealed the early activity of some MER2 transposons, and the relatively recent activity of MER1 transposons during primate lineages. The TEs from six additional mammalian genomes were defragmented and analyzed. Pairwise comparison of the independent chronological orders of TEs in these mammalian genomes revealed species phylogeny, the fact that transposons shared between genomes are older than species-specific transposons, and a subset of TEs that were potentially active during periods of speciation.
    PLoS Computational Biology 08/2007; 3(7):e137. · 4.87 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The remarkable responsiveness of dog morphology to selection is a testament to the mutability of mammals. The genetic sources of this morphological variation are largely unknown, but some portion is due to tandem repeat length variation in genes involved in development. Previous analysis of tandem repeats in coding regions of developmental genes revealed fewer interruptions in repeat sequences in dogs than in the orthologous repeats in humans, as well as higher levels of polymorphism, but the fragmentary nature of the available dog genome sequence thwarted attempts to distinguish between locus-specific and genome-wide origins of this disparity. Using whole-genome analyses of the human and recently completed dog genomes, we show that dogs possess a genome-wide increase in the basal germ-line slippage mutation rate. Building on the approach that gave rise to the initial observation in dogs, we sequenced 55 coding repeat regions in 42 species representing 10 major carnivore clades and found that a genome-wide elevated slippage mutation rate is a derived character shared by diverse wild canids, distinguishing them from other Carnivora. A similarly heightened slippage profile was also detected in rodents, another taxon exhibiting high diversity and rapid evolvability. The correlation of enhanced slippage rates with major evolutionary radiations suggests that the possession of a "slippery" genome may bestow on some taxa greater potential for rapid evolutionary change.
    Journal of Heredity 02/2007; 98(5):452-60. · 2.00 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Tandem repeats in DNA have been under intensive study for many years, first, as a consequence of their usefulness as genomic markers and DNA fingerprints and more recently as their role in human disease and regulatory processes has become apparent. The Tandem Repeats Database (TRDB) is a public repository of information on tandem repeats in genomic DNA. It contains a variety of tools for repeat analysis, including the Tandem Repeats Finder program, query and filtering capabilities, repeat clustering, polymorphism prediction, PCR primer selection, data visualization and data download in a variety of formats. In addition, TRDB serves as a centralized research workbench. It provides user storage space and permits collaborators to privately share their data and analysis. TRDB is available at https://tandem.bu.edu/cgi-bin/trdb/trdb.exe.
    Nucleic Acids Research 02/2007; 35(Database issue):D80-7. · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Short (~5 nucleotides) interspersed repeats regulate several aspects of post-transcriptional gene expression. Previously we developed an algorithm (REPFIND) that assigns P-values to all repeated motifs in a given nucleic acid sequence and reliably identifies clusters of short CAC-containing motifs required for mRNA localization in Xenopus oocytes. In order to facilitate the identification of genes possessing clusters of repeats that regulate post-transcriptional aspects of gene expression in mammalian genes, we used REPFIND to create a database of all repeated motifs in the 3' untranslated regions (UTR) of genes from the Mammalian Gene Collection (MGC). The MGC database includes seven vertebrate species: human, cow, rat, mouse and three non-mammalian vertebrate species. A web-based application was developed to search this database of repeated motifs to generate species-specific lists of genes containing specific classes of repeats in their 3'-UTRs. This computational tool is called 3'-UTR SIRF (Short Interspersed Repeat Finder), and it reveals that hundreds of human genes contain an abundance of short CAC-rich and CAG-rich repeats in their 3'-UTRs that are similar to those found in mRNAs localized to the neurites of neurons. We tested four candidate mRNAs for localization in rat hippocampal neurons by in situ hybridization. Our results show that two candidate CAC-rich (Syntaxin 1B and Tubulin beta4) and two candidate CAG-rich (Sec61alpha and Syntaxin 1A) mRNAs are localized to distal neurites, whereas two control mRNAs lacking repeated motifs in their 3'-UTR remain primarily in the cell body. Computational data generated with 3'-UTR SIRF indicate that hundreds of mammalian genes have an abundance of short CA-containing motifs that may direct mRNA localization in neurons. In situ hybridization shows that four candidate mRNAs are localized to distal neurites of cultured hippocampal neurons. These data suggest that short CA-containing motifs may be part of a widely utilized genetic code that regulates mRNA localization in vertebrate cells. The use of 3'-UTR SIRF to search for new classes of motifs that regulate other aspects of gene expression should yield important information in future studies addressing cis-regulatory information located in 3'-UTRs.
    BMC Bioinformatics 01/2007; 8. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homologous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.
    Bioinformatics 07/2006; 22(14):e341-9. · 5.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The problem of discovering arrangements of regions of high occurrence of one or more items of a given alphabet in a sequence, is studied, and two efficient approaches are pro- posed to solve it. The first approach is entropy-based and uses an existing recursive segmentation technique to split the input sequence into a set of homogeneous segments. The key idea of the second approach isto use a set of sliding win- dows over the sequence. Each sliding window keeps a set of statistics of a sequence segment that mainly includes the number of occurrences of each item in that segment. Com- bining these statistics efficiently yields the complete set of regions of high occurrence of the items of the given alpha- bet. After identifying these regions, the sequence is con- verted to a sequence of labeled intervals (each one corre- sponding to a region). An efficient algorithm for mining frequent arrangements of temporal intervals on a single se- quence is applied on the converted sequence to discover fre- quently occurring arrangements of these regions. The pro- posed algorithms are tested on various DNA sequences pro- ducing results with potentially significant biological mean- ing.
    Workshops Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18-22 December 2006, Hong Kong, China; 01/2006

Publication Stats

3k Citations
107.45 Total Impact Points

Institutions

  • 2012
    • Aalto University
      • Department of Information and Computer Science
      Helsinki, Province of Southern Finland, Finland
  • 2000–2012
    • Boston University
      • • Department of Computer Science
      • • Department of Electrical and Computer Engineering
      • • Department of Biology
      Boston, MA, United States
  • 1994–2004
    • Mount Sinai School of Medicine
      • Department of Genetics and Genomic Sciences
      Manhattan, NY, United States
    • Georgia Institute of Technology
      • College of Computing
      Atlanta, GA, United States
  • 2003
    • Université Paris-Sud 11
      • Institut de Génétique et Microbiologie (IGMORS)
      Paris, Ile-de-France, France
  • 1997
    • Bar Ilan University
      • Department of Computer Science
      Gan, Tel Aviv, Israel
    • University of Southern California
      • Department of Mathematics
      Los Angeles, CA, United States
  • 1992
    • University of Maryland, College Park
      • Department of Computer Science
      College Park, MD, United States