Gary Benson

Aalto University, Helsinki, Province of Southern Finland, Finland

Are you Gary Benson?

Claim your profile

Publications (42)93.56 Total impact

  • [show abstract] [hide abstract]
    ABSTRACT: We study the problem of mining poly-regions in DNA. A poly-region is defined as a bursty DNA area, i.e., area of elevated frequency of a DNA pattern. We introduce a general formulation that covers a range of meaningful types of poly-regions and develop three efficient detection methods. The first applies recursive segmentation and is entropy-based. The second uses a set of sliding windows that summarize each sequence segment using several statistics. Finally, the third employs a technique based on majority vote. The proposed algorithms are tested on DNA sequences of four different organisms in terms of recall and runtime.
    International Journal of Data Mining and Bioinformatics 01/2012; 6(4):406-28. · 0.39 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: The population structure of the species Legionella pneumophila was investigated by multilocus variable number of tandem repeats (VNTR) analysis (MLVA) and sequencing of three VNTRs (Lpms01, Lpms04 and Lpms13) in selected strains. Of 150 isolates of diverse origins, 136 (86 %) were distributed into eight large MLVA clonal complexes (VACCs) and the rest were either unique or formed small clusters of up to two MLVA genotypes. In spite of the lower degree of genome-wide linkage disequilibrium of the MLVA loci compared with sequence-based typing, the clustering achieved by the two methods was highly congruent. The detailed analysis of VNTR Lpms04 alleles showed a very complex organization, with five different repeat unit lengths and a high level of internal variation. Within each MLVA-defined VACC, Lpms04 was endowed with a common recognizable pattern with some interesting exceptions. Evidence of recombination events was suggested by analysis of internal repeat variations at the two additional VNTR loci, Lpms01 and Lpms13. Sequence analysis of L. pneumophila VNTR locus Lpms04 alone provides a first-line assay for allocation of a new isolate within the L. pneumophila population structure and for epidemiological studies.
    Microbiology 05/2011; 157(Pt 9):2582-94. · 3.06 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Although a variety of possible functions have been proposed for inverted repeat sequences (IRs), it is not known which of them might occur in vivo. We investigate this question by assessing the distributions and properties of IRs in the Saccharomyces cerevisiae (SC) genome. Using the IRFinder algorithm we detect 100,514 IRs having copy length greater than 6 bp and spacer length less than 77 bp. To assess statistical significance we also determine the IR distributions in two types of randomization of the S. cerevisiae genome. We find that the S. cerevisiae genome is significantly enriched in IRs relative to random. The S. cerevisiae IRs are significantly longer and contain fewer imperfections than those from the randomized genomes, suggesting that processes to lengthen and/or correct errors in IRs may be operative in vivo. The S. cerevisiae IRs are highly clustered in intergenic regions, while their occurrence in coding sequences is consistent with random. Clustering is stronger in the 3' flanks of genes than in their 5' flanks. However, the S. cerevisiae genome is not enriched in those IRs that would extrude cruciforms, suggesting that this is not a common event. Various explanations for these results are considered.
    Current Genetics 05/2010; 56(4):321-40. · 2.41 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Hepatitis C virus (HCV) infection can promote the development of hepatocellular carcinoma (HCC). Published data implicate the HCV core gene in oncogenesis. We tested the hypothesis that core gene sequences from HCC patients differ from those of patients without cirrhosis/HCC. Full-length HCV sequences from HCC patients and controls were obtained from the investigators and GenBank and compared with each other. A logistic regression model was developed to predict the HCC risk of individual point mutations and other sequence features. Mutations in partial sequences (bases 36-288) from HCC patients and controls were also analyzed. The first base of the AUG start codon was designated position 1. A logistic regression model developed through analysis of full-length core gene sequences identified seven polymorphisms significantly associated with increased HCC risk (36G/C, 209A, 271U/C, 309A/C, 435A/C, 481A, and 546A/C) and an interaction term (for 209A-271U/C) that had an odds ratio <1.0. Three of these polymorphisms could be analyzed in the partial sequences. Two of them, 36G/C and 209A, were again associated with increased HCC risk, but 271U/C was not. The odds ratio of 209A-271U/C was not significant. HCV core genes from patients with and without HCC differ at several positions. Of interest, 209A has been associated with IFN resistance and HCC in previous studies. Our findings suggest that HCV core gene sequence data might provide useful information about HCC risk. Prospective investigation is needed to establish the temporal relationship between appearance of the viral mutations and development of HCC.
    Clinical Cancer Research 05/2009; 15(9):3205-13. · 7.84 Impact Factor
  • Source
    J. Comput. Syst. Sci. 01/2009; 75:359-370.
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations may have shifted/changed. We formally define a broad class of problems of this type, capturing situations in which the pattern is obtained from the text by a sequence of rearrangements. We consider several natural rearrangement schemes, including the analogues of the ℓ1 and ℓ2 distances, as well as two distances based on interchanges. For these, we present efficient algorithms to solve the resulting string matching problems.
    Journal of Computer and System Sciences. 01/2009;
  • Source
    Denise Y. F. Mak, Gary Benson
    [show abstract] [hide abstract]
    ABSTRACT: The accurate computational prediction of RNA secondary structures is a difficult task, but an important one, since RNA structure is usually more evolutionarily conserved than primary sequence. We describe a dynamic programming algorithm called FoldRRS (Folding of RNA by Ranking of Stems) that predicts a consensus secondary structure from a multiple sequence alignment. Our algorithm exploits the use of k-length stems (k = 2) to acquire base pairing probability and covariation information from individual sequences. We test sequences from the BRAliBase I data set (1) and the Rfam database (2). Our results were compared against three algorithms, RNAalifold, Pfold, and KNetFold, that are similar in nature. FoldRRS exhibits an increase in accuracy over the other programs in data sets which contain longer and/or more numerous sequences.
    International Conference on Bioinformatics & Computational Biology, BIOCOMP 2009, July 13-16, 2009, Las Vegas Nevada, USA, 2 Volumes; 01/2009
  • Source
    Denise Y F Mak, Gary Benson
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: Standard search techniques for DNA repeats start by identifying small matching words, or seeds, that may inhabit larger repeats. Recent innovations in seed structure include spaced seeds and indel seeds which are more sensitive than contiguous seeds. Evaluating seed sensitivity requires (i) specifying a homology model for alignments and (ii) assigning probabilities to those alignments. Optimal seed selection is resource intensive because all alternative seeds must be tested. Current methods require that the model and its probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. RESULTS: We show how to eliminate the need for prior parameter specification by exploiting a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe an efficient preprocessing step, which is computed once per seed. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to identify optimal seeds and the boundaries within probability space where they are optimal.
    Bioinformatics 01/2009; 25(3):302-8. · 5.47 Impact Factor
  • Source
    Gary Benson, Denise Y. F. Mak
    [show abstract] [hide abstract]
    ABSTRACT: Let a seed, S, be a string from the alphabet {1,} , of arbi- trary length k, which starts and ends with a 1. For example, S = 11 1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with prob- ability of 1 equal to p. We refer to this new probability distribution as CnSp, for covered, with S being the seed. We present an ecient method to calculate this distribution exactly. Covered 1s represent matching po- sitions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.
    String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008, Melbourne, Australia, November 10-12, 2008. Proceedings; 01/2008
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: The constant bombardment of mammalian genomes by transposable elements (TEs) has resulted in TEs comprising at least 45% of the human genome. Because of their great age and abundance, TEs are important in comparative phylogenomics. However, estimates of TE age were previously based on divergence from derived consensus sequences or phylogenetic analysis, which can be unreliable, especially for older more diverged elements. Therefore, a novel genome-wide analysis of TE organization and fragmentation was performed to estimate TE age independently of sequence composition and divergence or the assumption of a constant molecular clock. Analysis of TEs in the human genome revealed approximately 600,000 examples where TEs have transposed into and fragmented other TEs, covering >40% of all TEs or approximately 542 Mbp of genomic sequence. The relative age of these TEs over evolutionary time is implicit in their organization, because newer TEs have necessarily transposed into older TEs that were already present. A matrix of the number of times that each TE has transposed into every other TE was constructed, and a novel objective function was developed that derived the chronological order and relative ages of human TEs spanning >100 million years. This method has been used to infer the relative ages across all four major TE classes, including the oldest, most diverged elements. Analysis of DNA transposons over the history of the human genome has revealed the early activity of some MER2 transposons, and the relatively recent activity of MER1 transposons during primate lineages. The TEs from six additional mammalian genomes were defragmented and analyzed. Pairwise comparison of the independent chronological orders of TEs in these mammalian genomes revealed species phylogeny, the fact that transposons shared between genomes are older than species-specific transposons, and a subset of TEs that were potentially active during periods of speciation.
    PLoS Computational Biology 08/2007; 3(7):e137. · 4.87 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Tandem repeats in DNA have been under intensive study for many years, first, as a consequence of their usefulness as genomic markers and DNA fingerprints and more recently as their role in human disease and regulatory processes has become apparent. The Tandem Repeats Database (TRDB) is a public repository of information on tandem repeats in genomic DNA. It contains a variety of tools for repeat analysis, including the Tandem Repeats Finder program, query and filtering capabilities, repeat clustering, polymorphism prediction, PCR primer selection, data visualization and data download in a variety of formats. In addition, TRDB serves as a centralized research workbench. It provides user storage space and permits collaborators to privately share their data and analysis. TRDB is available at https://tandem.bu.edu/cgi-bin/trdb/trdb.exe.
    Nucleic Acids Research 02/2007; 35(Database issue):D80-7. · 8.28 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: The remarkable responsiveness of dog morphology to selection is a testament to the mutability of mammals. The genetic sources of this morphological variation are largely unknown, but some portion is due to tandem repeat length variation in genes involved in development. Previous analysis of tandem repeats in coding regions of developmental genes revealed fewer interruptions in repeat sequences in dogs than in the orthologous repeats in humans, as well as higher levels of polymorphism, but the fragmentary nature of the available dog genome sequence thwarted attempts to distinguish between locus-specific and genome-wide origins of this disparity. Using whole-genome analyses of the human and recently completed dog genomes, we show that dogs possess a genome-wide increase in the basal germ-line slippage mutation rate. Building on the approach that gave rise to the initial observation in dogs, we sequenced 55 coding repeat regions in 42 species representing 10 major carnivore clades and found that a genome-wide elevated slippage mutation rate is a derived character shared by diverse wild canids, distinguishing them from other Carnivora. A similarly heightened slippage profile was also detected in rodents, another taxon exhibiting high diversity and rapid evolvability. The correlation of enhanced slippage rates with major evolutionary radiations suggests that the possession of a "slippery" genome may bestow on some taxa greater potential for rapid evolutionary change.
    Journal of Heredity 02/2007; 98(5):452-60. · 2.00 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Short (~5 nucleotides) interspersed repeats regulate several aspects of post-transcriptional gene expression. Previously we developed an algorithm (REPFIND) that assigns P-values to all repeated motifs in a given nucleic acid sequence and reliably identifies clusters of short CAC-containing motifs required for mRNA localization in Xenopus oocytes. In order to facilitate the identification of genes possessing clusters of repeats that regulate post-transcriptional aspects of gene expression in mammalian genes, we used REPFIND to create a database of all repeated motifs in the 3' untranslated regions (UTR) of genes from the Mammalian Gene Collection (MGC). The MGC database includes seven vertebrate species: human, cow, rat, mouse and three non-mammalian vertebrate species. A web-based application was developed to search this database of repeated motifs to generate species-specific lists of genes containing specific classes of repeats in their 3'-UTRs. This computational tool is called 3'-UTR SIRF (Short Interspersed Repeat Finder), and it reveals that hundreds of human genes contain an abundance of short CAC-rich and CAG-rich repeats in their 3'-UTRs that are similar to those found in mRNAs localized to the neurites of neurons. We tested four candidate mRNAs for localization in rat hippocampal neurons by in situ hybridization. Our results show that two candidate CAC-rich (Syntaxin 1B and Tubulin beta4) and two candidate CAG-rich (Sec61alpha and Syntaxin 1A) mRNAs are localized to distal neurites, whereas two control mRNAs lacking repeated motifs in their 3'-UTR remain primarily in the cell body. Computational data generated with 3'-UTR SIRF indicate that hundreds of mammalian genes have an abundance of short CA-containing motifs that may direct mRNA localization in neurons. In situ hybridization shows that four candidate mRNAs are localized to distal neurites of cultured hippocampal neurons. These data suggest that short CA-containing motifs may be part of a widely utilized genetic code that regulates mRNA localization in vertebrate cells. The use of 3'-UTR SIRF to search for new classes of motifs that regulate other aspects of gene expression should yield important information in future studies addressing cis-regulatory information located in 3'-UTRs.
    BMC Bioinformatics 01/2007; 8. · 3.02 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homologous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.
    Bioinformatics 07/2006; 22(14):e341-9. · 5.47 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: The problem of discovering arrangements of regions of high occurrence of one or more items of a given alphabet in a sequence, is studied, and two efficient approaches are pro- posed to solve it. The first approach is entropy-based and uses an existing recursive segmentation technique to split the input sequence into a set of homogeneous segments. The key idea of the second approach isto use a set of sliding win- dows over the sequence. Each sliding window keeps a set of statistics of a sequence segment that mainly includes the number of occurrences of each item in that segment. Com- bining these statistics efficiently yields the complete set of regions of high occurrence of the items of the given alpha- bet. After identifying these regions, the sequence is con- verted to a sequence of labeled intervals (each one corre- sponding to a region). An efficient algorithm for mining frequent arrangements of temporal intervals on a single se- quence is applied on the converted sequence to discover fre- quently occurring arrangements of these regions. The pro- posed algorithms are tested on various DNA sequences pro- ducing results with potentially significant biological mean- ing.
    Workshops Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18-22 December 2006, Hong Kong, China; 01/2006
  • Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006; 01/2006
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Tandem repeats are an important class of DNA repeats and much research has focused on their efficient identification, their use in DNA typing and fingerprinting, and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into groups or families based on sequence similarity so that their biological importance may be further explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by alignment. In this paper we evaluate five distance functions used to produce those alignments--Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is important to analyze and compare these functions because the choice of distance metric forms the core of any clustering algorithm. We employ a novel method to compare alignments and thereby compare the distance functions themselves. We rank the distance functions based on the cluster validation techniques--Average Cluster Density and Average Silhouette Width. Finally, we propose a multi-phase clustering method which produces good-quality clusters. In this study, we analyze clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans Chromosome III.
    Genome informatics. International Conference on Genome Informatics 02/2005; 16(1):3-12.
  • Source
    Gary Benson
    [show abstract] [hide abstract]
    ABSTRACT: We present a solution for the following problem. Given two sequences X=x1x2⋯xn and Y=y1y2⋯ym, n⩽m, find the best scoring alignment of X′=Xk[i] vs. Y over all possible pairs (k,i), for k=1,2,… and 1⩽i⩽n, where X[i] is the cyclic permutation of X starting at xi, Xk[i] is the concatenation of k complete copies of X[i] (k tandem copies), and the alignment must include all of Y and all of X′. Our algorithm allows any alignment scoring scheme with additive gap costs and uses time and O(nm) space. We use it to identify related tandem repeats in the C. elegans genome as part of the development of a multi-genome database of tandem repeats.
    Discrete Applied Mathematics. 01/2005;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We have performed the first genome-wide analysis of the Inverted Repeat (IR) structure in the human genome, using a novel and efficient software package called Inverted Repeats Finder (IRF). After masking of known repetitive elements, IRF detected 22,624 human IRs characterized by arm size from 25 bp to >100 kb with at least 75% identity, and spacer length up to 100 kb. This analysis required 6 h on a desktop PC. In all, 166 IRs had arm lengths >8 kb. From this set, IRs were excluded if they were in unfinished/unassembled regions of the genome, or clustered with other closely related IRs, yielding a set of 96 large IRs. Of these, 24 (25%) occurred on the X-chromosome, although it represents only approximately 5% of the genome. Of the X-chromosome IRs, 83.3% were >/=99% identical, compared with 28.8% of autosomal IRs. Eleven IRs from Chromosome X, one from Chromosome 11, and seven already described from Chromosome Y contain genes predominantly expressed in testis. PCR analysis of eight of these IRs correctly amplified the corresponding region in the human genome, and six were also confirmed in gorilla or chimpanzee genomes. Similarity dot-plots revealed that 22 IRs contained further secondary homologous structures partially categorized into three distinct patterns. The prevalence of large highly homologous IRs containing testes genes on the X- and Y-chromosomes suggests a possible role in male germ-line gene expression and/or maintaining sequence integrity by gene conversion.
    Genome Research 10/2004; 14(10A):1861-9. · 14.40 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We develop a metric for probability distributions with applications to biological sequence analysis. Our distance metric is obtained by minimizing a functional defined on the class of paths over probability measures on N categories. The underlying mathematical theory is connected to a constrained problem in the calculus of variations. The solution presented is a numerical solution, which approximates the true solution in a set of cases called rich paths where none of the components of the path is zero. The functional to be minimized is motivated by entropy considerations, reflecting the idea that nature might efficiently carry out mutations of genome sequences in such a way that the increase in entropy involved in transformation is as small as possible. We characterize sequences by frequency profiles or probability vectors, in the case of DNA where N is 4 and the components of the probability vector are the frequency of occurrence of each of the bases A, C, G and T. Given two probability vectors a and b, we define a distance function based as the infimum of path integrals of the entropy function H( p) over all admissible paths p(t), 0 < or = t< or =1, with p(t) a probability vector such that p(0)=a and p(1)=b. If the probability paths p(t) are parameterized as y(s) in terms of arc length s and the optimal path is smooth with arc length L, then smooth and "rich" optimal probability paths may be numerically estimated by a hybrid method of iterating Newton's method on solutions of a two point boundary value problem, with unknown distance L between the abscissas, for the Euler-Lagrange equations resulting from a multiplier rule for the constrained optimization problem together with linear regression to improve the arc length estimate L. Matlab code for these numerical methods is provided which works only for "rich" optimal probability vectors. These methods motivate a definition of an elementary distance function which is easier and faster to calculate, works on non-rich vectors, does not involve variational theory and does not involve differential equations, but is a better approximation of the minimal entropy path distance than the distance //b-a//(2). We compute minimal entropy distance matrices for examples of DNA myostatin genes and amino-acid sequences across several species. Output tree dendograms for our minimal entropy metric are compared with dendograms based on BLAST and BLAST identity scores.
    Journal of Mathematical Biology 06/2004; 48(5):563-90. · 2.37 Impact Factor

Publication Stats

3k Citations
121 Downloads
2k Views
93.56 Total Impact Points

Institutions

  • 2012
    • Aalto University
      • Department of Information and Computer Science
      Helsinki, Province of Southern Finland, Finland
  • 2000–2010
    • Boston University
      • • Department of Electrical and Computer Engineering
      • • Department of Biology
      Boston, Massachusetts, United States
  • 1994–2004
    • Mount Sinai School of Medicine
      • Department of Genetics and Genomic Sciences
      Manhattan, NY, United States
    • Georgia Institute of Technology
      • College of Computing
      Atlanta, GA, United States
  • 2003
    • Université Paris-Sud 11
      • Institut de Génétique et Microbiologie (IGMORS)
      Paris, Ile-de-France, France
  • 1997
    • Bar Ilan University
      • Department of Computer Science
      Gan, Tel Aviv, Israel
    • University of Southern California
      • Department of Mathematics
      Los Angeles, CA, United States
  • 1992
    • University of Maryland, College Park
      • Department of Computer Science
      College Park, MD, United States