Sophie Schbath

University of Western Australia, Perth, Western Australia, Australia

Are you Sophie Schbath?

Claim your profile

Publications (19)74.38 Total impact

  • Article: Separating significant matches from spurious matches in DNA sequences.
    Hugo Devillers, Sophie Schbath
    [show abstract] [hide abstract]
    ABSTRACT: Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (ℓ) that has to be set in the algorithm used to retrieve them. Indeed, if ℓ is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ℓ is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ℓ mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.
    Journal of computational biology: a journal of computational molecular cell biology 12/2011; 19(1):1-12. · 1.69 Impact Factor
  • Article: Statistical significance of threading scores.
    [show abstract] [hide abstract]
    ABSTRACT: We present a general method for assessing threading score significance. The threading score of a protein sequence, thread onto a given structure, should be compared with the threading score distribution of a random amino-acid sequence, of the same length, thread on the same structure; small p-values point significantly high scores. We claim that, due to general protein contact map properties, this reference distribution is a Weibull extreme value distribution whose parameters depend on the threading method, the structure, the length of the query and the random sequence simulation model used. These parameters can be estimated off-line with simulated sequence samples, for different sequence lengths. They can further be interpolated at the exact length of a query, enabling the quick computation of the p-value.
    Journal of computational biology: a journal of computational molecular cell biology 12/2011; 19(1):13-29. · 1.69 Impact Factor
  • Article: Robustness assessment of whole bacterial genome segmentations.
    [show abstract] [hide abstract]
    ABSTRACT: Comparison of closely related bacterial genomes has revealed the presence of highly conserved sequences forming a "backbone" that is interrupted by numerous, less conserved, DNA fragments. Segmentation of bacterial genomes into backbone and variable regions is particularly useful to investigate, among other things, bacterial genome evolution. Several software tools have been designed to compare complete bacterial chromosomes and a few online databases store pre-computed genome comparisons. However, very few statistical methods are available to evaluate the reliability of these software tools and to compare the results obtained with them. To fill this gap, we have developed two local scores to measure the robustness of bacterial genome segmentations. Our method uses a simulation procedure based on random perturbations of the compared genomes. The two scores described in this article provide useful information and are easy to implement, and their interpretation is intuitive. We show that they are suited to discriminate between robust and non-robust segmentations when genome aligners such as MAUVE and MGA are used.
    Journal of computational biology: a journal of computational molecular cell biology 09/2011; 18(9):1155-65. · 1.69 Impact Factor
  • Source
    Article: DNA motifs that sculpt the bacterial chromosome.
    [show abstract] [hide abstract]
    ABSTRACT: During the bacterial cell cycle, the processes of chromosome replication, DNA segregation, DNA repair and cell division are coordinated by precisely defined events. Tremendous progress has been made in recent years in identifying the mechanisms that underlie these processes. A striking feature common to these processes is that non-coding DNA motifs play a central part, thus 'sculpting' the bacterial chromosome. Here, we review the roles of these motifs in the mechanisms that ensure faithful transmission of genetic information to daughter cells. We show how their chromosomal distribution is crucial for their function and how it can be analysed quantitatively. Finally, the potential roles of these motifs in bacterial chromosome evolution are discussed.
    Nature Reviews Microbiology 01/2011; 9(1):15-26. · 21.18 Impact Factor
  • Article: Occurrence of structured motifs in random sequences: Arbitrary number of boxes
    [show abstract] [hide abstract]
    ABSTRACT: Structured motifs with arbitrary number of boxes are considered. In particular, such motifs are of interest in molecular biology for identifying gene promoters along genomes. Neat closed-form expressions for relevant distributions associated with occurrences of structured motifs are derived. Our methodology is based on developing a suitable semi-Markov embedding of the problem. A numerical example is also provided.
    Discrete Applied Mathematics. 01/2011;
  • Conference Proceeding: Assessing the Robustness of Complete Bacterial Genome Segmentations.
    Comparative Genomics - International Workshop, RECOMB-CG 2010, Ottawa, Canada, October 9-11, 2010. Proceedings; 01/2010
  • Chapter: How Can Pattern Statistics Be Useful for DNA Motif Discovery?
    Sophie Schbath, Stéphane Robin
    [show abstract] [hide abstract]
    ABSTRACT: Statistics of motifs have been widely revisited in the last 15 years due to the increasing availability of genomic sequences. The identification of DNA motifs with biological functions is still a huge challenge of genome analysis. Many functional and essential motifs have the particularity to be very frequent all along the chromosome or to be concentrated in some particular regions (e.g. in front of genes) or to be co-oriented with the replication direction. The prediction of functional motifs is then mostly based on statistical properties of pattern occurrences in Markovian sequences. This chapter is primarily devoted to such properties with a special focus on pattern frequency. How does one compute or approximate the count distribution to assess motif exceptionality? How can we test if a motif is significantly unbalanced between two (sets of) sequences? How should one deal with degenerated patterns? How can we model occurrences to find regions significantly enriched with a given pattern? Examples of functional motifs will illustrate all these questions, and we will see how the Chi motif has been identified in Staphylococcus aureus because of its statistical properties. Keywords and phrasesPattern statistics–word count–Markov chain–DNA sequence–exceptional words–unexpected frequency–compound Poisson process
    12/2009: pages 319-350;
  • Article: Assessing the Exceptionality of Coloured Motifs in Networks.
    EURASIP J. Bioinformatics and Systems Biology. 01/2009; 2009.
  • Article: The MatP/matS site-specific system organizes the terminus region of the E. coli chromosome into a macrodomain.
    [show abstract] [hide abstract]
    ABSTRACT: The organization of the Escherichia coli chromosome into insulated macrodomains influences the segregation of sister chromatids and the mobility of chromosomal DNA. Here, we report that organization of the Terminus region (Ter) into a macrodomain relies on the presence of a 13 bp motif called matS repeated 23 times in the 800-kb-long domain. matS sites are the main targets in the E. coli chromosome of a newly identified protein designated MatP. MatP accumulates in the cell as a discrete focus that colocalizes with the Ter macrodomain. The effects of MatP inactivation reveal its role as main organizer of the Ter macrodomain: in the absence of MatP, DNA is less compacted, the mobility of markers is increased, and segregation of Ter macrodomain occurs early in the cell cycle. Our results indicate that a specific organizational system is required in the Terminus region for bacterial chromosome management during the cell cycle.
    Cell 11/2008; 135(3):475-85. · 32.40 Impact Factor
  • Source
    Article: SIGffRid: a tool to search for sigma factor binding sites in bacterial genomes using comparative approach and biologically driven statistics.
    [show abstract] [hide abstract]
    ABSTRACT: Many programs have been developed to identify transcription factor binding sites. However, most of them are not able to infer two-word motifs with variable spacer lengths. This case is encountered for RNA polymerase Sigma (sigma) Factor Binding Sites (SFBSs) usually composed of two boxes, called -35 and -10 in reference to the transcription initiation point. Our goal is to design an algorithm detecting SFBS by using combinational and statistical constraints deduced from biological observations. We describe a new approach to identify SFBSs by comparing two related bacterial genomes. The method, named SIGffRid (SIGma Factor binding sites Finder using R'MES to select Input Data), performs a simultaneous analysis of pairs of promoter regions of orthologous genes. SIGffRid uses a prior identification of over-represented patterns in whole genomes as selection criteria for potential -35 and -10 boxes. These patterns are then grouped using pairs of short seeds (of which one is possibly gapped), allowing a variable-length spacer between them. Next, the motifs are extended guided by statistical considerations, a feature that ensures a selection of motifs with statistically relevant properties. We applied our method to the pair of related bacterial genomes of Streptomyces coelicolor and Streptomyces avermitilis. Cross-check with the well-defined SFBSs of the SigR regulon in S. coelicolor is detailed, validating the algorithm. SFBSs for HrdB and BldN were also found; and the results suggested some new targets for these sigma factors. In addition, consensus motifs for BldD and new SFBSs binding sites were defined, overlapping previously proposed consensuses. Relevant tests were carried out also on bacteria with moderate GC content (i.e. Escherichia coli/Salmonella typhimurium and Bacillus subtilis/Bacillus licheniformis pairs). Motifs of house-keeping sigma factors were found as well as other SFBSs such as that of SigW in Bacillus strains. We demonstrate that our approach combining statistical and biological criteria was successful to predict SFBSs. The method versatility authorizes the recognition of other kinds of two-box regulatory sites.
    BMC Bioinformatics 02/2008; 9:73. · 2.75 Impact Factor
  • Article: Assessing the Exceptionality of Network Motifs.
    Journal of Computational Biology. 01/2008; 15:1-20.
  • Source
    Article: Assessing the exceptionality of coloured motifs in networks.
    [show abstract] [hide abstract]
    ABSTRACT: : Various methods have been recently employed to characterise the structure of biological networks. In particular, the concept of network motif and the related one of coloured motif have proven useful to model the notion of a functional/evolutionary building block. However, algorithms that enumerate all the motifs of a network may produce a very large output, and methods to decide which motifs should be selected for downstream analysis are needed. A widely used method is to assess if the motif is exceptional, that is, over- or under-represented with respect to a null hypothesis. Much effort has been put in the last thirty years to derive -values for the frequencies of topological motifs, that is, fixed subgraphs. They rely either on (compound) Poisson and Gaussian approximations for the motif count distribution in Erdös-Rényi random graphs or on simulations in other models. We focus on a different definition of graph motifs that corresponds to coloured motifs. A coloured motif is a connected subgraph with fixed vertex colours but unspecified topology. Our work is the first analytical attempt to assess the exceptionality of coloured motifs in networks without any simulation. We first establish analytical formulae for the mean and the variance of the count of a coloured motif in an Erdös-Rényi random graph model. Using simulations under this model, we further show that a Pólya-Aeppli distribution better approximates the distribution of the motif count compared to Gaussian or Poisson distributions. The Pólya-Aeppli distribution, and more generally the compound Poisson distributions, are indeed well designed to model counts of clumping events. Altogether, these results enable to derive a -value for a coloured motif, without spending time on simulations.
    EURASIP Journal on Bioinformatics and Systems Biology 01/2008; 2009(1):616234.
  • Source
    Article: Identification of DNA motifs implicated in maintenance of bacterial core genomes by predictive modeling.
    [show abstract] [hide abstract]
    ABSTRACT: Bacterial biodiversity at the species level, in terms of gene acquisition or loss, is so immense that it raises the question of how essential chromosomal regions are spared from uncontrolled rearrangements. Protection of the genome likely depends on specific DNA motifs that impose limits on the regions that undergo recombination. Although most such motifs remain unidentified, they are theoretically predictable based on their genomic distribution properties. We examined the distribution of the "crossover hotspot instigator," or Chi, in Escherichia coli, and found that its exceptional distribution is restricted to the core genome common to three strains. We then formulated a set of criteria that were incorporated in a statistical model to search core genomes for motifs potentially involved in genome stability in other species. Our strategy led us to identify and biologically validate two distinct heptamers that possess Chi properties, one in Staphylococcus aureus, and the other in several streptococci. This strategy paves the way for wide-scale discovery of other important functional noncoding motifs that distinguish core genomes from the strain-variable regions.
    PLoS Genetics 10/2007; 3(9):1614-21. · 8.69 Impact Factor
  • Source
    Article: Statistical tests to compare motif count exceptionalities.
    [show abstract] [hide abstract]
    ABSTRACT: Finding over- or under-represented motifs in biological sequences is now a common task in genomics. Thanks to p-value calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count p-values in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required. We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K-12 backbone versus variable strain-specific loops. The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use.
    BMC Bioinformatics 02/2007; 8:84. · 2.75 Impact Factor
  • Article: Waiting times for clumps of patterns and for structured motifs in random sequences.
    Discrete Applied Mathematics. 01/2007; 155:868-880.
  • Chapter: Statistical Methods in Physical Mapping
    Sophie Schbath
    07/2006; , ISBN: 9780470015902
  • Article: FADO: a statistical method to detect favored or avoided distances between occurrences of motifs using the Hawkes' model.
    Gaelle Gusto, Sophie Schbath
    [show abstract] [hide abstract]
    ABSTRACT: We propose an original statistical method to estimate how the occurrences of a given process along a genome, genes or motifs for instance, may be influenced by the occurrences of a second process. More precisely, the aim is to detect avoided and/or favored distances between two motifs, for instance, suggesting possible interactions at a molecular level. For this, we consider occurrences along the genome as point processes and we use the so-called Hawkes' model. In such model, the intensity at position t depends linearly on the distances to past occurrences of both processes via two unknown profile functions to estimate. We perform a non parametric estimation of both profiles by using B-spline decompositions and a constrained maximum likelihood method. Finally, we use the AIC criterion for the model selection. Simulations show the excellent behavior of our estimation procedure. We then apply it to study (i) the dependence between gene occurrences along the E. coli genome and the occurrences of a motif known to be part of the major promoter for this bacterium, and (ii) the dependence between the yeast S. cerevisiae genes and the occurrences of putative polyadenylation signals. The results are coherent with known biological properties or previous predictions, meaning this method can be of great interest for functional motif detection, or to improve knowledge of some biological mechanisms.
    Statistical Applications in Genetics and Molecular Biology 02/2005; 4:Article24. · 1.52 Impact Factor
  • Article: Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences.
    Stéphane Robin, Sophie Schbath
    Journal of Computational Biology. 01/2002; 8:349-359.
  • Article: Occurrence Probability of Structured Motifs in Random Sequences.
    Journal of Computational Biology. 01/2002; 9:761-774.