Publications (15)263 Total impact
-
Article: Getting started in text mining: part two.
PLoS Computational Biology 08/2009; 5(7):e1000411. · 5.22 Impact Factor -
Article: Mismatch oligonucleotides in human and yeast: guidelines for probe design on tiling microarrays.
[show abstract] [hide abstract]
ABSTRACT: Mismatched oligonucleotides are widely used on microarrays to differentiate specific from nonspecific hybridization. While many experiments rely on such oligos, the hybridization behavior of various degrees of mismatch (MM) structure has not been extensively studied. Here, we present the results of two large-scale microarray experiments on S. cerevisiae and H. sapiens genomic DNA, to explore MM oligonucleotide behavior with real sample mixtures under tiling-array conditions. We examined all possible nucleotide substitutions at the central position of 36-nucleotide probes, and found that nonspecific binding by MM oligos depends upon the individual nucleotide substitutions they incorporate: C-->A, C-->G and T-->A (yielding purine-purine mispairs) are most disruptive, whereas A-->X were least disruptive. We also quantify a marked GC skew effect: substitutions raising probe GC content exhibit higher intensity (and vice versa). This skew is small in highly-expressed regions (+/- 0.5% of total intensity range) and large (+/- 2% or more) elsewhere. Multiple mismatches per oligo are largely additive in effect: each MM added in a distributed fashion causes an additional 21% intensity drop relative to PM, three-fold more disruptive than adding adjacent mispairs (7% drop per MM). We investigate several parameters for oligonucleotide design, including the effects of each central nucleotide substitution on array signal intensity and of multiple MM per oligo. To avoid GC skew, individual substitutions should not alter probe GC content. RNA sample mixture complexity may increase the amount of nonspecific hybridization, magnify GC skew and boost the intensity of MM oligos at all levels.BMC Genomics 01/2009; 9:635. · 4.07 Impact Factor -
Article: Seeking a new biology through text mining.
[show abstract] [hide abstract]
ABSTRACT: Tens of thousands of biomedical journals exist, and the deluge of new articles in the biomedical sciences is leading to information overload. Hence, there is much interest in text mining, the use of computational tools to enhance the human ability to parse and understand complex text.Cell 08/2008; 134(1):9-13. · 32.40 Impact Factor -
Article: Uncovering trends in gene naming.
[show abstract] [hide abstract]
ABSTRACT: We take stock of current genetic nomenclature and attempt to organize strange and notable gene names. We categorize, for instance, those that involve a naming system transferred from another context (for example, Pavlov's dogs). We hope this analysis provides clues to better steer gene naming in the future.Genome biology 02/2008; 9(1):401. · 6.63 Impact Factor -
Article: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
[show abstract] [hide abstract]
ABSTRACT: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.Nature 07/2007; 447(7146):799-816. · 36.28 Impact Factor -
Article: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
[show abstract] [hide abstract]
ABSTRACT: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.Nature 06/2007; 447(7146):799-816. · 36.28 Impact Factor -
Article: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
[show abstract] [hide abstract]
ABSTRACT: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.Nature 06/2007; 447(7146):799-816. · 36.28 Impact Factor -
Article: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
[show abstract] [hide abstract]
ABSTRACT: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.Nature 05/2007; 447:799-816. · 36.28 Impact Factor -
Article: Publishing perishing? Towards tomorrow's information architecture
[show abstract] [hide abstract]
ABSTRACT: Abstract Scientific articles are tailored to present information in human-readable aliquots. Although the Internet has revolutionized the way our society thinks about information, the traditional text-based framework of the scientific article remains largely unchanged. This format imposes sharp constraints upon the type and quantity of biological information published today. Academic journals alone cannot capture the findings of modern genome-scale inquiry. Like many other disciplines, molecular biology is a science of facts: information inherently suited to database storage. In the past decade, a proliferation of public and private databases has emerged to house genome sequence, protein structure information, functional genomics data and more; these digital repositories are now a vital component of scientific communication. The next challenge is to integrate this vast and ever-growing body of information with academic journals and other media. To truly integrate scientific information we must modernize academic publishing to exploit the power of the Internet. This means more than online access to articles, hyperlinked references and web-based supplemental data; it means making articles fully computer-readable with intelligent markup and Structured Digital Abstracts. Here, we examine the changing roles of scholarly journals and databases. We present our vision of the optimal information architecture for the biosciences, and close with tangible steps to improve our handling of scientific information today while paving the way for an expansive central index in the future.BMC Bioinformatics. 01/2007; -
Article: Predicting essential genes in fungal genomes.
[show abstract] [hide abstract]
ABSTRACT: Essential genes are required for an organism's viability, and the ability to identify these genes in pathogens is crucial to directed drug development. Predicting essential genes through computational methods is appealing because it circumvents expensive and difficult experimental screens. Most such prediction is based on homology mapping to experimentally verified essential genes in model organisms. We present here a different approach, one that relies exclusively on sequence features of a gene to estimate essentiality and offers a promising way to identify essential genes in unstudied or uncultured organisms. We identified 14 characteristic sequence features potentially associated with essentiality, such as localization signals, codon adaptation, GC content, and overall hydrophobicity. Using the well-characterized baker's yeast Saccharomyces cerevisiae, we employed a simple Bayesian framework to measure the correlation of each of these features with essentiality. We then employed the 14 features to learn the parameters of a machine learning classifier capable of predicting essential genes. We trained our classifier on known essential genes in S. cerevisiae and applied it to the closely related and relatively unstudied yeast Saccharomyces mikatae. We assessed predictive success in two ways: First, we compared all of our predictions with those generated by homology mapping between these two species. Second, we verified a subset of our predictions with eight in vivo knockouts in S. mikatae, and we present here the first experimentally confirmed essential genes in this species.Genome Research 10/2006; 16(9):1126-35. · 13.61 Impact Factor -
Article: Genomic analysis of insertion behavior and target specificity of mini-Tn7 and Tn3 transposons in Saccharomyces cerevisiae.
[show abstract] [hide abstract]
ABSTRACT: Transposons are widely employed as tools for gene disruption. Ideally, they should display unbiased insertion behavior, and incorporate readily into any genomic DNA to which they are exposed. However, many transposons preferentially insert at specific nucleotide sequences. It is unclear to what extent such bias affects their usefulness as mutagenesis tools. Here, we examine insertion site specificity and global insertion behavior of two mini-transposons previously used for large-scale gene disruption in Saccharomyces cerevisiae: Tn3 and Tn7. Using an expanded set of insertion data, we confirm that Tn3 displays marked preference for the AT-rich 5 bp consensus site TA[A/T]TA, whereas Tn7 displays negligible target site preference. On a genome level, both transposons display marked non-uniform insertion behavior: certain sites are targeted far more often than expected, and both distributions depart drastically from Poisson. Thus, to compare their insertion behavior on a genome level, we developed a windowed Kolmogorov-Smirnov (K-S) test to analyze transposon insertion distributions in sequence windows of various sizes. We find that when scored in large windows (>300 bp), both Tn3 and Tn7 distributions appear uniform, whereas in smaller windows, Tn7 appears uniform while Tn3 does not. Thus, both transposons are effective tools for gene disruption, but Tn7 does so with less duplication and a more uniform distribution, better approximating the behavior of the ideal transposon.Nucleic Acids Research 02/2006; 34(8):e57. · 8.03 Impact Factor -
Article: Large-scale mutagenesis of the yeast genome using a Tn7-derived multipurpose transposon.
[show abstract] [hide abstract]
ABSTRACT: We present here an unbiased and extremely versatile insertional library of yeast genomic DNA generated by in vitro mutagenesis with a multipurpose element derived from the bacterial transposon Tn7. This mini-Tn7 element has been engineered such that a single insertion can be used to generate a lacZ fusion, gene disruption, and epitope-tagged gene product. Using this transposon, we generated a plasmid-based library of approximately 300,000 mutant alleles; by high-throughput screening in yeast, we identified and sequenced 9032 insertions affecting 2613 genes (45% of the genome). From analysis of 7176 insertions, we found little bias in Tn7 target-site selection in vitro. In contrast, we also sequenced 10,174 Tn3 insertions and found a markedly stronger preference for an AT-rich 5-base pair target sequence. We further screened 1327 insertion alleles in yeast for hypersensitivity to the chemotherapeutic cisplatin. Fifty-one genes were identified, including four functionally uncharacterized genes and 25 genes involved in DNA repair, replication, transcription, and chromatin structure. In total, the collection reported here constitutes the largest plasmid-based set of sequenced yeast mutant alleles to date and, as such, should be singularly useful for gene and genome-wide functional analysis.Genome Research 11/2004; 14(10A):1975-86. · 13.61 Impact Factor -
Article: Analyzing cellular biochemistry in terms of molecular networks.
[show abstract] [hide abstract]
ABSTRACT: One way to understand cells and circumscribe the function of proteins is through molecular networks. These networks take a variety of forms including webs of protein-protein interactions, regulatory circuits linking transcription factors and targets, and complex pathways of metabolic reactions. We first survey experimental techniques for mapping networks (e.g., the yeast two-hybrid screens). We then turn our attention to computational approaches for predicting networks from individual protein features, such as correlating gene expression levels or analyzing sequence coevolution. All the experimental techniques and individual predictions suffer from noise and systematic biases. These problems can be overcome to some degree through statistical integration of different experimental datasets and predictive features (e.g., within a Bayesian formalism). Next, we discuss approaches for characterizing the topology of networks, such as finding hubs and analyzing subnetworks in terms of common motifs. Finally, we close with perspectives on how network analysis represents a preliminary step toward a systems approach for modeling cells.Annual Review of Biochemistry 02/2004; 73:1051-87. · 34.32 Impact Factor -
Article: A semantic web approach to integrating heterogeneous yeast genome data
-
Article: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
[show abstract] [hide abstract]
ABSTRACT: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.Nature. 447(7146):799-816.
Top Journals
- Nature (3)
- Genome Research (2)
- Annual Review of Biochemistry (1)
- Cell (1)
- Nucleic Acids Research (1)
Institutions
-
2006–2009
-
Yale University
- Department of Molecular Biophysics and Biochemistry
New Haven, CT, USA
-
-
2008
-
University of Chicago
Chicago, IL, USA
-