[show abstract][hide abstract] ABSTRACT: We describe a new computer program, SnpEff, for rapidly categorizing the effects of variants in genome sequences. Once a genome is sequenced, SnpEff annotates variants based on their genomic locations and predicts coding effects. Annotated genomic locations include intronic, untranslated region, upstream, downstream, splice site, or intergenic regions. Coding effects such as synonymous or non-synonymous amino acid replacement, start codon gains or losses, stop codon gains or losses, or frame shifts can be predicted. Here the use of SnpEff is illustrated by annotating ~356,660 candidate SNPs in ~117 Mb unique sequences, representing a substitution rate of ~1/305 nucleotides, between the Drosophila melanogaster w(1118); iso-2; iso-3 strain and the reference y(1); cn(1) bw(1) sp(1) strain. We show that ~15,842 SNPs are synonymous and ~4,467 SNPs are non-synonymous (N/S ~0.28). The remaining SNPs are in other categories, such as stop codon gains (38 SNPs), stop codon losses (8 SNPs), and start codon gains (297 SNPs) in the 5'UTR. We found, as expected, that the SNP frequency is proportional to the recombination frequency (i.e., highest in the middle of chromosome arms). We also found that start-gain or stop-lost SNPs in Drosophila melanogaster often result in additions of N-terminal or C-terminal amino acids that are conserved in other Drosophila species. It appears that the 5' and 3' UTRs are reservoirs for genetic variations that changes the termini of proteins during evolution of the Drosophila genus. As genome sequencing is becoming inexpensive and routine, SnpEff enables rapid analyses of whole-genome sequencing data to be performed by an individual laboratory.
[show abstract][hide abstract] ABSTRACT: A single expressing copy of the human protamine domain was randomly inserted into an intron of Cyp2c38. The transgenic locus was shown to recapitulate the level of expression observed in normal human testis while not perturbing endogenous protamine expression. The development of an interspecies tiling array was pursued to enable direct comparison of the orthologous protamine domains in a single experiment. Probe design was adapted to generate species-specific high resolution probe sets that would tolerate repetitive elements. Results from competitive hybridizations demonstrate that interspecies tiling arrays are a valuable tool for parallel analysis of highly similar DNA sequences. This approach provides a rapid and reliable means of interrogating samples prior to deep sequencing analysis. These arrays should readily compliment most DNA isolation and analysis techniques such as ChIP, nuclease sensitivity and nuclear matrix association assays.
Systems biology in reproductive medicine 02/2011; 57(1-2):54-62. · 1.85 Impact Factor
[show abstract][hide abstract] ABSTRACT: At fertilization, the male germ cell conveys a richly layered genetic landscape consisting of both DNA and its associated
epigenetic information. A systems level understanding of these forms of information could reveal some of the origins of idiopathic
male infertility. Characterizing the genetic and epigenetic contributions to fertilization could also offer insight into the
root causes of aberrant development. Perhaps some of these elements reflect the fetal origins of adult disease. As a host
of new tools and techniques emerge, we have the opportunity to reassess our models of gametogenesis in the male. The challenge
is no longer to construct biological models from sparse data but to assimilate a wealth of data being generated by high throughput
technologies. By aggregating data from multiple high throughput and targeted experiments, bioinformatics offers potential
insight into how genetic and epigenetic information are utilized in the sperm-oocyte system. In this chapter, we will review
online resources that can aid in conducting an epigenetic investigation as well as describing approaches to managing second
and third generation deep sequencing data.
KeywordsBioinformatics-Epigenetics-Imprinting-Male gamete-Micro RNA-NGS-Next generation sequencing-RNA
[show abstract][hide abstract] ABSTRACT: Abnormal trophoblast invasion is associated with the most common and most severe complications of human pregnancy. The biology of invasion, as well as the etiology of abnormal invasion remains poorly understood. The aim of this study was to characterize the transcriptome of the HTR-8/SVneo human cytotrophoblast cell line which displays well characterized invasive and non-invasive behavior, and to correlate the activity of the transcriptome with nuclear matrix attachment and cell phenotype. Comparison of the invasive to non-invasive HTR transcriptomes was unremarkable. In contrast, comparison of the MARs on chromosomes 14-18 revealed an increased number of MARs associated with the invasive phenotype. These attachment areas were more likely to be associated with silent rather than actively transcribed genes. This study supports the view that nuclear matrix attachment may play an important role in cytotrophoblast invasion by ensuring specific silencing that facilitates invasion.
[show abstract][hide abstract] ABSTRACT: Networks of genes are typically generated from expression changes observed between control and test conditions. Nevertheless, within a single control state many genes show expression variance across biological replicates. These transcripts, typically termed unstable, are usually excluded from analyses because their behavior cannot be reconciled with biological constraints. Grouped as pairs of covariant genes they can however show a consistent response to the progression of a disease. We present a model of coherence arising from sets of covariant genes that was developed in-vitro then tested against a range of solid tumors. DGPMs, Decoherence Gene Pair Models, showed changes in network topology reflective of the metastatic transition. Across a range of solid tumor studies the model generalizes to reveal a richly connected topology of networks in healthy tissues that becomes sparser as the disease progresses reaching a minimum size in the advanced tumors with minim survivability.
Molecular and Cellular Probes 10/2009; 24(1):53-60. · 1.87 Impact Factor
[show abstract][hide abstract] ABSTRACT: During the haploid phase of mammalian spermatogenesis, nucleosomal chromatin is ultimately repackaged by small, highly basic protamines to generate an extremely compact, toroidal chromatin architecture that is critical to normal spermatozoal function. In common with several species, however, the human spermatozoon retains a small proportion of its chromatin packaged in nucleosomes. As nucleosomal chromatin in spermatozoa is structurally more open than protamine-packaged chromatin, we considered it likely to be more accessible to exogenously applied endonucleases. Accordingly, we have used this premise to identify a population of endonuclease-sensitive DNA sequences in human and murine spermatozoa. Our results show unequivocally that, in contrast to the endonuclease-resistant sperm chromatin packaged by protamines, regions of increased endonuclease sensitivity are closely associated with gene regulatory regions, including many promoter sequences and sequences recognized by CCCTC-binding factor (CTCF). Similar differential packaging of promoters is observed in the spermatozoal chromatin of both mouse and man. These observations imply the existence of epigenetic marks that distinguish gene regulatory regions in male germ cells and prevent their repackaging by protamines during spermiogenesis. The ontology of genes under the control of endonuclease-sensitive regulatory regions implies a role for this phenomenon in subsequent embryonic development.
Genome Research 09/2009; 19(8):1338-49. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: We used the Illumina reversible-short sequencing technology to obtain 17-fold average depth (s.d. approximately 8) of approximately 94% of the euchromatic genome and approximately 1-5% of the heterochromatin sequence of the Drosophila melanogaster isogenic strain w(1118); iso-2; iso-3. We show that this strain has a approximately 9 kb deletion that uncovers the first exon of the white (w) gene, approximately 4 kb of downstream promoter sequences, and most of the first intron, thus demonstrating that whole-genome sequencing can be used for mutation characterization. We chose this strain because there are thousands of transposon insertion lines and hundreds of isogenic deficiency lines available with this genetic background, such as the Exelixis, Inc., and the DrosDEL collections. We compared our sequence to Release 5 of the finished reference genome sequence which was made from the isogenic strain y(1); cn(1) bw(1) sp(1) and identified 356,614 candidate SNPs in the approximately 117 Mb unique sequence genome, which represents a substitution rate of approximately 1/305 nucleotides ( approximately 0.30%). The distribution of SNPs is not uniform, but rather there is a approximately 2-fold increase in SNPs on the autosome arms compared with the X chromosome and a approximately 7-fold increase when compared to the small 4(th) chromosome. This is consistent with previous analyses that demonstrated a correlation between recombination frequency and SNP frequency. An unexpected finding was a SNP hotpot in a approximately 20 Mb central region of the 4(th) chromosome, which might indicate higher than expected recombination frequency in this region of this chromosome. Interestingly, genes involved in sensory perception are enriched in SNP hotspots and genes encoding developmental genes are enriched in SNP coldspots, which suggests that recombination frequencies might be proportional to the evolutionary selection coefficient. There are currently 12 Drosophila species sequenced, and this represents one of many isogenic Drosophila melanogaster genome sequences that are in progress. Because of the dramatic increase in power in using isogenic lines rather than outbred individuals, the SNP information should be valuable as a test bed for understanding genotype-by-environment interactions in human population studies.
[show abstract][hide abstract] ABSTRACT: To date, there has been little progress towards identifying markers of normal male fertility. The need to supplement current subjective methods that rely on variable semen parameters to assess fertility status continues to be acknowledged. Several studies have shown that spermatozoal RNAs can describe characteristic failures of the spermatogenic pathway among infertile males. In spite of the inherent heterogeneity of semen that describe the "normal" fertile male, this holds the promise of developing markers that could help identify the ever elusive idiopathic infertile male. Through the analyses of the spermatozoal transcriptome from 24 donors of proven fertility, we identified a series of transcripts that were consistently present among all individuals. The heterogeneous nature of the samples, reflected by their semen parameters, was mirrored by the variability of the observed array signal. Nevertheless, clusters of invariable transcript pairs were identified. These were founded by a single central member that was linked in constant proportion even though the absolute level of each member of the transcript pair often varied among individuals. The presence of pairs of stable transcripts suggests that among the heterogeneity observed in the sperm transcriptome, a distinct set is strictly regulated.
Journal of Molecular Medicine 08/2009; 87(7):735-48. · 4.77 Impact Factor
[show abstract][hide abstract] ABSTRACT: Microarray experiments can appear daunting because the considerations called for in their analysis cover several fields of
research. To understand the data microarrays generate some knowledge of classical statistics and recent complexity theory
are useful while emerging computational techniques such as XML directed workflows could aid in managing the data. These considerations
are called for because as experimental tools, microarrays (arrays) exemplify the recent trend in biological research towards
high dimensionality datasets. Until recently observations were made on only a few variables at a time and these were used
to support or refute hypotheses, but high dimensionality datasets are generated by observing a very large number of variables
(e.g. gene expression measurements) at the same time. The number of expression measurements made on arrays is not only high,
but notably high when compared to the size of a typical sample population. This combination of high dimensionality and asymmetry
leads to large datasets and fundamental problems when using standard approaches to interpret the data. An end-to-end approach
is a general framework in which to place some useful considerations when planning an analysis. The framework described here
explores the origins of signal and several sources of variance, approaches to representing high-throughput data, the statistical
considerations when modeling array data and the software tools that can aid in carrying out the analysis.
KeywordsMicroarray-Statistical analysis-Data repository-False discovery rate-Promoter analysis
[show abstract][hide abstract] ABSTRACT: It is well established that nuclear architecture plays a key role in poising regions of the genome for transcription. This may be achieved using scaffold/matrix attachment regions (S/MARs) that establish loop domains. However, the relationship between changes in the physical structure of the genome as mediated by attachment to the nuclear scaffold/matrix and gene expression is not clearly understood. To define the role of S/MARs in organizing our genome and to resolve the often contradictory loci-specific studies, we have surveyed the S/MARs in HeLa S3 cells on human chromosomes 14-18 by array comparative genomic hybridization. Comparison of LIS (lithium 3,5-diiodosalicylate) extraction to identify SARs and 2 m NaCl extraction to identify MARs revealed that approximately one-half of the sites were in common. The results presented in this study suggest that SARs 5' of a gene are associated with transcript presence whereas MARs contained within a gene are associated with silenced genes. The varied functions of the S/MARs as revealed by the different extraction methods highlights their unique functional contribution.
Human Molecular Genetics 12/2008; 18(4):645-54. · 7.69 Impact Factor
[show abstract][hide abstract] ABSTRACT: The folding of chromatin into topologically constrained loop domains is essential for genomic function. We have identified genomic anchors that define the organization of chromatin loop domains across the human major histocompatibility complex (MHC). This locus contains critical genes for immunity and is associated with more diseases than any other region of the genome. Classical MHC genes are expressed in a cell type-specific pattern and can be induced by cytokines such as interferon-gamma (IFNG). Transcriptional activation of the MHC was associated with a reconfiguration of chromatin architecture resulting from the formation of additional genomic anchors. These findings suggest that the dynamic arrangement of genomic anchors and loops plays a role in transcriptional regulation.
Genome Research 11/2008; 18(11):1778-86. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: It is known that transcription factors (TFs) work in cooperation with each other to govern gene expression and thus single TF studies may not always reflect the underlying biology. Using microarray data obtained from two independent studies of the first wave of spermatogenesis, we tested the hypothesis that co-expressed spermatogenic genes in cells committed to differentiation are regulated by a set of distinct combinations of TF modules. A computational approach was designed to identify over-represented module combinations in the promoter regions of genes associated with transcripts that either increase or decrease in abundance between the first two major spermatogenic cell types: spermatogonia and spermatocytes. We identified five TFs constituting four module combinations that were correlated with expression and repression of similarly regulated genes. These modules were biologically assessed in the context that they represent the key transcriptional mediators in the developmental transition from the spermatogonia to spermatocyte.
Molecular and General Genetics 08/2008; 280(3):263-74. · 2.88 Impact Factor
[show abstract][hide abstract] ABSTRACT: The quantitative real-time polymerase chain reaction (PCR) remains a cornerstone technique in gene expression analysis and sequence characterization. Despite the importance of the approach to experimental biology, the confident assignment of reaction efficiency to the early cycles of real-time PCR reactions remains problematic. Considerable noise may be generated when few cycles in the amplification are available to estimate peak efficiency. An alternate approach that uses data from beyond the log–linear amplification phase is explored in this article with the aim of reducing noise and adding confidence to efficiency estimates. PCR reaction efficiency is regressed to estimate the per-cycle profile of an asymptotically departed peak efficiency even when this is not closely approximated in the measurable cycles. The process can be repeated over replicates to develop a robust estimate of peak reaction efficiency. This leads to an estimate of the maximum reaction efficiency that may be considered primer design specific. Using a series of biological scenarios, we demonstrate that this approach can provide an accurate estimate of initial template concentration.
[show abstract][hide abstract] ABSTRACT: Systems biology presents a new paradigm for elucidating the processes required to organize and sustain life. We now have access to whole genome sequences, gene expression data for multiple cell types, and databases for regulatory elements governing these genes. These resources make it feasible to identify conserved genomic sequences across multiple species, transcription factors regulating the expression of genes with similar expression patterns within a given cell type and to compare expression levels of specific genes between normal and diseased cellular states. In order to utilize this wealth of information, new computational tools that integrate these datasets in a genome-wide context are required. Using the protamine cluster as an example, we present a series of in-house applications that we have developed to integrate, contextualize and visualize datasets across multiple hierarchies.
Systems Biology in Reproductive Medicine 01/2008; 54(2):97-108. · 1.85 Impact Factor
[show abstract][hide abstract] ABSTRACT: High-throughput technologies now afford the opportunity to directly determine the distribution of MARs (matrix attachment regions) throughout a genome. The utility of cosmid and oligonucleotide platforms to identify human chromosome 16 MARs from preparations that employed LIS (lithium di-iodosalicylic acid) and NaCl extraction protocols was examined. The effectiveness of the platforms was then evaluated by Q-PCR (quantitative real-time PCR). Analysis revealed that caution must be exercised, since the representation of non-coding regions varies among platforms. Nevertheless, several interesting trends were revealed. We expect that these technologies will prove useful in systems approaches directed towards defining the role of MARs in various cell types and cellular processes.
Biochemical Society Transactions 07/2007; 35(Pt 3):612-7. · 2.59 Impact Factor
[show abstract][hide abstract] ABSTRACT: Proteolysis is a critical regulatory mechanism for a wide variety of physiologic and pathologic processes. To assist in the identification of proteases, their endogenous inhibitors, and proteins that interact with proteases or proteolytic pathways in biological tissues, a dual-species oligonucleotide microarray has been developed in conjunction with Affymetrix. The Hu/Mu ProtIn microarray contains 516 and 456 probe sets that survey human and mouse genes of interest (proteases, protease inhibitors, or interactors), respectively. To investigate the performance of the array, gene expression profiles were analyzed in pure mouse and human samples (reference RNA; normal and tumor cell lines/tissues) and orthotopically implanted xenografts of human A549 lung and MDA-MB-231 breast carcinomas. Relative gene expression and "present-call" P values were determined for each probe set using dChip and MAS5 software, respectively. Despite the high level of sequence identity of mouse and human protease/inhibitor orthologues and the theoretical potential for cross-hybridization of some of the probes, >95% of the "present calls" (P<0.01) resulted from same-species hybridizations (e.g., human transcripts to human probe sets). To further assess the performance of the microarray, differential gene expression and false discovery rate analyses were carried out on human or mouse sample groups, and data processing methods to optimize performance of the mouse and human probe sets were identified. The Hu/Mu ProtIn microarray is a valuable discovery tool for the identification of components of human and murine proteolytic pathways in health and disease and has particular utility in the determination of cellular origins of proteases and protease inhibitors in xenograft models of human cancer.
Molecular Cancer Research 05/2007; 5(5):443-54. · 4.35 Impact Factor
[show abstract][hide abstract] ABSTRACT: We are coming to appreciate that at fertilization human spermatozoa deliver the paternal genome alongside a suite of structures, proteins and RNAs. Although the role of some of the structures and proteins as requisite elements for early human development has been established, the function of the sperm-delivered RNAs remains a point for discussion. The presence of RNAs in transcriptionally quiescent spermatozoa can only be derived from transcription that precedes late spermiogenesis. A cross-platform microarray strategy was used to assess the profile of human spermatozoal transcripts from fertile males who had fathered at least one child compared to teratozoospermic individuals. Unsupervised clustering of the data followed by pathway and ontological analysis revealed the transcriptional perturbation common to the affected individuals. Transcripts encoding components of various cellular remodeling pathways, such as the ubiquitin-proteosome pathway, were severely disrupted. The origin of the perturbation could be traced as far back as the pachytene stage of spermatogenesis. It is anticipated that this diagnostic strategy will prove valuable for understanding male factor infertility.
Human Molecular Genetics 05/2007; 16(7):763-73. · 7.69 Impact Factor
[show abstract][hide abstract] ABSTRACT: A novel approach to DNase I-sensitivity analysis was applied to examining genes of the spermatogenic pathway, reflective of the substantial morphological and genomic changes that occur during this program of differentiation. A new real-time PCR-based strategy that considers the nuances of response to nuclease treatment was used to assess the nuclease susceptibility through differentiation. Data analysis was automated with the K-Lab PCR algorithm, facilitating the rapid analysis of multiple samples while eliminating the subjectivity usually associated with C(t) analyses. The utility of this assay and analytical paradigm as applied to nuclease-sensitivity mapping is presented.