[show abstract][hide abstract] ABSTRACT: We have developed a statistical framework and software for Differential
isOform usage Testing (DOT) using RNA-seq data. Our method, namely IsoDOT,
provides accurate p-values for differential isoform usage testing with respect
to continuous covariate or categorical covariate of any sample size. Simulation
studies show that IsoDOT delivers significant improvement in sensitivity and
specificity to detect differential isoform usage. We apply IsoDOT to study the
change of mouse transcriptome upon treating with placebo or haloperidol, which
is a commonly used schizophrenia drug with possible adverse side effect. By
comparing the mice treated with placebo or haloperidol, we identify a group of
genes (e.g., Utrn, Dmd, Grin2b, and Snap25) whose isoform usage respond to
haloperidol treatment. We also show that such treatment effect depend on the
genetic background of the mice.
[show abstract][hide abstract] ABSTRACT: X chromosome inactivation (XCI) is the mammalian mechanism of dosage compensation that balances X-linked gene expression between the sexes. Early during female development, each cell of the embryo proper independently inactivates one of its two parental X-chromosomes. In mice, the choice of which X chromosome is inactivated is affected by the genotype of a cis-acting locus, the X-chromosome controlling element (Xce). Xce has been localized to a 1.9 Mb interval within the X-inactivation center (Xic), yet its molecular identity and mechanism of action remain unknown. We combined genotype and sequence data for mouse stocks with detailed phenotyping of ten inbred strains and with the development of a statistical model that incorporates phenotyping data from multiple sources to disentangle sources of XCI phenotypic variance in natural female populations on X inactivation. We have reduced the Xce candidate 10-fold to a 176 kb region located approximately 500 kb proximal to Xist. We propose that structural variation in this interval explains the presence of multiple functional Xce alleles in the genus Mus. We have identified a new allele, Xce(e) present in Mus musculus and a possible sixth functional allele in Mus spicilegus. We have also confirmed a parent-of-origin effect on X inactivation choice and provide evidence that maternal inheritance magnifies the skewing associated with strong Xce alleles. Based on the phylogenetic analysis of 155 laboratory strains and wild mice we conclude that Xce(a) is either a derived allele that arose concurrently with the domestication of fancy mice but prior the derivation of most classical inbred strains or a rare allele in the wild. Furthermore, we have found that despite the presence of multiple haplotypes in the wild Mus musculus domesticus has only one functional Xce allele, Xce(b) . Lastly, we conclude that each mouse taxa examined has a different functional Xce allele.
[show abstract][hide abstract] ABSTRACT: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.
We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives.
The software can be downloaded at http://csbio.unc.edu/genescissors/
Supplementary data are available at Bioinformatics online.
[show abstract][hide abstract] ABSTRACT: Genetic variation contributes to host responses and outcomes following infection by influenza A virus or other viral infections. Yet narrow windows of disease symptoms and confounding environmental factors have made it difficult to identify polymorphic genes that contribute to differential disease outcomes in human populations. Therefore, to control for these confounding environmental variables in a system that models the levels of genetic diversity found in outbred populations such as humans, we used incipient lines of the highly genetically diverse Collaborative Cross (CC) recombinant inbred (RI) panel (the pre-CC population) to study how genetic variation impacts influenza associated disease across a genetically diverse population. A wide range of variation in influenza disease related phenotypes including virus replication, virus-induced inflammation, and weight loss was observed. Many of the disease associated phenotypes were correlated, with viral replication and virus-induced inflammation being predictors of virus-induced weight loss. Despite these correlations, pre-CC mice with unique and novel disease phenotype combinations were observed. We also identified sets of transcripts (modules) that were correlated with aspects of disease. In order to identify how host genetic polymorphisms contribute to the observed variation in disease, we conducted quantitative trait loci (QTL) mapping. We identified several QTL contributing to specific aspects of the host response including virus-induced weight loss, titer, pulmonary edema, neutrophil recruitment to the airways, and transcriptional expression. Existing whole-genome sequence data was applied to identify high priority candidate genes within QTL regions. A key host response QTL was located at the site of the known anti-influenza gene. We sequenced the coding regions of in the eight CC founder strains, and identified a novel allele that showed reduced ability to inhibit viral replication, while maintaining protection from weight loss.
[show abstract][hide abstract] ABSTRACT: The Collaborative Cross (CC) is a panel of recombinant inbred lines derived from eight genetically diverse laboratory inbred strains. Recently, the genetic architecture of the CC population was reported based on the genotype of a single male per line, and other publications reported incompletely inbred CC mice that have been used to map a variety of traits. The three breeding sites, in the US, Israel, and Australia, are actively collaborating to accelerate the inbreeding process through marker-assisted inbreeding and to expedite community access of CC lines deemed to have reached defined thresholds of inbreeding. Plans are now being developed to provide access to this novel genetic reference population through distribution centers. Here we provide a description of the distribution efforts by the University of North Carolina Systems Genetics Core, Tel Aviv University, Israel and the University of Western Australia.
[show abstract][hide abstract] ABSTRACT: The JAX Diversity Outbred population is a new mouse resource derived from partially inbred Collaborative Cross strains and maintained by randomized outcrossing. As such, it segregates the same allelic variants as the Collaborative Cross but embeds these in a distinct population architecture in which each animal has a high degree of heterozygosity and carries a unique combination of alleles. Phenotypic diversity is striking and often divergent from phenotypes seen in the founder strains of the Collaborative Cross. Allele frequencies and recombination density in early generations of Diversity Outbred mice are consistent with expectations based on simulations of the mating design. We describe analytical methods for genetic mapping using this resource and demonstrate the power and high mapping resolution achieved with this population by mapping a serum cholesterol trait to a 2-Mb region on chromosome 3 containing only 11 genes. Analysis of the estimated allele effects in conjunction with complete genome sequence data of the founder strains reduced the pool of candidate polymorphisms to seven SNPs, five of which are located in an intergenic region upstream of the Foxo1 gene.
[show abstract][hide abstract] ABSTRACT: Inbred model organisms are powerful tools for genetic studies because they provide reproducible genomes for use in mapping and genetic manipulation. Generating inbred lines via sibling matings, however, is a costly undertaking that requires many successive generations of breeding, during which time many lines fail. We evaluated several approaches for accelerating inbreeding, including the systematic use of back-crosses and marker-assisted breeder selection, which we contrasted with randomized sib-matings. Using simulations, we explored several alternative breeder-selection methods and monitored the gain and loss of genetic diversity, measured by the number of recombination-induced founder intervals, as a function of generation. For each approach we simulated 100,000 independent lines to estimate distributions of generations to achieve full-fixation as well as to achieve a mean heterozygosity level equal to 20 generations of randomized sib-mating. Our analyses suggest that the number of generations to fully inbred status can be substantially reduced with minimal impact on genetic diversity through combinations of parental backcrossing and marker-assisted inbreeding. Although simulations do not consider all confounding factors underlying the inbreeding process, such as a loss of fecundity, our models suggest many viable alternatives for accelerating the inbreeding process.
[show abstract][hide abstract] ABSTRACT: We present full-genome genotype imputations for 100 classical laboratory mouse strains, using a novel method. Using genotypes at 549,683 SNP loci obtained with the Mouse Diversity Array, we partitioned the genome of 100 mouse strains into 40,647 intervals that exhibit no evidence of historical recombination. For each of these intervals we inferred a local phylogenetic tree. We combined these data with 12 million loci with sequence variations recently discovered by whole-genome sequencing in a common subset of 12 classical laboratory strains. For each phylogenetic tree we identified strains sharing a leaf node with one or more of the sequenced strains. We then imputed high- and medium-confidence genotypes for each of 88 nonsequenced genomes. Among inbred strains, we imputed 92% of SNPs genome-wide, with 71% in high-confidence regions. Our method produced 977 million new genotypes with an estimated per-SNP error rate of 0.083% in high-confidence regions and 0.37% genome-wide. Our analysis identified which of the 88 nonsequenced strains would be the most informative for improving full-genome imputation, as well as which additional strain sequences will reveal more new genetic variants. Imputed sequences and quality scores can be downloaded and visualized online.
[show abstract][hide abstract] ABSTRACT: Mouse models play a crucial role in the study of human behavioral traits and diseases. Variation of gene expression in brain may play a critical role in behavioral phenotypes, and thus it is of great importance to understand regulation of transcription in mouse brain. In this study, we analyzed the role of two important factors influencing steady-state transcriptional variation in mouse brain. First we considered the effect of assessing whole brain vs. discrete regions of the brain. Second, we investigated the genetic basis of strain effects on gene expression. We examined the transcriptome of three brain regions using Affymetrix expression arrays: whole brain, forebrain, and hindbrain in adult mice from two common inbred strains (C57BL/6J vs. NOD/ShiLtJ) with eight replicates for each brain region and strain combination. We observed significant differences between the transcriptomes of forebrain and hindbrain. In contrast, the transcriptomes of whole brain and forebrain were very similar. Using 4.3 million single-nucleotide polymorphisms identified through whole-genome sequencing of C57BL/6J and NOD/ShiLtJ strains, we investigated the relationship between strain effect in gene expression and DNA sequence similarity. We found that cis-regulatory effects play an important role in gene expression differences between strains and that the cis-regulatory elements are more often located in 5' and/or 3' transcript boundaries, with no apparent preference on either 5' or 3' ends.
[show abstract][hide abstract] ABSTRACT: High-density genotyping arrays that measure hybridization of genomic DNA fragments to allele-specific oligonucleotide probes are widely used to genotype single nucleotide polymorphisms (SNPs) in genetic studies, including human genome-wide association studies. Hybridization intensities are converted to genotype calls by clustering algorithms that assign each sample to a genotype class at each SNP. Data for SNP probes that do not conform to the expected pattern of clustering are often discarded, contributing to ascertainment bias and resulting in lost information - as much as 50% in a recent genome-wide association study in dogs.
We identified atypical patterns of hybridization intensities that were highly reproducible and demonstrated that these patterns represent genetic variants that were not accounted for in the design of the array platform. We characterized variable intensity oligonucleotide (VINO) probes that display such patterns and are found in all hybridization-based genotyping platforms, including those developed for human, dog, cattle, and mouse. When recognized and properly interpreted, VINOs recovered a substantial fraction of discarded probes and counteracted SNP ascertainment bias. We developed software (MouseDivGeno) that identifies VINOs and improves the accuracy of genotype calling. MouseDivGeno produced highly concordant genotype calls when compared with other methods but it uniquely identified more than 786000 VINOs in 351 mouse samples. We used whole-genome sequence from 14 mouse strains to confirm the presence of novel variants explaining 28000 VINOs in those strains. We also identified VINOs in human HapMap 3 samples, many of which were specific to an African population. Incorporating VINOs in phylogenetic analyses substantially improved the accuracy of a Mus species tree and local haplotype assignment in laboratory mouse strains.
The problems of ascertainment bias and missing information due to genotyping errors are widely recognized as limiting factors in genetic studies. We have conducted the first formal analysis of the effect of novel variants on genotyping arrays, and we have shown that these variants account for a large portion of miscalled and uncalled genotypes. Genetic studies will benefit from substantial improvements in the accuracy of their results by incorporating VINOs in their analyses.
[show abstract][hide abstract] ABSTRACT: Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research.
We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations.
Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains.
[show abstract][hide abstract] ABSTRACT: Here we provide a genome-wide, high-resolution map of the phylogenetic origin of the genome of most extant laboratory mouse inbred strains. Our analysis is based on the genotypes of wild-caught mice from three subspecies of Mus musculus. We show that classical laboratory strains are derived from a few fancy mice with limited haplotype diversity. Their genomes are overwhelmingly Mus musculus domesticus in origin, and the remainder is mostly of Japanese origin. We generated genome-wide haplotype maps based on identity by descent from fancy mice and show that classical inbred strains have limited and non-randomly distributed genetic diversity. In contrast, wild-derived laboratory strains represent a broad sampling of diversity within M. musculus. Intersubspecific introgression is pervasive in these strains, and contamination by laboratory stocks has played a role in this process. The subspecific origin, haplotype diversity and identity by descent maps can be visualized using the Mouse Phylogeny Viewer (see URLs).
[show abstract][hide abstract] ABSTRACT: The Collaborative Cross (CC) is a mouse recombinant inbred strain panel that is being developed as a resource for mammalian systems genetics. Here we describe an experiment that uses partially inbred CC lines to evaluate the genetic properties and utility of this emerging resource. Genome-wide analysis of the incipient strains reveals high genetic diversity, balanced allele frequencies, and dense, evenly distributed recombination sites-all ideal qualities for a systems genetics resource. We map discrete, complex, and biomolecular traits and contrast two quantitative trait locus (QTL) mapping approaches. Analysis based on inferred haplotypes improves power, reduces false discovery, and provides information to identify and prioritize candidate genes that is unique to multifounder crosses like the CC. The number of expression QTLs discovered here exceeds all previous efforts at eQTL mapping in mice, and we map local eQTL at 1-Mb resolution. We demonstrate that the genetic diversity of the CC, which derives from random mixing of eight founder strains, results in high phenotypic diversity and enhances our ability to map causative loci underlying complex disease-related traits.
Genome Research 03/2011; 21(8):1213-22. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: We have developed a novel tool for visualizing and analyzing multiple collinear genomes. Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser is web-based and provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. Our tool illustrates genomesequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse strains.
[show abstract][hide abstract] ABSTRACT: MOTIVATION: High-density SNP data of model animal resources provides opportunities for fine-resolution genetic variation studies. These genetic resources are generated through a variety of breeding schemes that involve multiple generations of matings derived from a set of founder animals. In this article, we investigate the problem of inferring the most probable ancestry of resulting genotypes, given a set of founder genotypes. Due to computational difficulty, existing methods either handle only small pedigree data or disregard the pedigree structure. However, large pedigrees of model animal resources often contain repetitive substructures that can be utilized in accelerating computation. RESULTS: We present an accurate and efficient method that can accept complex pedigrees with inbreeding in inferring genome ancestry. Inbreeding is a commonly used process in generating genetically diverse and reproducible animals. It is often carried out for many generations and can account for most of the computational complexity in real-world model animal pedigrees. Our method builds a hidden Markov model that derives the ancestry probabilities through inbreeding process without explicit modeling in every generation. The ancestry inference is accurate and fast, independent of the number of generations, for model animal resources such as the Collaborative Cross (CC). Experiments on both simulated and real CC data demonstrate that our method offers comparable accuracy to those methods that build an explicit model of the entire pedigree, but much better scalability with respect to the pedigree size.
[show abstract][hide abstract] ABSTRACT: A perspective image represents the spatial relationships of objects in a scene as they appear from a single viewpoint. In contrast, a multi-perspective image combines what is seen from several viewpoints into a single image. Despite their incongruity of view, effective multi-perspective images are able to preserve spatial coherence and can depict, within a single context, details of a scene that are simultaneously inaccessible from a single view, yet easily interpretable by a viewer. In computer vision, multi-perspective images have been used for analysing structure revealed via motion and generating panoramic images with a wide field-of-view using mirrors.In this STAR, we provide a practical guide on topics in multi-perspective modelling and rendering methods and multi-perspective imaging systems. We start with a brief review of multi-perspective image techniques frequently employed by artists such as the visual paradoxes of Escher, the Cubism of Picasso and Braque and multi-perspective panoramas in cel animations. We then characterize existing multi-perspective camera models, with an emphasis on their underlying geometry and image properties. We demonstrate how to use these camera models for creating specific multi-perspective rendering effects. Furthermore, we show that many of these cameras satisfy the multi-perspective stereo constraints and we demonstrate several multi-perspective imaging systems for extracting 3D geometry for computer vision.The participants learn about topics in multi-perspective modelling and rendering for generating compelling pictures for computer graphics and in multi-perspective imaging for extracting 3D geometry for computer vision. We hope to provide enough fundamentals to satisfy the technical specialist without intimidating curious digital artists interested in multi-perspective images.The intended audience includes digital artists, photographers and computer graphics and computer vision researchers using or building multi-perspective cameras. They will learn about multi-perspective modelling and rendering, along with many real world multi-perspective imaging systems.
[show abstract][hide abstract] ABSTRACT: We present a new method for identifying gene sets associated with labeled samples, where the labels can be case versus control, or genotype differences. Existing approaches to this problem assume that variations observed within a group are due primarily to noise and they, therefore, look for significant mean shifts between groups. Biological evidence suggests variations can also result from the coordination of genes. Our method attempts to identify and assess the significance of changes in gene-gene correlation patterns. We model gene-gene correlations using principal component analysis and compare their significance to a baseline of a linear models generated by random permutations of the sample labels. Simulation results show that our method detects changes that are undetectable by Hotelling's T2 method. Its performance on real data is comparable to existing methods with the additional capability of detecting changes in gene-interactions between sample groups.
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, BCB 2010, Niagara Falls, NY, USA, August 2-4, 2010; 01/2010