[Show abstract][Hide abstract] ABSTRACT: We have developed a statistical framework and software for Differential
isOform usage Testing (DOT) using RNA-seq data. Our method, namely IsoDOT,
provides accurate p-values for differential isoform usage testing with respect
to continuous covariate or categorical covariate of any sample size. Simulation
studies show that IsoDOT delivers significant improvement in sensitivity and
specificity to detect differential isoform usage. We apply IsoDOT to study the
change of mouse transcriptome upon treating with placebo or haloperidol, which
is a commonly used schizophrenia drug with possible adverse side effect. By
comparing the mice treated with placebo or haloperidol, we identify a group of
genes (e.g., Utrn, Dmd, Grin2b, and Snap25) whose isoform usage respond to
haloperidol treatment. We also show that such treatment effect depend on the
genetic background of the mice.
Journal of the American Statistical Association 11/2015; 110(511):975-986. DOI:10.1080/01621459.2015.1040880 · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Surveys of inbred strains of mice are standard approaches to determine the heritability and range of phenotypic variation for biomedical traits. In addition, they may lead to the identification of novel phenotypes and models of human disease. Surprisingly, male reproductive phenotypes are among the least represented traits in the Mouse Phenome Database. Here we report the results of a broad survey of the eight founder inbred strains of both the Collaborative Cross (CC) and the Diversity Outbred populations, two new mouse resources that are being used as platforms for systems genetics and sources of mouse models of human diseases. Our survey includes representatives of the three main subspecies of the house mice and a mix of classical and wild-derived inbred strains. In addition to standard staples of male reproductive phenotyping such as reproductive organ weights, sperm counts and sperm morphology, our survey includes sperm motility and the first survey of testis histology. As expected for such a broad survey, heritability varies widely among traits. We conclude that although all eight inbred strains are fertile, most display a mix of advantageous and deleterious male reproductive traits. The CAST/EiJ strain is an outlier, with an unusual combination of deleterious male reproductive traits including low sperm counts, high levels of morphologically abnormal sperm, and poor motility. In contrast, sperm from the PWK/PhJ and WSB/EiJ strains had the highest percentages of normal morphology and vigorous motility. Finally, we report an abnormal testis phenotype that is highly heritable and restricted to the WSB/EiJ strain. This phenotype is characterized by the presence of a large, but variable, number of vacuoles in at least 10% of the seminiferous tubules. The onset of the phenotype between two and three weeks of age is temporally correlated with the formation of the blood-testis barrier. We speculate that this phenotype may play a role in high rates of extinction in the CC project and in the phenotypes associated with speciation in genetic crosses that use the WSB/EiJ strain as representative of the Mus muculus domesticus subspecies.
[Show abstract][Hide abstract] ABSTRACT: New systems genetics approaches are needed to rapidly identify host genes and genetic networks that regulate complex disease outcomes. Using genetically diverse animals from incipient lines of the Collaborative Cross mouse panel, we demonstrate a greatly expanded range of phenotypes relative to classical mouse models of SARS-CoV infection including lung pathology, weight loss and viral titer. Genetic mapping revealed several loci contributing to differential disease responses, including an 8.5Mb locus associated with vascular cuffing on chromosome 3 that contained 23 genes and 13 noncoding RNAs. Integrating phenotypic and genetic data narrowed this region to a single gene, Trim55, an E3 ubiquitin ligase with a role in muscle fiber maintenance. Lung pathology and transcriptomic data from mice genetically deficient in Trim55 were used to validate its role in SARS-CoV-induced vascular cuffing and inflammation. These data establish the Collaborative Cross platform as a powerful genetic resource for uncovering genetic contributions of complex traits in microbial disease severity, inflammation and virus replication in models of outbred populations.
[Show abstract][Hide abstract] ABSTRACT: Complex human traits are influenced by variation in regulatory DNA through mechanisms that are not fully understood. Because regulatory elements are conserved between humans and mice, a thorough annotation of cis regulatory variants in mice could aid in further characterizing these mechanisms. Here we provide a detailed portrait of mouse gene expression across multiple tissues in a three-way diallel. Greater than 80% of mouse genes have cis regulatory variation. Effects from these variants influence complex traits and usually extend to the human ortholog. Further, we estimate that at least one in every thousand SNPs creates a cis regulatory effect. We also observe two types of parent-of-origin effects, including classical imprinting and a new global allelic imbalance in expression favoring the paternal allele. We conclude that, as with humans, pervasive regulatory variation influences complex genetic traits in mice and provide a new resource toward understanding the genetic control of transcription in mammals.
[Show abstract][Hide abstract] ABSTRACT: Significant departures from expected Mendelian inheritance ratios (transmission ratio distortion, TRD) are frequently observed in both experimental crosses and natural populations. TRD on mouse Chromosome (Chr) 2 has been reported in multiple experimental crosses, including the Collaborative Cross (CC). Among the eight CC founder inbred strains, we found that Chr 2 TRD was exclusive to females that were heterozygous for the WSB/EiJ allele within a 9.3 Mb region (Chr 2 76.9 - 86.2 Mb). A copy number gain of a 127 kb-long DNA segment (designated as responder to drive, R2d) emerged as the strongest candidate for the causative allele. We mapped R2d sequences to two loci within the candidate interval. R2d1 is located near the proximal boundary, and contains a single copy of R2d in all strains tested. R2d2 maps to a 900 kb interval, and the number of R2d copies varies from zero in classical strains (including the mouse reference genome) to more than 30 in wild-derived strains. Using real-time PCR assays for the copy number, we identified a mutation (R2d2WSBdel1) that eliminates the majority of the R2d2WSB copies without apparent alterations of the surrounding WSB/EiJ haplotype. In a three-generation pedigree segregating for R2d2WSBdel1, the mutation is transmitted to the progeny and Mendelian segregation is restored in females heterozygous for R2d2WSBdel1, thus providing direct evidence that the copy number gain is causal for maternal TRD. We found that transmission ratios in R2d2WSB heterozygous females vary between Mendelian segregation and complete distortion depending on the genetic background, and that TRD is under genetic control of unlinked distorter loci. Although the R2d2WSB transmission ratio was inversely correlated with average litter size, several independent lines of evidence support the contention that female meiotic drive is the cause of the distortion. We discuss the implications and potential applications of this novel meiotic drive system.
[Show abstract][Hide abstract] ABSTRACT: Motivation:
The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.
We present a novel algorithm that merges multi-string BWTs in [Formula: see text] time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using [Formula: see text] bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.
[Show abstract][Hide abstract] ABSTRACT: Mapping reads to a reference sequence is a common step when analyzing allele effects in high-throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending on the genetic distances of the target sequences from the reference. To avoid this bias, researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single-reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins.
Database The Journal of Biological Databases and Curation 01/2014; 2014. DOI:10.1093/database/bau057 · 3.37 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: X chromosome inactivation (XCI) is the mammalian mechanism of dosage compensation that balances X-linked gene expression between the sexes. Early during female development, each cell of the embryo proper independently inactivates one of its two parental X-chromosomes. In mice, the choice of which X chromosome is inactivated is affected by the genotype of a cis-acting locus, the X-chromosome controlling element (Xce). Xce has been localized to a 1.9 Mb interval within the X-inactivation center (Xic), yet its molecular identity and mechanism of action remain unknown. We combined genotype and sequence data for mouse stocks with detailed phenotyping of ten inbred strains and with the development of a statistical model that incorporates phenotyping data from multiple sources to disentangle sources of XCI phenotypic variance in natural female populations on X inactivation. We have reduced the Xce candidate 10-fold to a 176 kb region located approximately 500 kb proximal to Xist. We propose that structural variation in this interval explains the presence of multiple functional Xce alleles in the genus Mus. We have identified a new allele, Xce(e) present in Mus musculus and a possible sixth functional allele in Mus spicilegus. We have also confirmed a parent-of-origin effect on X inactivation choice and provide evidence that maternal inheritance magnifies the skewing associated with strong Xce alleles. Based on the phylogenetic analysis of 155 laboratory strains and wild mice we conclude that Xce(a) is either a derived allele that arose concurrently with the domestication of fancy mice but prior the derivation of most classical inbred strains or a rare allele in the wild. Furthermore, we have found that despite the presence of multiple haplotypes in the wild Mus musculus domesticus has only one functional Xce allele, Xce(b) . Lastly, we conclude that each mouse taxa examined has a different functional Xce allele.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we contrast the resolution and accuracy of determining recombination boundaries using genotyping arrays compared to high-throughput sequencing. In addition, we consider the impacts of sequence coverage and genetic diversity on localizing recombination boundaries. We developed a hidden Markov model for estimating recombination breakpoints based on variant observations seen in the read coverage spanning uniformly sized genomic windows. Our model includes 36 states representing all combinations of 8 genomes, and estimates a founder mosaic that is consistent with the variants observed in the aligned sequences. At HMM transition locations we consider the most likely founder-pair and refine the recombination breakpoints down to an interval spanning two informative variants. We compare this solution to alternate solutions based on microarrays that we have estimated. At 30x coverage the recombination mapping accuracy far exceeds the resolution attainable by any microarray. Even at coverages of 1x and below we are generally able to estimate recombination breakpoints with comparable accuracy.
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics; 09/2013
[Show abstract][Hide abstract] ABSTRACT: Next generation sequencing techniques have enabled new methods of DNA and RNA quantification. Many of these methods require a step of aligning short reads to some reference genome. If the target organism differs significantly from this reference, alignment errors can lead to significant errors in downstream analysis. Various attempts have been tried to integrate known genetic variants into the reference genome so as to construct sample-specific genomes to improve read alignments. However, many hurdles in generating and annotating such genomes remain unsolved. In this paper, we propose a general framework for mapping back and forth between genomes. It employs a new format, MOD, to represent known variants between genomes, and a set of tools that facilitate genome manipulation and mapping. We demonstrate the utility of this framework using three inbred mouse strains. We built pseudogenomes from the mm9 mouse reference genome for three highly divergent mouse strains based on MOD files and used them to map the gene annotations to these new genomes. We observe that a large fraction of genes have their positions or ranges altered. Finally, using RNA-seq and DNA-seq short reads from these strains, we demonstrate that mapping to the new genomes yields a better alignment result than mapping to the standard reference. The MOD files for the 17 mouse strains sequenced in the Wellcome Trust Sanger Institute's Mouse Genomes Project can be found at http://www.csbio.unc.edu/CCstatus/index.py?run=Pseudo The auxiliary tools (i.e. MODtools and Lapels), written in Python, are available at http://code.google.com/p/modtools/ and http://code.google.com/p/lapels/.
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics; 09/2013
[Show abstract][Hide abstract] ABSTRACT: Mapping reads to a reference sequence is a common step when analyzing allele effects in high throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending the genetic distances of the target sequences from the reference. To avoid this bias researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings, and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins.
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics; 09/2013
[Show abstract][Hide abstract] ABSTRACT: Motivation:
RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ~3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ~10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.
We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives.
The software can be downloaded at http://csbio.unc.edu/genescissors/.
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Genetic variation contributes to host responses and outcomes following infection by influenza A virus or other viral infections. Yet narrow windows of disease symptoms and confounding environmental factors have made it difficult to identify polymorphic genes that contribute to differential disease outcomes in human populations. Therefore, to control for these confounding environmental variables in a system that models the levels of genetic diversity found in outbred populations such as humans, we used incipient lines of the highly genetically diverse Collaborative Cross (CC) recombinant inbred (RI) panel (the pre-CC population) to study how genetic variation impacts influenza associated disease across a genetically diverse population. A wide range of variation in influenza disease related phenotypes including virus replication, virus-induced inflammation, and weight loss was observed. Many of the disease associated phenotypes were correlated, with viral replication and virus-induced inflammation being predictors of virus-induced weight loss. Despite these correlations, pre-CC mice with unique and novel disease phenotype combinations were observed. We also identified sets of transcripts (modules) that were correlated with aspects of disease. In order to identify how host genetic polymorphisms contribute to the observed variation in disease, we conducted quantitative trait loci (QTL) mapping. We identified several QTL contributing to specific aspects of the host response including virus-induced weight loss, titer, pulmonary edema, neutrophil recruitment to the airways, and transcriptional expression. Existing whole-genome sequence data was applied to identify high priority candidate genes within QTL regions. A key host response QTL was located at the site of the known anti-influenza gene. We sequenced the coding regions of in the eight CC founder strains, and identified a novel allele that showed reduced ability to inhibit viral replication, while maintaining protection from weight loss.
[Show abstract][Hide abstract] ABSTRACT: Numerous methods exist for inferring the ancestry mosaic of an admixed individual based on its genotypes and those of its ancestors. These methods rely on bialleic SNPs obtained from genotype calling algorithms, which classify each marker as belonging to one of four states (reference allele, alternate allele, heterozygous, or no call) based on probe hybridization intensity signals. We demonstrate that this conversion of probe intensities to discrete genotypes can lead to a loss of information and introduce errors via incorrect genotype calls. We propose a method that directly infers ancestry from probe intensities by minimizing the intensity difference between a target individual and one or more of its ancestors. We demonstrate our method on mice from the developing Collaborative Cross (CC) genetic reference population, which are admixtures of a common set of eight ancestors. Our samples were genotyped using a 7.8K-marker Illumina Infinium platform called the Mouse Universal Genotyping Array (MUGA). We compare our reconstructions with a standard genotype-based method and validate our results using DNA sequencing data. Our algorithm is able to use information not captured by genotype calls and avoid errors due to incorrect calls.
Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine; 10/2012
[Show abstract][Hide abstract] ABSTRACT: The Collaborative Cross (CC) is a panel of recombinant inbred lines derived from eight genetically diverse laboratory inbred strains. Recently, the genetic architecture of the CC population was reported based on the genotype of a single male per line, and other publications reported incompletely inbred CC mice that have been used to map a variety of traits. The three breeding sites, in the US, Israel, and Australia, are actively collaborating to accelerate the inbreeding process through marker-assisted inbreeding and to expedite community access of CC lines deemed to have reached defined thresholds of inbreeding. Plans are now being developed to provide access to this novel genetic reference population through distribution centers. Here we provide a description of the distribution efforts by the University of North Carolina Systems Genetics Core, Tel Aviv University, Israel and the University of Western Australia.
[Show abstract][Hide abstract] ABSTRACT: In this report, I present a complete image-based rendering system. This includes the derivation of a mapping function from first principles, an algorithm for determining the visibility of these mapped points in the resulting image, and a method for reconstructing a continuous image from these mapped points. I refer to this type of mapping function as image warping, because it processes the elements of an image according to their image coordinates and produces outputs that are image coordinates in the resulting image. In addition to the coordinates of the reference image additional information is required for each pixel. This information is related to the distance of the object seen at a particular pixel from the image plane. There are many different measures that can be used to describe this distance. Distance can be specified as range values describing the Euclidean distance from the visible object to image’s center-ofprojection. If the viewing or image plane is known and the coordinate system is chosen so that the normal of this plane lies a unit distance along the z-axis, then this distance information is called depth or the pixel’s z-value. However, there are many other reasonable choices for representing this same distance. For instance distance values can be described indirectly by to the relative motion of image points induced by a change in the camera’s position, this distance representation is frequently called optical flow, and it is inversely related to the point’s range. Disparity and projective-depth are two more representations of distance for
[Show abstract][Hide abstract] ABSTRACT: Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research.
We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations.
Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains.
[Show abstract][Hide abstract] ABSTRACT: The Collaborative Cross Consortium reports here on the development of a unique genetic resource population. The Collaborative Cross (CC) is a multi parental recombinant inbred panel derived from eight laboratory mouse inbred strains. Breeding of the CC lines was initiated at multiple international sites using mice from The Jackson Laboratory. Currently, this innovative project is breeding independent CC lines at the University of North Carolina (UNC), at Tel Aviv University (TAU), and at Geniad in Western Australia (GND). These institutions aim to make publicly available the completed CC lines and their genotypes and sequence information. We genotyped, and report here, results from 458 extant lines from UNC, TAU, and GND using a custom genotyping array with 7500 SNPs designed to be maximally informative in the CC and used a novel algorithm to infer inherited haplotypes directly from hybridization intensity patterns. We identified lines with breeding errors and cousin lines generated by splitting incipient lines into two or more cousin lines at early generations of inbreeding. We then characterized the genome architecture of 350 genetically independent CC lines. Results showed that founder haplotypes are inherited at the expected frequency, although we also consistently observed highly significant transmission ratio distortion at specific loci across all three populations. On chromosome 2, there is significant overrepresentation of WSB/EiJ alleles, and on chromosome X, there is a large deficit of CC lines with CAST/EiJ alleles. Linkage disequilibrium decays as expected and we saw no evidence of gametic disequilibrium in the CC population as a whole or in random subsets of the population. Gametic equilibrium in the CC population is in marked contrast to the gametic disequilibrium present in a large panel of classical inbred strains. Finally, we discuss access to the CC population and to the associated raw data describing the genetic structure of individual lines. Integration of rich phenotypic and genomic data over time and across a wide variety of fields will be vital to delivering on one of the key attributes of the CC, a common genetic reference platform for identifying causative variants and genetic networks determining traits in mammals.
[Show abstract][Hide abstract] ABSTRACT: We present full-genome genotype imputations for 100 classical laboratory mouse strains, using a novel method. Using genotypes at 549,683 SNP loci obtained with the Mouse Diversity Array, we partitioned the genome of 100 mouse strains into 40,647 intervals that exhibit no evidence of historical recombination. For each of these intervals we inferred a local phylogenetic tree. We combined these data with 12 million loci with sequence variations recently discovered by whole-genome sequencing in a common subset of 12 classical laboratory strains. For each phylogenetic tree we identified strains sharing a leaf node with one or more of the sequenced strains. We then imputed high- and medium-confidence genotypes for each of 88 nonsequenced genomes. Among inbred strains, we imputed 92% of SNPs genome-wide, with 71% in high-confidence regions. Our method produced 977 million new genotypes with an estimated per-SNP error rate of 0.083% in high-confidence regions and 0.37% genome-wide. Our analysis identified which of the 88 nonsequenced strains would be the most informative for improving full-genome imputation, as well as which additional strain sequences will reveal more new genetic variants. Imputed sequences and quality scores can be downloaded and visualized online.