[Show abstract][Hide abstract] ABSTRACT: Genotyping microarrays are an important resource for genetic mapping, population genetics and monitoring of the genetic integrity of laboratory stocks. We have developed the third generation of the Mouse Universal Genotyping Array (MUGA) series, GigaMUGA, a 143,259-probe Illumina Infinium II array for the house mouse (Mus musculus). The bulk of the content of GigaMUGA is optimized for genetic mapping in the Collaborative Cross and Diversity Outbred populations and for substrain-level identification of laboratory mice. In addition to 141,090 SNP probes, GigaMUGA contains 2,006 probes for copy number concentrated in structurally polymorphic regions of the mouse genome. The performance of the array is characterized in a set of 500 high-quality reference samples spanning laboratory inbred strains, recombinant inbred lines, outbred stocks, and wild-caught mice. GigaMUGA is highly informative across a wide range of genetically-diverse samples, from laboratory substrains to other Mus species. In addition to describing the content and performance of the array, we provide detailed probe-level annotation and recommendations for quality control.
Full-text · Article · Dec 2015 · G3-Genes Genomes Genetics
[Show abstract][Hide abstract] ABSTRACT: We have developed a statistical framework and software for Differential
isOform usage Testing (DOT) using RNA-seq data. Our method, namely IsoDOT,
provides accurate p-values for differential isoform usage testing with respect
to continuous covariate or categorical covariate of any sample size. Simulation
studies show that IsoDOT delivers significant improvement in sensitivity and
specificity to detect differential isoform usage. We apply IsoDOT to study the
change of mouse transcriptome upon treating with placebo or haloperidol, which
is a commonly used schizophrenia drug with possible adverse side effect. By
comparing the mice treated with placebo or haloperidol, we identify a group of
genes (e.g., Utrn, Dmd, Grin2b, and Snap25) whose isoform usage respond to
haloperidol treatment. We also show that such treatment effect depend on the
genetic background of the mice.
Full-text · Article · Nov 2015 · Journal of the American Statistical Association
[Show abstract][Hide abstract] ABSTRACT: Surveys of inbred strains of mice are standard approaches to determine the heritability and range of phenotypic variation for biomedical traits. In addition, they may lead to the identification of novel phenotypes and models of human disease. Surprisingly, male reproductive phenotypes are among the least represented traits in the Mouse Phenome Database. Here we report the results of a broad survey of the eight founder inbred strains of both the Collaborative Cross (CC) and the Diversity Outbred populations, two new mouse resources that are being used as platforms for systems genetics and sources of mouse models of human diseases. Our survey includes representatives of the three main subspecies of the house mice and a mix of classical and wild-derived inbred strains. In addition to standard staples of male reproductive phenotyping such as reproductive organ weights, sperm counts and sperm morphology, our survey includes sperm motility and the first survey of testis histology. As expected for such a broad survey, heritability varies widely among traits. We conclude that although all eight inbred strains are fertile, most display a mix of advantageous and deleterious male reproductive traits. The CAST/EiJ strain is an outlier, with an unusual combination of deleterious male reproductive traits including low sperm counts, high levels of morphologically abnormal sperm, and poor motility. In contrast, sperm from the PWK/PhJ and WSB/EiJ strains had the highest percentages of normal morphology and vigorous motility. Finally, we report an abnormal testis phenotype that is highly heritable and restricted to the WSB/EiJ strain. This phenotype is characterized by the presence of a large, but variable, number of vacuoles in at least 10% of the seminiferous tubules. The onset of the phenotype between two and three weeks of age is temporally correlated with the formation of the blood-testis barrier. We speculate that this phenotype may play a role in high rates of extinction in the CC project and in the phenotypes associated with speciation in genetic crosses that use the WSB/EiJ strain as representative of the Mus muculus domesticus subspecies.
Preview · Article · Oct 2015 · G3-Genes Genomes Genetics
[Show abstract][Hide abstract] ABSTRACT: Author Summary
New emerging pathogens are a significant threat to human health with at least six highly pathogenic viruses, including four respiratory viruses, having spread from animal hosts into the human population within the past 15 years. With the emergence of new pathogens, new and better animal models are needed in order to better understand the disease these pathogens cause; to assist in the rapid development of therapeutics; and importantly to evaluate the role of natural host genetic variation in regulating disease outcome. We used incipient lines of the Collaborative Cross, a newly available recombinant inbred mouse panel, to identify polymorphic host genes that contribute to SARS-CoV pathogenesis. We discovered new animal models that better capture the range of disease found in human SARS patients and also found four novel susceptibility loci governing various aspects of SARS-induced pathogenesis. By integrating statistical, genetic and bioinformatic approaches we were able to narrow candidate genome regions to highly likely candidate genes. We narrowed one locus to a single candidate gene, Trim55, and confirmed its role in the inflammatory response to SARS-CoV infection through the use of knockout mice. This work identifies a novel function for Trim55 and also demonstrates the utility of the CC as a platform for identifying the genetic contributions of complex traits.
[Show abstract][Hide abstract] ABSTRACT: Complex human traits are influenced by variation in regulatory DNA through mechanisms that are not fully understood. Because regulatory elements are conserved between humans and mice, a thorough annotation of cis regulatory variants in mice could aid in further characterizing these mechanisms. Here we provide a detailed portrait of mouse gene expression across multiple tissues in a three-way diallel. Greater than 80% of mouse genes have cis regulatory variation. Effects from these variants influence complex traits and usually extend to the human ortholog. Further, we estimate that at least one in every thousand SNPs creates a cis regulatory effect. We also observe two types of parent-of-origin effects, including classical imprinting and a new global allelic imbalance in expression favoring the paternal allele. We conclude that, as with humans, pervasive regulatory variation influences complex genetic traits in mice and provide a new resource toward understanding the genetic control of transcription in mammals.
[Show abstract][Hide abstract] ABSTRACT: Significant departures from expected Mendelian inheritance ratios (transmission ratio distortion, TRD) are frequently observed in both experimental crosses and natural populations. TRD on mouse Chromosome (Chr) 2 has been reported in multiple experimental crosses, including the Collaborative Cross (CC). Among the eight CC founder inbred strains, we found that Chr 2 TRD was exclusive to females that were heterozygous for the WSB/EiJ allele within a 9.3 Mb region (Chr 2 76.9 - 86.2 Mb). A copy number gain of a 127 kb-long DNA segment (designated as responder to drive, R2d) emerged as the strongest candidate for the causative allele. We mapped R2d sequences to two loci within the candidate interval. R2d1 is located near the proximal boundary, and contains a single copy of R2d in all strains tested. R2d2 maps to a 900 kb interval, and the number of R2d copies varies from zero in classical strains (including the mouse reference genome) to more than 30 in wild-derived strains. Using real-time PCR assays for the copy number, we identified a mutation (R2d2WSBdel1) that eliminates the majority of the R2d2WSB copies without apparent alterations of the surrounding WSB/EiJ haplotype. In a three-generation pedigree segregating for R2d2WSBdel1, the mutation is transmitted to the progeny and Mendelian segregation is restored in females heterozygous for R2d2WSBdel1, thus providing direct evidence that the copy number gain is causal for maternal TRD. We found that transmission ratios in R2d2WSB heterozygous females vary between Mendelian segregation and complete distortion depending on the genetic background, and that TRD is under genetic control of unlinked distorter loci. Although the R2d2WSB transmission ratio was inversely correlated with average litter size, several independent lines of evidence support the contention that female meiotic drive is the cause of the distortion. We discuss the implications and potential applications of this novel meiotic drive system.
[Show abstract][Hide abstract] ABSTRACT: Numerous microarray genotype-calling methods rely on fit- Ting a parametric model to clusters derived from the hybridization intensities of training data. However, in most cases we are uncertain about the expected sample distribution and the resulting parametric model tends to be inaccurate if the assumptions of the data distribution are not met. Moreover, many methods assume four genotypes (reference allele, alternate allele, heterozygous allele, or no call) and use a common parametric model that applies to all probes. We demonstrate that conversion of probe intensities to discrete genotypes and applying the same model to all probes results in information loss and even incorrect genotype calls due to genomic variations within probes . We make no assumption about the data distribution. We represent cluster distribution using a non-parametric model which is consistent with the data. The model can be easily evaluated and it provides genotype calls for a given sample's marker using a table lookup. Furthermore, our algorithms have no prior assumptions concerning the number of genotype calls. We apply the algorithms to each probe separately whereas others infer a common set of clusters that apply to all probes. We demonstrate our methods on Collaborative Cross (CC) genetic reference mice population and all samples are genotyped using a 78,000-marker genotyping array on Illumina platform. Our algorithm exhibits high concor- dance with Illumina genotype calls and achieves > 98% call rates on all CC samples. Code for the described algorithms is available by request from the authors.
[Show abstract][Hide abstract] ABSTRACT: Many methods have been developed for mapping quantita- Tive trait loci (QTLs) using microarrays. Traditional methods for QTL mapping rely on the assumption that biallelic genotype calls represent the complete genetic variation at a marker. In reality, the process of converting microarray intensities to discrete genotype calls results in the loss of marker information on other variations involving the marker sequence, such as nearby SNPs, deletions, or copy numbers. We have developed a novel approach to QTL mapping that directly uses microarray marker intensities. Our method scans for marker windows where the intensity distances be- Tween sample pairs are correlated with the quantitative phe- notype difierences. The presence of such markers indicates that samples which are genetically close together in the region also share similar phenotype values, suggesting the presence of a QTL. The significance of putative QTLs is then assessed through permutation testing. By directly incorporating genotype intensities, our method eliminates intermediate processes such as genotype calling or ancestry inference that may introduce uncertainty or data loss. We tested our method on synthetic phenotype data of mice genotyped with the 78K-marker Mega MUGA array, and our results compared favorably to those of R/qtl, a well-established QTL mapping package. In addition, we used our method to map the binary albino trait in inbred and backcrossed mice to the tyrosinase (Tyr) gene on chromosome 7, and we also verified several QTLs found to affect colitis-related traits from a previous mouse study.
[Show abstract][Hide abstract] ABSTRACT: Motivation:
The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.
We present a novel algorithm that merges multi-string BWTs in [Formula: see text] time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using [Formula: see text] bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.
[Show abstract][Hide abstract] ABSTRACT: RNA-seq technology enables large-scale studies of allele-specific expression (ASE), or the expression difference between maternal and paternal alleles. Here, we study ASE in animals for which parental RNA-seq data are available. While most methods for determining ASE rely on read alignment, read alignment either leads to reference bias or requires knowledge of genomic variants in each parental strain. When RNA-seq data are available for both parental strains of a hybrid animal, it is possible to infer ASE with minimal reference bias and without knowledge of parental genomic variants. Our approach first uses parental RNA-seq reads to discover maternal and paternal versions of transcript sequences. Using these alternative transcript sequences as features, we estimate abundance levels of transcripts in the hybrid animal using a modified lasso linear regression model.
We tested our methods on synthetic data from the mouse transcriptome and compared our results with those of Trinity, a state-of-the-art de novo RNA-seq assembler. Our methods achieved high sensitivity and specificity in both identifying expressed transcripts and transcripts exhibiting ASE. We also ran our methods on real RNA-seq mouse data from two F1 samples with wild-derived parental strains and were able to validate known genes exhibiting ASE, as well as confirm the expected maternal contribution ratios in all genes and genes on the X chromosome.
[Show abstract][Hide abstract] ABSTRACT: Mapping reads to a reference sequence is a common step when analyzing allele effects in high-throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending on the genetic distances of the target sequences from the reference. To avoid this bias, researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single-reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins.
Preview · Article · Jan 2014 · Database The Journal of Biological Databases and Curation
[Show abstract][Hide abstract] ABSTRACT: X chromosome inactivation (XCI) is the mammalian mechanism of dosage compensation that balances X-linked gene expression between the sexes. Early during female development, each cell of the embryo proper independently inactivates one of its two parental X-chromosomes. In mice, the choice of which X chromosome is inactivated is affected by the genotype of a cis-acting locus, the X-chromosome controlling element (Xce). Xce has been localized to a 1.9 Mb interval within the X-inactivation center (Xic), yet its molecular identity and mechanism of action remain unknown. We combined genotype and sequence data for mouse stocks with detailed phenotyping of ten inbred strains and with the development of a statistical model that incorporates phenotyping data from multiple sources to disentangle sources of XCI phenotypic variance in natural female populations on X inactivation. We have reduced the Xce candidate 10-fold to a 176 kb region located approximately 500 kb proximal to Xist. We propose that structural variation in this interval explains the presence of multiple functional Xce alleles in the genus Mus. We have identified a new allele, Xce(e) present in Mus musculus and a possible sixth functional allele in Mus spicilegus. We have also confirmed a parent-of-origin effect on X inactivation choice and provide evidence that maternal inheritance magnifies the skewing associated with strong Xce alleles. Based on the phylogenetic analysis of 155 laboratory strains and wild mice we conclude that Xce(a) is either a derived allele that arose concurrently with the domestication of fancy mice but prior the derivation of most classical inbred strains or a rare allele in the wild. Furthermore, we have found that despite the presence of multiple haplotypes in the wild Mus musculus domesticus has only one functional Xce allele, Xce(b) . Lastly, we conclude that each mouse taxa examined has a different functional Xce allele.
[Show abstract][Hide abstract] ABSTRACT: Mapping reads to a reference sequence is a common step when analyzing allele effects in high throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending the genetic distances of the target sequences from the reference. To avoid this bias researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings, and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we contrast the resolution and accuracy of determining recombination boundaries using genotyping arrays compared to high-throughput sequencing. In addition, we consider the impacts of sequence coverage and genetic diversity on localizing recombination boundaries. We developed a hidden Markov model for estimating recombination breakpoints based on variant observations seen in the read coverage spanning uniformly sized genomic windows. Our model includes 36 states representing all combinations of 8 genomes, and estimates a founder mosaic that is consistent with the variants observed in the aligned sequences. At HMM transition locations we consider the most likely founder-pair and refine the recombination breakpoints down to an interval spanning two informative variants. We compare this solution to alternate solutions based on microarrays that we have estimated. At 30x coverage the recombination mapping accuracy far exceeds the resolution attainable by any microarray. Even at coverages of 1x and below we are generally able to estimate recombination breakpoints with comparable accuracy.
[Show abstract][Hide abstract] ABSTRACT: Next generation sequencing techniques have enabled new methods of DNA and RNA quantification. Many of these methods require a step of aligning short reads to some reference genome. If the target organism differs significantly from this reference, alignment errors can lead to significant errors in downstream analysis. Various attempts have been tried to integrate known genetic variants into the reference genome so as to construct sample-specific genomes to improve read alignments. However, many hurdles in generating and annotating such genomes remain unsolved. In this paper, we propose a general framework for mapping back and forth between genomes. It employs a new format, MOD, to represent known variants between genomes, and a set of tools that facilitate genome manipulation and mapping. We demonstrate the utility of this framework using three inbred mouse strains. We built pseudogenomes from the mm9 mouse reference genome for three highly divergent mouse strains based on MOD files and used them to map the gene annotations to these new genomes. We observe that a large fraction of genes have their positions or ranges altered. Finally, using RNA-seq and DNA-seq short reads from these strains, we demonstrate that mapping to the new genomes yields a better alignment result than mapping to the standard reference. The MOD files for the 17 mouse strains sequenced in the Wellcome Trust Sanger Institute's Mouse Genomes Project can be found at http://www.csbio.unc.edu/CCstatus/index.py?run=Pseudo The auxiliary tools (i.e. MODtools and Lapels), written in Python, are available at http://code.google.com/p/modtools/ and http://code.google.com/p/lapels/.
[Show abstract][Hide abstract] ABSTRACT: Motivation:
RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ~3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ~10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.
We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives.
The software can be downloaded at http://csbio.unc.edu/genescissors/.
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Genetic variation contributes to host responses and outcomes following infection by influenza A virus or other viral infections. Yet narrow windows of disease symptoms and confounding environmental factors have made it difficult to identify polymorphic genes that contribute to differential disease outcomes in human populations. Therefore, to control for these confounding environmental variables in a system that models the levels of genetic diversity found in outbred populations such as humans, we used incipient lines of the highly genetically diverse Collaborative Cross (CC) recombinant inbred (RI) panel (the pre-CC population) to study how genetic variation impacts influenza associated disease across a genetically diverse population. A wide range of variation in influenza disease related phenotypes including virus replication, virus-induced inflammation, and weight loss was observed. Many of the disease associated phenotypes were correlated, with viral replication and virus-induced inflammation being predictors of virus-induced weight loss. Despite these correlations, pre-CC mice with unique and novel disease phenotype combinations were observed. We also identified sets of transcripts (modules) that were correlated with aspects of disease. In order to identify how host genetic polymorphisms contribute to the observed variation in disease, we conducted quantitative trait loci (QTL) mapping. We identified several QTL contributing to specific aspects of the host response including virus-induced weight loss, titer, pulmonary edema, neutrophil recruitment to the airways, and transcriptional expression. Existing whole-genome sequence data was applied to identify high priority candidate genes within QTL regions. A key host response QTL was located at the site of the known anti-influenza gene. We sequenced the coding regions of in the eight CC founder strains, and identified a novel allele that showed reduced ability to inhibit viral replication, while maintaining protection from weight loss.
[Show abstract][Hide abstract] ABSTRACT: Numerous methods exist for inferring the ancestry mosaic of an admixed individual based on its genotypes and those of its ancestors. These methods rely on bialleic SNPs obtained from genotype calling algorithms, which classify each marker as belonging to one of four states (reference allele, alternate allele, heterozygous, or no call) based on probe hybridization intensity signals. We demonstrate that this conversion of probe intensities to discrete genotypes can lead to a loss of information and introduce errors via incorrect genotype calls. We propose a method that directly infers ancestry from probe intensities by minimizing the intensity difference between a target individual and one or more of its ancestors. We demonstrate our method on mice from the developing Collaborative Cross (CC) genetic reference population, which are admixtures of a common set of eight ancestors. Our samples were genotyped using a 7.8K-marker Illumina Infinium platform called the Mouse Universal Genotyping Array (MUGA). We compare our reconstructions with a standard genotype-based method and validate our results using DNA sequencing data. Our algorithm is able to use information not captured by genotype calls and avoid errors due to incorrect calls.