[Show abstract][Hide abstract] ABSTRACT: The advent of next-generation sequencing technologies has greatly promoted advances in the study of human diseases at the genomic, transcriptomic, and epigenetic levels. Exome sequencing, where the coding region of the genome is captured and sequenced at a deep level, has proven to be a cost-effective method to detect disease-causing variants and discover gene targets. In this review, we outline the general framework of whole exome sequence data analysis. We focus on established bioinformatics tools and applications that support five analytical steps: raw data quality assessment, preprocessing, alignment, post-processing, and variant analysis (detection, annotation, and prioritization). We evaluate the performance of open-source alignment programs and variant calling tools using simulated and benchmark datasets, and highlight the challenges posed by the lack of concordance among variant detection tools. Based on these results, we recommend adopting multiple tools and resources to reduce false positives and increase the sensitivity of variant calling. In addition, we briefly discuss the current status and solutions for big data management, analysis, and summarization in the field of bioinformatics.
Cancer informatics 09/2014; Cancer Informatics 2014(Suppl. 2):67-82.
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
Next-generation sequencing (NGS) technology has led to the discovery of causal genetic variants associated with human diseases at high productivity and low cost. Although there are many open source software available, each processes different assumptions and biases. It remains a challenge to create a framework that (1) minimizes human intervention (2) reduces analysis time (3) leverages the optimal configuration of these tools (4) integrates the results of distinct methodologies into one consensus set with higher confidence.
We developed ExScalibur, a highly scalable and automated whole exome sequencing analysis pipeline on the University of Chicago’s Tarbell high performance cluster system. The pipeline integrates three aligners and six variant callers, allowing accurate detection of genomic aberrations. By comparing the performance of individual tools and consensus approaches, we recommend to focus on high-quality calls detected by at least two callers for downstream analysis.
International Conference on Intelligent Systems for Molecular Biology (ISMB) 2014; 08/2014
[Show abstract][Hide abstract] ABSTRACT: Community associated methicillin-resistant Staphylococcus aureus (CA-MRSA) is an emerging threat to human health throughout the world. Rodent MRSA pneumonia models mainly focus on the early innate immune responses to MRSA lung infection. However, the molecular pattern and mechanisms of recovery from MRSA lung infection are largely unknown. In this study, a sublethal mouse MRSA pneumonia model was employed to investigate late events during the recovery from MRSA lung infection. We compared lung bacterial clearance, bronchoalveolar lavage fluid (BALF) characterization, lung histology, lung cell proliferation, lung vascular permeability and lung gene expression profiling between days 1 and 3 post MRSA lung infection. Compared to day 1 post infection, bacterial colony counts, BALF total cell number and BALF protein concentration significantly decreased at day 3 post infection. Lung cDNA microarray analysis identified 47 significantly up-regulated and 35 down-regulated genes (p<0.01, 1.5 fold change [up and down]). The pattern of gene expression suggests that lung recovery is characterized by enhanced cell division, vascularization, wound healing and adjustment of host adaptive immune responses. Proliferation assay by PCNA staining further confirmed that at day 3 lungs have significantly higher cell proliferation than at day 1. Furthermore, at day 3 lungs displayed significantly lower levels of vascular permeability to albumin, compared to day 1. Collectively, this data helps us elucidate the molecular mechanisms of the recovery after MRSA lung infection.
PLoS ONE 01/2013; 8(8):e70176. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Linezolid (LZD) is beneficial to patients with MRSA pneumonia, but whether and how LZD influences global host lung immune responses at the mRNA level during MRSA-mediated pneumonia is still unknown.
A lethal mouse model of MRSA pneumonia mediated by USA300 was employed to study the influence of LZD on survival, while the sublethal mouse model was used to examine the effect of LZD on bacterial clearance and lung gene expression during MRSA pneumonia. LZD (100mg/kg/day, IP) was given to C57Bl6 mice for three days. On Day 1 and Day 3 post infection, bronchoalveolar lavage fluid (BALF) protein concentration and levels of cytokines including IL6, TNFα, IL1β, Interferon-γ and IL17 were measured. In the sublethal model, left lungs were used to determine bacterial clearance and right lungs for whole-genome transcriptional profiling of lung immune responses.
LZD therapy significantly improved survival and bacterial clearance. It also significantly decreased BALF protein concentration and levels of cytokines including IL6, IL1β, Interferon-γ and IL17. No significant gene expression changes in the mouse lungs were associated with LZD therapy.
LZD is beneficial to MRSA pneumonia, but it does not modulate host lung immune responses at the transcriptional level.
PLoS ONE 01/2013; 8(6):e67994. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Therapies for central nervous system (CNS) diseases remain an unmet medical need. This is largely due to multiple unknown disease-modifying genes and pathways. Systems biology through network modeling has shown promise in discovering novel therapeutic targets, deciphering disease mechanisms, and suggesting drug repurposing opportunities. In this article we cover current progress in systems biology and its role, applications, and challenges in the pharmaceutical industry. We also outline a practical strategy to infer drug repositioning candidates for rare CNS diseases by describing Multiple Level Network Modeling (MLNM) analysis.
Drug discovery today 06/2012; · 6.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Use of microarray data to generate expression profiles of genes associated with disease can aid in identification of markers of disease and potential therapeutic targets. Pathway analysis methods further extend expression profiling by creating inferred networks that provide an interpretable structure of the gene list and visualize gene interactions. This chapter describes GeneAnswers, a novel gene-concept network analysis tool available as an open source Bioconductor package. GeneAnswers creates a gene-concept network and also can be used to build protein-protein interaction networks. The package includes an example multiple myeloma cell line dataset and tutorial. Several network analysis methods are included in GeneAnswers, and the tutorial highlights the conditions under which each type of analysis is most beneficial and provides sample code.
[Show abstract][Hide abstract] ABSTRACT: The nucleoside analogues 8-amino-adenosine and 8-chloro-adenosine have been investigated in the context of B-lineage lymphoid malignancies by our laboratories due to the selective cytotoxicity they exhibit toward multiple myeloma (MM), chronic lymphocytic leukemia (CLL), and mantle cell lymphoma (MCL) cell lines and primary cells. Encouraging pharmacokinetic and pharmacodynamic properties of 8-chloro-adenosine being documented in an ongoing Phase I trial in CLL provide additional impetus for the study of these promising drugs. In order to foster a deeper understanding of the commonalities between their mechanisms of action and gain insight into specific patient cohorts positioned to achieve maximal benefit from treatment, we devised a novel two-tiered chemoinformatic screen to identify molecular determinants of responsiveness to these compounds. This screen entailed: 1) the elucidation of gene expression patterns highly associated with the anti-tumor activity of 8-chloro-adenosine in the NCI-60 cell line panel, 2) characterization of altered transcript abundances between paired MM and MCL cell lines exhibiting differential susceptibility to 8-amino-adenosine, and 3) integration of the resulting datasets. This approach generated a signature of seven unique genes including G6PD which encodes the rate-determining enzyme of the pentose phosphate pathway (PPP), glucose-6-phosphate dehydrogenase. Bioinformatic analysis of primary cell gene expression data demonstrated that G6PD is frequently overexpressed in MM and CLL, highlighting the potential clinical implications of this finding. Utilizing the paired sensitive and resistant MM and MCL cell lines as a model system, we go on to demonstrate through loss-of-function and gain-of-function studies that elevated G6PD expression is necessary to maintain resistance to 8-amino- and 8-chloro-adenosine but insufficient to induce de novo resistance in sensitive cells. Taken together, these results indicate that G6PD activity antagonizes the cytotoxicity of 8-substituted adenosine analogues and suggests that administration of these agents to patients with B-cell malignancies exhibiting normal levels of G6PD expression may be particularly efficacious.
PLoS ONE 01/2012; 7(7):e41455. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.
PLoS ONE 01/2012; 7(10):e46450. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The Disease Ontology (DO) database (http://disease-ontology.org) represents a comprehensive knowledge base of 8043 inherited, developmental and acquired human diseases (DO version 3, revision 2510). The DO web browser has been designed for speed, efficiency and robustness through the use of a graph database. Full-text contextual searching functionality using Lucene allows the querying of name, synonym, definition, DOID and cross-reference (xrefs) with complex Boolean search strings. The DO semantically integrates disease and medical vocabularies through extensive cross mapping and integration of MeSH, ICD, NCI's thesaurus, SNOMED CT and OMIM disease-specific terms and identifiers. The DO is utilized for disease annotation by major biomedical databases (e.g. Array Express, NIF, IEDB), as a standard representation of human disease in biomedical ontologies (e.g. IDO, Cell line ontology, NIFSTD ontology, Experimental Factor Ontology, Influenza Ontology), and as an ontological cross mappings resource between DO, MeSH and OMIM (e.g. GeneWiki). The DO project (http://diseaseontology.sf.net) has been incorporated into open source tools (e.g. Gene Answers, FunDO) to connect gene and disease biomedical data through the lens of human disease. The next iteration of the DO web browser will integrate DO's extended relations and logical definition representation along with these biomedical resource cross-mappings.
Nucleic Acids Research 11/2011; 40(Database issue):D940-6. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Exposure to ambient particulate matter (PM) significantly increases cardiovascular morbidity and mortality in the general
population. We hypothesized that some components of PM can affect the gene expression patterns in the hearts of rats exposed
for 3months to filtered air (FA), coarse (CP; 2.5 < dp < 10μm), fine (FP; dp ≤ 2.5μm) or ultrafine (UFP; dp ≤ 0.18μm) components of PM. The median diameters of CP, FP, and UFP were 3μm, 0.7μm and 0.07μm, respectively. Exposures
(n = 8 per group) were performed using a particle concentrator system in Riverside, California, an area with high ambient levels
of photochemically derived gaseous and particulate pollutants. At the end of the exposure, hearts were subjected to gene expression
profiling by using Illumina RatRef-12 bead chips and levels of malonaldehyde (MDA), a biomarker of oxidative stress, were
measured. Applying fold ratio >1.5 (for both up- and down-regulated genes), we found three genes in the CP and nine genes
in the UFP groups with significantly changed expression, compared with FA. No significant changes in gene expression patterns
were observed in the FP group. In the UFP group thioredoxin interacting protein (Txnip), a negative regulator of an antioxidant
enzyme thioredoxin, and cytochrome P450 (Cyp2e1), an enzyme involved in the metabolism of foreign substances demonstrated
significant up-regulation (fold ratios 1.79 and 1.57, respectively, with false discovery rate, FDR < 0.05). In the CP group
there was also a trend towards increased Txnip expression (fold ratio 1.43, FDR > 0.05) and significant increase in the Cyp2e1
expression (fold ratio 1.79, FDR < 0.05). Changes in the Txnip and Cyp2e1 expression showed statistically significant positive
correlation to each other (p < 0.0009) and were confirmed by real-time PCR. In addition Txnip and Cyp2e1 expression demonstrated statistically significant
moderate correlation with the levels of MDA in the heart. Up-regulation of both Cyp2e1 and Txnip are observed in hearts of
patients with certain cardiac diseases. Therefore, chronic exposure to CP and UFP directly affects expression of disease-relevant
genes in the myocardium.
KeywordsParticulate air pollution–Heart–Gene expression–Oxidative stress
Air Quality Atmosphere & Health 01/2011; 4(1):15-25. · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Gene-list annotations are critical for researchers to explore the complex relationships between genes and functionalities. Currently, the annotations of a gene list are usually summarized by a table or a barplot. As such, potentially biologically important complexities such as one gene belonging to multiple annotation categories are difficult to extract. We have devised explicit and efficient visualization methods that provide intuitive methods for interrogating the intrinsic connections between biological categories and genes.
We have constructed a data model and now present two novel methods in a Bioconductor package, "GeneAnswers", to simultaneously visualize genes, concepts (a.k.a. annotation categories), and concept-gene connections (a.k.a. annotations): the "Concept-and-Gene Network" and the "Concept-and-Gene Cross Tabulation". These methods have been tested and validated with microarray-derived gene lists.
These new visualization methods can effectively present annotations using Gene Ontology, Disease Ontology, or any other user-defined gene annotations that have been pre-associated with an organism's genome by human curation, automated pipelines, or a combination of the two. The gene-annotation data model and associated methods are available in the Bioconductor package called "GeneAnswers " described in this publication.
[Show abstract][Hide abstract] ABSTRACT: GATA-2 is an essential transcription factor that regulates multiple aspects of hematopoiesis. Dysregulation of GATA-2 is a hallmark of acute megakaryoblastic leukemia in children with Down syndrome, a malignancy that is defined by the combination of trisomy 21 and a GATA1 mutation. Here, we show that GATA-2 is required for normal megakaryocyte development as well as aberrant megakaryopoiesis in Gata1 mutant cells. Furthermore, we demonstrate that GATA-2 indirectly controls cell cycle progression in GATA-1-deficient megakaryocytes. Genome-wide microarray analysis and chromatin immunoprecipitation studies revealed that GATA-2 regulates a wide set of genes, including cell cycle regulators and megakaryocyte-specific genes. Surprisingly, GATA-2 also negatively regulates the expression of crucial myeloid transcription factors, such as Sfpi1 and Cebpa. In the absence of GATA-1, GATA-2 prevents induction of a latent myeloid gene expression program. Thus, GATA-2 contributes to cell cycle progression and the maintenance of megakaryocyte identity of GATA-1-deficient cells, including GATA-1s-expressing fetal megakaryocyte progenitors. Moreover, our data reveal that overexpression of GATA-2 facilitates aberrant megakaryopoiesis.
Molecular and cellular biology 08/2009; 29(18):5168-80. · 6.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Subjective methods have been reported to adapt a general-purpose ontology for a specific application. For example, Gene Ontology (GO) Slim was created from GO to generate a highly aggregated report of the human-genome annotation. We propose statistical methods to adapt the general purpose, OBO Foundry Disease Ontology (DO) for the identification of gene-disease associations. Thus, we need a simplified definition of disease categories derived from implicated genes. On the basis of the assumption that the DO terms having similar associated genes are closely related, we group the DO terms based on the similarity of gene-to-DO mapping profiles. Two types of binary distance metrics are defined to measure the overall and subset similarity between DO terms. A compactness-scalable fuzzy clustering method is then applied to group similar DO terms. To reduce false clustering, the semantic similarities between DO terms are also used to constrain clustering results. As such, the DO terms are aggregated and the redundant DO terms are largely removed. Using these methods, we constructed a simplified vocabulary list from the DO called Disease Ontology Lite (DOLite). We demonstrated that DOLite results in more interpretable results than DO for gene-disease association tests. The resultant DOLite has been used in the Functional Disease Ontology (FunDO) Web application at http://www.projects.bioinformatics.northwestern.edu/fundo.
[Show abstract][Hide abstract] ABSTRACT: The human genome has been extensively annotated with Gene Ontology for biological functions, but minimally computationally annotated for diseases.
We used the Unified Medical Language System (UMLS) MetaMap Transfer tool (MMTx) to discover gene-disease relationships from the GeneRIF database. We utilized a comprehensive subset of UMLS, which is disease-focused and structured as a directed acyclic graph (the Disease Ontology), to filter and interpret results from MMTx. The results were validated against the Homayouni gene collection using recall and precision measurements. We compared our results with the widely used Online Mendelian Inheritance in Man (OMIM) annotations.
The validation data set suggests a 91% recall rate and 97% precision rate of disease annotation using GeneRIF, in contrast with a 22% recall and 98% precision using OMIM. Our thesaurus-based approach allows for comparisons to be made between disease containing databases and allows for increased accuracy in disease identification through synonym matching. The much higher recall rate of our approach demonstrates that annotating human genome with Disease Ontology and GeneRIF for diseases dramatically increases the coverage of the disease annotation of human genome.