Article

Perspectives on Data Integration in Human Complex Disease Analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The identification of causal or predictive variants/genes/mechanisms for disease-associated traits is characterized by "complex" networks of molecular phenotypes. Present technology and computer power allow building and processing large collections of these data types. However, the super-rapid data generation is counterweighted by a slow-pace for data integration methods development. Most currently available integrative analytic tools pertain to pairing omics data and focus on between-data source relationships, making strong assumptions about within-data source architectures. A limited number of initiatives exist aiming to find the most optimal ways to analyze multiple, possibly related, omics databases, and fully acknowledge the specific characteristics of each data type. A thorough understanding of the underlying assumptions of integrative methods is needed to draw sound conclusions afterwards. In this chapter, the authors discuss how the field of "integromics" has evolved and give pointers towards essential research developments in this context.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Examples of 2-omics analyses include expression quantitative trait locus eQTL (Franke & Jansen, 2009) and methylation quantitative trait locus meQTL (Smith, Kilaru, Kocak, Almli, & Mercer, 2014) that, respectively, assess the influence of genetic and epigenetic markers on gene expression. Combining more than two omics data types is much more complex, given the hierarchical structure and interdependencies such data entails (Hamid, Hu, Roslin, Ling, & Greenwood, 2009;Ritchie, Holzinger, Li, Pendergrass, & Kim, 2015;Van Steen & Malats 2014). With a few exceptions, most methods integrate more than two different data sources by combining evidences obtained from pairwise analyses (Van Steen & Malats, 2014). ...
... Combining more than two omics data types is much more complex, given the hierarchical structure and interdependencies such data entails (Hamid, Hu, Roslin, Ling, & Greenwood, 2009;Ritchie, Holzinger, Li, Pendergrass, & Kim, 2015;Van Steen & Malats 2014). With a few exceptions, most methods integrate more than two different data sources by combining evidences obtained from pairwise analyses (Van Steen & Malats, 2014). These evidences are often based on the derivation of standard measures of association, linking (epi-)genetic markers to gene expression combined with gene expression analysis (Wagner, Busche, Ge, Kwan, & Pastinen, 2014). ...
Article
The vast amount of heterogeneous omics data, encompassing a broad range of biomolecular information, requires novel methods of analysis, including those that integrate the available levels of information. In this work, we describe Regression2Net, a computational approach that is able to integrate gene expression and genomic or methylation data in two steps. First, penalized regressions are used to build Expression-Expression (EEnet) and Expression-Genomic or Expression-Methylation (EMnet) networks. Second, network theory is used to highlight important communities of genes. When applying our approach, Regression2Net to gene expression and methylation profiles for individuals with glioblastoma multiforme, we identified, respectively, 284 and 447 potentially interesting genes in relation to glioblastoma pathology. These genes showed at least one connection in the integrated networks ANDnet and XORnet derived from aforementioned EEnet and EMnet networks. Although the edges in ANDnet occur in both EEnet and EMnet, the edges in XORnet occur in EMnet but not in EEnet. In-depth biological analysis of connected genes in ANDnet and XORnet revealed genes that are related to energy metabolism, cell cycle control (AATF), immune system response, and several cancer types. Importantly, we observed significant overrepresentation of cancer-related pathways including glioma, especially in the XORnet network, suggesting a nonignorable role of methylation in glioblastoma multiforma. In the ANDnet, we furthermore identified potential glioma suppressor genes ACCN3 and ACCN4 linked to the NBPF1 neuroblastoma breakpoint family, as well as numerous ABC transporter genes (ABCA1, ABCB1) suggesting drug resistance of glioblastoma tumors.
... It includes data fusion as well as more fancy and more elaborate forms of combining evidence from different data sets or sources. 258 Further, it agrees with the definition of Thorsen and Oxley 259 as the process of connecting systems (which may have fusion in them) into a larger system. Apart from data integrative analysis, integrative analysis sometimes also refers to the integration of analytic tools or methods, to combine different analytic viewpoints to the same data. ...
Article
Full-text available
Background: Systems Medicine is a novel approach to medicine, i.e. an interdisciplinary field that considers the human body as a system, composed of multiple parts and of complex relationships at multiple levels, and further integrated into an environment. Exploring Systems Medicine implies understanding and combining concepts coming from diametral different fields, including medicine, biology, statistics, modelling and simulation, and data science. Such heterogeneity leads to semantic issues, which may slow down implementation and fruitful interaction between these highly diverse fields. Methods: In this review we collect and explain over one hundredtermsrelated to Systems Medicine. These include both modelling and data science terms and basic systems medicine terms, along with some synthetic definitions, examples of applications, and lists of relevant references. Results: This glossaryaims at being a first aid kit for the Systems Medicine researcher facing an unfamiliar term, where he/she can get a first understanding of them, and, more importantly, examples and references for keep digging into the topic.
... This definition does not explicitly refer to statistical, bioinformatics or computational tools but to any approach that fits within a transdisciplinary viewpoint. It includes data fusion as well as more fancy and more elaborate forms of combining evidence from different data sets or sources [258]. Furthermore, it agrees with the definition of Oxley and Thorsen [259] as the process of connecting systems (which may have fusion in them) into a larger system. ...
Article
Full-text available
Background: Systems Medicine is a novel approach to medicine, i.e. an interdisciplinary field that considers the human body as a system, composed of multiple parts and of complex relationships at multiple levels, and further integrated into an environment. Exploring Systems Medicine implies understanding and combining concepts coming from diametral different fields, including medicine, biology, statistics, modelling and simulation, and data science. Such heterogeneity leads to semantic issues, which may slow down implementation and fruitful interaction between these highly diverse fields. Methods: In this review we collect and explain over one hundred terms related to Systems Medicine. These include both modelling and data science terms and basic systems medicine terms, along with some synthetic definitions, examples of applications, and lists of relevant references. Results: This glossary aims at being a first aid kit for the Systems Medicine researcher facing an unfamiliar term, where he/she can get a first understanding of them, and, more importantly, examples and references for keep digging into the topic.
Article
Full-text available
The understanding that differences in biological epistasis may impact disease risk, diagnosis, or disease management stands in wide contrast to the unavailability of widely accepted large-scale epistasis analysis protocols. Several choices in the analysis workflow will impact false-positive and false-negative rates. One of these choices relates to the exploitation of particular modelling or testing strategies. The strengths and limitations of these need to be well understood, as well as the contexts in which these hold. This will contribute to determining the potentially complementary value of epistasis detection workflows and is expected to increase replication success with biological relevance. In this contribution, we take a recently introduced regression-based epistasis detection tool as a leading example to review the key elements that need to be considered to fully appreciate the value of analytical epistasis detection performance assessments. We point out unresolved hurdles and give our perspectives towards overcoming these.
Article
Full-text available
Motivation: The consistent amount of different types of omics data requires novel methods of analysis and data integration. In this work we describe Regression2Net, a computational approach to analyse gene expression and methylation profiles via regression analysis and network-based techniques. Results: We identified 284 and 447 unique candidate genes potentially associated to the Glioblastoma pathology from two networks inferred from mixed genetic datasets. In-depth biological analysis of these networks reveals genes that are related to energy metabolism, cell cycle control (AATF), immune system response and several types of cancer. Importantly, we observed significant over- representation of cancer related pathways including glioma especially in the methylation network. This confirms the strong link between methylation and glioblastomas. Potential glioma suppressor genes ACCN3 and ACCN4 linked to NBPF1 neuroblastoma breakpoint family have been identified in our expression network. Numerous ABC transporter genes (ABCA1, ABCB1) present in the expression network suggest drug resistance of glioblastoma tumors.
Article
Full-text available
genomic data combining strengths of different methods and statistical tools. The different steps of this protocol are illustrated on a real-life data application for Alzheimer's disease (AD) (2259 patients and 6017 controls from France). Particularly, in the exhaustive genome-wide epistasis screening we identified AD-associated interacting SNPs-pair from chromosome 6q11.1 (rs6455128, the KHDRBS2 gene) and 13q12.11 (rs7989332, the CRYL1 gene) (p = 0.006, corrected for multiple testing). A replication analysis in the independent AD cohort from Germany (555 patients and 824 controls) confirmed the discovered epistasis signal (p = 0.036). This signal was also supported by a meta-analysis approach in 5 independent AD cohorts that was applied in the context of epistasis for the first time. Transcriptome analysis revealed negative correlation between expression levels of KHDRBS2 and CRYL1 in both the temporal cortex (β = -0.19, p = 0.0006) and cerebellum (β = -0.23, p < 0.0001) brain regions. This is the first time a replicable epistasis associated with AD was identified using a hypothesis free screening approach.
Article
Full-text available
DNA methylation is strongly associated with smoking status at multiple sites across the genome. Studies have largely been restricted to European origin individuals yet the greatest increase in smoking is occurring in low income countries, such as the Indian subcontinent. We determined whether there are differences between South Asians and Europeans in smoking related loci, and if a smoking score, combining all smoking related DNA methylation scores, could differentiate smokers from non-smokers. Illumina HM450k BeadChip arrays were performed on 192 samples from the Southall And Brent REvisited (SABRE) cohort. Differential methylation in smokers was identified in 29 individual CpG sites at 18 unique loci. Interaction between smoking status and ethnic group was identified at the AHRR locus. Ethnic differences in DNA methylation were identified in non-smokers at two further loci, 6p21.33 and GNG12. With the exception of GFI1 and MYO1G these differences were largely unaffected by adjustment for cell composition. A smoking score based on methylation profile was constructed. Current smokers were identified with 100% sensitivity and 97% specificity in Europeans and with 80% sensitivity and 95% specificity in South Asians. Differences in ethnic groups were identified in both single CpG sites and combined smoking score. The smoking score is a valuable tool for identification of true current smoking behaviour. Explanations for ethnic differences in DNA methylation in association with smoking may provide valuable clues to disease pathways.
Article
Full-text available
To better understand dynamic disease processes, integrated multi-omic methods are needed, yet comparing different types of omic data remains difficult. Integrative solutions benefit experimenters by eliminating potential biases that come with single omic analysis.We have developed the methods needed to explore whether a relationship exists between co-expression network models built from transcriptomic and proteomic data types, and whether this relationship can be used to improve the disease signature discovery process. A naïve, correlation based method is utilized for comparison. Using publicly available infectious disease time series data, we analyzed the related co-expression structure of the transcriptome and proteome in response to SARS-CoV infection in mice. Transcript and peptide expression data was filtered using quality scores and subset by taking the intersection on mapped Entrez IDs. Using this data set, independent co-expression networks were built. The networks were integrated by constructing a bipartite module graph based on module member overlap, module summary correlation, and correlation to phenotypes of interest. Compared to the module level results, the naïve approach is hindered by a lack of correlation across data types, less significant enrichment results, and little functional overlap across data types. Our module graph approach avoids these problems, resulting in an integrated omic signature of disease progression, which allows prioritization across data types for down-stream experiment planning. Integrated modules exhibited related functional enrichments and could suggest novel interactions in response to infection. These disease and platform-independent methods can be used to realize the full potential of multi-omic network signatures.
Article
Full-text available
Gene expression profiles have been broadly used in cancer research as a diagnostic or prognostic signature for the clinical outcome prediction such as stage, grade, metastatic status, recurrence, and patient survival, as well as to potentially improve patient management. However, emerging evidence shows that gene expression-based prediction varies between independent data sets. One possible explanation of this effect is that previous studies were focused on identifying genes with large main effects associated with clinical outcomes. Thus, non-linear interactions without large individual main effects would be missed. The other possible explanation is that gene expression as a single level of genomic data is insufficient to explain the clinical outcomes of interest since cancer can be dysregulated by multiple alterations through genome, epigenome, transcriptome, and proteome levels. In order to overcome the variability of diagnostic or prognostic predictors from gene expression alone and to increase its predictive power, we need to integrate multi-levels of genomic data and identify interactions between them associated with clinical outcomes. Here, we proposed an integrative framework for identifying interactions within/between multi-levels of genomic data associated with cancer clinical outcomes using the Grammatical Evolution Neural Networks (GENN). In order to demonstrate the validity of the proposed framework, ovarian cancer data from TCGA was used as a pilot task. We found not only interactions within a single genomic level but also interactions between multi-levels of genomic data associated with survival in ovarian cancer. Notably, the integration model from different levels of genomic data achieved 72.89% balanced accuracy and outperformed the top models with any single level of genomic data. Understanding the underlying tumorigenesis and progression in ovarian cancer through the global view of interactions within/between different levels of genomic data is expected to provide guidance for improved prognostic biomarkers and individualized therapies.
Article
Full-text available
The identification of genes involved in human complex diseases remains a great challenge in computational systems biology. Although methods have been developed to use disease phenotypic similarities with a protein-protein interaction network for the prioritization of candidate genes, other valuable omics data sources have been largely overlooked in these methods. With this understanding, we proposed a method called BRIDGE to prioritize candidate genes by integrating disease phenotypic similarities with such omics data as protein-protein interactions, gene sequence similarities, gene expression patterns, gene ontology annotations, and gene pathway memberships. BRIDGE utilizes a multiple regression model with lasso penalty to automatically weight different data sources and is capable of discovering genes associated with diseases whose genetic bases are completely unknown. We conducted large-scale cross-validation experiments and demonstrated that more than 60% known disease genes can be ranked top one by BRIDGE in simulated linkage intervals, suggesting the superior performance of this method. We further performed two comprehensive case studies by applying BRIDGE to predict novel genes and transcriptional networks involved in obesity and type II diabetes. The proposed method provides an effective and scalable way for integrating multi omics data to infer disease genes. Further applications of BRIDGE will be benefit to providing novel disease genes and underlying mechanisms of human diseases.
Article
Full-text available
The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS assaying at least 100 000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 × 10−5. The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs’ chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.
Article
Full-text available
We have developed Lynx (http://lynx.ci.uchicago.edu)—a web-based database and a knowledge extraction engine, supporting annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Its underlying knowledge base (LynxKB) integrates various classes of information from >35 public databases and private collections, as well as manually curated data from our group and collaborators. Lynx provides advanced search capabilities and a variety of algorithms for enrichment analysis and network-based gene prioritization to assist the user in extracting meaningful knowledge from LynxKB and experimental data, whereas its service-oriented architecture provides public access to LynxKB and its analytical tools via user-friendly web services and interfaces.
Article
Full-text available
Over the past decade, the Database of Genomic Variants (DGV; http://dgv.tcag.ca/) has provided a publicly accessible, comprehensive curated catalogue of structural variation (SV) found in the genomes of control individuals from worldwide populations. Here, we describe updates and new features, which have expanded the utility of DGV for both the basic research and clinical diagnostic communities. The current version of DGV consists of 55 published studies, comprising >2.5 million entries identified in >22 300 genomes. Studies included in DGV are selected from the accessioned data sets in the archival SV databases dbVar (NCBI) and DGVa (EBI), and then further curated for accuracy and validity. The core visualization tool (gbrowse) has been upgraded with additional functions to facilitate data analysis and comparison, and a new query tool has been developed to provide flexible and interactive access to the data. The content from DGV is regularly incorporated into other large-scale genome reference databases and represents a standard data resource for new product and database development, in particular for copy number variation testing in clinical labs. The accurate cataloguing of variants in DGV will continue to enable medical genetics and genome sequencing research.
Article
Full-text available
Advancements in high-throughput technology have allowed researchers to examine the genetic etiology of complex human traits in a robust fashion. While genome-wide association studies (GWAS) have identified many novel variants associated with hundreds of traits, a large proportion of the estimated trait heritability remains unexplained. One hypothesis is that the commonly used statistical techniques and study designs are not robust to the complex etiology that may underlie these human traits. This etiology could include non-linear gene x gene or gene x environment interactions. Additionally, other levels of biological regulation may play a large role in trait variability. In order to address the need for computational tools that can explore enormous data sets to detect complex susceptibility models, we have developed a software package called the Analysis Tool for Heritable and Environmental Network Associations (ATHENA). ATHENA combines various variable filtering methods with machine learning techniques in order to analyze high-throughput categorical (i.e. SNPs) and quantitative (i.e. gene expression levels) predictor variables to generate multi-variable models that predict either a categorical (i.e. disease status) or quantitative (i.e. cholesterol levels) outcomes. The goal of this paper is to demonstrate the utility of ATHENA using simulated and biological data sets that consist of both SNPs and gene expression variables to identify complex prediction models. Importantly, this method is flexible and can be expanded to include other types of high-throughput data (i.e. RNA-seq. data and biomarker measurements). ATHENA is freely available for download. The software, user manual, and tutorial can be downloaded from http://ritchielab.psu.edu/ritchielab/software. marylyn.ritchie@psu.edu.
Article
Full-text available
Massively parallel sequencing greatly facilitates the discovery of novel disease genes causing Mendelian and oligogenic disorders. However, many mutations are present in any individual genome, and identifying which ones are disease causing remains a largely open problem. We introduce eXtasy, an approach to prioritize nonsynonymous single-nucleotide variants (nSNVs) that substantially improves prediction of disease-causing variants in exome sequencing data by integrating variant impact prediction, haploinsufficiency prediction and phenotype-specific gene prioritization.
Article
Full-text available
Pancreatic cancer is a highly lethal cancer with limited diagnostic and therapeutic modalities. To begin to explore the genomic landscape of pancreatic cancer, we used massively parallel sequencing to catalog and compare transcribed regions and potential regulatory elements in two human cell lines derived from normal and cancerous pancreas. By RNA-sequencing, we identified 2,146 differentially expressed genes in these cell lines that were enriched in cancer related pathways and biological processes that include cell adhesion, growth factor and receptor activity, signaling, transcription and differentiation. Our high throughput Chromatin immunoprecipitation (ChIP) sequence analysis furthermore identified over 100,000 regions enriched in epigenetic marks, showing either positive (H3K4me1, H3K4me3, RNA Pol II) or negative (H3K27me3) correlation with gene expression. Notably, an overall enrichment of RNA Pol II binding and depletion of H3K27me3 binding were seen in the cancer derived cell line as compared to the normal derived cell line. By selecting genes for further assessment based on this difference, we confirmed enhanced expression of aldehyde dehydrogenase 1A3 (ALDH1A3) in two larger sets of pancreatic cancer cell lines and in tumor tissues as compared to normal derived tissues. As aldehyde dehydrogenase (ALDH) activity is a key feature of cancer stem cells, our results indicate that a member of the ALDH superfamily, ALDH1A3, may be upregulated in pancreatic cancer, where it could mark pancreatic cancer stem cells.
Article
Full-text available
Previously, we reported strong influences of genetic variants on metabolic phenotypes, some of them with clinical relevance. Here we hypothesize that DNA methylation may have an important and potentially independent effect on human metabolism. To test this hypothesis we conducted what is to the best of our knowledge the first epigenome-wide association study (EWAS) between DNA methylation and metabolic traits (metabotypes) in human blood. We assess 649 blood metabolic traits from 1,814 participants of the KORA population study for association with methylation of 457,004 CpG sites, determined on the Infinium HumanMethylation450 BeadChip platform. Using the EWAS approach, we identified two types of methylome-metabotype associations. One type is driven by an underlying genetic effect; the other type is independent of genetic variation and potentially driven by common environmental and life-style dependent factors. We report eight CpG loci at genome-wide significance that have a genetic variant as confounder (p=3.9x10(-20) to 2.0x10(-108), r(2)=0.036 to 0.221). Seven loci display CpG-site-specific associations to metabotypes, but do not exhibit any underlying genetic signals (p=9.2x10(-14) to 2.7x10(-27), r(2)=0.008 to 0.107). We further identify several groups of CpG loci that associate with a same metabotype, such as 4-vinylphenol sulfate and 4-androsten-3beta,17beta-diol disulfate. In these cases the association between CpG-methylation and metabotype are likely the result of a common external environmental factor, including smoking. Our study shows that analysis of EWAS with large numbers of metabolic traits in large population cohorts are, in principle, feasible. Taken together, our data suggests that DNA methylation plays an important role in regulating human metabolism.
Article
Full-text available
Modern high-throughput methods allow the investigation of biological functions across multiple 'omics' levels. Levels include mRNA and protein expression profiling as well as additional knowledge on, for example, DNA methylation and microRNA regulation. The reason for this interest in multi-omics is that actual cellular responses to different conditions are best explained mechanistically when taking all omics levels into account. To map gene products to their biological functions, public ontologies like Gene Ontology are commonly used. Many methods have been developed to identify terms in an ontology, overrepresented within a set of genes. However, these methods are not able to appropriately deal with any combination of several data types. Here, we propose a new method to analyse integrated data across multiple omics-levels to simultaneously assess their biological meaning. We developed a model-based Bayesian method for inferring interpretable term probabilities in a modular framework. Our Multi-level ONtology Analysis (MONA) algorithm performed significantly better than conventional analyses of individual levels and yields best results even for sophisticated models including mRNA fine-tuning by microRNAs. The MONA framework is flexible enough to allow for different underlying regulatory motifs or ontologies. It is ready-to-use for applied researchers and is available as a standalone application from http://icb.helmholtz-muenchen.de/mona.
Article
Full-text available
DNA methylation patterns are important for establishing cell, tissue, and organism phenotypes, but little is known about their contribution to natural human variation. To determine their contribution to variability, we have generated genome-scale DNA methylation profiles of three human populations (Caucasian-American, African-American, and Han Chinese-American) and examined the differentially methylated CpG sites. The distinctly methylated genes identified suggest an influence of DNA methylation on phenotype differences, such as susceptibility to certain diseases and pathogens, and response to drugs and environmental agents. DNA methylation differences can be partially traced back to genetic variation, suggesting that differentially methylated CpG sites serve as evolutionarily established mediators between the genetic code and phenotypic variability. Notably, one-third of the DNA methylation differences were not associated with any genetic variation, suggesting that variation in population-specific sites takes place at the genetic and epigenetic levels, highlighting the contribution of epigenetic modification to natural human variation.
Article
Full-text available
Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci.
Article
Full-text available
Integrative and comparative analyses of multiple transcriptomics, proteomics and metabolomics datasets require an intensive knowledge of tools and background concepts. Thus, it is challenging for users to perform such analyses, highlighting the need for a single tool for such purposes. The 3Omics one-click web tool was developed to visualize and rapidly integrate multiple human inter- or intra-transcriptomic, proteomic, and metabolomic data by combining five commonly used analyses: correlation networking, coexpression, phenotyping, pathway enrichment, and GO (Gene Ontology) enrichment. 3Omics generates inter-omic correlation networks to visualize relationships in data with respect to time or experimental conditions for all transcripts, proteins and metabolites. If only two of three omics datasets are input, then 3Omics supplements the missing transcript, protein or metabolite information related to the input data by text-mining the PubMed database. 3Omics' coexpression analysis assists in revealing functions shared among different omics datasets. 3Omics' phenotype analysis integrates Online Mendelian Inheritance in Man with available transcript or protein data. Pathway enrichment analysis on metabolomics data by 3Omics reveals enriched pathways in the KEGG/HumanCyc database. 3Omics performs statistical Gene Ontology-based functional enrichment analyses to display significantly overrepresented GO terms in transcriptomic experiments. Although the principal application of 3Omics is the integration of multiple omics datasets, it is also capable of analyzing individual omics datasets. The information obtained from the analyses of 3Omics in Case Studies 1 and 2 are also in accordance with comprehensive findings in the literature. 3Omics incorporates the advantages and functionality of existing software into a single platform, thereby simplifying data analysis and enabling the user to perform a one-click integrated analysis. Visualization and analysis results are downloadable for further user customization and analysis. The 3Omics software can be freely accessed at http://3omics.cmdm.tw.
Article
Full-text available
Increasing evidence suggests that single nucleotide polymorphisms (SNPs) associated with complex traits are more likely to be expression quantitative trait loci (eQTLs). Incorporating eQTL information hence has potential to increase power of genome-wide association studies (GWAS). In this paper, we propose using eQTL weights as prior information in SNP based association tests to improve test power while maintaining control of the family-wise error rate (FWER) or the false discovery rate (FDR). We apply the proposed methods to the analysis of a GWAS for childhood asthma consisting of 1296 unrelated individuals with German ancestry. The results confirm that eQTLs are enriched for previously reported asthma SNPs. We also find that some SNPs are insignificant using procedures without eQTL weighting, but become significant using eQTL-weighted Bonferroni or Benjamini-Hochberg procedures, while controlling the same FWER or FDR level. Some of these SNPs have been reported by independent studies in recent literature. The results suggest that the eQTL-weighted procedures provide a promising approach for improving power of GWAS. We also report the results of our methods applied to the large-scale European GABRIEL consortium data.
Article
Full-text available
Given recent advances in the generation of high-throughput data such as whole-genome genetic variation and transcriptome expression, it is critical to come up with novel methods to integrate these heterogeneous datasets and to assess the significance of identified phenotype-genotype relationships. Recent studies show that genome-wide association findings are likely to fall in loci with gene regulatory effects such as expression quantitative trait loci (eQTLs), demonstrating the utility of such integrative approaches. When genotype and gene expression data are available on the same individuals, we and others developed methods wherein top phenotype-associated genetic variants are prioritized if they are associated, as eQTLs, with gene expression traits that are themselves associated with the phenotype. Yet there has been no method to determine an overall p-value for the findings that arise specifically from the integrative nature of the approach. We propose a computationally feasible permutation method that accounts for the assimilative nature of the method and the correlation structure among gene expression traits and among genotypes. We apply the method to data from a study of cellular sensitivity to etoposide, one of the most widely used chemotherapeutic drugs. To our knowledge, this study is the first statistically sound quantification of the overall significance of the genotype-phenotype relationships resulting from applying an integrative approach. This method can be easily extended to cases in which gene expression data are replaced by other molecular phenotypes of interest, e.g., microRNA or proteomic data. This study has important implications for studies seeking to expand on genetic association studies by the use of omics data. Finally, we provide an R code to compute the empirical false discovery rate when p-values for the observed and simulated phenotypes are available.
Article
Full-text available
Identification and functional interpretation of gene regulatory variants is a major focus of modern genomics. The application of genetic mapping to molecular and cellular traits has enabled the detection of regulatory variation on genome-wide scales and revealed an enormous diversity of regulatory architecture in humans and other species. In this review I summarise the insights gained and questions raised by a decade of genetic mapping of gene expression variation. I discuss recent extensions of this approach using alternative molecular phenotypes that have revealed some of the biological mechanisms that drive gene expression variation between individuals. Finally, I highlight outstanding problems and future directions for development.
Article
Full-text available
Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and widely adopted approach to identifying putative regulatory variants and linking them to specific genes. Up to now eQTL studies have been conducted in a relatively narrow range of tissues or cell types. However, understanding the biology of organismal phenotypes will involve understanding regulation in multiple tissues, and ongoing studies are collecting eQTL data in dozens of cell types. Here we present a statistical framework for powerfully detecting eQTLs in multiple tissues or cell types (or, more generally, multiple subgroups). The framework explicitly models the potential for each eQTL to be active in some tissues and inactive in others. By modeling the sharing of active eQTLs among tissues, this framework increases power to detect eQTLs that are present in more than one tissue compared with "tissue-by-tissue" analyses that examine each tissue separately. Conversely, by modeling the inactivity of eQTLs in some tissues, the framework allows the proportion of eQTLs shared across different tissues to be formally estimated as parameters of a model, addressing the difficulties of accounting for incomplete power when comparing overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our framework to re-analyze data from transformed B cells, T cells, and fibroblasts, we find that it substantially increases power compared with tissue-by-tissue analysis, identifying 63% more genes with eQTLs (at FDR = 0.05). Further, the results suggest that, in contrast to previous analyses of the same data, the majority of eQTLs detectable in these data are shared among all three tissues.
Article
Full-text available
Regulated gene expression is a major requirement for all living organisms. The requirement for complex spatio-temporal regulation is most obvious during development and differentiation, when precise gene switching choreographs the generation of many different cell types, at the right time and the right place, from a single fertilized cell. When this process goes awry, deciphering the genetic cause can provide detailed insight into mechanisms. While chromatin structure and the recruitment of the transcriptional machinery to proximal promoters are well understood, how far-distant enhancers direct the correct spatial and temporal control of transcription is less clear. This concept prompted us to organize a Royal Society Discussion Meeting on this topic in October 2012. The timeliness of the debate was highlighted by the publication of results from the prominently heralded ENCODE project published just a month before the meeting (http://www.nature.com/encode/#/threads). This highlighted the unexpectedly large expanse of the human genome that appears to harbour regulatory elements [1,2]. Here, we present papers from some of the speakers at this lively and exciting meeting.
Article
Full-text available
The last few years have seen the development of large efforts for the analysis of genome function, especially in the context of genome variation. One of the most prominent directions has been the extensive set of studies on expression quantitative trait loci (eQTLs), namely, the discovery of genetic variants that explain variation in gene expression levels. Such studies have offered promise not just for the characterization of functional sequence variation but also for the understanding of basic processes of gene regulation and interpretation of genome-wide association studies. In this review, we discuss some of the key directions of eQTL research and its implications.
Article
Full-text available
Background Genome-wide association studies can provide novel insights into diseases of interest, as well as to the responsiveness of an individual to specific treatments. In such studies, it is very important to correct for population stratification, which refers to allele frequency differences between cases and controls due to systematic ancestry differences. Population stratification can cause spurious associations if not adjusted properly. The principal component analysis (PCA) method has been relied upon as a highly useful methodology to adjust for population stratification in these types of large-scale studies. Recently, the linear mixed model (LMM) has also been proposed to account for family structure or cryptic relatedness. However, neither of these approaches may be optimal in properly correcting for sample structures in the presence of subject outliers. Results We propose to use robust PCA combined with k-medoids clustering to deal with population stratification. This approach can adjust for population stratification for both continuous and discrete populations with subject outliers, and it can be considered as an extension of the PCA method and the multidimensional scaling (MDS) method. Through simulation studies, we compare the performance of our proposed methods with several widely used stratification methods, including PCA and MDS. We show that subject outliers can greatly influence the analysis results from several existing methods, while our proposed robust population stratification methods perform very well for both discrete and admixed populations with subject outliers. We illustrate the new method using data from a rheumatoid arthritis study. Conclusions We demonstrate that subject outliers can greatly influence the analysis result in GWA studies, and propose robust methods for dealing with population stratification that outperform existing population stratification methods in the presence of subject outliers.
Article
Full-text available
We evaluated the presence/absence of proteins encoded by 14 077 genes in adipocytes obtained from different tissue samples using immunohistochemistry. By combining this with previously published adipocyte-specific proteome data, we identified proteins associated with 7340 genes in human adipocytes. This information was used to reconstruct a comprehensive and functional genome-scale metabolic model of adipocyte metabolism. The resulting metabolic model, iAdipocytes1809, enables mechanistic insights into adipocyte metabolism on a genome-wide level, and can serve as a scaffold for integration of omics data to understand the genotype–phenotype relationship in obese subjects. By integrating human transcriptome and fluxome data, we found an increase in the metabolic activity around androsterone, ganglioside GM2 and degradation products of heparan sulfate and keratan sulfate, and a decrease in mitochondrial metabolic activities in obese subjects compared with lean subjects. Our study hereby shows a path to identify new therapeutic targets for treating obesity through combination of high throughput patient data and metabolic modeling.
Article
Full-text available
Asthma is a common chronic respiratory disease characterized by airway hyperresponsiveness (AHR). The genetics of asthma have been widely studied in mouse and human, and homologous genomic regions have been associated with mouse AHR and human asthma-related phenotypes. Our goal was to identify asthma-related genes by integrating AHR associations in mouse with human genome-wide association study (GWAS) data. We used Efficient Mixed Model Association (EMMA) analysis to conduct a GWAS of baseline AHR measures from males and females of 31 mouse strains. Genes near or containing SNPs with EMMA p-values <0.001 were selected for further study in human GWAS. The results of the previously reported EVE consortium asthma GWAS meta-analysis consisting of 12,958 diverse North American subjects from 9 study centers were used to select a subset of homologous genes with evidence of association with asthma in humans. Following validation attempts in three human asthma GWAS (i.e., Sepracor/LOCCS/LODO/Illumina, GABRIEL, DAG) and two human AHR GWAS (i.e., SHARP, DAG), the Kv channel interacting protein 4 () gene was identified as nominally associated with both asthma and AHR at a gene- and SNP-level. In EVE, the smallest association was at rs6833065 (P-value 2.9e-04), while the strongest associations for Sepracor/LOCCS/LODO/Illumina, GABRIEL, DAG were 1.5e-03, 1.0e-03, 3.1e-03 at rs7664617, rs4697177, rs4696975, respectively. At a SNP level, the strongest association across all asthma GWAS was at rs4697177 (P-value 1.1e-04). The smallest P-values for association with AHR were 2.3e-03 at rs11947661 in SHARP and 2.1e-03 at rs402802 in DAG. Functional studies are required to validate the potential involvement of in modulating asthma susceptibility and/or AHR. Our results suggest that a useful approach to identify genes associated with human asthma is to leverage mouse AHR association data.
Article
Full-text available
Background: Increased risk of pancreatic cancer has been reported in breast cancer families carrying BRCA1and BRCA2 mutations; however, pancreatic cancer risk in mutation-negative (BRCAX) families has not been explored to date. The aim of this study was to estimate pancreatic cancer risk in high-risk breast cancer families according to the BRCA mutation status. Methods: A retrospective cohort analysis was applied to estimate standardized incidence ratios (SIR) for pancreatic cancer. A total of 5,799 families with ≥1 breast cancer case tested for mutations in BRCA1 and/or BRCA2 were eligible. Families were divided into four classes: BRCA1, BRCA2, BRCAX with ≥2 breast cancer diagnosed before age 50 (class 3), and the remaining BRCAX families (class 4). Results: BRCA1 mutation carriers were at increased risk of pancreatic cancer [SIR = 4.11; 95% confidence interval (CI), 2.94–5.76] as were BRCA2 mutation carriers (SIR = 5.79; 95% CI, 4.28–7.84). BRCAX family members were also at increased pancreatic cancer risk, which did not appear to vary by number of members with early-onset breast cancer (SIR = 1.31; 95% CI, 1.06–1.63 for class 3 and SIR = 1.30; 95% CI, 1.13–1.49 for class 4). Conclusions: Germline mutations in BRCA1 and BRCA2 are associated with an increased risk of pancreatic cancer. Members of BRCAX families are also at increased risk of pancreatic cancer, pointing to the existence of other genetic factors that increase the risk of both pancreatic cancer and breast cancer. Impact: This study clarifies the relationship between familial breast cancer and pancreatic cancer. Given its high mortality, pancreatic cancer should be included in risk assessment in familial breast cancer counseling.
Article
Full-text available
Technology is driving the field of human genetics research with advances in techniques to generate high-throughput data that interrogate various levels of biological regulation. With this massive amount of data comes the important task of using powerful bioinformatics techniques to sift through the noise to find true signals that predict various human traits. A popular analytical method thus far has been the genome-wide association study (GWAS), which assesses the association of single nucleotide polymorphisms (SNPs) with the trait of interest. Unfortunately, GWAS has not been able to explain a substantial proportion of the estimated heritability for most complex traits. Due to the inherently complex nature of biology, this phenomenon could be a factor of the simplistic study design. A more powerful analysis may be a systems biology approach that integrates different types of data, or a meta-dimensional analysis. For this study we used the Analysis Tool for Heritable and Environmental Network Associations (ATHENA) to integrate high-throughput SNPs and gene expression variables (EVs) to predict high-density lipoprotein cholesterol (HDL-C) levels. We generated multivariable models that consisted of SNPs only, EVs only, and SNPs + EVs with testing r-squared values of 0.16, 0.11, and 0.18, respectively. Additionally, using just the SNPs and EVs from the best models, we generated a model with a testing r-squared of 0.32. A linear regression model with the same variables resulted in an adjusted r-squared of 0.23. With this systems biology approach, we were able to integrate different types of high-throughput data to generate meta-dimensional models that are predictive for the HDL-C in our data set. Additionally, our modeling method was able to capture more of the HDL-C variation than a linear regression model that included the same variables.
Article
Full-text available
Background Genome-wide association studies have identified thousands of SNP variants associated with hundreds of phenotypes. For most associations the causal variants and the molecular mechanisms underlying pathogenesis remain unknown. Exploration of the underlying functional annotations of trait-associated loci has thrown some light on their potential roles in pathogenesis. However, there are some shortcomings of the methods used to date, which may undermine efforts to prioritize variants for further analyses. Here, we introduce and apply novel methods to rigorously identify annotation classes showing enrichment or depletion of trait-associated variants taking into account the underlying associations due to co-location of different functional annotations and linkage disequilibrium. Results We assessed enrichment and depletion of variants in publicly available annotation classes such as genic regions, regulatory features, measures of conservation, and patterns of histone modifications. We used logistic regression to build a multivariate model that identified the most influential functional annotations for trait-association status of genome-wide significant variants. SNPs associated with all of the enriched annotations were 8 times more likely to be trait-associated variants than SNPs annotated with none of them. Annotations associated with chromatin state together with prior knowledge of the existence of a local expression QTL (eQTL) were the most important factors in the final logistic regression model. Surprisingly, despite the widespread use of evolutionary conservation to prioritize variants for study we find only modest enrichment of trait-associated SNPs in conserved regions. Conclusion We established odds ratios of functional annotations that are more likely to contain significantly trait-associated SNPs, for the purpose of prioritizing GWAS hits for further studies. Additionally, we estimated the relative and combined influence of the different genomic annotations, which may facilitate future prioritization methods by adding substantial information.
Article
Full-text available
�is annual editorial from Genome Medicine’s Section Editors highlights the most exciting research from the past year and the potential of these advances for medicine. Last year, we noted that medical ‘omics continued its inexorable move towards the clinic; in 2012 it has truly arrived. DNA capture technologies and sequencing continue to lead the way, with implications for human genomics, personalized medicine, pharmacogenomics and drug labeling, public health screening, and public policy already apparent. �ere have also been technological advances in proteomics and other ‘omic approaches, and in the integration of these approaches to provide more informative molecular signatures of health and suscep tibility to disease. De novo mutations: from complexity to the clinic
Article
Full-text available
Gene expression levels can be an important link DNA between variation and phenotypic manifestations. Our previous map of global gene expression based on approximately 400K SNPs and 50K transcripts in 400 sib pairs from the MRCA family panel has been widely used to interpret the results of GWAS. Here, we more than double the size of our initial dataset with expression data on 550 additional individuals from the MRCE family panel using the Illumina whole genome expression array. We have used new statistical methods for dimension reduction to account for non-genetic effects in estimates of expression levels and we have also included SNPs imputed from the 1000 Genomes Project. Our methods reduced false discovery rates and increased the number of eQTLs mapped either locally or at a distance (i.e. in cis or trans) from 1,534 in the MRCA dataset to 4,452 (with <5% FDR). Imputation of 1000 Genomes SNPs further increased the number of eQTLs to 7,302. Using the same methods and imputed SNPs in the newly acquired MRCE dataset we identified eQTLs for 9000 genes. The combined results identify strong local and distant effects for transcripts from 14,177 genes.
Article
Full-text available
Missing data are a major problem in the behavioral neurosciences, particularly when data collection is costly. Often researchers exclude cases with missing data, which can result in biased estimates and reduced power. Trying to avoid the deletion of a case because of a missing data point can be conducted, but implementing a naïve missing data method can result in distorted estimates and incorrect conclusions. New approaches for handling missing data have been developed but these techniques are not typically included in undergraduate research methods texts. The topic of missing data techniques would be useful for teaching research methods and for helping students with their research projects. This paper aimed to illustrate that estimating missing data is often more efficacious than complete case analysis, otherwise known as listwise deletion. Longitudinal data was obtained from an experiment examining the effects of an anorectic drug on food consumption in a small sample (n=17) of rats. The complete dataset was degraded by removing a percentage of datapoints (1-5%, 10%). Four missing data techniques: listwise deletion, mean substitution, regression, and expectation-maximization (EM) were applied to all six datasets to ensure that each approach was applied to the same missing data points. P-values, effect sizes, and Bayes factors were computed. Results demonstrated listwise deletion was the least effective method. EM and regression imputation were the preferred methods when more than 5% of the data were missing. Based on these findings it is recommended that researchers avoid using listwise deletion and consider alternative missing data techniques.
Article
Full-text available
Background The predominant model for regulation of gene expression through DNA methylation is an inverse association in which increased methylation results in decreased gene expression levels. However, recent studies suggest that the relationship between genetic variation, DNA methylation and expression is more complex. Results Systems genetic approaches for examining relationships between gene expression and methylation array data were used to find both negative and positive associations between these levels. A weighted correlation network analysis revealed that i) both transcriptome and methylome are organized in modules, ii) co-expression modules are generally not preserved in the methylation data and vice-versa, and iii) highly significant correlations exist between co-expression and co-methylation modules, suggesting the existence of factors that affect expression and methylation of different modules (i.e., trans effects at the level of modules). We observed that methylation probes associated with expression in cis were more likely to be located outside CpG islands, whereas specificity for CpG island shores was present when methylation, associated with expression, was under local genetic control. A structural equation model based analysis found strong support in particular for a traditional causal model in which gene expression is regulated by genetic variation via DNA methylation instead of gene expression affecting DNA methylation levels. Conclusions Our results provide new insights into the complex mechanisms between genetic markers, epigenetic mechanisms and gene expression. We find strong support for the classical model of genetic variants regulating methylation, which in turn regulates gene expression. Moreover we show that, although the methylation and expression modules differ, they are highly correlated.
Conference Paper
The rapid development of sequencing technologies makes thousands to millions of genetic attributes available for testing associations with various biological traits. Searching this enormous high-dimensional data space imposes a great computational challenge in genome-wide association studies. We introduce a network-based approach to supervise the search for three-locus models of disease susceptibility. Such statistical epistasis networks (SEN) are built using strong pairwise epistatic interactions and provide a global interaction map to search for higher-order interactions by prioritizing genetic attributes clustered together in the networks. Applying this approach to a population-based bladder cancer dataset, we found a high susceptibility three-way model of genetic variations in DNA repair and immune regulation pathways, which holds great potential for studying the etiology of bladder cancer with further biological validations. We demonstrate that our SEN-supervised search is able to find a small subset of three-locus models with significantly high associations at a substantially reduced computational cost.
Book
Presenting an area of research that intersects with and integrates diverse disciplines, including molecular biology, applied informatics, and statistics, among others, Bioinformatics for Omics Data: Methods and Protocols collects contributions from expert researchers in order to provide practical guidelines to this complex study. Divided into three convenient sections, this detailed volume covers central analysis strategies, standardization and data-management guidelines, and fundamental statistics for analyzing Omics profiles, followed by a section on bioinformatics approaches for specific Omics tracks, spanning genome, transcriptome, proteome, and metabolome levels, as well as an assortment of examples of integrated Omics bioinformatics applications, complemented by case studies on biomarker and target identification in the context of human disease. Written in the highly successful Methods in Molecular BiologyTM series format, chapters contain introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and notes on troubleshooting and avoiding known pitfalls. Authoritative and accessible, Bioinformatics for Omics Data: Methods and Protocols serves as an ideal guide to scientists of all backgrounds and aims to convey the appropriate sense of fascination associated with this research field.
Article
Background: Hereditary factors have been reported in 5–10% of cases with exocrine pancreatic cancer and recent data support a role for BRCA2. Aims: We have studied the prevalence of germline BRCA2 mutations in two groups of patients with exocrine pancreatic cancer from an unselected series in Spain: group A included 24 cases showing familial aggregation of cancer and group B included 54 age, sex, and hospital matched cases without such evidence. Methods: Information was obtained by interview of patients and was validated by a telephone interview with a structured questionnaire. In patients from group A, >80% of the coding sequence of BRCA2 was analysed; in patients from group B, the regions in which germline BRCA2 mutations have been described to be associated with pancreatic cancer were screened. Results: Telephone interviews led to reclassification of 7/54 cases (13%). Familial aggregation of cancer was found in 24/165 cases (14.5%); six patients had a first degree relative with pancreatic cancer (3.6%) and nine patients had relatives with breast cancer. Germline BRCA2 mutations were not identified in any patient from group A (0/23). Among group B cases, one germline variant (T5868G>Asn1880Lys) was found in a 59 year old male without a family history of cancer. The 6174delT mutation was not found in any of the 71 cases analysed. Conclusions: The overall prevalence of BRCA2 mutations among patients with pancreatic cancer in Spain is low and the 6174delT mutation appears to be very infrequent. Our data do not support screening patients with cancer of the pancreas for germline BRCA2 mutations to identify relatives at high risk of developing this tumour.
Book
Thirty-five years have elapsed since the development of modern DNA sequencing till today’s apogee of high-throughput sequencing. During that time, starting from the sequencing of the first small phage genome (5,386 bases length) and going towards the sequencing of 1,000 human genomes (three billion bases length each), massive amounts of data from thousands of species have been generated and are available in public repositories. This is mostly due to the development of a new generation of sequencing instruments a few years ago. With the advent of this data, new bioinformatics challenges arose and work needs to be done in order to teach biologist swimming in this ocean of sequences so they get safely into port.
Article
Recent technologies have made it cost-effective to collect diverse types of genome-wide data. Computational methods are needed to combine these data to create a comprehensive view of a given disease or a biological process. Similarity network fusion (SNF) solves this problem by constructing networks of samples (e.g., patients) for each available data type and then efficiently fusing these into one network that represents the full spectrum of underlying data. For example, to create a comprehensive view of a disease given a cohort of patients, SNF computes and fuses patient similarity networks obtained from each of their data types separately, taking advantage of the complementarity in the data. We used SNF to combine mRNA expression, DNA methylation and microRNA (miRNA) expression data for five cancer data sets. SNF substantially outperforms single data type analysis and established integrative approaches when identifying cancer subtypes and is effective for predicting survival.
Article
Abstract To locate multiple interacting quantitative trait loci (QTL) influencing a trait of interest within experimental populations, usually methods as the Cockerham's model are applied. Within this framework, interactions are understood as the part of the joined effect of several genes which cannot be explained as the sum of their additive effects. However, if a change in the phenotype (as disease) is caused by Boolean combinations of genotypes of several QTLs, this Cockerham's approach is often not capable to identify them properly. To detect such interactions more efficiently, we propose a logic regression framework. Even though with the logic regression approach a larger number of models has to be considered (requiring more stringent multiple testing correction) the efficient representation of higher order logic interactions in logic regression models leads to a significant increase of power to detect such interactions as compared to a Cockerham's approach. The increase in power is demonstrated analytically for a simple two-way interaction model and illustrated in more complex settings with simulation study and real data analysis.
Article
Functional genomics experiments and analyses give rise to large sets of results, each typically quantifying the relation of molecular entities including genes, gene products, polymorphisms, and other genomic features with biological characteristics or processes. There is tremendous utility and value in using these data in an integrative fashion to find convergent evidence for the role of genes in various processes, to identify functionally similar molecular entities, or to compare processes based on their genomic correlates. However, these gene-centered data are often deposited in diverse and non-interoperable stores. Therefore, integration requires biologists to implement computational algorithms and harmonization of gene identifiers both within and across species. The GeneWeaver web-based software system brings together a large data archive from diverse functional genomics data with a suite of combinatorial tools in an interactive environment. Account management features allow data and results to be shared among user-defined groups. Users can retrieve curated gene set data, upload, store, and share their own experimental results and perform integrative analyses including novel algorithmic approaches for set-set integration of genes and functions.
Article
This essay focuses on multidisciplinary, interdisciplinary, and transdisciplinary research. The definitions and objectives for these three types of multiple discipline research are given. Discussion centers on the gains and losses that may be experienced by individual nurses who engage in such research, as well as gains and losses for the discipline of nursing.
Article
The rapidly growing availability of electronic biomedical data has increased the need for innovative data mining methods. Clustering in particular has been an active area of research in many different application areas, with existing clustering algorithms mostly focusing on one modality or representation of the data. Complementary ensemble clustering (CEC) is a recently introduced framework in which Kmeans is applied to a weighted, linear combination of the coassociation matrices obtained from separate ensemble clustering of different data modalities. The strength of CEC is its extraction of information from multiple aspects of the data when forming the final clusters. This study assesses the utility of CEC in biomedical data, which often have multiple data modalities, e.g., text and images, by applying CEC to two distinct biomedical datasets (PubMed images and radiology reports) that each have two modalities. Referent to five different clustering approaches based on the Kmeans algorithm, CEC exhibited equal or better performance in the metrics of micro-averaged precision and Normalized Mutual Information across both datasets. The reference methods included clustering of single modalities as well as ensemble clustering of separate and merged data modalities. Our experimental results suggest that CEC is equivalent or more efficient than comparable Kmeans based clustering methods using either single or merged data modalities.
Article
Where once there was the genome, now there are thousands of ’omes. Nature goes in search of the ones that matter.
Article
SUMMARY When making sampling distribution inferences about the parameter of the data, θ, it is appropriate to ignore the process that causes missing data if the missing data are ‘missing at random’ and the observed data are ‘observed at random’, but these inferences are generally conditional on the observed pattern of missing data. When making direct-likelihood or Bayesian inferences about θ, it is appropriate to ignore the process that causes missing data if the missing data are missing at random and the parameter of the missing data process is ‘distinct’ from θ. These conditions are the weakest general conditions under which ignoring the process that causes missing data always leads to correct inferences.
Article
The rapid development of sequencing technologies makes thousands to millions of genetic attributes available for testing associations with various biological traits. Searching this enormous high-dimensional data space imposes a great computational challenge in genome-wide association studies. We introduce a network-based approach to supervise the search for three-locus models of disease susceptibility. Such statistical epistasis networks (SEN) are built using strong pairwise epistatic interactions and provide a global interaction map to search for higher-order interactions by prioritizing genetic attributes clustered together in the networks. Applying this approach to a population-based bladder cancer dataset, we found a high susceptibility three-way model of genetic variations in DNA repair and immune regulation pathways, which holds great potential for studying the etiology of bladder cancer with further biological validations. We demonstrate that our SEN-supervised search is able to find a small subset of three-locus models with significantly high associations at a substantially reduced computational cost.
Article
A forum of the Human Variome Project (HVP) was held as a satellite to the 2012 Annual Meeting of the American Society of Human Genetics in San Francisco, California. The theme of this meeting was "Getting Ready for the Human Phenome Project". Understanding the genetic contribution to both rare single gene "Mendelian" disorders and to more complex common diseases will require integration of research efforts among many fields and better defined phenotypes. The HVP is dedicated to bringing together researchers and research populations throughout the world to provide the resources to investigate the impact of genetic variation on disease. To this end, there needs to be a greater sharing of phenotype and genotype data. For this to occur, the many databases that currently exist will need to become interoperable to allow for the combining of cohorts with similar phenotypes to increase statistical power for studies attempting to identify novel disease genes or causative genetic variants. Improved systems and tools that enhance the collection of phenotype data from clinicians are urgently needed. This meeting begins the HVP's effort towards this important goal.
Article
Broader functional annotation of known as well as putative genetic variations is a valuable mean for prioritizing targets in disease studies and large-scale genotyping projects. In this article, we present a practical guide to SNPnexus, a web-based tool that provides an aggregate set of functional annotations for genomic variation data by characterizing related consequences at the transcriptome/proteome levels with in-depth analysis of potential deleterious effects, inferring physical and cytogenetic mapping, reporting related HapMap data, finding overlaps with potential regulatory, structural as well as conserved elements and retrieving links with previously reported genetic disease studies. We focus on the SNPnexus query system, its annotation categories and the biological interpretation of results.
Article
The 4th Biennial Meeting of the Human Variome Project Consortium was held at the headquarters of the United Nations Educational, Scientific and Cultural Organization (UNESCO) in Paris, 11–15 June 2012. The Human Variome Project, a nongovernmental organization and an official partner of UNESCO, enables the routine collection, curation, interpretation, and sharing of information on all human genetic variation. This meeting was attended by more than 180 delegates from 39 countries and continued the theme of addressing issues of implementation in this unique project. The meeting was structured around the four main themes of the Human Variome Project strategic plan, “Project Roadmap 2012–2016”: setting normative function, behaving ethically, sharing knowledge, and building capacity. During the meeting, the members held extensive discussions to formulate an action plan in the key areas of the Human Variome Project. The actions agreed on were promulgated at the Project’s two Advisory Council and Scientific Advisory Committee postconference meetings. Genet Med 2013:15(7):507–512