Conference Paper

Improving biological significance of gene expression biclusters with key missing genes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Identifying condition-specific co-expressed gene groups is critical for gene functional and regulatory analysis. However, given that genes with critical functions (such as transcription factors) may not co-express with their target genes, it is insufficient to uncover gene functional associations only from gene expression data. In this paper, we propose a novel integrative biclustering approach to build high quality biclusters from gene expression data, and to identify critical missing genes in biclusters based on Gene Ontology as well. Our approach delivers a complete inter- and intra-bicluster functional relationship, thus provides biologists a clear picture for gene functional association study. We experimented with the Yeast cell cycle and Arabidopsis cold-response gene expression datasets. Experimental results show that a clear inter- and intra-bicluster relationship is identified, and the biological significance of the biclusters is considerably improved.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstrate that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited. Supplementary information and software are available at http://www.msu.edu/~jinchen/NETSIM. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0474-7) contains supplementary material, which is available to authorized users.
Article
Full-text available
Gene Ontology (GO) provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various applications. However, the existing GO based similarity measurements have limited functions for only a subset of GO information is considered in each measure. An appropriate integration of the existing measures to take into account more information in GO is demanding. We propose a novel integrative measure called InteGO2 to automatically select appropriate seed measures and then to integrate them using a metaheuristic search method. The experiment results show that InteGO2 significantly improves the performance of gene similarity in human, Arabidopsis and yeast on both molecular function and biological process GO categories. InteGO2 computes gene-to-gene similarities more accurately than tested existing measures and has high robustness. The supplementary document and software are available at http://mlg.hit.edu.cn:8082/.
Article
Full-text available
Motivation: Gene regulatory network (GRN) inference reveals the influences genes have on one another in cellular regulatory systems. If the experimental data are inadequate for reliable inference of the network, informative priors have been shown to improve the accuracy of inferences.Results: This study explores the potential of undirected, confidence-weighted networks, such as those in functional association databases, as a prior source for GRN inference. Such networks often erroneously indicate symmetric interaction between genes and may contain mostly correlation-based interaction information. Despite these drawbacks, our testing on synthetic datasets indicates that even noisy priors reflect some causal information that can improve GRN inference accuracy. Our analysis on yeast data indicates that using the functional association databases FunCoup and STRING as priors can give a small improvement in GRN inference accuracy with biological data.Contact: matthew.studham@scilifelab.seSupplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Identifying patterns in temporal data is key to uncover meaningful relationships in diverse domains, from stock trading to social interactions. Also of great interest are clinical and biological applications, namely monitoring patient response to treatment or characterizing activity at the molecular level. In biology, researchers seek to gain insight into gene functions and dynamics of biological processes, as well as potential perturbations of these leading to disease, through the study of patterns emerging from gene expression time series. Clustering can group genes exhibiting similar expression profiles, but focuses on global patterns denoting rather broad, unspecific responses. Biclustering reveals local patterns, which more naturally capture the intricate collaboration between biological players, particularly under a temporal setting. Despite the general biclustering formulation being NP-hard, considering specific properties of time series has led to efficient solutions for the discovery of temporally aligned patterns. Notably, the identification of biclusters with time-lagged patterns, suggestive of transcriptional cascades, remains a challenge due to the combinatorial explosion of delayed occurrences. Herein, we propose LateBiclustering, a sensible heuristic algorithm enabling a polynomial rather than exponential time solution for the problem. We show that it identifies meaningful time-lagged biclusters relevant to the response of Saccharomyces cerevisiae to heat stress.
Article
Full-text available
In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge framework for gene function comparison and prediction. Its structure and annotation provide a convenient way to compare gene functional similarities at the molecular level. The existing gene similarity measures, however, solely rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, common terms, lowest common parents. We introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the state-of-the-art gene-to-gene similarity measures. By integrating three GO based seed measures, InteGO significantly improves the performance by about two-fold in all the three species studied (yeast, Arabidopsis and human). InteGO is a systematic and novel method to study gene functional associations. The software and description are available at http://www.msu.edu/~jinchen/InteGO.
Article
Full-text available
Prognostic genes are key molecules informative for cancer prognosis and treatment. Previous studies have focused on the properties of individual prognostic genes, but have lacked a global view of their system-level properties. Here we examined their properties in gene co-expression networks for four cancer types using data from 'The Cancer Genome Atlas'. We found that prognostic mRNA genes tend not to be hub genes (genes with an extremely high connectivity), and this pattern is unique to the corresponding cancer-type-specific network. In contrast, the prognostic genes are enriched in modules (a group of highly interconnected genes), especially in module genes conserved across different cancer co-expression networks. The target genes of prognostic miRNA genes show similar patterns. We identified the modules enriched in various prognostic genes, some of which show cross-tumour conservation. Given the cancer types surveyed, our study presents a view of emergent properties of prognostic genes.
Article
Full-text available
Background Gene Ontology (GO) has been widely used in biological databases, annotation projects, and computational analyses. Although the three GO categories are structured as independent ontologies, the biological relationships across the categories are not negligible for biological reasoning and knowledge integration. However, the existing cross-category ontology term similarity measures are either developed by utilizing the GO data only or based on manually curated term name similarities, ignoring the fact that GO is evolving quickly and the gene annotations are far from complete. Results In this paper we introduce a new cross-category similarity measurement called CroGO by incorporating genome-specific gene co-function network data. The performance study showed that our measurement outperforms the existing algorithms. We also generated genome-specific term association networks for yeast and human. An enrichment based test showed our networks are better than those generated by the other measures. Conclusions The genome-specific term association networks constructed using CroGO provided a platform to enable a more consistent use of GO. In the networks, the frequently occurred MF-centered hub indicates that a molecular function may be shared by different genes in multiple biological processes, or a set of genes with the same functions may participate in distinct biological processes. And common subgraphs in multiple organisms also revealed conserved GO term relationships. Software and data are available online at http://www.msu.edu/˜jinchen/CroGO.
Article
Full-text available
Living cells are realized by complex gene expression programs that are moderated by regulatory proteins called transcription factors (TFs). The TFs control the differential expression of target genes in the context of transcriptional regulatory networks (TRNs), either individually or in groups. Deciphering the mechanisms of how the TFs control the differential expression of a target gene in a TRN is challenging, especially when multiple TFs collaboratively participate in the transcriptional regulation. To unravel the roles of the TFs in the regulatory networks, we model the underlying regulatory interactions in terms of the TF-target interactions' directions (activation or repression) and their corresponding logical roles (necessary and/or sufficient). We design a set of constraints that relate gene expression patterns to regulatory interaction models, and develop TRIM (Transcriptional Regulatory Interaction Model Inference), a new hidden Markov model, to infer the models of TF-target interactions in large-scale TRNs of complex organisms. Besides, by training TRIM with wild-type time-series gene expression data, the activation timepoints of each regulatory module can be obtained. To demonstrate the advantages of TRIM, we applied it on yeast TRN to infer the TF-target interaction models for individual TFs as well as pairs of TFs in collaborative regulatory modules. By comparing with TF knockout and other gene expression data, we were able to show that the performance of TRIM is clearly higher than DREM (the best existing algorithm). In addition, on an individual Arabidopsis binding network, we showed that the target genes' expression correlations can be significantly improved by incorporating the TF-target regulatory interaction models inferred by TRIM into the expression data analysis, which may introduce new knowledge in transcriptional dynamics and bioactivation.
Article
Full-text available
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research. TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A. thaliana genome assembly and annotation. TAIR also provides researchers with an extensive set of visualization and analysis tools. Recent developments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A. thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants. Other highlights include progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature.
Article
Full-text available
The analysis of massive high throughput data via clustering algorithms is very important for elucidating gene functions in biological systems. However, traditional clustering methods have several drawbacks. Biclustering overcomes these limitations by grouping genes and samples simultaneously. It discovers subsets of genes that are co-expressed in certain samples. Recent studies showed that biclustering has a great potential in detecting marker genes that are associated with certain tissues or diseases. Several biclustering algorithms have been proposed. However, it is still a challenge to find biclusters that are significant based on biological validation measures. Besides that, there is a need for a biclustering algorithm that is capable of analyzing very large datasets in reasonable time. Here we present a fast biclustering algorithm called DeBi (Differentially Expressed BIclusters). The algorithm is based on a well known data mining approach called frequent itemset. It discovers maximum size homogeneous biclusters in which each gene is strongly associated with a subset of samples. We evaluate the performance of DeBi on a yeast dataset, on synthetic datasets and on human datasets. We demonstrate that the DeBi algorithm provides functionally more coherent gene sets compared to standard clustering or biclustering algorithms using biological validation measures such as Gene Ontology term and Transcription Factor Binding Site enrichment. We show that DeBi is a computationally efficient and powerful tool in analyzing large datasets. The method is also applicable on multiple gene expression datasets coming from different labs or platforms.
Article
Full-text available
Clustering methods are a useful and common first step in gene expression studies, but the results may be hard to interpret. We bring in explicitly an indicator of which genes tie each cluster, changing the setup to biclustering. Furthermore, we make the indicators hierarchical, resulting in a hierarchy of progressively more specific biclusters. A non-parametric Bayesian formulation makes the model rigorous yet flexible and computations feasible. The model can additionally be used in information retrieval for relating relevant samples. We show that the model outperforms four other biclustering procedures on a large miRNA data set. We also demonstrate the model's added interpretability and information retrieval capability in a case study. Software is publicly available at http://research.ics.tkk.fi/mi/software/treebic/.
Article
Full-text available
Matrix metalloproteinases (MMPs) and A Disintegrin and Metalloproteinases (ADAMs) are two related protease families that play key roles in matrix remodeling and growth factor ligand shedding. Directly ascertaining the proteolytic activities of particular MMPs and ADAMs in physiological environments in a non-invasive, real-time, multiplex manner remains a challenge. This work describes Proteolytic Activity Matrix Analysis (PrAMA), an integrated experimental measurement and mathematical analysis framework for simultaneously determining the activities of particular enzymes in complex mixtures of MMPs and ADAMs. The PrAMA method interprets dynamic signals from panels of moderately specific FRET-based polypeptide protease substrates to deduce a profile of specific MMP and ADAM proteolytic activities. Deconvolution of signals from complex mixtures of proteases is accomplished using prior data on individual MMP/ADAM cleavage signatures for the substrate panel measured with purified enzymes. We first validate PrAMA inference using a compendium of roughly 4000 measurements involving known mixtures of purified enzymes and substrates, and then demonstrate application to the live-cell response of wildtype, ADAM10-/-, and ADAM17-/- fibroblasts to phorbol ester and ionomycin stimulation. Results indicate PrAMA can distinguish closely related enzymes from each other with high accuracy, even in the presence of unknown background proteolytic activity. PrAMA offers a valuable tool for applications ranging from live-cell in vitro assays to high-throughput inhibitor screening with complex enzyme mixtures. Moreover, our approach may extend to other families of proteases, such as caspases and cathepsins, that also can lack highly-specific substrates.
Article
Full-text available
Although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and, therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach and its relevance in the discovery of regulatory modules. Results obtained using the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress show not only the ability of the proposed methodology to extract relevant information compatible with documented biological knowledge but also the utility of using this algorithm in the study of other environmental stresses and of regulatory modules in general.
Article
Full-text available
Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
Article
Full-text available
An efficient node-deletion algorithm is introduced to find submatrices in expression data that have low mean squared residue scores and it is shown to perform well in finding co-regulation patterns in yeast and human. This introduces "biclustering", or simultaneous clustering of both genes and conditions, to knowledge discovery from expression data. This approach overcomes some problems associated with traditional clustering methods, by allowing automatic discovery of similarity based on a subset of attributes, simultaneous clustering of genes and conditions, and overlapped grouping that provides a better representation for genes with multiple functions or regulated by many factors.
Article
Full-text available
This article by Yu. Lazebnik, “Can a Biologist Fix a Radio? — or, What I Learned while Studying Apoptosis” has already been published in English (Cancer Cell, 2002, 2, 179–182) and in Russian (Uspekhi Gerontologii, 2003, No. 12, 166–171). Nevertheless, we have undertaken its secondary publication in our journal for two reasons: first, our journal has different readers, and, second, the great significance of this manifest of Yuri Lazebnik. The author in bright and clever form shows the emerging necessity to create formalized language designed to describe complicated systems of regulation of biochemical processes in living cells. The article is published with permission of Cancer Cell and Uspekhi Gerontologii. Editor-in-Chief of Biokhimiya/Biochemistry (Moscow) V.P.Skulachev
Article
Full-text available
For effective exposition of biological information, especially with regard to analysis of large-scale data types, researchers need immediate access to multiple categorical knowledge bases and need summary information presented to them on collections of genes, as opposed to the typical one gene at a time. We present here a web-based tool (FunSpec) for statistical evaluation of groups of genes and proteins (e.g. co-regulated genes, protein complexes, genetic interactors) with respect to existing annotations (e.g. functional roles, biochemical properties, localization). FunSpec is available online at http://funspec.med.utoronto.ca FunSpec is helpful for interpretation of any data type that generates groups of related genes and proteins, such as gene expression clustering and protein complexes, and is useful for predictive methods employing "guilt-by-association."
Article
Full-text available
Much of a cell's activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. We present a probabilistic method for identifying regulatory modules from gene expression data. Our procedure identifies modules of coregulated genes, their regulators and the conditions under which regulation occurs, generating testable hypotheses in the form 'regulator X regulates module Y under conditions W'. We applied the method to a Saccharomyces cerevisiae expression data set, showing its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.
Article
Full-text available
Many bioinformatics data resources not only hold data in the form of sequences, but also as annotation. In the majority of cases, annotation is written as scientific natural language: this is suitable for humans, but not particularly useful for machine processing. Ontologies offer a mechanism by which knowledge can be represented in a form capable of such processing. In this paper we investigate the use of ontological annotation to measure the similarities in knowledge content or 'semantic similarity' between entries in a data resource. These allow a bioinformatician to perform a similarity measure over annotation in an analogous manner to those performed over sequences. A measure of semantic similarity for the knowledge component of bioinformatics resources should afford a biologist a new tool in their repertoire of analyses. We present the results from experiments that investigate the validity of using semantic similarity by comparison with sequence similarity. We show a simple extension that enables a semantic search of the knowledge held within sequence databases. Software available from http://www.russet.org.uk.
Article
Full-text available
Clustering is often one of the first steps in gene expression analysis. How do clustering algorithms work, which ones should we use and what can we expect from them?
Article
Full-text available
Motivation: In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and datasets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available. Results: First, this paper provides a methodology for comparing and validating biclustering methods that includes a simple binary reference model. Although this model captures the essential features of most biclustering approaches, it is still simple enough to exactly determine all optimal groupings; to this end, we propose a fast divide-and-conquer algorithm (Bimax). Second, we evaluate the performance of five salient biclustering algorithms together with the reference model and a hierarchical clustering method on various synthetic and real datasets for Saccharomyces cerevisiae and Arabidopsis thaliana. The comparison reveals that (1) biclustering in general has advantages over a conventional hierarchical clustering approach, (2) there are considerable performance differences between the tested methods and (3) already the simple reference model delivers relevant patterns within all considered settings.
Article
Full-text available
Gene Ontology (GO) is a standard vocabulary of functional terms and allows for coherent annotation of gene products. These annotations provide a basis for new methods that compare gene products regarding their molecular function and biological role. We present a new method for comparing sets of GO terms and for assessing the functional similarity of gene products. The method relies on two semantic similarity measures; simRel and funSim. One measure (simRel) is applied in the comparison of the biological processes found in different groups of organisms. The other measure (funSim) is used to find functionally related gene products within the same or between different genomes. Results indicate that the method, in addition to being in good agreement with established sequence similarity approaches, also provides a means for the identification of functionally related proteins independent of evolutionary relationships. The method is also applied to estimating functional similarity between all proteins in Saccharomyces cerevisiae and to visualizing the molecular function space of yeast in a map of the functional space. A similar approach is used to visualize the functional relationships between protein families. The approach enables the comparison of the underlying molecular biology of different taxonomic groups and provides a new comparative genomics tool identifying functionally related gene products independent of homology. The proposed map of the functional space provides a new global view on the functional relationships between gene products or protein families.
Article
Full-text available
The whole-genome response of Arabidopsis (Arabidopsis thaliana) exposed to different types and durations of abiotic stress has now been described by a wealth of publicly available microarray data. When combined with studies of how gene expression is affected in mutant and transgenic Arabidopsis with altered ability to transduce the low temperature signal, these data can be used to test the interactions between various low temperature-associated transcription factors and their regulons. We quantized a collection of Affymetrix microarray data so that each gene in a particular regulon could vote on whether a cis-element found in its promoter conferred induction (+1), repression (-1), or no transcriptional change (0) during cold stress. By statistically comparing these election results with the voting behavior of all genes on the same gene chip, we verified the bioactivity of novel cis-elements and defined whether they were inductive or repressive. Using in silico mutagenesis we identified functional binding consensus variants for the transcription factors studied. Our results suggest that the previously identified ICEr1 (induction of CBF expression region 1) consensus does not correlate with cold gene induction, while the ICEr3/ICEr4 consensuses identified using our algorithms are present in regulons of genes that were induced coordinate with observed ICE1 transcript accumulation and temporally preceding genes containing the dehydration response element. Statistical analysis of overlap and cis-element enrichment in the ICE1, CBF2, ZAT12, HOS9, and PHYA regulons enabled us to construct a regulatory network supported by multiple lines of evidence that can be used for future hypothesis testing.
Article
Full-text available
In this paper we propose a data based algorithm to marry existing biological knowledge (e.g., functional annotations of genes) with experimental data (gene expression profiles) in creating an overall dissimilarity that can be used with any clustering algorithm that uses a general dissimilarity matrix. We explore this idea with two publicly available gene expression data sets and functional annotations where the results are compared with the clustering results that uses only the experimental data. Although more elaborate evaluations might be called for, the present paper makes a strong case for utilizing existing biological information in the clustering process. Availability Supplement is available at www.somnathdatta.org/Supp/Bioinformation/appendix.pdf
Article
Full-text available
A challenge in microarray data analysis concerns discovering local structures composed by sets of genes that show homogeneous expression patterns across subsets of conditions. We present an extension of the mixture of factor analyzers model (MFA) allowing for simultaneous clustering of genes and conditions. The proposed model is rather flexible since it models the density of high-dimensional data assuming a mixture of Gaussian distributions with a particular omponent-specific covariance structure. Specifically, a binary and row stochastic matrix representing tissue membership is used to cluster tissues (experimental conditions), whereas the traditional mixture approach is used to define the gene clustering. An alternating expectation conditional maximization (AECM) algorithm is proposed for parameter estimation; experiments on simulated and real data show the efficiency of our method as a general approach to biclustering. The Matlab code of the algorithm is available upon request from authors.
Conference Paper
Full-text available
Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. We introduce a more general model, referred to as the δ-cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The δ-cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the δ-cluster model and the FLOC algorithm on a number of real and synthetic data sets
Article
Full-text available
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
Article
One of the emerging techniques for performing the analysis of the DNA microarray data known as biclustering is the search of subsets of genes and conditions which are coherently expressed. These subgroups provide clues about the main biological processes. Until now, different approaches to this problem have been proposed. Most of them use the mean squared residue as quality measure but relevant and interesting patterns can not be detected such as shifting, or scaling patterns. Furthermore, recent papers show that there exist new coherence patterns involved in different kinds of cancer and tumors such as inverse relationships between genes which can not be captured. The proposed measure is called Spearman's biclustering measure (SBM) which performs an estimation of the quality of a bicluster based on the non-linear correlation among genes and conditions simultaneously. The search of biclusters is performed by using a evolutionary technique called estimation of distribution algorithms which uses the SBM measure as fitness function. This approach has been examined from different points of view by using artificial and real microarrays. The assessment process has involved the use of quality indexes, a set of bicluster patterns of reference including new patterns and a set of statistical tests. It has been also examined the performance using real microarrays and comparing to different algorithmic approaches such as Bimax, CC, OPSM, Plaid and xMotifs. SBM shows several advantages such as the ability to recognize more complex coherence patterns such as shifting, scaling and inversion and the capability to selectively marginalize genes and conditions depending on the statistical significance.
Article
Many systems take the form of networks, including the Internet, distribution and transport networks, neural networks, food webs, and social networks. The characterization and modeling of these systems has proved amenable to treatment using techniques drawn from statistical and computational physics, and has as a result attracted considerable attention in the physics literature in recent years. In this paper the author reviews some of the interesting issues in this area and recounts some recent work on these issues by himself and by others.
Conference Paper
We propose a mining framework that supports the identi- flcation of useful knowledge based on data clustering. With the recent advancement of microarray technologies, we focus our attention on gene expression datasets mining. In particular, given that genes are often co- expressed under subsets of experimental conditions, we present a novel subspace clustering algorithm. In contrast to previous approaches, our method is based on the observation that the number of subspace clusters is related with the number of maximal subspace clusters to which any gene pair can belong. By performing discretization to gene expression proflles, the similarity between two genes is transformed as a sequence of symbols that represents the maximal subspace cluster for the gene pair. This domain transformation (from genes into gene-gene relations) allows us to make the number of possible subspace clusters dependent on the number of genes. Based on the symbolic representations of genes, we present an e-cient subspace clustering algorithm that is scalable to the number of dimensions. In addition, the running time can be dras- tically reduced by utilizing inverted index and pruning non-interesting subspaces. Experimental results indicate that the proposed method e-- ciently identifles co-expressed gene subspace clusters for a yeast cell cycle dataset.
Conference Paper
Mining biclusters that exhibit both consistent trends and trends with similar degrees of fluctuations is vital to bioinformatics research. However; existing biclustering methods are not very efficient and effective at mining such biclusters. Moreover, few inter-bicluster relationships are delivered to biologists. In this paper, we introduce a quick hierarchical biclustering algorithm (QHB) to efficiently mine biclusters with both consistent trends and trends with similar degrees of fluctuations. Our QHB produces not only biclusters but also a hierarchical graph of inter-bicluster relationships. We experimented with the Yeast dataset and compared QHB against an existing biclustering scheme, DBF Our results show that QHB identifies biclusters with better quality. In addition, QHB shows the relationships among biclusters. Moreover compared with DBF, QHB is much more efficient and offers users a progressive way of bicluster exploration
Article
Methods are presented for computing the equilibrium distribution of customers in closed queueing networks with exponential servers. Expressions for various marginal distributions are also derived. The computational algorithms are based on two-dimensional ...
Article
This article by Yu. Lazebnik, “Can a Biologist Fix a Radio? — or, What I Learned while Studying Apoptosis” has already been published in English (Cancer Cell, 2002, 2, 179–182) and in Russian (Uspekhi Gerontologii, 2003, No. 12, 166–171). Nevertheless, we have undertaken its secondary publication in our journal for two reasons: first, our journal has different readers, and, second, the great significance of this manifest of Yuri Lazebnik. The author in bright and clever form shows the emerging necessity to create formalized language designed to describe complicated systems of regulation of biochemical processes in living cells. The article is published with permission of Cancer Cell and Uspekhi Gerontologii.
Article
Biological networks, such as protein interaction, regulatory or metabolic networks, derived from public databases, biological experiments or text mining can be useful for the analysis of high-throughput experimental data. We present two algorithms embedded in the ToPNet application that show promising performance in analyzing expression data in the context of such networks. First, the Significant Area Search algorithm detects subnetworks consisting of significantly regulated genes. These subnetworks often provide hints on which biological processes are affected in the measured conditions. Second, Pathway Queries allow detection of networks including molecules that are not necessarily significantly regulated, such as transcription factors or signaling proteins. Moreover, using these queries, the user can formulate biological hypotheses and check their validity with respect to experimental data. All resulting networks and pathways can be explored further using the interactive analysis tools provided by ToPNet program.
Article
Summary The CBF cold response pathway has a prominent role in cold acclimation. The pathway includes action of three transcription factors, CBF1, 2 and 3 (also known as DREB1b, c and a, respectively), that are rapidly induced in response to low temperature followed by expression of the CBF-targeted genes (the CBF regulon) that act in concert to increase plant-freezing tolerance. The results of transcriptome profiling and mutagenesis experiments, however, indicate that additional cold response pathways exist and may have important roles in life at low temperature. To further understand the roles that the CBF proteins play in configuring the low temperature transcriptome and to identify additional transcription factors with roles in cold acclimation, we used the Affymetrix GeneChip containing probe sets for approximately 24,000 Arabidopsis genes to define a core set of cold-responsive genes and to determine which genes were targets of CBF2 and 6 other transcription factors that appeared to be coordinately regulated with CBF2. A total of 514 genes were placed in the core set of cold-responsive genes, 302 of which were upregulated and 212 downregulated. Hierarchical clustering and bioinformatic analysis indicated that the 514 cold-responsive transcripts could be assigned to one of seven distinct expression classes and identified multiple potential novel cis-acting cold-regulatory elements. Eighty-five cold-induced genes and eight cold-repressed genes were assigned to the CBF2 regulon. An additional nine cold-induced genes and 15 cold-repressed genes were assigned to a regulon controlled by ZAT12. Of the 25 core cold-induced genes that were most highly upregulated (induced over 15-fold), 19 genes (84%) were induced by CBF2 and another two genes (8%) were regulated by both CBF2 and ZAT12. Thus, the large majority (92%) of the most highly induced genes belong to the CBF and ZAT12 regulons. Constitutive expression of ZAT12 in Arabidopsis caused a small, but reproducible, increase in freezing tolerance, indicating a role for the ZAT12 regulon in cold acclimation. In addition, ZAT12 downregulated the expression of the CBF genes indicating a role for ZAT12 in a negative regulatory circuit that dampens expression of the CBF cold response pathway.
Article
Cluster analysis of gene expression profiles has been widely applied to clustering genes for gene function discovery. Many approaches have been proposed. The rationale is that the genes with the same biological function or involved in the same biological process are more likely to co-express, hence they are more likely to form a cluster with similar gene expression patterns. However, most existing methods, including model-based clustering, ignore known gene functions in clustering. To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions as prior probabilities in model-based clustering. In contrast to a global mixture model applicable to all the genes in the standard model-based clustering, we use a stratified mixture model: one stratum corresponds to the genes of unknown function while each of the other ones corresponding to the genes sharing the same biological function or pathway; the genes from the same stratum are assumed to have the same prior probability of coming from a cluster while those from different strata are allowed to have different prior probabilities of coming from the same cluster. We derive a simple EM algorithm that can be used to fit the stratified model. A simulation study and an application to gene function prediction demonstrate the advantage of our proposal over the standard method. weip@biostat.umn.edu
Article
Motivation: Because co-expressed genes are likely to share the same biological function, cluster analysis of gene expression profiles has been applied for gene function discovery. Most existing clustering methods ignore known gene functions in the process of clustering. Results: To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions into a new distance metric, which shrinks a gene expression-based distance towards 0 if and only if the two genes share a common gene function. A two-step procedure is used. First, the shrinkage distance metric is used in any distance-based clustering method, e.g. K-medoids or hierarchical clustering, to cluster the genes with known functions. Second, while keeping the clustering results from the first step for the genes with known functions, the expression-based distance metric is used to cluster the remaining genes of unknown function, assigning each of them to either one of the clusters obtained in the first step or some new clusters. A simulation study and an application to gene function prediction for the yeast demonstrate the advantage of our proposal over the standard method.
Article
This research analyzes some aspects of the relationship between gene expression, gene function, and gene annotation. Many recent studies are implicitly based on the assumption that gene products that are biologically and functionally related would maintain this similarity both in their expression profiles as well as in their Gene Ontology (GO) annotation. We analyze how accurate this assumption proves to be using real publicly available data. We also aim to validate a measure of semantic similarity for GO annotation. We use the Pearson correlation coefficient and its absolute value as a measure of similarity between expression profiles of gene products. We explore a number of semantic similarity measures (Resnik, Jiang, and Lin) and compute the similarity between gene products annotated using the GO. Finally, we compute correlation coefficients to compare gene expression similarity against GO semantic similarity. Our results suggest that the Resnik similarity measure outperforms the others and seems better suited for use in Gene Ontology. We also deduce that there seems to be correlation between semantic similarity in the GO annotation and gene expression for the three GO ontologies. We show that this correlation is negligible up to a certain semantic similarity value; then, for higher similarity values, the relationship trend becomes almost linear. These results can be used to augment the knowledge provided by clustering algorithms and in the development of bioinformatic tools for finding and characterizing gene products.
Article
The Gene Ontology (GO) project is a collaboration among model organism databases to describe gene products from all organisms using a consistent and computable language. GO produces sets of explicitly defined, structured vocabularies that describe biological processes, molecular functions and cellular components of gene products in both a computer- and human-readable manner. Here we describe key aspects of GO, which, when overlooked, can cause erroneous results, and address how these pitfalls can be avoided.
Conference Paper
A new placement algorithm for general cell assemblies is presented which combines the ideas of polar graph representation and min-cut placement. First a detailed description of the initial placement procedure is given, then the various methods for placement improvement (rotation, squeezing, reflecting) and global routing are discussed. A sample circuit is used to demonstrate the performance of the algorithms. Results are shown to compare favourably with manually achieved solutions.
Conference Paper
A bicluster of a gene expression dataset captures the coherence of a subset of genes and a subset of conditions. Biclustering algorithms are used to discover biclusters whose subset of genes are co-regulated under subset of conditions. In this paper, we present a novel approach, called DBF (deterministic biclustering with frequent pattern mining) to finding biclusters. Our scheme comprises two phases. In the first phase, we generate a set of good quality biclusters based on frequent pattern mining. In the second phase, the biclusters are further iteratively refined (enlarged) by adding more genes and/or conditions. We evaluated our scheme against FLOC and our results show that DBF can generate larger and better biclusters.
Article
Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
Gene ontology annotations and resources
  • T Gene Ontology
  • Consortium
Gene Ontology Consortium. Gene ontology annotations and resources
  • T. Gene Ontology Consortium
Garcia-Hernandez etal The arabidopsis information resource (tair): improved gene annotation and new tools
  • P Lamesch
  • T Z Berardini
  • D D Li
  • C Swarbreck
  • R Wilks R. Sasidharan
  • K Muller
  • D L Dreher
  • M Alexander