Effective similarity measures for expression profiles

Department of Computer Science, Cornell University, NY, USA.
Bioinformatics (Impact Factor: 4.98). 08/2006; 22(13):1616-22. DOI: 10.1093/bioinformatics/btl127
Source: PubMed


It is commonly accepted that genes with similar expression profiles are functionally related. However, there are many ways one can measure the similarity of expression profiles, and it is not clear a priori what is the most effective one. Moreover, so far no clear distinction has been made as for the type of the functional link between genes as suggested by microarray data. Similarly expressed genes can be part of the same complex as interacting partners; they can participate in the same pathway without interacting directly; they can perform similar functions; or they can simply have similar regulatory sequences. Here we conduct a study of the notion of functional link as implied from expression data. We analyze different similarity measures of gene expression profiles and assess their usefulness and robustness in detecting biological relationships by comparing the similarity scores with results obtained from databases of interacting proteins, promoter signals and cellular pathways, as well as through sequence comparisons. We also introduce variations on similarity measures that are based on statistical analysis and better discriminate genes which are functionally nearby and faraway. Our tools can be used to assess other similarity measures for expression profiles, and are accessible at

Download full-text


Available from: Golan Yona
  • Source
    • "As a normalized and dimensionless similarity metric, the correlation coefficient is appropriate for comparing variables with different units or scales, such as different expressive parameters, and is especially useful for comparing profiles or sequences (Hubert, 1979). Thus, correlation coefficients are among the most effective measures for detecting similarity in gene expression profiles (Yona et al., 2006), a research question which has many parallels with the similarity analysis of expressive profiles in music performance . Unlike the parametric Pearson correlation coefficient, non-parametric correlation coefficients such as Spearman's rho and Kendall's tau are not sensitive to outliers and are less affected by the shape of the statistical distribution of the data, making them more widely applicable as similarity indices. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Although a growing body of research has examined issues related to individuality in music performance, few studies have attempted to quantify markers of individuality that transcend pieces and musical styles. This study aims to identify such meta-markers by discriminating between influences linked to specific pieces or interpretive goals and performer-specific playing styles, using two complementary statistical approaches: linear mixed models (LMMs) to estimate fixed (piece and interpretation) and random (performer) effects, and similarity analyses to compare expressive profiles on a note-by-note basis across pieces and expressive parameters. Twelve professional harpsichordists recorded three pieces representative of the Baroque harpsichord repertoire, including three interpretations of one of these pieces, each emphasizing a different melodic line, on an instrument equipped with a MIDI console. Four expressive parameters were analyzed: articulation, note onset asynchrony, timing, and velocity. LMMs showed that piece-specific influences were much larger for articulation than for other parameters, for which performer-specific effects were predominant, and that piece-specific influences were generally larger than effects associated with interpretive goals. Some performers consistently deviated from the mean values for articulation and velocity across pieces and interpretations, suggesting that global measures of expressivity may in some cases constitute valid markers of artistic individuality. Similarity analyses detected significant associations among the magnitudes of the correlations between the expressive profiles of different performers. These associations were found both when comparing across parameters and within the same piece or interpretation, or on the same parameter and across pieces or interpretations. These findings suggest the existence of expressive meta-strategies that can manifest themselves across pieces, interpretive goals, or expressive devices.
    Full-text · Article · Nov 2013 · Frontiers in Psychology
  • Source
    • "For transcriptomic (microarray) and proteomic (mass-spectrometry) data, we use the Pearson correlation coefficient that gives high similarity to genes with correlated transcriptomic/proteomic profiles. Other similarity measures, as the classical Euclidean metric, could be possible, but the Pearson correlation measure has been shown to perform well to detect functional links in several analyses [20]. For the protein-protein interaction data, we use the Czekanovski-Dice metric [21], which gives high similarity to pairs of genes that share many interactors, and has been shown to perform well to predict biological functions [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Of the 5,484 predicted proteins of Plasmodium falciparum, the main causative agent of malaria, about 60% do not have sufficient sequence similarity with proteins in other organisms to warrant provision of functional assignments. Non-homology methods are thus needed to obtain functional clues for these uncharacterized genes. We present PlasmoDraft, a database of Gene Ontology (GO) annotation predictions for P. falciparum genes based on postgenomic data. Predictions of PlasmoDraft are achieved with a Guilt By Association method named Gonna. This involves (1) a predictor that proposes GO annotations for a gene based on the similarity of its profile (measured with transcriptome, proteome or interactome data) with genes already annotated by GeneDB; (2) a procedure that estimates the confidence of the predictions achieved with each data source; (3) a procedure that combines all data sources to provide a global summary and confidence estimate of the predictions. Gonna has been applied to all P. falciparum genes using most publicly available transcriptome, proteome and interactome data sources. Gonna provides predictions for numerous genes without any annotations. For example, 2,434 genes without any annotations in the Biological Process ontology are associated with specific GO terms (e.g. Rosetting, Antigenic variation), and among these, 841 have confidence values above 50%. In the Cellular Component and Molecular Function ontologies, 1,905 and 1,540 uncharacterized genes are associated with specific GO terms, respectively (740 and 329 with confidence value above 50%). All predictions along with their confidence values have been compiled in PlasmoDraft, which thus provides an extensive database of GO annotation predictions that can be achieved with these data sources. The database can be accessed in different ways. A global view allows for a quick inspection of the GO terms that are predicted with high confidence, depending on the various data sources. A gene view and a GO term view allow for the search of potential GO terms attached to a given gene, and genes that potentially belong to a given GO term.
    Full-text · Article · Nov 2008 · BMC Bioinformatics
  • Source
    • "The reason for selecting Pearson correlation is that it has been consistently proven to provide better performance for microarray data compared with other similarity measures [23]. The sub-matrix SM c with the maximum absolute correlation with Y represents the best correlation of Y (Step 4). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Microarrays are able to measure the patterns of expression of thousands of genes in a genome to give profiles that facilitate much faster analysis of biological processes for diagnosis, prognosis and tailored drug discovery. Microarrays, however, commonly have missing values which can result in erroneous downstream analysis. To impute these missing values, various algorithms have been proposed including Collateral Missing Value Estimation (CMVE), Bayesian Principal Component Analysis (BPCA), Least Square Impute (LSImpute), Local Least Square Impute (LLSImpute) and K-Nearest Neighbour (KNN). Most of these imputation algorithms exploit either the global or local correlation structure of the data, which normally leads to larger estimation errors. This paper presents an enhanced Heuristic Non Parametric Collateral Missing Value Imputation (HCMVI) algorithm which uses CMVE as its core estimator and Heuristic Non Parametric strategy to compute optimal number of estimator genes to exploit optimally both local and global correlations.
    Full-text · Article · Oct 2008
Show more