[Show abstract][Hide abstract] ABSTRACT: Evaluating effectiveness of information retrieval systems is achieved by performing on a collection of documents, a search,
in which a set of test queries are performed and, for each query, the list of the relevant documents. This evaluation framework
also includes performance measures making it possible to control the impact of a modification of search parameters. The program
trec_eval calculates a large number of measures, some being more used like the mean average precision or recall-precision
curves. The motivation of our work is to compare all measures and to help the user to choose a small number of them when evaluating
different information retrieval systems. In this paper, we present the study we carried out from a massive data analysis of
TREC results. Relationships between the 130 measures calculated by trec_eval for individual queries are investigated, and
we show that they can be clustered into homogeneous clusters.
KeywordsInformation retrieval–Performance measures–Evaluation–Statistical data analysis
Knowledge and Information Systems 03/2011; 30(3):693-713. DOI:10.1007/s10115-011-0391-7 · 1.78 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Biological data produced by high throughput technologies are becoming more and more abundant and are arousing many statistical questions. This paper addresses one of them; when gene expression data are jointly observed with other variables with the purpose of highlighting significant relationships between gene expression and these other variables. One relevant statistical method to explore these relationships is Canonical Correlation Analysis (CCA). Unfortunately, in the context of postgenomic data, the number of variables (gene expressions) is usually greater than the number of units (samples) and CCA cannot be directly performed: a regularized version is required.
We applied regularized CCA on data sets from two different studies and show that its interpretation evidences both previously validated relationships and new hypothesis. From the first data sets (nutrigenomic study), we generated interesting hypothesis on the transcription factor pathways potentially linking hepatic fatty acidsand gene expression. From the second data sets (pharmacogenomic study on the NCI-60 cancer cell line panel), we identified new ABC transporter candidate substrates which relevancy is illustrated by the concomitant identification of several known substrates.
In conclusion, the use of regularized CCA is likely to be relevant to a number and a variety of biological experiments involving the generation of high throughput data. We demonstrated here its ability to enhance the range of relevant conclusions that can be drawn from these relatively expensive experiments.
Journal of Biological Systems 06/2009; 17:173-199. DOI:10.1142/S0218339009002831 · 0.38 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Meat from rabbits reared either according to a standard (STAND) or a high quality norm (LABEL) or a low growth breeding (RUSSE) system were submitted to a sensory evaluation and to a large set of physicochemical measurements (weight of retail cuts, colour parameters, ultimate pH, femur flexure test, Warner-Bratzler shear test, water holding capacities and cooking losses). STAND rabbit meat exhibited the most juicy meat in back and in leg (p<0.01). Leg tenderness significantly decreased (p<0.001) in the rank order STAND>LABEL>RUSSE. Canonical correlation analysis showed strong correlations between physicochemical and sensory variables (R(2)=0.73 and 0.68 between the two first pairs of canonical variates). Especially, sensory tenderness and WB shear test variables assessed on raw longissimus muscle (LL) were correlated. Fibrous attribute in back was correlated with cooking loss in LL. When analysed separately only RUSSE rabbits exhibited the same relations between variables as those calculated in whole dataset.
[Show abstract][Hide abstract] ABSTRACT: Canonical correlations analysis (CCA) is an exploratory statistical method to highlight correlations between two data sets acquired on the same experimental units. The cancor() function in R (R Development Core Team 2007) performs the core of computations but further work was required to provide the user with additional tools to facilitate the interpretation of the results. We implemented an R package, CCA, freely available from the Comprehensive R Archive Network (CRAN, http://CRAN.R-project.org/), to develop numerical and graphical outputs and to enable the user to handle missing values. The CCA package also includes a regularized version of CCA to deal with data sets with more variables than units. Illustrations are given through the analysis of a data set coming from a nutrigenomic study in the mouse.
[Show abstract][Hide abstract] ABSTRACT: Microarray data acquired during time-course experiments allow the temporal variations in gene expression to be monitored. An original postprandial fasting experiment was conducted in the mouse and the expression of 200 genes was monitored with a dedicated macroarray at 11 time points between 0 and 72 hours of fasting. The aim of this study was to provide a relevant clustering of gene expression temporal profiles. This was achieved by focusing on the shapes of the curves rather than on the absolute level of expression. Actually, we combined spline smoothing and first derivative computation with hierarchical and partitioning clustering. A heuristic approach was proposed to tune the spline smoothing parameter using both statistical and biological considerations. Clusters are illustrated a posteriori through principal component analysis and heatmap visualization. Most results were found to be in agreement with the literature on the effects of fasting on the mouse liver and provide promising directions for future biological investigations.
EURASIP Journal on Bioinformatics and Systems Biology 02/2007;
Computer-Assisted Information Retrieval (Recherche d'Information et ses Applications) - RIAO 2007, 8th International Conference, Carnegie Mellon University, Pittsburgh, PA, USA, May 30 - June 1, 2007. Proceedings, CD-ROM; 01/2007
[Show abstract][Hide abstract] ABSTRACT: This paper investigates two aspects in this experiment. Linguistic techniques are used to categorize queries in a first step. This classification is then used to analyze systems performances in a TREC context. More precisely, we cluster TREC topics with 13 linguistic features (Mothe and al, 2005), and use the systems which have participated to TREC3, 5, 6, and 7 campaign. The results show that our method can improve the results of the retrieval process compared to CombMNZ technique (Lee, 1997) and the best systems of each TREC campaign. When evaluated on a training/testing mode, we obtain an improvement, depending on the years considered, from 3.72% to 5.97% for P@5 and from 1.48% to 6.73% for P@10.
[Show abstract][Hide abstract] ABSTRACT: 2) Laboratoire de Pharmacologie et Toxicologie -UR 66 -INRA 180, chemin de Tournefeuille -B.P. 3 -31931 Toulouse Cedex (3) Station d'Amélioration Génétique des Animaux -INRA (4) Résumé Afin d'illustrer la diversité des stratégies applicables à l'analyse de données transcripto-miques, nous mettons d'abord en oeuvre des méthodes issues de la statistique exploratoire (ACP, positionnement multidimensionnel, classification), de la modélisation (analyse de va-riance, modèles mixtes, tests) ou de l'apprentissage (forêts aléatoires), sur un jeu de données provenant d'une étude de nutrition chez la souris. Dans un second temps, les résultats ob-tenus sont mis en relation avec des paramètres cliniques mesurés sur les mêmes animaux, en utilisant cette fois l'analyse canonique. La plupart des méthodes fournissent des résultats biologiquement pertinents sur ces données. De cette expérience, nous tirons quelques enseignements élémentaires : il n'y a pas, a priori, de meilleure approche ; il faut trouver la « bonne » stratégie associant exploration et modélisation, adaptée tant aux données qu'à l'objectif recherché. Dans cette optique, une collaboration étroite entre statisticien et biologiste est indispensable. Abstract In order to illustrate the variety of strategies applicable to transcriptomic data analysis, we first implement methods of exploratory statistics (PCA, multidimensional scaling, clustering), modelling (ANOVA, mixed models, tests) or learning (random forests), on a dataset coming from a nutrition study for mice. In a second stage, relationships between the previous results and clinical measures are studied through canonical correlation analysis. Most of the methods provide biological relevant results on these data. From these experiment we conclude that there is not one best approach ; one have to find the "good" strategy combining exploration and modelling to fit the data as well as the biolog-ical purpose. In this point of view, a strong collaboration between statistician and biologist is essential.