A method for processing multivariate data in medical studies

The Neuropsychological Laboratory, CNS-Fed, 39 rue Meaux, 75019 Paris, France.
Statistics in Medicine (Impact Factor: 1.83). 09/2013; 32(20):3436-3448. DOI: 10.1002/sim.5788
Source: PubMed

ABSTRACT Traditional displays of principal component analyses lack readability to discriminate between putative clusters of variables or cases. Here, the author proposes a method that clusterizes and visualizes variables or cases through principal component analyses thus facilitating their analysis. The method displays pre-determined clusters of variables or cases as urchins that each has a soma (the average point) and spines (the individual variables or cases). Through three examples in the field of neuropsychology, the author illustrates how urchins help examine the modularity of cognitive tasks on the one hand and identify groups of healthy versus brain-damaged participants on the other hand. Some of the data used in this article were obtained from the Alzheimer's Disease Neuroimaging Initiative database. The urchin method was implemented in MATLAB, and the source code is available in the Supporting information. Urchins can be useful in biomedical studies to identify distinct phenomena at first glance, each having several measures (clusters of variables) or distinct groups of participants (clusters of cases). Copyright © 2013 John Wiley & Sons, Ltd.

1 Follower
11 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A key question when analyzing high throughput data is whether the information provided by the measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or, rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to better understand the underlying structure of the data in a 'blind' (unsupervised) way. A well-established technique to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can overcome both the high dimensionality and noisy characteristics of biological data. We propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. The result is a better clustering of the biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal variable selection to identify biologically relevant features (sIPCA). On simulation studies and real data sets, we showed that IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the data with respect to the biological experiment.IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of high dimensional biological data sets, and on mixomics' web-interface.
    BMC Bioinformatics 02/2012; 13(1):24. DOI:10.1186/1471-2105-13-24 · 2.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In spite of the common belief of Europe as reasonably homogeneous at genetic level, advances in high-throughput genotyping technology have resolved several gradients which define different geographical areas with good precision. When Northern and Southern European groups were considered separately, there were clear genetic distinctions. Intra-country genetic differences were also evident, especially in Finland and, to a lesser extent, within other European populations. Here, we present the first analysis using the 125,799 genome-wide Single Nucleotide Polymorphisms (SNPs) data of 1,014 Italians with wide geographical coverage. We showed by using Principal Component analysis and model-based individual ancestry analysis, that the current population of Sardinia can be clearly differentiated genetically from mainland Italy and Sicily, and that a certain degree of genetic differentiation is detectable within the current Italian peninsula population. Pair-wise F(ST) statistics Northern and Southern Italy amounts approximately to 0.001 between, and around 0.002 between Northern Italy and Utah residents with Northern and Western European ancestry (CEU). The Italian population also revealed a fine genetic substructure underscoring by the genomic inflation (Sardinia vs. Northern Italy = 3.040 and Northern Italy vs. CEU = 1.427), warning against confounding effects of hidden relatedness and population substructure in association studies.
    PLoS ONE 09/2012; 7(9):e43759. DOI:10.1371/journal.pone.0043759 · 3.23 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: SeqExpress, a gene-expression analysis suite, has been extended to offer a number of cluster generation, refinement and visualization techniques. The cluster generation methods have been specialized to deal with aspects of the sparseness and extreme values that occur within microarray data. The results of such cluster analysis can then be refined using either: a functional enrichment based procedure, which examines each cluster to see if it possesses an unusually high or low concentration of ontology terms; or by using Expectation-Maximization to find a mixture of model based distributions within the datasets. Visualizations are provided both to explore and compare the results of the cluster generation algorithms. In addition, a tool has been developed which integrates SeqExpress with the Gene-Expression Omnibus repository. The tool provides seamless access to the large number of experimental results in the repository, so that they can be visualized and analysed locally using SeqExpress. AVAILABILITY: SeqExpress is available as a 6 MB download from and runs under Windows. A server-based version is available and is required for the GEO integration. SeqExpress is not affiliated with any academic institution, funding body or commercial organization and is free to use by all.
    Bioinformatics 06/2005; 21(10):2550-1. DOI:10.1093/bioinformatics/bti355 · 4.98 Impact Factor
Show more