ArticlePublisher preview available

Normalization techniques for PARAFAC modeling of urine metabolomic data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Introduction One of the body fluids often used in metabolomics studies is urine. The concentrations of metabolites in urine are affected by hydration status of an individual, resulting in dilution differences. This requires therefore normalization of the data to correct for such differences. Two normalization techniques are commonly applied to urine samples prior to their further statistical analysis. First, AUC normalization aims to normalize a group of signals with peaks by standardizing the area under the curve (AUC) within a sample to the median, mean or any other proper representation of the amount of dilution. The second approach uses specific end-product metabolites such as creatinine and all intensities within a sample are expressed relative to the creatinine intensity. Objectives Another way of looking at urine metabolomics data is by realizing that the ratios between peak intensities are the information-carrying features. This opens up possibilities to use another class of data analysis techniques designed to deal with such ratios: compositional data analysis. The aim of this paper is to develop PARAFAC modeling of three-way urine metabolomics data in the context of compositional data analysis and compare this with standard normalization techniques. Methods In the compositional data analysis approach, special coordinate systems are defined to deal with the ratio problem. In essence, it comes down to using other distance measures than the Euclidian Distance that is used in the conventional analysis of metabolomic data. Results We illustrate using this type of approach in combination with three-way methods (i.e. PARAFAC) of a longitudinal urine metabolomics study and two simulations. In both cases, the advantage of the compositional approach is established in terms of improved interpretability of the scores and loadings of the PARAFAC model. Conclusion For urine metabolomics studies, we advocate the use of compositional data analysis approaches. They are easy to use, well established and proof to give reliable results.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
Normalization techniques for PARAFAC modeling of urine
metabolomic data
Alz
ˇbe
ˇta Gardlo
1,2
Age K. Smilde
3
Karel Hron
1
Marcela Hrda
´
2
Radana Karlı
´kova
´
2
David Friedecky
´
2,4
Toma
´s
ˇAdam
2,4
Received: 5 January 2016 / Accepted: 14 June 2016 / Published online: 23 June 2016
ÓSpringer Science+Business Media New York 2016
Abstract
Introduction One of the body fluids often used in meta-
bolomics studies is urine. The concentrations of metabo-
lites in urine are affected by hydration status of an
individual, resulting in dilution differences. This requires
therefore normalization of the data to correct for such
differences. Two normalization techniques are commonly
applied to urine samples prior to their further statistical
analysis. First, AUC normalization aims to normalize a
group of signals with peaks by standardizing the area under
the curve (AUC) within a sample to the median, mean or
any other proper representation of the amount of dilution.
The second approach uses specific end-product metabolites
such as creatinine and all intensities within a sample are
expressed relative to the creatinine intensity.
Objectives Another way of looking at urine metabolomics
data is by realizing that the ratios between peak intensities
are the information-carrying features. This opens up pos-
sibilities to use another class of data analysis techniques
designed to deal with such ratios: compositional data
analysis. The aim of this paper is to develop PARAFAC
modeling of three-way urine metabolomics data in the
context of compositional data analysis and compare this
with standard normalization techniques.
Methods In the compositional data analysis approach,
special coordinate systems are defined to deal with the ratio
problem. In essence, it comes down to using other distance
measures than the Euclidian Distance that is used in the
conventional analysis of metabolomic data.
Results We illustrate using this type of approach in com-
bination with three-way methods (i.e. PARAFAC) of a
longitudinal urine metabolomics study and two simula-
tions. In both cases, the advantage of the compositional
approach is established in terms of improved inter-
pretability of the scores and loadings of the PARAFAC
model.
Conclusion For urine metabolomics studies, we advocate
the use of compositional data analysis approaches. They
are easy to use, well established and proof to give reliable
results.
Keywords Parallel factor analysis (PARAFAC)
Compositional data Metabolomics Creatinine Area
under the curve
1 Introduction
Metabolomics as a young subfield of the omics sciences
concerns the comprehensive characterization of metabo-
lites in biological systems. It is applied to plants, bacteria,
animals and humans; in humans all biological materials
from biofluids (blood, urine) till tissues can be analyzed.
Metabolomics is increasingly being used in almost all
fields of health science including pharmacology, pre-clin-
ical drug trials, toxicology, newborn screening and many
&Alz
ˇbe
ˇta Gardlo
alzbeta.gardlo@gmail.com
1
Department of Mathematical Analysis and Applications of
Mathematics, Faculty of Science, Palacky
´University,
Olomouc, Czech Republic
2
Laboratory of Metabolomics, Institute of Molecular and
Translational Medicine, University Hospital Olomouc,
Palacky
´University Olomouc, Olomouc, Czech Republic
3
Biosystems Data Analysis, Swammerdam Institute for Life
Sciences, University of Amsterdam, Amsterdam, The
Netherlands
4
Department of Clinical Biochemistry, University Hospital
Olomouc, Olomouc, Czech Republic
123
Metabolomics (2016) 12:117
DOI 10.1007/s11306-016-1059-9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... In fact, data normalization is a critical step in MS data processing to adjust size effect, due to the difference in the sample amount or dilution across samples, as well as other technical variations. Various data normalization methods, such as housekeeping normalization [18,25,26], centred logratio transformation [25], probabilistic quotient normalization [25,27], total sum normalization [25], and variance stabilization normalization [27,28], have been proposed. The choice of an appropriate normalization method depends on the type of biological samples, the study design, and the investigator's experience. ...
... In fact, data normalization is a critical step in MS data processing to adjust size effect, due to the difference in the sample amount or dilution across samples, as well as other technical variations. Various data normalization methods, such as housekeeping normalization [18,25,26], centred logratio transformation [25], probabilistic quotient normalization [25,27], total sum normalization [25], and variance stabilization normalization [27,28], have been proposed. The choice of an appropriate normalization method depends on the type of biological samples, the study design, and the investigator's experience. ...
... In fact, data normalization is a critical step in MS data processing to adjust size effect, due to the difference in the sample amount or dilution across samples, as well as other technical variations. Various data normalization methods, such as housekeeping normalization [18,25,26], centred logratio transformation [25], probabilistic quotient normalization [25,27], total sum normalization [25], and variance stabilization normalization [27,28], have been proposed. The choice of an appropriate normalization method depends on the type of biological samples, the study design, and the investigator's experience. ...
Article
Full-text available
Background: Identifying differentially abundant features between different experimental groups is a common goal for many metabolomics and proteomics studies. However, analyzing data from mass spectrometry (MS) is difficult because the data may not be normally distributed and there is often a large fraction of zero values. Although several statistical methods have been proposed, they either require the data normality assumption or are inefficient. Results: We propose a new semi-parametric differential abundance analysis (SDA) method for metabolomics and proteomics data from MS. The method considers a two-part model, a logistic regression for the zero proportion and a semi-parametric log-linear model for the possibly non-normally distributed non-zero values, to characterize data from each feature. A kernel-smoothed likelihood method is developed to estimate model coefficients and a likelihood ratio test is constructed for differential abundant analysis. The method has been implemented into an R package, SDAMS, which is available at https://www.bioconductor.org/packages/release/bioc/html/SDAMS.html . Conclusion: By introducing the two-part semi-parametric model, SDA is able to handle both non-normally distributed data and large fraction of zero values in a MS dataset. It also allows for adjustment of covariates. Simulations and real data analyses demonstrate that SDA outperforms existing methods.
... Recently, a new approach based on Compositional Data Analysis (CODA) is receiving attention [10,11]. This approach is based on an attractive idea of working with log-ratios, thus eliminating the data normalization step. ...
... This comparison showed that the CODA approach (clr transformation) should not be applied to identify biomarkers [13]. Unfortunately, our conclusions did not coincide with that presented in Ref. [10]. As stated in Ref. [10], the observed discrepancy was probably caused by a limited number of variables considered in our study. ...
... Unfortunately, our conclusions did not coincide with that presented in Ref. [10]. As stated in Ref. [10], the observed discrepancy was probably caused by a limited number of variables considered in our study. Thus, we undertook a new simulation study, working with data sets with much larger numbers of variables. ...
Article
Instrumental signals of samples cannot be compared and/or analysed directly if their concentrations are unknown. Differences in overall concentration need to be removed at the data normalization step. The choice of normalization method has a profound effect on the final results of data analysis, and especially on biomarker identification. One of the possible approaches to deal with the ‘size effect’ is to work with size-irrelevant (log) ratios instead of the original variables. In the presented study, the performance of log-ratio methods, namely pairwise log-ratio (plr) and centered log-ratio (clr), is discussed for real and simulated data sets with different characteristics. It was found that the clr method can lead to distribution of local differences along an entire signal and as such, it should be avoided in all studies aiming to identify biomarkers.
... The parameters of width at 5% height and peak-to-peak baseline noise were automatically calculated. The total peak area was normalized according the reference (Gardlo et al., 2016), that is, the ion intensities for each detected peak were normalized against the sum of the peak intensities within the sample. The data lists of RT, m/z and normalized peak area of each peak in positive and negative ion mode were generated, respectively. ...
... Due to the limitation of sample collection and time arrangement, rats' urine was collected in parallel for 12 h according the references (Giri et al., 2007;Zhou et al., 2011). The mass spectrum data were normalized by total peak area (Gardlo et al., 2016). The metabolic information of plasma and urine detected by UPLC-QTOF/MS were analyzed using PCA. ...
Article
Full-text available
The seed of Ziziphus jujuba Mill. var. spinosa (Bunge) Hu ex H. F. Chou (ZSS) is often used as a traditional Chinese medicine for insomnia due to its sedative and hypnotic effects, but the mechanism underlying this effect has not been thoroughly elucidated. In this study, an insomnia model induced by intraperitoneal injection of DL-4-chlorophenylalanine suspension in Sprague-Dawley rats was adopted to investigate the therapeutic effect of ZSS extract. Metabolomics analyses of plasma and urine as well as 16S rRNA gene sequencing of the intestinal flora were performed. The relationships between the plasma and urine metabolites and the intestinal flora in insomnia rats were also analyzed. The results showed that changes in plasma and urine metabolites caused by insomnia were reversed after administration of ZSS, and these changes were mainly related to amino acid metabolism, especially phenylalanine metabolism. The results of 16S rRNA gene sequencing and short-chain fatty acid determination showed that the ZSS extract could reverse the imbalance of intestinal flora caused by insomnia and increase the contents of SCFAs in feces. All of these improvements are mainly related to the regulation of inflammation. Therefore, it is concluded that insomnia, which alters metabolic profiles and the intestinal flora, could be alleviated effectively by ZSS extract.
... Urinary metabolomics is a valuable source in the early disease diagnosis process. It was reported in many studies that certain metabolites are differentially expressed in the presence of diseases like cancer [1][2][3]. Therefore it is a valuable source in the early disease diagnosis process. Normally, patients are hesitant to damage their organs and tissues to give samples during the disease diagnosis process. ...
... In recent years, there has been an increasing amount of attention given to the evaluation of normalization methods. Much work has focused on the development and comparison of different datadriven normalization techniques that utilize advanced statistical methods [2,[6][7][8][9]. Others have explored strategies relying primarily on biological values such as creatinine and osmolality for normalization [5,[10][11][12][13]. ...
Article
Human urine recently became a popular medium for metabolomics biomarker discovery because its collection is non-invasive. Sometimes renal dilution of urine can be problematic in this type of urinary biomarker analysis. Currently, various normalization techniques such as creatinine ratio, osmolality, specific gravity, dry mass, urine volume, and area under the curve are used to account for the renal dilution. However, these normalization techniques have their own drawbacks. In this project, mass spectrometry-based urinary metabolomic data obtained from prostate cancer (n = 56), bladder cancer (n = 57) and control (n = 69) groups were analyzed using statistical normalization techniques. The normalization techniques investigated in this study are Creatinine Ratio, Log Value, Linear Baseline, Cyclic Loess, Quantile, Probabilistic Quotient, Auto Scaling, Pareto Scaling, and Variance Stabilizing Normalization. The appropriate summary statistics for comparison of normalization techniques were created using variances, coefficients of variation, and boxplots. For each normalization technique, a principal component analysis was performed to identify clusters based on cancer type. In addition, hypothesis tests were conducted to determine if the normalized biomarkers could be used to differentiate between the cancer types. The results indicate that the determination of statistical significance can be dependent upon which normalization method is utilized. Therefore, careful consideration should go into choosing an appropriate normalization technique as no method had universally superior performance.
... From the compositional perspective, the rotational invariance of the ALS algorithm (Kruskal 1989) is of particular importance, because it enables to employ any logratio coordinates with the isometry property (like clr coefficients) for the estimation purposes (Di Palma, Gallo, Filzmoser, and Hron 2016). Although PARAFAC or, more generally, statistical modeling of three-way data was recently successfully employed for economic applications (Dell'Anno and Amendola 2015; Veldscholte, Kroonenberg, and Antonides 1998) and its specifics for compositional data were developed (Gallo 2013;Gardlo, Smilde, Hron, Hrdá, Karlíková, Friedecký, and Adam 2016), combination of both aspects (as far as it is known to the authors) is not available in the literature. ...
Article
Full-text available
The present article aims to point and interval estimation of the parameters of generalised exponential distribution (GED) under progressive interval type-I (PITI) censoring scheme with random removals. The considered censoring scheme is most useful in those cases where continuous examination is not possible. Maximum likelihood, expectation-maximization and Bayesian procedures have been developed for the estimation of parameters of the GED, based on a PITI censored sample. Real datasets have been considered to illustrate the applicability of the proposed work. Further, we have compared the performances of the proposed estimators under PITI censoring to that of the complete sample.
Article
Clinical metabolomics aims at finding statistically significant differences in metabolic statuses of patient and control groups with the intention of understanding pathobiochemical processes and identification of clinically useful biomarkers of particular diseases. After the raw measurements are integrated and pre-processed as intensities of chromatographic peaks, the differences between controls and patients are evaluated by both univariate and multivariate statistical methods. The traditional univariate approach relies on t-tests (or their nonparametric alternatives) and the results from multiple testing are misleadingly compared merely by p-values using the so-called volcano plot. This paper proposes a Bayesian counterpart to the widespread univariate analysis, taking into account the compositional character of a metabolome. Since each metabolome is a collection of some small-molecule metabolites in a biological material, the relative structure of metabolomic data, which is inherently contained in ratios between metabolites, is of the main interest. Therefore, a proper choice of logratio coordinates is an essential step for any statistical analysis of such data. In addition, a concept of b-values is introduced together with a Bayesian version of the volcano plot incorporating distance levels of the posterior highest density intervals from zero. The theoretical background of the contribution is illustrated using two data sets containing samples of patients suffering from 3-hydroxy-3-methylglutaryl-CoA lyase deficiency and medium-chain acyl-CoA dehydrogenase deficiency. To evaluate the stability of the proposed method as well as the benefits of the compositional approach, two simulations designed to mimic a loss of samples and a systematical measurement error, respectively, are added.
Chapter
Compositional data are multivariate observations carrying relative information. Their specific properties are captured by the so‐called Aitchison geometry with the Euclidean vector space structure. Accordingly, it is possible to construct real orthonormal coordinate systems, where most of the popular multivariate statistical methods can be performed. The main point in the construction of coordinates is their interpretation. It should reflect the fact that the relevant information in compositional data is contained in the log‐ratios between compositional parts. The article summarizes recent advances in compositional data analysis concerning the definition of compositional data, their geometrical properties, and their possible coordinate representations with emphasis to orthonormal coordinates, being most reliable for further statistical processing. Finally, an outline for multivariate statistical analysis of compositional data in orthonormal coordinates is provided.
Chapter
Regression analysis is used to model the relationship between a response variable and one or more explanatory variables (covariates). In the compositional case, the proper choice of logratio coordinates matters, both due to the interpretation of the regression parameters and because of the properties of the regression models. And again, orthonormal coordinates, particularly in their pivot version, are preferable. Moreover, in case of regression with compositional response and real covariates, ilr coordinates enable to decompose the multivariate regression model into single multiple regressions. The coordinate representation of compositions is essential also for statistical inference like hypotheses testing, which is frequently of interest in the regression context. In this chapter, all basic regression cases are contained: the mentioned regression with compositional response and real covariates, the case of real response and compositional explanatory variables, regression between two compositions, and finally also regression between the parts within one composition. A further important task is considered: variable selection of relevant covariates by forward and backward selection. Robustness issues are also of particular importance in the regression context—outliers in the response or in the covariates will have limited effect for robust regression estimates.
Chapter
With increasing dimensionality of compositional data much more care needs to be devoted to a reasonable coordinate representation and selection of methods to be used for their statistical processing. This situation frequently occurs with chemometric data, particularly when dealing with observations from “omics”-fields (genomics, proteomics, or metabolomics). In principle, all methods that are popular in the context of high-dimensional data, like principal component analysis and partial least squares regression, can also be used for compositional data with far more parts than observations. On the other hand, while pivot coordinates are still useful in terms of interpretation also in the high-dimensional context, this is not so clear for other types of balances: defining an interpretable sequential binary partition for compositions with hundreds or thousands of parts, where many of them may just be related to noise, is nearly impossible. Accordingly, it is meaningful here to consider even the elemental information, contained in pairwise logratios, to build up a relevant method for marker identification or for the detection of cell-wise outliers. The latter one can be used to reveal which observations are deviating from the majority in order to identify possible measurement errors or other artifacts. Moreover, it may be possible with these methods to identify parts or groups of parts that show a different behavior in all or in subsets of the observations.
Article
The increasing complexity of omics research has encouraged the development of new instrumental technologies able to deal with these challenging samples. In this way, the rise of multidimensional separations should be highlighted due to the massive amounts of information that provide with an enhanced analyte determination. Both proteomics and metabolomics benefit from this higher separation capacity achieved when different chromatographic dimensions are combined, either in LC or GC. However, this vast quantity of experimental information requires the application of chemometric data analysis strategies to retrieve this hidden knowledge, especially in the case of non‐targeted studies. In this work, we review the most common chemometric tools and approaches for the analysis of this multidimensional chromatographic data. First, different options for data preprocessing and enhancement of the instrumental signal are introduced. Next, the most used chemometric methods for the detection of chromatographic peaks and the resolution of chromatographic and spectral contributions (profiling) are presented. The description of these data analysis approaches is complemented with enlightening examples from omics fields that demonstrate the exceptional potential of the combination of multidimensional separation techniques and chemometrics tools of data analysis. This article is protected by copyright. All rights reserved
Article
Full-text available
The R package ThreeWay is presented and its main features are illustrated. The aim of ThreeWay is to offer a suit of functions for handling three-way arrays. In particular, the most relevant available functions are T3 and CP, which implement, respectively, the Tucker3 and Candecomp/Parafac methods. They are the two most popular tools for summarizing three-way arrays in terms of components. After briefly recalling both techniques from a theoretical point of view, the functions T3 and CP are described by considering three real life examples.
Book
It is difficult to imagine that the statistical analysis of compositional data has been a major issue of concern for more than 100 years. It is even more difficult to realize that so many statisticians and users of statistics are unaware of the particular problems affecting compositional data, as well as their solutions. The issue of spurious correlation'', as the situation was phrased by Karl Pearson back in 1897, affects all data that measures parts of some whole, such as percentages, proportions, ppm and ppb. Such measurements are present in all fields of science, ranging from geology, biology, environmental sciences, forensic sciences, medicine and hydrology. This book presents the history and development of compositional data analysis along with Aitchison's log-ratio approach. Compositional Data Analysis describes the state of the art both in theoretical fields as well as applications in the different fields of science. Key Features: • Reflects the state-of-the-art in compositional data analysis. • Gives an overview of the historical development of compositional data analysis, as well as basic concepts and procedures. • Looks at advances in algebra and calculus on the simplex. • Presents applications in different fields of science, including, genomics, ecology, biology, geochemistry, planetology, chemistry and economics. • Explores connections to correspondence analysis and the Dirichlet distribution. • Presents a summary of three available software packages for compositional data analysis. • Supported by an accompanying website featuring R code. Applied scientists working on compositional data analysis in any field of science, both in academia and professionals will benefit from this book, along with graduate students in any field of science working with compositional data.
Article
Compositional data are characterized by values containing relative information, and thus the ratios between the data values are of interest for the analysis. Due to specific features of compositional data, standard statistical methods should be applied to compositions expressed in a proper coordinate system with respect to an orthonormal basis. It is discussed how three-way compositional data can be analyzed with the Parafac model. When data are contaminated by outliers, robust estimates for the Parafac model parameters should be employed. It is demonstrated how robust estimation can be done in the context of compositional data and how the results can be interpreted. A real data example from macroeconomics underlines the usefulness of this approach.
Book
Modeling and Analysis of Compositional Data presents a practical and comprehensive introduction to the analysis of compositional data along with numerous examples to illustrate both theory and application of each method. Based upon short courses delivered by the authors, it provides a complete and current compendium of fundamental to advanced methodologies along with exercises at the end of each chapter to improve understanding, as well as data and a solutions manual which is available on an accompanying website. Complementing Pawlowsky-Glahn's earlier collective text that provides an overview of the state-of-the-art in this field, Modeling and Analysis of Compositional Data fills a gap in the literature for a much-needed manual for teaching, self learning or consulting.
Article
This paper presents a standardized notation and terminology to be used for three- and multiway analyses, especially when these involve (variants of) the CANDECOMP/PARAFAC model and the Tucker model. The notation also deals with basic aspects such as symbols for different kinds of products, and terminology for three- and higher-way data. The choices for terminology and symbols to be used have to some extent been based on earlier (informal) conventions. Simplicity and reduction of the possibility of confusion have also played a role in the choices made. Copyright (C) 2000 John Wiley & Sons, Ltd.